0% found this document useful (0 votes)
9 views15 pages

A Hybrid Approach For Binary and Multi-Class Class

This paper presents a hybrid approach for classifying voice disorders using a two-stage framework that combines deep learning features from a pre-trained VGGish model with various classifiers, including SVM and ensemble classifiers. The proposed method achieves state-of-the-art accuracy in both binary and multi-class classification tasks, outperforming existing models, particularly in multi-class scenarios. The study emphasizes the importance of gender-specific analysis and demonstrates the feasibility of enhancing automated tools for clinical assessments of voice disorders.

Uploaded by

Fahima Minda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

A Hybrid Approach For Binary and Multi-Class Class

This paper presents a hybrid approach for classifying voice disorders using a two-stage framework that combines deep learning features from a pre-trained VGGish model with various classifiers, including SVM and ensemble classifiers. The proposed method achieves state-of-the-art accuracy in both binary and multi-class classification tasks, outperforming existing models, particularly in multi-class scenarios. The study emphasizes the importance of gender-specific analysis and demonstrates the feasibility of enhancing automated tools for clinical assessments of voice disorders.

Uploaded by

Fahima Minda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 BMC Medical Informatics

https://doi.org/10.1186/s12911-025-02978-w
and Decision Making

RESEARCH Open Access

A hybrid approach for binary and multi-class


classification of voice disorders using a pre-
trained model and ensemble classifiers
Mehtab Ur Rahman1,2* and Cem Direkoglu2

Abstract
Recent advances in artificial intelligence-based audio and speech processing have increasingly focused on the
binary and multi-class classification of voice disorders. Despite progress, achieving high accuracy in multi-class
classification remains challenging. This paper proposes a novel hybrid approach using a two-stage framework
to enhance voice disorders classification performance, and achieve state-of-the-art accuracies in multi-class
classification. Our hybrid approach, combines deep learning features with various powerful classifiers. In the first
stage, high-level feature embeddings are extracted from voice data spectrograms using a pre-trained VGGish
model. In the second stage, these embeddings are used as input to four different classifiers: Support Vector
Machine (SVM), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and an Ensemble Classifier (EC). Experiments
are conducted on a subset of the Saarbruecken Voice Database (SVD) for male, female, and combined speakers.
For binary classification, VGGish-SVM achieved the highest accuracy for male speakers (82.45% for healthy vs.
disordered; 75.45% for hyperfunctional dysphonia vs. vocal fold paresis), while VGGish-EC performed best for
female speakers (71.54% for healthy vs. disordered; 68.42% for hyperfunctional dysphonia vs. vocal fold paresis).
In multi-class classification, VGGish-SVM outperformed other models, achieving mean accuracies of 77.81% for
male speakers, 63.11% for female speakers, and 70.53% for combined genders. We conducted a comparative
analysis against related works, including the Mel frequency cepstral coefficient (MFCC), MFCC-glottal features, and
features extracted using the wav2vec and HuBERT models with SVM classifier. Results demonstrate that our hybrid
approach consistently outperforms these models, especially in multi-class classification tasks. The results show the
feasibility of a hybrid framework for voice disorder classification, offering a foundation for refining automated tools
that could support clinical assessments with further validation.
Keywords Voice disorders, Multi-class classification, Ensemble classifier, VGGish

*Correspondence:
Mehtab Ur Rahman
[email protected]
1
Department of Language and Communication, Radboud University,
Houtlaan, Nijmegen, Gelderland 6525, Netherlands
2
Electrical and Electronics Engineering Department, Middle East Technical
University, Northern Cyprus Campus, Kalkanli, Güzelyurt, Mersin 10
99738, Turkey

© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included
in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a copy of this licence, visit ​h​t​t​p​​:​/​/​​c​r​e​a​​t​i​​v​e​c​​o​m​m​​o​n​s​.​​o​r​​g​/​l​i​c​e​n​s​e​s​/​b​y​/​4​.​0​/.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 2 of 14

Introduction speaker identification methods, combining deep learn-


Voice production is the process by which humans pro- ing, machine learning, and subspace classifiers with
duce sound to communicate ideas, meaning, opinions, diverse feature extraction techniques. Authors in [7] fur-
and other information. The human voice production ther demonstrate the potential of deep multiple instance
system includes the lungs, larynx, vocal tract, and vocal learning for voice activity detection (VAD). These stud-
folds. The lungs provide the air pressure that is needed ies highlight the versatility of hybrid approaches—inte-
to vibrate the vocal folds. The vocal folds are located in grating feature engineering with classifiers like support
the larynx, also known as the voice box. When the vocal vector machines (SVMs), logistic regression, and neural
folds vibrate, they create sound waves that travel through networks—to address the unique challenges of audio
the vocal tract, which is the passage of air from the lar- classification. However, voice disorder classification
ynx to the mouth and nose. The shape of the vocal tract poses distinct difficulties due to the subtle acoustic varia-
affects the timbre of the voice. Voice disorders can occur tions between disorders and the need for high diagnos-
when there is a problem with any of the components tic precision. While existing works often focus on binary
of the voice production system causing changes in the classification, multi-class frameworks remain underex-
pitch, loudness, or quality of the voice. These disorders plored. This gap motivates our proposed framework,
can reduce the clarity of a person’s oral communication which combines deep learning-based feature extraction
ability. Voice disorders can vary in severity from minor with robust classifiers to improve both binary and multi-
hoarseness or alterations in vocal quality to the extreme class classification performance.
outcome of complete voice loss [1].
Voice disorders can result from various factors. These Related work
disorders are commonly classified based on their under- In recent years, researchers have become increasingly
lying causes, which may include psychogenic, functional, interested in the automatic classification of voice disor-
or organic factors. Organic voice disorders are caused by ders. These methods, utilizing computer algorithms to
structural or neurological problems that affect the vocal analyze speech signals, can revolutionize the classifica-
folds or other parts of the voice production system [2]. tion and detection of voice disorders, making it more
Functional voice disorders occur when the vocal mech- objective, efficient, and accessible. Most researchers
anism is not used efficiently, even though the physical focus on binary classification problems, such as clas-
structure of the larynx and vocal tract is normal. Psycho- sifying between healthy and pathological voices or the
genic voice disorders, on the other hand, stem from psy- detection of a voice disorder. Spectrograms and cepstral
chological factors such as sadness, anxiety, or emotional analysis are two commonly used features for this pur-
responses to traumatic or stressful situations [3]. pose. However, in recent years, machine learning algo-
Voice disorders can have a significant impact on people rithms have gained popularity for their ability to learn
of all ages, potentially leading to stress, embarrassment, and recognize patterns in acoustic features associated
frustration, withdrawal, and depression. Professions that with various types of voice disorders. Fang et al. [8] used
require frequent and demanding use of the voice, such Mel Frequency Cepstral Coefficients (MFCCs) and three
as teaching, acting, and singing, are particularly suscep- classifiers, namely Deep Neural Network (DNN), Gauss-
tible to these disorders [4]. To ensure the right treatment, ian Mixture Model (GMM) and SVM, for pathological
accurate classification of the voice disorders is crucial. voice detection. Cordeiro et al. [9] applied hierarchical
A speech therapist typically evaluates the patient’s voice classification for the identification of pathological voice,
quality for this purpose. However, this approach is sub- employing MFCCs and line spectral frequencies features.
jective and relies on the speech therapist’s expertise. Kodrasi et al. [10] proposed a hierarchical multi-class
Another approach to assess voice disorders is to use automatic technique using handcrafted acoustic features
artificial intelligence (AI) to process acoustic features of to distinguish between speech apraxia, dysarthria and
voice signals, which provides an objective assessment. neurotypical speech. The approach utilizes two SVMs,
Automatic classification of voice disorders can provide with the first SVM distinguishing between neurotypical
speech therapists with a faster and more comfortable way speech and impaired, while the second SVM discrimi-
to identify voice disorders in patients. nates between dysarthria and apraxia of speech.
Recent advances in AI have enabled significant prog- Costa et al. [11] proposed combining the hidden Mar-
ress in audio and speech processing tasks, including kov model (HMM) and modified MFCCs for the voice
speaker identification, speech emotion recognition, disorders caused by a vocal fold pathology. In [12, 13], the
and voice disorder detection. For instance, Xie et al. [5] authors applied multilayer neural networks for the classi-
employed attention-based long short-term memory fication of MFCC features and demonstrated that results
(LSTM) networks to classify speech emotions. Similarly, can be enhanced by considering the differentiation of the
Keser and Gezer [6] conducted a comparative analysis of speaker’s gender. Ali et al. [14] introduced a method for

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 3 of 14

the classification and detection of voice disorders, uti- stage employs a CNN with long short-term memory
lizing a Gaussian mixture model (GMM) classifier with (CNN-LSTM) to learn complex features from spectro-
running speech voice data. Benba et al. [15] investigated grams of the enhanced voice signals. Harar et al. [28]
the detection of dysphonia using a Naive Bayes (NB) proposed a novel approach for voice pathology detection
algorithm. They extracted acoustic features using MFCC. that uses convolutional and LSTM layers to learn directly
Authors in [16] used MFCC features to differentiate from raw audio signals. Furthermore, recent studies have
between Parkinson’s disease (PD) and healthy voices. highlighted the beneficial impact of denoising for audio
They extracted MFCC features from three different vowel signals [29], advanced vocal feature extraction [30]. These
sounds: /a/, /o/, and /u/. Authors in [17, 18] also explored approaches collectively suggest promising avenues for
binary classification of voice disorders in their research. enhancing the automatic classification of voice disorders.
In [19], the authors employ wavelet scattering features In voice disorders classification, limited data avail-
to capture both time-frequency information from voice ability is a common challenge. To address this, some
signals, which are then used for classifying neurological researchers have used pre-trained models [31–34]. In
voice disorders. [35], the authors proposed a transfer learning framework
In addition to binary classification, the majority of that uses a pre-trained OpenL3-SVM model and linear
research studies use sustained vowel /a/ recordings from local tangent space alignment (LLTSA) for dimensional-
clinical settings for their investigations [20]. In [21], the ity reduction. They first extracted the Mel spectrum of
introduction of continuous speech and vowel /a/ analysis the voice signals and then fed it into the OpenL3 model
for voice disorder identification is discussed. The authors to obtain high-level feature embeddings. Violeta et al.
conducted a comparison of glottal features extracted [36] investigated the performance of self-supervised
from the sustained vowel sound /a/ and voiced segments pre-trained Wav2Vec 2.0 and WavLM models for auto-
within continuous speech. Fujimura et al. [22] used matic pathological speech recognition using different
an end-to-end 1D-CNN model to classify voice disor- setups. Zhu et al. [37, 38] introduced pre-trained BERT
ders using voice samples of the sustained vowel /a/. The and WavBERT models for the detection of dementia
research demonstrated that the 1D-CNN models were using human speech. Karaman et al. [39] employed the
capable of consistently evaluating voice disorders, align- SqueezeNet1_1, ResNet101, and DenseNet161 networks
ing with human assessments. for the detection of Parkinson’s disease based on speech
In recent years, deep learning has achieved impressive signals. The findings showed that the proposed networks,
results in a variety of areas, including natural language which utilize pre-trained models with a fine-tuning
processing, computer vision and audio analysis. Deep approach, achieved promising results. In [40], the authors
learning’s ability to handle complex and high-dimen- used a pre-trained ResNet50 model for dysarthric speech
sional acoustic features makes it well-suited for address- detection.
ing the challenges of voice disorders classification. This
has encouraged many researchers to explore the potential Research gap and contribution
of deep learning for voice disorder classification. Wu et Most studies on the automatic classification of voice dis-
al. [23] developed a novel system using spectrograms of orders have focused on the binary classification, typically
disordered and normal speech recordings as input. They distinguishing between pathological and healthy voices.
employed Convolutional Deep Belief Networks for pre- Some studies have taken a more specialized approach,
training CNN weights as a generative model to under- aiming to identify particular pathological voices among
stand the input data’s structure statistically. Subsequently, all other pathological and healthy voices. A few stud-
they fine-tuned the CNN using supervised back-propa- ies have investigated multi-class classification of voice
gation. In [24], authors propose the use of a CNN model disorders, but the accuracy of these approaches is low.
along with short-time Fourier transform (STFT) features Multi-class classification of voice disorders is a challeng-
for the binary classification of voice disorders. Moham- ing problem due to the limited training data and subtle
med et al. [25] addressed the problem of voice disorder differences between different types of disordered voices.
detection by using CNN model. They specifically focused In this study, we address both binary and multi-class clas-
on the automatic detection of depression from speech. sification of voice disorders. For binary classification,
Chaiani et al. [26], the authors analyzed an algorithm we distinguish between healthy and disordered voices,
that extracts a chromagram acoustic feature from voice as well as between two different types of pathologi-
samples and uses it as input to a CNN-based classifica- cal voices. For multi-class classification, we have three
tion system. The research in [27] proposed a two-stage classes: healthy, vocal fold paresis and hyperfunctional
framework for the classification of different voice disor- dysphonia.
ders. The first stage uses speech enhancement to improve Gender-specific classification of voice disorders has
the voice signal quality by removing noise. The second not been widely investigated. We present classification

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 4 of 14

results separately for male and female speakers, as well as This study demonstrates the effectiveness of utiliz-
combined results, for both binary and multi-class tasks. ing embeddings from a pre-trained VGGish model and
This enables us to analyze and compare gender-based dif- ensemble classifiers for both binary and multi-class clas-
ferences in the classification of voice disorders. sification of voice disorders. Additionally, we examine
Feature extraction is a crucial step in machine learn- the impact of gender on the classification task. Our find-
ing tasks, and it holds particular significance in the clas- ings are compared to popular baseline methods, provid-
sification of voice disorders due to the small dataset size. ing a comprehensive evaluation of our approach. The
We utilize the pre-trained VGGish model [41] to extract results show that our method outperforms the baseline
128-dimensional high-level embedding features using approaches on both binary and multi-class classifica-
logarithmic mel spectrogram of voice data. As the name tion tasks, demonstrating the superiority of the proposed
indicates, the VGGish network takes inspiration from a method.
well-known VGG network and is adapted for audio clas-
sification. This model was trained on a large Audio set, Paper outline
which was a preliminary version of the YouTube-8M The rest of the paper is structured as follows. Section
dataset. These embeddings are then utilized as input for “Proposed method” presents an in-depth explanation of
machine learning classifiers. the proposed method. Section “Dataset and experimen-
Previous studies have employed transformer-based tal setup” describes the experimental setup and the voice
models like wav2vec and HuBERT for extracting audio dataset used in this study. Section “Experiments and
embeddings. While these models perform well in gen- results” provides a comprehensive overview of the exper-
eral speech tasks, we found that VGGish, a CNN-based iments conducted for both binary and multi-class clas-
model, delivers better results for the classification of sification tasks. We present the results and performance
voice disorders. Consequently, our approach outper- metrics achieved by our approach. Section “Discussion”
forms transformer-based models in this domain. presents the implications of our findings. Section “Con-
Our dataset is imbalanced, which mirrors the distribu- clusion” summarizes the key points and highlights the
tion often seen in real-world applications, where certain main contributions made by our study.
voice disorders are less common. This imbalance pres-
ents challenges in accurately classifying minority classes. Proposed method
To overcome this issue, we employed ensemble classifiers In this paper, we propose a novel hybrid two-stage frame-
that combine the strengths of multiple models, improv- work for voice disorders classification. In the first stage,
ing performance on minority classes and enhancing over- voice data is converted into logarithmic mel spectro-
all classification accuracy. grams and high-level feature embeddings are extracted
We tested three machine learning classifiers: Logis- from these spectrograms using the pre-trained VGGish
tic Regression (LR), Multi-Layer Perceptron (MLP), and model. In the second stage, we use classifiers, including
Support Vector Machine (SVM). We also employed an an ensemble classifier, to classify the feature embeddings.
ensemble classifier (EC) using SVM, LR, and MLP with Figure 1 provides an illustration of the proposed classifi-
soft voting to combine the predictions of the three clas- cation framework.
sifiers. This allowed us to leverage the collective insights
of these diverse classifiers and improve the overall clas-
sification performance.

Fig. 1 The proposed voice disorders classification system

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 5 of 14


 0 if k < f (m − 1)
Preprocessing and feature extraction 
 k−f (m−1)
f (m)−f (m−1) if f (m − 1) ≤ k ≤ f (m)
Before extracting features, we apply several preprocess- Hm (k) = f (m+1)−k (1)
ing steps. In the original dataset, as explained in Sect. 
 if f (m) < k ≤ f (m + 1)
 f (m+1)−f (m)
0 if k > f (m + 1)
3, voice signals were recorded at 50 kHz sampling fre-
quency. To align with our processing requirements,
we resampled the audio to 16 kHz. The length of each Here, 0 ≤ m ≤ M, and M represents the count of filters.
audio recording in the original database is different. We The central frequency f(m) of the filters can be written as:
trimmed the audio signals to 1 s. Audio signals that were ( ) ( )
less than 1 s were padded with zeros to ensure that all N Fmel (fh ) − Fmel (fl )
f (m) = −1
Fmel Fmel (fl ) + m (2)
fs M +1
audio data had the same length.
The VGGish model takes the logarithmic mel spectro-
gram of an audio signal for feature extraction. To com- Here, fl denotes the lowest frequency within the filter’s
pute the mel spectrogram for each audio, we apply the frequency domain. fh represents the highest frequency.
Short-Time Fourier Transform (STFT) with a Hamming N corresponds to the length of the Fourier transform. fs
window lasting 25 milliseconds (ms) and a 10 ms shift. stands for the sampling frequency. Fmel signifies the Mel
This resultant spectrogram is subsequently integrated frequency. The transformation formula linking Fmel and
into 64 frequency bins spaced along the Mel scale, and the regular frequency f is given by:
the magnitude of each bin is then transformed logarith- ( )
mically. The configuration of the mel spectrogram draws Fmel = 2595 log 1 +
f
(3)
inspiration from psychoacoustic analysis, which strives to 700
replicate characteristics of the human auditory system.
This procedure involves the application of a Mel filter The log mel spectrogram tensor (96 × 64) is the input to
bank denoted as Hm(k) to filter the spectral line energy of VGGish. Here, 96 is the number of frames within each
the audio. The purpose of these filters is outlined by the time scale, and 64 is the number of frequency bands.
following equations. In Fig. 2, the VGGish [41] model’s structure is illus-
trated. Batch normalization was implemented follow-
ing each convolutional layer. The chosen loss function
was cross-entropy, and the model employed the Adam
optimizer. Dropout, weight decay, and other usual regu-
larization methods were not utilized. This architecture
was trained on a large Audio set which was a prelimi-
nary version of the YouTube-8 M dataset. We extracted
128-dimensional high-level feature embeddings using a
pre-trained VGGish model. These embeddings are then
utilized as input for machine learning classifiers.

Classifiers
We evaluated the performance of three classifiers: SVM,
LR, and MLP. The SVM classifies audio signals by map-
ping high-dimensional VGGish features into a new space
using a kernel, allowing it to create a nonlinear decision
boundary. LR assigns weight coefficients to features and
makes predictions based on probability scores. The MLP
model learns hierarchical representations through its
hidden layers, capturing complex patterns in the VGGish
embeddings.
In addition, we utilized an ensemble classifier (EC) that
incorporated SVM, LR, and MLP. Figure 3 shows the
EC model. Instead of relying on a single model, the EC
combines the predictions of multiple models to improve
accuracy and reduce the risk of overfitting. Soft voting
was employed to combine the predictions of these three
classifiers. Soft voting is an ensemble strategy that com-
Fig. 2 VGGish model architecture bines the predictions of multiple classifiers by averaging

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 6 of 14

Fig. 3 Ensemble classifier

Table 1 Details of voice recordings for each class


their predicted probability scores for each class. In soft Class Male Female Total Age
voting, each classifier outputs a probability distribution recordings recordings recordings range
over the classes. For a given input sample, let pi(c) be the Healthy 227 360 587 19–60
probability that classifier i assigns to class c. With K clas- Hyper- 32 114 146 19–60
sifiers, the ensemble probability for class c is computed functional
as: dysphonia
Vocal fold 25 60 85 19–60
paresis
1 ∑
K
P (c|x) = pi (c)(4)
K
i=1
Dataset and experimental setup
This section provides a comprehensive description of
Then, the final predicted class is the one with the highest the voice dataset used in the study. We also describe the
averaged probability. training and testing process of classifiers.
( )
1 ∑
K
Dataset
ŷ = arg max P (c|x) = arg max pi (c) (5) We selected a subset of voice data from the publicly avail-
c c K
i=1
able Saarbruecken Voice Database (SVD) [42, 43] for this
study. The SVD database was created by researchers at
While the EC demonstrated superior performance for the Institut für Phonetik at Saarland University and the
female speakers in the binary classification task (see Sect. Phoniatry Section of the Caritas Clinic St. Theresia in
4), SVM emerged as the top-performing individual model Saarbrücken. The database contains audio recordings of
for multiclass classification across genders and for male 71 different voice disorders. Speakers engage in various
speakers in binary classification. This advantage can be speaking tasks, including the pronunciation of vowels ‘a’,
attributed to SVM’s ability to capture complex, nonlinear ‘i’, and ‘u’ at normal, high, low, and rising-falling pitches
relationships within high-dimensional VGGish features as well as saying the sentence “Guten Morgen, wie geht es
while maintaining robust generalization through rigorous Ihnen?” (“Good morning, how are you?”). This includes
regularization. individuals who were recorded before and after recovery
In our experiments, we employed SVM with a radial from a voice disorder. Every recording in the database
basis function (RBF) kernel. The SVM was configured was captured at a sampling frequency of 50 kHz and 16
with a regularization parameter of ‘1’, ‘scale’ kernel coeffi- bits resolution.
cient and utilized the ‘ovr’ (one-vs-rest) decision function We extracted a subset from the SVD database contain-
shape. A logistic regression classifier with a maximum ing three classes: healthy, hyperfunctional dysphonia,
iteration count of 300, ‘newton-cg’ solver, L2 penalty and and vocal fold paresis. The healthy class includes 227
‘ovr’ multi-class strategy was utilized to ensure conver- recordings of males and 360 recordings of females. The
gence and prevent overfitting of the training data. Fur- hyperfunctional dysphonia voice disorder class has 32
thermore, we incorporated an MLP classifier with two recordings of males and 114 recordings of females. The
hidden layers, stochastic gradient descent solver, a learn- vocal fold paresis voice disorder class has 25 recordings
ing rate of 0.001 and ReLU activation function. To opti- of males and 60 recordings of females. We included the
mize the classifier’s performance, we employed the grid recordings from individuals whose ages ranged from 19
search technique. We tested all classifiers for male and to 60 years at the time of recording. Table 1 provides
female speakers separately, as well as combined for both details of the subset used in this work. We chose hyper-
binary and multi-class tasks. functional dysphonia and vocal fold paresis because they

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 7 of 14

are commonly found voice disorders [44]. By choosing Experiments and results
these specific disorders and matching the number of This section presents a comprehensive overview of all the
recordings used in previous study [45], we were able to experiments conducted to compare the performance of
directly compare our experimental results with existing the proposed voice disorders classification framework to
research. This approach allowed for a more robust and other state-of-the-art methods. The first two experiments
fair evaluation of our findings. address binary classification problems: healthy vs. disor-
dered and vocal fold paresis vs. hyperfunctional dyspho-
Training and testing nia. The third experiment is a multi-class classification
To train the classifiers, we used 5-fold cross-validation. problem.
In each iteration, we held out one fold for evaluation and
used the remaining folds for training. All samples from Healthy vs. disordered
each speaker were consistently placed within a single As shown in Table 1, we have three classes: healthy, vocal
fold to prevent the model from learning to classify voice fold paresis and hyperfunctional dysphonia. For this
samples based on speaker identity. We computed perfor- experiment, hyperfunctional dysphonia and vocal fold
mance metrics based on the predictions generated for paresis are combined into a single class. Separate experi-
the evaluation fold. The evaluation metrics include mean ments have been conducted for male speakers, female
accuracy and F1 score, as well as mean precision, recall, speakers, and both genders combined. Table 2 shows
and F1 score for each class. the mean accuracy and F1 score, as well as the precision,
The dataset used in this study is imbalanced, which can recall, and F1 score of each class for male and female
be a problem for machine learning models, as they can speakers. For male speakers, VGGish-SVM achieved
learn to favor the majority classes and ignore the minor- the highest accuracy, closely followed by VGGish-EC.
ity classes. To address this issue, we balanced the training VGGish-SVM achieved an accuracy of 82.45%, while
set by oversampling the minority classes. This involved VGGish-EC reached 80.25%. Our method outperforms
duplicating samples from the minority classes to ensure the approach presented in [45], as shown in Table 2,
that each class had an equal number of samples in the which uses SVM as a classifier and features extracted
training set, which helping to prevent the model from with wav2vec and HuBERT models, as well as SVM with
overfitting to the majority classes. We also applied Stan- MFCC and MFCC-glottal features.
dardScaler to all feature embeddings to ensure that all For female speakers, VGGish-EC achieved the highest
features were on the same scale, thereby to improve the accuracy with 71.54%, followed closely by VGGish-SVM,
performance of the machine learning models. VGGish-MLP and VGGish-LR with accuracies of 70.03%,
68.36%, and 66.31% respectively. It is worth noting that
this is the only case where our model demonstrates a
slightly lower accuracy compared to the existing method

Table 2 Performance metrics for the binary classification task of healthy vs. disordered for male and female speakers
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1
Male VGGish-SVM 82.45 ± 2.77 82.99 0.91 0.87 0.89 0.54 0.64 0.58
VGGish-LR 75.35 ± 4.30 75.45 0.85 0.84 0.84 0.41 0.41 0.40
VGGish-MLP 77.09 ± 5.75 76.96 0.86 0.86 0.86 0.43 0.42 0.42
VGGish-EC 80.25 ± 5.70 79.66 0.86 0.89 0.88 0.51 0.44 0.47
wav2vec-SVM [45] 75.65 ± 5.81 - 0.91 0.82 0.87 0.50 0.69 0.58
MFCC-glottal-SVM [45] 74.48 ± 5.85 - 0.90 0.84 0.87 0.51 0.64 0.57
MFCC-SVM [45] 72.02 ± 7.75 - 0.89 0.88 0.88 0.54 0.56 0.55
HuBERT-SVM [45] 72.14 ± 7.93 - 0.89 0.85 0.87 0.50 0.59 0.54
Female VGGish-SVM 70.03 ± 3.07 70.05 0.79 0.77 0.77 0.53 0.57 0.54
VGGish-LR 66.31 ± 4.86 66.68 0.77 0.72 0.74 0.48 0.55 0.51
VGGish-MLP 68.36 ± 3.76 68.11 0.76 0.78 0.77 0.51 0.49 0.50
VGGish-EC 71.54 ± 4.13 71.83 0.80 0.76 0.78 0.56 0.62 0.58
wav2vec-SVM [45] 73.80 ± 5.03 - 0.84 0.77 0.80 0.60 0.71 0.65
MFCC-glottal-SVM [45] 66.13 ± 3.11 - 0.80 0.66 0.72 0.49 0.66 0.56
MFCC-SVM [45] 68.15 ± 4.59 - 0.81 0.68 0.74 0.51 0.68 0.58
HuBERT-SVM [45] 74.50 ± 4.38 - 0.85 0.76 0.81 0.60 0.72 0.65
In the metric names, ‘0’ corresponds to the healthy class, and ‘1’ represents the disordered. PR, RE and F1 represent Precision, Recall and F1 score respectively. The
mean values over folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are provided

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 8 of 14

Table 3 Performance metrics for the binary classification task of healthy vs. disordered for male and female speakers combined
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1
Male & Female VGGish-SVM 73.35 ± 3.32 72.95 0.81 0.83 0.82 0.53 0.49 0.51
VGGish-LR 70.05 ± 3.08 70.80 0.82 0.75 0.78 0.48 0.58 0.52
VGGish-MLP 73.35 ± 3.93 73.57 0.82 0.80 0.81 0.53 0.56 0.54
VGGish-EC 73.84 ± 2.83 73.92 0.82 0.81 0.82 0.54 0.55 0.54
In the metric names, ‘0’ corresponds to the healthy class, and ‘1’ represents the disordered. PR, RE and F1 represent Precision, Recall and F1 score respectively. The
mean values over folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are provided

Fig. 4 Normalized confusion matrix for healthy vs. disordered. The predicted classes are represented on the horizontal axis, while the true classes are
represented on the vertical axis. Class labels: 0 for healthy and 1 for disordered

[45], which attains its highest accuracy of 74.50% using accuracy, F1 score, precision, and recall for male and
HuBERT-SVM. female speakers are presented in Table 4. VGGish-SVM
Experiments were also conducted with male and female achieved the highest accuracy (75.45%) for male speak-
speakers combined. The mean accuracy, F1 score, pre- ers, while VGGish-EC achieved 71.82%. For female
cision and recall are shown in Table 3. The results dem- speakers, VGGish-EC attained the highest accuracy at
onstrate that VGGish-EC achieved the highest overall 68.42%, closely followed by VGGish-SVM, VGGish-MLP,
accuracy and F1 score, with values of 73.84% and 73.92%, and VGGish-LR, with respective accuracies of 68.37%,
respectively. It was closely followed by VGGish-MLP, 64.97%, and 62.11%. Our method outperforms the
VGGish-SVM, and then VGGish-LR in terms of accu- approach presented in [45]. For male speakers, their high-
racy. This study’s results on combined male and female est accuracy was 71.95%, while for female speakers, their
speakers can not be directly compared to those of any best accuracy was 63.06%, achieved with wav2vec-SVM.
other study because of the differences in the datasets and Table 5 presents the mean accuracy, F1 score, preci-
disorders studied. Figure 4 presents the normalized con- sion, and recall for male and female speakers combined.
fusion matrices for each classifier and gender. VGGish-SVM achieved the highest overall accuracy of
68.80% and F1 score of 67.64%, followed by VGGish-EC
Hyperfunctional dysphonia vs. vocal fold paresis with an accuracy of 67.10% and F1 score of 66.39%. These
To classify hyperfunctional dysphonia and vocal fold results cannot be directly compared to previous studies
paresis, we used the same classification setup for male, because of the differences in the datasets and disorders
female, and combined gender speakers. The mean

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 9 of 14

Table 4 Performance metrics for the binary classification task of hyperfunctional dysphonia and vocal fold paresis for male and female
speakers
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1
Male VGGish-SVM 75.45 ± 6.24 74.64 0.75 0.85 0.79 0.81 0.64 0.69
VGGish-LR 66.52 Ł} 7.24 64.92 0.68 0.75 0.69 0.70 0.56 0.59
VGGish-MLP 71.66 ± 11.02 70.97 0.72 0.81 0.75 0.74 0.60 0.65
VGGish-EC 71.82 ± 7.12 70.89 0.73 0.81 0.75 0.75 0.60 0.65
wav2vec-SVM [45] 71.95 ± 12.62 – 0.74 0.75 0.74 0.67 0.66 0.66
MFCC-glottal-SVM [45] 69.05 ± 9.67 – 0.73 0.74 0.73 0.66 0.64 0.65
MFCC-SVM [45] 61.60 ± 8.86 – 0.65 0.76 0.70 0.60 0.47 0.53
HuBERT-SVM [45] 71.88 ± 10.56 – 0.73 0.80 0.76 0.70 0.62 0.66
Female VGGish-SVM 68.37 ± 6.61 67.66 0.74 0.80 0.77 0.56 0.47 0.51
VGGish-LR 62.11 ± 7.77 61.83 0.70 0.73 0.71 0.46 0.42 0.44
VGGish-MLP 64.97 ± 3.51 64.96 0.74 0.72 0.73 0.50 0.52 0.50
VGGish-EC 68.42 ± 6.39 68.08 0.75 0.77 0.76 0.54 0.52 0.53
wav2vec-SVM [45] 63.06 ± 6.77 – 0.74 0.83 0.78 0.57 0.44 0.50
MFCC-glottal-SVM [45] 59.96 ± 7.91 – 0.72 0.81 0.76 0.52 0.40 0.45
MFCC-SVM [45] 57.09 ± 7.48 – 0.71 0.74 0.72 0.45 0.42 0.43
HuBERT-SVM [45] 61.31 ± 5.94 – 0.73 0.78 0.75 0.52 0.45 0.48
In the metric names, ‘0’ represents hyperfunctional dysphonia class and ‘1’ represents vocal fold paresis. PR, RE and F1 represent Precision, Recall and F1 score
respectively. The mean values over folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are
provided

Table 5 Performance metrics for the binary classification task of hyperfunctional dysphonia and vocal fold paresis for male and female
speakers combined
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1
Male & Female VGGish-SVM 68.80 ± 6.79 67.64 0.72 0.82 0.77 0.60 0.46 0.52
VGGish-LR 63.20 ± 2.40 63.03 0.72 0.70 0.70 0.50 0.52 0.50
VGGish-MLP 65.37 ± 3.08 65.27 0.73 0.73 0.72 0.53 0.53 0.53
VGGish-EC 67.10 ± 3.93 66.39 0.72 0.78 0.75 0.57 0.48 0.52
In the metric names, ‘0’ represents hyperfunctional dysphonia class and ‘1’ represents vocal fold paresis. PR, RE and F1 represent Precision, Recall and F1 score
respectively. The mean values over folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are
provided

studied. Figure 5 illustrates the normalized confusion Discussion


matrices for all classifiers and genders. The proposed voice disorders classification system dem-
onstrates superior performance compared to state-of-
Multi-class classification the-art methods. In this study, we employed machine
The mean accuracy, F1 score, precision, recall, and F1 learning classifiers, particularly ensemble classifiers, to
score of each class for male and female speakers are evaluate high-level feature embeddings extracted using a
shown in Table 6 and for both genders combined in pre-trained VGGish model. To evaluate the effectiveness
Table 7. For male speakers, the highest classification of our approach, the results were compared with those
accuracy achieved was 77.81%, for female speakers, it reported in [45], where the same dataset was used for
was 63.11%, and when both genders were combined, evaluation. Our study shows that extracting features with
the accuracy reached 70.53%. In the case of multi-class a pre-trained model outperforms MFCC feature-based
classification, VGGish-SVM outperformed all other systems, which are the most commonly used features in
classifiers, including those in [41], in terms of accuracy. the detection and classification of voice disorders [46–
While the accuracy of VGGish-EC is lower than that of 48]. This statement is also confirmed by [45], where the
VGGish-SVM, it demonstrates better performance for authors extract features with the wav2vec and HuBERT
the minority classes, which is important when dealing models and compare the results with MFCC features.
with imbalanced datasets, as it ensures that the model Our study also investigated the performance of the pro-
effectively recognizes and classifies the minority classes. posed system on male and female speakers separately for
The normalized confusion matrices for all classifiers are both binary and multi-class classification tasks. Interest-
illustrates in Fig. 6. It is clear that the ensemble classi- ingly, our findings reveal a consistent trend where the
fier enhances the performance of the minority classes for accuracy of male speakers outperforms that of female
both male and female speakers. speakers. The best accuracy for healthy vs. disordered

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 10 of 14

Fig. 5 Normalized confusion matrix for hyperfunctional dysphonia vs. vocal fold paresis. The predicted classes are represented on the horizontal axis,
while the true classes are represented on the vertical axis. Class labels: 0 for hyperfunctional dysphonia and 1 for vocal fold paresis

classification was 82.45% for male speakers and 71.54% fold paresis. The same trend was observed for female
for female speakers. Similarly, the highest accuracy for speakers. In multi-class classification, for male speakers,
hyperfunctional dysphonia vs. vocal fold paresis clas- the lowest F1 score is recorded for hyperfunctional dys-
sification was 75.45% for male speakers and 68.42% for phonia, while for female speakers, the lowest F1 score is
female speakers. In the multi-class classification sce- observed for vocal fold paresis. These classes presented
nario, the accuracy differences between male and female particular challenges in terms of accuracy, probably
speakers continued similar trends. For male speakers, our because of the smaller number of samples available for
model achieved an impressive accuracy of 77.81%, how- these classes. This underlines the importance of address-
ever, for female speakers, the highest accuracy observed ing data imbalance in future research to further enhance
was 63.11%. It is important to highlight that the binary classification performance.
classification of healthy vs. disordered voices for female As part of our future work, we plan to incorporate
speakers stands as the only case where our model exhib- explainability techniques such as LIME, SHAP, and Grad-
ited a slightly lower accuracy compared to the results CAM. These methods will enable us to better understand
reported in [45]. the contribution of different features in the classification
VGGish-SVM achieved the highest accuracy for male process and provide visual insights into the regions of the
speakers and VGGish-EC for female speakers in both spectrograms that are most influential in decision-mak-
binary classification tasks (i.e., healthy vs. disordered ing. It will help build trust in the model’s predictions and
and hyperfunctional dysphonia vs. vocal fold paresis). In facilitate its integration into diagnostic workflows.
multi-class classification, VGGish-SVM performed better This study demonstrates the efficacy of hybrid frame-
for both genders. However, while VGGish-EC achieved a works for voice disorder classification using controlled
lower overall accuracy than VGGish-SVM in multi-class datasets. However, it does not evaluate real-time perfor-
classification, it outperformed VGGish-SVM on minority mance, which is a critical factor for clinical deployment.
classes. For example, for male speakers, the precision and Furthermore, the computational demands of the VGGish
recall for hyperfunctional dysphonia with VGGish-SVM feature extractor and classifier pipeline may introduce
were 0.20% and 0.27%, respectively, while with VGGish- latency in unoptimized implementations. Future work
EC, the precision and recall were 0.23 and 0.44, respec- will focus on optimizing the framework for low-latency
tively. Similarly, VGGish-EC performed better for vocal inference (e.g., via model lightweighting, edge-device

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Table 6 Multi-class classification performance metrics for male and female speakers
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1 PR 2 RE 2 F1 2
Male VGGish-SVM 77.81 ± 2.71 78.34 0.88 0.87 0.88 0.20 0.27 0.22 0.73 0.56 0.62
VGGish-LR 74.62 ± 4.86 75.83 0.88 0.83 0.85 0.23 0.31 0.25 0.58 0.56 0.55
VGGish-MLP 77.46 ± 2.06 77.28 0.88 0.87 0.87 0.20 0.22 0.19 0.62 0.60 0.59
VGGish-EC 72.17 ± 7.34 74.73 0.89 0.77 0.82 0.23 0.44 0.30 0.65 0.64 0.62
wav2vec-LARGE-hier [45] 62.77 ± 10.94 - 0.92 0.79 0.85 0.30 0.55 0.39 0.47 0.53 0.50
MFCC-glottal-SVM-hier [45] 57.35 ± 6.79 - 0.91 0.84 0.87 0.28 0.38 0.33 0.41 0.48 0.44
MFCC-SVM-hier [45] 53.76 ± 9.35 - 0.89 0.87 0.88 0.26 0.31 0.29 0.43 0.42 0.42
Female VGGish-SVM 63.11 ± 3.92 62.89 0.76 0.77 0.76 0.38 0.37 0.37 0.32 0.32 0.31
VGGish-LR 55.99 ± 2.30 57.87 0.77 0.64 0.69 0.32 0.40 0.35 0.25 0.40 0.31
VGGish-MLP 61.98 ± 1.94 61.64 0.76 0.76 0.76 0.38 0.36 0.37 0.24 0.28 0.26
VGGish-EC 62.56 ± 3.41 64.28 0.82 0.69 0.75 0.39 0.50 0.44 0.33 0.47 0.38
wav2vec-LARGE-hier [45] 55.36 ± 4.99 - 0.84 0.78 0.81 0.39 0.50 0.43 0.43 0.39 0.41
MFCC-glottal-SVM-hier [45] 49.27 ± 5.80 - 0.80 0.65 0.72 0.30 0.49 0.37 0.37 0.33 0.35
MFCC-SVM-hier [45] 51.11 ± 7.08 - 0.82 0.69 0.75 0.34 0.47 0.39 0.31 0.37 0.34
In the metric names, ‘0’ represents the healthy class, ‘1’ represents hyperfunctional dysphonia, and ‘2’ represents vocal fold paresis. PR, RE and F1 represent Precision, Recall and F1 score respectively. The mean values over
folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are provided
Rahman and Direkoglu BMC Medical Informatics and Decision Making
(2025) 25:177

Table 7 Multi-class classification performance metrics for male and female speakers combined
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1 PR 2 RE 2 F1 2
Male & Female VGGish-SVM 70.53 ± 3.22 69.53 0.81 0.83 0.82 0.38 0.36 0.37 0.48 0.39 0.41
VGGish-LR 61.00 ± 6.29 63.52 0.83 0.67 0.74 0.30 0.43 0.35 0.34 0.52 0.40
VGGish-MLP 67.85 ± 2.65 67.80 0.80 0.80 0.80 0.38 0.37 0.37 0.36 0.35 0.35
VGGish-EC 68.34 ± 3.45 68.52 0.81 0.80 0.80 0.38 0.38 0.38 0.40 0.41 0.40
In the metric names, ‘0’ represents the healthy class, ‘1’ represents hyperfunctional dysphonia, and ‘2’ represents vocal fold paresis. PR, RE and F1 represent Precision, Recall and F1 score respectively. The mean values over
folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are provided

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Page 11 of 14
Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 12 of 14

Fig. 6 Normalized confusion matrix for multi-class classification. The predicted classes are represented on the horizontal axis, while the true classes are
represented on the vertical axis. Class labels: 0 for healthy, 1 for hyperfunctional dysphonia, and 2 for vocal fold paresis

deployment) and validating its performance on stream- While our model excelled in most scenarios, there was
ing audio data acquired in clinical or telehealth settings. a slight exception. In the healthy vs. disordered task for
female speakers, our model demonstrated an accuracy
Conclusion that was 2.96% lower when compared to the baseline.
In this paper, we proposed a two-stage hybrid framework The accuracies for the combined dataset of male and
for voice disorders classification. In the first stage, we uti- female speakers are also promising in all three scenarios.
lized a pre-trained VGGish model to extract high-level It is important to note that these combined results can-
feature embeddings from the log-mel spectrograms of not be directly compared to existing studies because of
voice data. In the second stage, we evaluated four clas- variations in the datasets and the types of voice disorders
sifiers: support vector machine (SVM), logistic regres- investigated.
sion (LR), multilayer perceptron (MLP), and ensemble In binary classification, VGGish-SVM exhibited the
classifier. highest accuracy for male speakers, while VGGish-EC
The results of our study demonstrate the potential of performed best for female speakers. However, in multi-
using a pre-trained VGGish model to extract features class classification, VGGish-SVM outperformed other
for voice disorders classification. We achieved state-of- models for both genders. Notably, VGGish-EC dem-
the-art results on the SVD dataset, outperforming the onstrated its strength in handling minority classes,
baseline systems that used MFCC features, MFCC-glottal an important aspect of medical applications. The
features, as well as features extracted with pre-trained results confirm that VGGish-EC provides more bal-
wav2vec and HuBERT models. Compared to the best anced accuracy by giving importance to the minority
baseline accuracy, we improved by 6.8% for male speak- classes. Although we used oversampling to balance the
ers in healthy vs. disordered task, 3.5% and 5.36% for classes, the accuracy of minority classes remains com-
male and female speakers respectively in hyperfunctional paratively lower. Future research will focus on improv-
dysphonia vs. vocal fold paresis tasks. In the context of ing the robustness and generalizability of the proposed
multi-class classification, our method significantly out- two-stage hybrid framework for voice disorders classifi-
performed the baseline, achieving a 15.04% improvement cation. Additionally, expanding the dataset to include a
for male speakers and a 7.75% improvement for female more diverse and a broader range of voice disorders will
speakers.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 13 of 14

signal processing (ICASSP). 2021. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​i​​c​a​s​​s​p​3​​9​7​2​8​​.​2​​0​2​1​.​9​


be crucial for enhancing the model’s applicability in real- 4​1​4​2​8​3
world scenarios. 11. Costa SC, Neto BGA, Fechine JM. Pathological voice discrimination using
CEPSTRAL analysis, vector quantization and Hidden Markov models. In: 2008
Acknowledgements 8th IEEE international conference on bioinformatics and bioengineering.
Not applicable. 2008. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​b​​i​b​e​.​2​0​0​8​.​4​6​9​6​7​8​3.
12. Salhi L, Mourad T, Cherif A. Voice disorders classification using multilayer
Author contributions neural network. In: 2008 2nd International conference on signals, circuits and
Concept and design: MUR. Experiments and analysis: MUR. Drafting of the systems. 2008. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​i​​c​s​c​s​.​2​0​0​8​.​4​7​4​6​9​5​3.
manuscript: MUR and CD. Critical revision of the manuscript: MUR and CD. 13. Fraile R, Sáenz-Lechón N, Godino-Llorente JI, Osma-Ruiz V, Fredouille C. Auto-
matic detection of laryngeal pathologies in records of sustained vowels by
Funding means of Mel-frequency cepstral coefficient parameters and differentiation
Not applicable. of patients by sex. Folia Phoniatr Logop. 2009;61(3):146–52. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​
.​​1​1​5​9​​/​0​​0​0​2​1​9​9​5​0.
Data Availability 14. Ali Z, Elamvazuthi I, Alsulaiman M, Muhammad G. Automatic voice
The data used in this study were selected from the publicly available pathology detection with running speech by using estimation of auditory
Saarbruecken Voice Database (SVD). The full database can be accessed at the spectrum and CEPSTRAL coefficients based on the all-pole model. J Voice.
link: ​h​t​t​p​​s​:​/​​/​s​t​i​​m​m​​d​a​t​​e​n​b​​a​n​k​.​​c​o​​l​i​.​​u​n​i​​-​s​a​a​​r​l​​a​n​d​.​d​e​/. 2016;30(6):757.e7–757.e19. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​j​v​​o​i​c​​e​.​2​0​​1​5​​.​0​8​.​0​1​0.
15. Al-Dhief FT, Latiff NMA, Malik NNNA, Baki MM, Sabri N, Albadr MAA. Dyspho-
nia detection based on voice signals using naive bayes classifier. In: 2022 IEEE
Declarations 6th international symposium on telecommunication technologies (ISTT).
IEEE; 2022. pp. 56–61. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​i​​s​t​t​​5​6​2​​8​8​.​2​​0​2​​2​.​9​9​6​6​5​3​5.
Ethics approval and consent to participate 16. Benba A, Jilbab A, Hammouch A. Analysis of multiple types of voice record-
Not applicable. ings in cepstral domain using MFCC for discriminating between patients with
Parkinson’s disease and healthy people. Int J Speech Technol. 2016;19(3):449–
Consent for publication 56. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​0​7​​/​s​​1​0​7​7​2​-​0​1​6​-​9​3​3​8​-​4.
Not applicable. 17. Gómez-García JA, Moro-Velázquez L, Godino-Llorente JI. On the design of
automatic voice condition analysis systems. Part ii: review of speaker rec-
Clinical trial number ognition techniques and study on the effects of different variability factors.
Not applicable. Biomed Signal Process Control. 2019;48:128–43. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​b​s​p​
c​.​2​0​1​8​.​0​9​.​0​0​3.
Competing interests 18. Reddy MK, Alku P. A comparison of CEPSTRAL features in the detection of
The authors declare no competing interests. pathological voices by varying the input and Filterbank of the cepstrum
computation. IEEE Access. 2021;9:135953–63. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​a​​c​c​e​​s​s​.​​
Received: 28 November 2024 / Accepted: 18 March 2025 2​0​2​1​​.​3​​1​1​7​6​6​5.
19. Yagnavajjula MK, Mittapalle KR, Alku P, Rao KS, Mitra P. Automatic classifica-
tion of neurological voice disorders using wavelet scattering features. Speech
Commun. 2024;157:103040. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​s​p​​e​c​o​​m​.​2​0​​2​4​​.​1​0​3​0​4​0.
20. Martínez D, Lleida E, Ortega A, Miguel A, Villalba J. Voice pathology detection
References on the Saarbrücken voice database with calibration and fusion of scores
1. Ramig LO, Verdolini K. Treatment efficacy. J Speech Lang Hear Res. using multifocal toolkit. In: Advances in speech and language technologies
1998;41(1):S101–S116. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​4​4​​/​j​​s​l​h​r​.​4​1​0​1​.​s​1​0​1. for Iberian languages. 2012. pp. 99–109. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​0​7​​/​9​​7​8​-​​3​-​6​​4​2​-​3​​
2. Robotti C, Mozzanica F, Barillari MR, Bono M, Cacioppo G, Dimattia F, Gitto M, 5​2​​9​2​-​8​_​1​1.
Rocca S, Schindler A. Treatment of relapsing functional and organic dyspho- 21. Parsa V, Jamieson DG. Acoustic discrimination of pathological voice. J Speech
nia: a narrative literature review. Acta Otorhinolaryngol Ital. 2023;43(2 Suppl Lang Hear Res. 2001;44(2):327–39. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​4​4​​/​1​​0​9​2​-​4​3​8​8​(​2​0​0​1​/​0​
1):84. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​4​6​3​​9​/​​0​3​9​​2​-​1​​0​0​x​-​​s​u​​p​p​l​.​1​-​4​3​-​2​0​2​3​-​1​1. 2​7​).
3. American Speech-Language-Hearing Association. (n.d.).Voice disorders. (Prac- 22. Fujimura S, Kojima T, Okanoue Y, Shoji K, Inoue M, Omori K, Hori R. Clas-
tice Portal). Accessed 14 Sept 2023. ​h​t​t​p​​s​:​/​​/​w​w​w​​.​a​​s​h​a​​.​o​r​​g​/​p​r​​a​c​​t​i​c​​e​-​p​​o​r​t​a​​l​/​​c​l​i​​ sification of voice disorders using a one-dimensional convolutional neural
n​i​c​​a​l​-​t​​o​p​​i​c​s​/​v​o​i​c​e​-​d​i​s​o​r​d​e​r​s​/ network. J Voice. 2022;36(1):15–20. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​j​v​​o​i​c​​e​.​2​0​​2​0​​.​0​2​.​0​0​
4. Ribas D, Pastor MA, Miguel A, Martnez D, Ortega A, Lleida E. Automatic 9.
voice disorder detection using self-supervised representations. IEE Access. 23. Wu H, Soraghan J, Lowit A, Di-Caterina G. A deep learning method for
2023;11:14915–27. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​A​​C​C​E​​S​S​.​​2​0​2​3​​.​3​​2​4​3​9​8​6. pathological voice detection using convolutional deep belief networks. In:
5. Xie Y, Ruiyu L, Liang Z, Huang C, Zou C, Schuller B. Speech emotion clas- Interspeech 2018. 2018. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​2​1​4​3​​7​/​​i​n​t​​e​r​s​​p​e​e​c​​h​.​​2​0​1​8​-​1​3​5​1.
sification using attention-based LSTM. IEEE/ACM Trans Audio Speech Lang 24. Wu H, Soraghan J, Lowit A, Caterina GD. Convolutional neural networks for
Process. 2019;27:1–1. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​T​​A​S​L​P​.​2​0​1​9​.​2​9​2​5​9​3​4. pathological voice detection. In: 2018 40th Annual international conference
6. Keser S, Gezer E. Comparative analysis of speaker identification perfor- of the IEEE engineering in Medicine and Biology Society (EMBC). 2018. ​h​t​t​p​​s​:​/​​
mance using deep learning, machine learning, and novel subspace clas- /​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​e​​m​b​c​.​2​0​1​8​.​8​5​1​3​2​2​2.
sifiers with multiple feature extraction techniques. Digital Signal Process. 25. Mohammed MA, Abdulkareem KH, Mostafa SA, Khanapi Abd Ghani M,
2025;156:104811. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​d​s​p​.​2​0​2​4​.​1​0​4​8​1​1. Maashi MS, Garcia-Zapirain B, Oleagordia I, Alhakami H, Al-Dhief FT. Voice
7. Korkmaz Y, Boyacı A. milVAD: a bag-level mnist modelling of voice activity pathology detection and classification using convolutional neural network
detection using deep multiple instance learning. Biomed Signal Process model. Appl Sci. 2020;10(11):3723. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​3​3​9​0​​/​a​​p​p​1​0​1​1​3​7​2​3.
Control. 2022;74:103520. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​b​s​p​c​.​2​0​2​2​.​1​0​3​5​2​0. 26. Islam R, Tarique M. A novel convolutional neural network based dysphonic
8. Fang S-H, Tsao Y, Hsiao M-J, Chen J-Y, Lai Y-H, Lin F-C, Wang C-T. Detection voice detection algorithm using Chromagram. Int J Elec Comp Eng (IJECE).
of pathological voice using cepstrum vectors: a deep learning approach. J 2022;12(5):5511–18. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​5​9​​1​/​​i​j​e​​c​e​.​​v​1​2​i​​5​.​​p​p​5​5​1​1​-​5​5​1​8.
Voice. 2019;33(5):634–41. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​j​v​​o​i​c​​e​.​2​0​​1​8​​.​0​2​.​0​0​3. 27. Chaiani M, Selouani SA, Boudraa M, Yakoub MS. Voice disorder classification
9. Cordeiro H, Fonseca J, Guimarães I, Meneses C. Hierarchical classification and using speech enhancement and deep learning models. Biocybern Biomed
system combination for automatically identifying physiological and neuro- Eng. 2022;42(2):463–80. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​b​b​e​.​2​0​2​2​.​0​3​.​0​0​2.
muscular laryngeal pathologies. J Voice. 2017;31(3):384.e9–384.e14. ​h​t​t​p​​s​:​/​​/​d​ 28. Harar P, Alonso-Hernandezy JB, Mekyska J, Galaz Z, Burget R, Smekal Z. Voice
o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​j​v​​o​i​c​​e​.​2​0​​1​6​​.​0​9​.​0​0​3. pathology detection using deep learning: a preliminary study. In: 2017 Inter-
10. Kodrasi I, Pernon M, Laganaro M, Bourlard H. Automatic and perceptual dis- national conference and workshop on bioinspired intelligence (IWOBI). IEEE;
crimination between dysarthria, apraxia of speech, and neurotypical speech. 2017. pp. 1–4. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​i​​w​o​b​i​.​2​0​1​7​.​7​9​8​5​5​2​5.
In: ICASSP 2021–2021 IEEE international conference on acoustics, speech and

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Rahman and Direkoglu BMC Medical Informatics and Decision Making (2025) 25:177 Page 14 of 14

29. Korkmaz Y. SS-ESC: a spectral subtraction denoising based deep network 40. Sekhar SRM, Kashyap G, Bhansali A, Abishek A A, Singh K. Dysarthric-speech
model on environmental sound classification. Signal Image Video Process. detection using transfer learning with convolutional neural networks. ICT
2024;19:50. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​0​7​​/​s​​1​1​7​6​0​-​0​2​4​-​0​3​6​4​9​-​5. Express. 2022;8(1):61–64. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​i​c​t​e​.​2​0​2​1​.​0​7​.​0​0​4.
30. Korkmaz Y, Boyacı A. Classification of Turkish vowels based on formant fre- 41. Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M,
quencies. In: 2018 International conference on artificial intelligence and data Platt D, Saurous RA, Seybold B, et al. CNN architectures for large-scale audio
processing (IDAP). 2018. pp. 1–4. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​I​​D​A​P​.​2​0​1​8​.​8​6​2​0​8​7​7. classification. In: 2017 Ieee International conference on acoustics, speech and
31. Hernandez A, Pérez-Toro PA, Nöth E, Orozco-Arroyave JR, Maier A, Yang SH. signal processing (icassp). IEEE; 2017. pp. 131–35. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​i​​c​a​s​​
Cross-lingual self-supervised speech representations for improved dysarthric s​p​.​​2​0​1​7​​.​7​​9​5​2​1​3​2.
speech recognition. arXiv Preprint arXiv:2204 01670. 2022. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​ 42. Pützer M, Barry WJ. Saarbrücken Voice Database. Institute of Phonetics,
2​1​4​3​​7​/​​i​n​t​​e​r​s​​p​e​e​c​​h​.​​2​0​2​2​-​1​0​6​7​4. University of Saarland. ​h​t​t​p​​:​/​/​​w​w​w​.​​s​t​​i​m​m​​d​a​t​​e​n​b​a​​n​k​​.​c​o​​l​i​.​​u​n​i​-​​s​a​​a​r​l​a​n​d​.​d​e​/.
32. Rahman MU, Direkoglu C. Multi-class classification of voice disorders using 43. Pützer M, Barry WJ. Instrumental dimensioning of normal and pathological
deep transfer learning. In: Computing, internet of things and data analytics. phonation using acoustic measurements. Clin Linguist Phon. 2008;22(6):407–
ICCIDA 2023. Studies in computational intelligence. Studies in computational 20. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​8​0​​/​0​​2​6​9​9​2​0​0​7​0​1​8​3​0​8​6​9.
intelligence, vol. 1145. Cham, Switzerland: Springer; 2024. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​ 44. Walton C, Conway E, Blackshaw H, Carding P. Unilateral vocal fold paralysis:
0​0​7​​/​9​​7​8​-​​3​-​0​​3​1​-​5​​3​7​​1​7​-​2​_​2​5. a systematic review of speech-language pathology management. J Voice.
33. Mallela J, Illa A, Suhas B, Udupa S, Belur Y, Atchayaram N, Yadav R, Reddy P, 2017;31(4):509.e7–509.e22. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​j​v​​o​i​c​​e​.​2​0​​1​6​​.​1​1​.​0​0​2.
Gope D, Ghosh PK. Voice based classification of patients with amyotrophic 45. Tirronen S, Kadiri SR, Alku P. Hierarchical multi-class classification of voice dis-
lateral sclerosis, Parkinson’s disease and healthy controls with CNN-LSTM orders using self-supervised models and glottal features. IEEE Open J Signal
using transfer learning. In: ICASSP 2020–2020 IEEE international conference Process. 2023;4:80–88. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​o​​j​s​p​.​2​0​2​3​.​3​2​4​2​8​6​2.
on acoustics, speech and signal processing (ICASSP). IEEE. 2020. pp. 6784–88. ​ 46. Amara F, Fezari M, Bourouba H. An improved GMM-SVM system based
h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​i​​c​a​s​​s​p​4​​0​7​7​6​​.​2​​0​2​0​.​9​0​5​3​6​8​2. on distance metric for voice pathology detection. Appl Math Inf Sci.
34. Zaman K, Sah M, Direkoglu C, Unoki M. A survey of audio classification using 2016;10(3):1061–70. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​8​5​7​​6​/​​a​m​i​s​/​1​0​0​3​2​4.
deep learning. IEEE Access. 2023;11:106620–49. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​1​0​9​​/​A​​C​C​E​​ 47. Laguarta J, Hueto F, Subirana B. Covid-19 artificial intelligence diagnosis using
S​S​.​​2​0​2​3​​.​3​​3​1​8​0​1​5. only cough recordings. IEEE Open J Eng Med Biol. 2020;1:275–81. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​
35. Peng X, Xu H, Liu J, Wang J, He C. Voice disorder classification using o​​r​g​/​​1​0​.​​1​1​0​9​​/​o​​j​e​m​b​.​2​0​2​0​.​3​0​2​6​9​2​8.
convolutional neural network based on deep transfer learning. Sci Rep. 48. Rejaibi E, Komaty A, Meriaudeau F, Agrebi S, Othmani A. MFCC-based recur-
2023;13(1):7264. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​3​8​​/​s​​4​1​5​9​8​-​0​2​3​-​3​4​4​6​1​-​9. rent neural network for automatic clinical depression recognition and assess-
36. Violeta LP, Huang WC, Toda T. Investigating self-supervised pretraining frame- ment from speech. Biomed Signal Process Control. 2022;71:103107. ​h​t​t​p​s​:​​​/​​/​d​
works for pathological speech recognition. In: Interspeech 2022. 2022. ​h​t​t​p​​s​:​/​​ o​​i​.​​o​r​​g​​/​​1​0​​.​1​0​​​1​​​6​/​j​.​b​​s​p​c​.​​2​0​2​1​.​1​0​3​1​0​7.
/​d​o​i​​.​o​​r​g​/​​1​0​.​​2​1​4​3​​7​/​​i​n​t​​e​r​s​​p​e​e​c​​h​.​​2​0​2​2​-​1​0​0​4​3.
37. Zhu Y, Liang X, Batsis JA, Roth RM. Domain-aware intermediate pretraining for
dementia detection with limited data. In: Interspeech 2022. 2022. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​ Publisher’s Note
o​​r​g​/​​1​0​.​​2​1​4​3​​7​/​​i​n​t​​e​r​s​​p​e​e​c​​h​.​​2​0​2​2​-​1​0​8​6​2. Springer Nature remains neutral with regard to jurisdictional claims in
38. Zhu Y, Obyat A, Liang X, Batsis JA, Roth RM. Wavbert: exploiting semantic and published maps and institutional affiliations.
non-semantic speech using Wav2vec and BERT for dementia detection. In:
Interspeech 2021. 2021. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​2​1​4​3​​7​/​​i​n​t​​e​r​s​​p​e​e​c​​h​.​​2​0​2​1​-​3​3​2.
39. Karaman O, Çakın H, Alhudhaif A, Polat K. Robust automated Parkinson
disease detection based on voice signals with transfer learning. Expert Syst
Appl. 2021;178:115013. ​h​t​t​p​​s​:​/​​/​d​o​i​​.​o​​r​g​/​​1​0​.​​1​0​1​6​​/​j​​.​e​s​w​a​.​2​0​2​1​.​1​1​5​0​1​3.

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

[email protected]

You might also like