A Hybrid Approach For Binary and Multi-Class Class
A Hybrid Approach For Binary and Multi-Class Class
https://doi.org/10.1186/s12911-025-02978-w
and Decision Making
Abstract
Recent advances in artificial intelligence-based audio and speech processing have increasingly focused on the
binary and multi-class classification of voice disorders. Despite progress, achieving high accuracy in multi-class
classification remains challenging. This paper proposes a novel hybrid approach using a two-stage framework
to enhance voice disorders classification performance, and achieve state-of-the-art accuracies in multi-class
classification. Our hybrid approach, combines deep learning features with various powerful classifiers. In the first
stage, high-level feature embeddings are extracted from voice data spectrograms using a pre-trained VGGish
model. In the second stage, these embeddings are used as input to four different classifiers: Support Vector
Machine (SVM), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and an Ensemble Classifier (EC). Experiments
are conducted on a subset of the Saarbruecken Voice Database (SVD) for male, female, and combined speakers.
For binary classification, VGGish-SVM achieved the highest accuracy for male speakers (82.45% for healthy vs.
disordered; 75.45% for hyperfunctional dysphonia vs. vocal fold paresis), while VGGish-EC performed best for
female speakers (71.54% for healthy vs. disordered; 68.42% for hyperfunctional dysphonia vs. vocal fold paresis).
In multi-class classification, VGGish-SVM outperformed other models, achieving mean accuracies of 77.81% for
male speakers, 63.11% for female speakers, and 70.53% for combined genders. We conducted a comparative
analysis against related works, including the Mel frequency cepstral coefficient (MFCC), MFCC-glottal features, and
features extracted using the wav2vec and HuBERT models with SVM classifier. Results demonstrate that our hybrid
approach consistently outperforms these models, especially in multi-class classification tasks. The results show the
feasibility of a hybrid framework for voice disorder classification, offering a foundation for refining automated tools
that could support clinical assessments with further validation.
Keywords Voice disorders, Multi-class classification, Ensemble classifier, VGGish
*Correspondence:
Mehtab Ur Rahman
[email protected]
1
Department of Language and Communication, Radboud University,
Houtlaan, Nijmegen, Gelderland 6525, Netherlands
2
Electrical and Electronics Engineering Department, Middle East Technical
University, Northern Cyprus Campus, Kalkanli, Güzelyurt, Mersin 10
99738, Turkey
© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included
in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
the classification and detection of voice disorders, uti- stage employs a CNN with long short-term memory
lizing a Gaussian mixture model (GMM) classifier with (CNN-LSTM) to learn complex features from spectro-
running speech voice data. Benba et al. [15] investigated grams of the enhanced voice signals. Harar et al. [28]
the detection of dysphonia using a Naive Bayes (NB) proposed a novel approach for voice pathology detection
algorithm. They extracted acoustic features using MFCC. that uses convolutional and LSTM layers to learn directly
Authors in [16] used MFCC features to differentiate from raw audio signals. Furthermore, recent studies have
between Parkinson’s disease (PD) and healthy voices. highlighted the beneficial impact of denoising for audio
They extracted MFCC features from three different vowel signals [29], advanced vocal feature extraction [30]. These
sounds: /a/, /o/, and /u/. Authors in [17, 18] also explored approaches collectively suggest promising avenues for
binary classification of voice disorders in their research. enhancing the automatic classification of voice disorders.
In [19], the authors employ wavelet scattering features In voice disorders classification, limited data avail-
to capture both time-frequency information from voice ability is a common challenge. To address this, some
signals, which are then used for classifying neurological researchers have used pre-trained models [31–34]. In
voice disorders. [35], the authors proposed a transfer learning framework
In addition to binary classification, the majority of that uses a pre-trained OpenL3-SVM model and linear
research studies use sustained vowel /a/ recordings from local tangent space alignment (LLTSA) for dimensional-
clinical settings for their investigations [20]. In [21], the ity reduction. They first extracted the Mel spectrum of
introduction of continuous speech and vowel /a/ analysis the voice signals and then fed it into the OpenL3 model
for voice disorder identification is discussed. The authors to obtain high-level feature embeddings. Violeta et al.
conducted a comparison of glottal features extracted [36] investigated the performance of self-supervised
from the sustained vowel sound /a/ and voiced segments pre-trained Wav2Vec 2.0 and WavLM models for auto-
within continuous speech. Fujimura et al. [22] used matic pathological speech recognition using different
an end-to-end 1D-CNN model to classify voice disor- setups. Zhu et al. [37, 38] introduced pre-trained BERT
ders using voice samples of the sustained vowel /a/. The and WavBERT models for the detection of dementia
research demonstrated that the 1D-CNN models were using human speech. Karaman et al. [39] employed the
capable of consistently evaluating voice disorders, align- SqueezeNet1_1, ResNet101, and DenseNet161 networks
ing with human assessments. for the detection of Parkinson’s disease based on speech
In recent years, deep learning has achieved impressive signals. The findings showed that the proposed networks,
results in a variety of areas, including natural language which utilize pre-trained models with a fine-tuning
processing, computer vision and audio analysis. Deep approach, achieved promising results. In [40], the authors
learning’s ability to handle complex and high-dimen- used a pre-trained ResNet50 model for dysarthric speech
sional acoustic features makes it well-suited for address- detection.
ing the challenges of voice disorders classification. This
has encouraged many researchers to explore the potential Research gap and contribution
of deep learning for voice disorder classification. Wu et Most studies on the automatic classification of voice dis-
al. [23] developed a novel system using spectrograms of orders have focused on the binary classification, typically
disordered and normal speech recordings as input. They distinguishing between pathological and healthy voices.
employed Convolutional Deep Belief Networks for pre- Some studies have taken a more specialized approach,
training CNN weights as a generative model to under- aiming to identify particular pathological voices among
stand the input data’s structure statistically. Subsequently, all other pathological and healthy voices. A few stud-
they fine-tuned the CNN using supervised back-propa- ies have investigated multi-class classification of voice
gation. In [24], authors propose the use of a CNN model disorders, but the accuracy of these approaches is low.
along with short-time Fourier transform (STFT) features Multi-class classification of voice disorders is a challeng-
for the binary classification of voice disorders. Moham- ing problem due to the limited training data and subtle
med et al. [25] addressed the problem of voice disorder differences between different types of disordered voices.
detection by using CNN model. They specifically focused In this study, we address both binary and multi-class clas-
on the automatic detection of depression from speech. sification of voice disorders. For binary classification,
Chaiani et al. [26], the authors analyzed an algorithm we distinguish between healthy and disordered voices,
that extracts a chromagram acoustic feature from voice as well as between two different types of pathologi-
samples and uses it as input to a CNN-based classifica- cal voices. For multi-class classification, we have three
tion system. The research in [27] proposed a two-stage classes: healthy, vocal fold paresis and hyperfunctional
framework for the classification of different voice disor- dysphonia.
ders. The first stage uses speech enhancement to improve Gender-specific classification of voice disorders has
the voice signal quality by removing noise. The second not been widely investigated. We present classification
results separately for male and female speakers, as well as This study demonstrates the effectiveness of utiliz-
combined results, for both binary and multi-class tasks. ing embeddings from a pre-trained VGGish model and
This enables us to analyze and compare gender-based dif- ensemble classifiers for both binary and multi-class clas-
ferences in the classification of voice disorders. sification of voice disorders. Additionally, we examine
Feature extraction is a crucial step in machine learn- the impact of gender on the classification task. Our find-
ing tasks, and it holds particular significance in the clas- ings are compared to popular baseline methods, provid-
sification of voice disorders due to the small dataset size. ing a comprehensive evaluation of our approach. The
We utilize the pre-trained VGGish model [41] to extract results show that our method outperforms the baseline
128-dimensional high-level embedding features using approaches on both binary and multi-class classifica-
logarithmic mel spectrogram of voice data. As the name tion tasks, demonstrating the superiority of the proposed
indicates, the VGGish network takes inspiration from a method.
well-known VGG network and is adapted for audio clas-
sification. This model was trained on a large Audio set, Paper outline
which was a preliminary version of the YouTube-8M The rest of the paper is structured as follows. Section
dataset. These embeddings are then utilized as input for “Proposed method” presents an in-depth explanation of
machine learning classifiers. the proposed method. Section “Dataset and experimen-
Previous studies have employed transformer-based tal setup” describes the experimental setup and the voice
models like wav2vec and HuBERT for extracting audio dataset used in this study. Section “Experiments and
embeddings. While these models perform well in gen- results” provides a comprehensive overview of the exper-
eral speech tasks, we found that VGGish, a CNN-based iments conducted for both binary and multi-class clas-
model, delivers better results for the classification of sification tasks. We present the results and performance
voice disorders. Consequently, our approach outper- metrics achieved by our approach. Section “Discussion”
forms transformer-based models in this domain. presents the implications of our findings. Section “Con-
Our dataset is imbalanced, which mirrors the distribu- clusion” summarizes the key points and highlights the
tion often seen in real-world applications, where certain main contributions made by our study.
voice disorders are less common. This imbalance pres-
ents challenges in accurately classifying minority classes. Proposed method
To overcome this issue, we employed ensemble classifiers In this paper, we propose a novel hybrid two-stage frame-
that combine the strengths of multiple models, improv- work for voice disorders classification. In the first stage,
ing performance on minority classes and enhancing over- voice data is converted into logarithmic mel spectro-
all classification accuracy. grams and high-level feature embeddings are extracted
We tested three machine learning classifiers: Logis- from these spectrograms using the pre-trained VGGish
tic Regression (LR), Multi-Layer Perceptron (MLP), and model. In the second stage, we use classifiers, including
Support Vector Machine (SVM). We also employed an an ensemble classifier, to classify the feature embeddings.
ensemble classifier (EC) using SVM, LR, and MLP with Figure 1 provides an illustration of the proposed classifi-
soft voting to combine the predictions of the three clas- cation framework.
sifiers. This allowed us to leverage the collective insights
of these diverse classifiers and improve the overall clas-
sification performance.
0 if k < f (m − 1)
Preprocessing and feature extraction
k−f (m−1)
f (m)−f (m−1) if f (m − 1) ≤ k ≤ f (m)
Before extracting features, we apply several preprocess- Hm (k) = f (m+1)−k (1)
ing steps. In the original dataset, as explained in Sect.
if f (m) < k ≤ f (m + 1)
f (m+1)−f (m)
0 if k > f (m + 1)
3, voice signals were recorded at 50 kHz sampling fre-
quency. To align with our processing requirements,
we resampled the audio to 16 kHz. The length of each Here, 0 ≤ m ≤ M, and M represents the count of filters.
audio recording in the original database is different. We The central frequency f(m) of the filters can be written as:
trimmed the audio signals to 1 s. Audio signals that were ( ) ( )
less than 1 s were padded with zeros to ensure that all N Fmel (fh ) − Fmel (fl )
f (m) = −1
Fmel Fmel (fl ) + m (2)
fs M +1
audio data had the same length.
The VGGish model takes the logarithmic mel spectro-
gram of an audio signal for feature extraction. To com- Here, fl denotes the lowest frequency within the filter’s
pute the mel spectrogram for each audio, we apply the frequency domain. fh represents the highest frequency.
Short-Time Fourier Transform (STFT) with a Hamming N corresponds to the length of the Fourier transform. fs
window lasting 25 milliseconds (ms) and a 10 ms shift. stands for the sampling frequency. Fmel signifies the Mel
This resultant spectrogram is subsequently integrated frequency. The transformation formula linking Fmel and
into 64 frequency bins spaced along the Mel scale, and the regular frequency f is given by:
the magnitude of each bin is then transformed logarith- ( )
mically. The configuration of the mel spectrogram draws Fmel = 2595 log 1 +
f
(3)
inspiration from psychoacoustic analysis, which strives to 700
replicate characteristics of the human auditory system.
This procedure involves the application of a Mel filter The log mel spectrogram tensor (96 × 64) is the input to
bank denoted as Hm(k) to filter the spectral line energy of VGGish. Here, 96 is the number of frames within each
the audio. The purpose of these filters is outlined by the time scale, and 64 is the number of frequency bands.
following equations. In Fig. 2, the VGGish [41] model’s structure is illus-
trated. Batch normalization was implemented follow-
ing each convolutional layer. The chosen loss function
was cross-entropy, and the model employed the Adam
optimizer. Dropout, weight decay, and other usual regu-
larization methods were not utilized. This architecture
was trained on a large Audio set which was a prelimi-
nary version of the YouTube-8 M dataset. We extracted
128-dimensional high-level feature embeddings using a
pre-trained VGGish model. These embeddings are then
utilized as input for machine learning classifiers.
Classifiers
We evaluated the performance of three classifiers: SVM,
LR, and MLP. The SVM classifies audio signals by map-
ping high-dimensional VGGish features into a new space
using a kernel, allowing it to create a nonlinear decision
boundary. LR assigns weight coefficients to features and
makes predictions based on probability scores. The MLP
model learns hierarchical representations through its
hidden layers, capturing complex patterns in the VGGish
embeddings.
In addition, we utilized an ensemble classifier (EC) that
incorporated SVM, LR, and MLP. Figure 3 shows the
EC model. Instead of relying on a single model, the EC
combines the predictions of multiple models to improve
accuracy and reduce the risk of overfitting. Soft voting
was employed to combine the predictions of these three
classifiers. Soft voting is an ensemble strategy that com-
Fig. 2 VGGish model architecture bines the predictions of multiple classifiers by averaging
are commonly found voice disorders [44]. By choosing Experiments and results
these specific disorders and matching the number of This section presents a comprehensive overview of all the
recordings used in previous study [45], we were able to experiments conducted to compare the performance of
directly compare our experimental results with existing the proposed voice disorders classification framework to
research. This approach allowed for a more robust and other state-of-the-art methods. The first two experiments
fair evaluation of our findings. address binary classification problems: healthy vs. disor-
dered and vocal fold paresis vs. hyperfunctional dyspho-
Training and testing nia. The third experiment is a multi-class classification
To train the classifiers, we used 5-fold cross-validation. problem.
In each iteration, we held out one fold for evaluation and
used the remaining folds for training. All samples from Healthy vs. disordered
each speaker were consistently placed within a single As shown in Table 1, we have three classes: healthy, vocal
fold to prevent the model from learning to classify voice fold paresis and hyperfunctional dysphonia. For this
samples based on speaker identity. We computed perfor- experiment, hyperfunctional dysphonia and vocal fold
mance metrics based on the predictions generated for paresis are combined into a single class. Separate experi-
the evaluation fold. The evaluation metrics include mean ments have been conducted for male speakers, female
accuracy and F1 score, as well as mean precision, recall, speakers, and both genders combined. Table 2 shows
and F1 score for each class. the mean accuracy and F1 score, as well as the precision,
The dataset used in this study is imbalanced, which can recall, and F1 score of each class for male and female
be a problem for machine learning models, as they can speakers. For male speakers, VGGish-SVM achieved
learn to favor the majority classes and ignore the minor- the highest accuracy, closely followed by VGGish-EC.
ity classes. To address this issue, we balanced the training VGGish-SVM achieved an accuracy of 82.45%, while
set by oversampling the minority classes. This involved VGGish-EC reached 80.25%. Our method outperforms
duplicating samples from the minority classes to ensure the approach presented in [45], as shown in Table 2,
that each class had an equal number of samples in the which uses SVM as a classifier and features extracted
training set, which helping to prevent the model from with wav2vec and HuBERT models, as well as SVM with
overfitting to the majority classes. We also applied Stan- MFCC and MFCC-glottal features.
dardScaler to all feature embeddings to ensure that all For female speakers, VGGish-EC achieved the highest
features were on the same scale, thereby to improve the accuracy with 71.54%, followed closely by VGGish-SVM,
performance of the machine learning models. VGGish-MLP and VGGish-LR with accuracies of 70.03%,
68.36%, and 66.31% respectively. It is worth noting that
this is the only case where our model demonstrates a
slightly lower accuracy compared to the existing method
Table 2 Performance metrics for the binary classification task of healthy vs. disordered for male and female speakers
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1
Male VGGish-SVM 82.45 ± 2.77 82.99 0.91 0.87 0.89 0.54 0.64 0.58
VGGish-LR 75.35 ± 4.30 75.45 0.85 0.84 0.84 0.41 0.41 0.40
VGGish-MLP 77.09 ± 5.75 76.96 0.86 0.86 0.86 0.43 0.42 0.42
VGGish-EC 80.25 ± 5.70 79.66 0.86 0.89 0.88 0.51 0.44 0.47
wav2vec-SVM [45] 75.65 ± 5.81 - 0.91 0.82 0.87 0.50 0.69 0.58
MFCC-glottal-SVM [45] 74.48 ± 5.85 - 0.90 0.84 0.87 0.51 0.64 0.57
MFCC-SVM [45] 72.02 ± 7.75 - 0.89 0.88 0.88 0.54 0.56 0.55
HuBERT-SVM [45] 72.14 ± 7.93 - 0.89 0.85 0.87 0.50 0.59 0.54
Female VGGish-SVM 70.03 ± 3.07 70.05 0.79 0.77 0.77 0.53 0.57 0.54
VGGish-LR 66.31 ± 4.86 66.68 0.77 0.72 0.74 0.48 0.55 0.51
VGGish-MLP 68.36 ± 3.76 68.11 0.76 0.78 0.77 0.51 0.49 0.50
VGGish-EC 71.54 ± 4.13 71.83 0.80 0.76 0.78 0.56 0.62 0.58
wav2vec-SVM [45] 73.80 ± 5.03 - 0.84 0.77 0.80 0.60 0.71 0.65
MFCC-glottal-SVM [45] 66.13 ± 3.11 - 0.80 0.66 0.72 0.49 0.66 0.56
MFCC-SVM [45] 68.15 ± 4.59 - 0.81 0.68 0.74 0.51 0.68 0.58
HuBERT-SVM [45] 74.50 ± 4.38 - 0.85 0.76 0.81 0.60 0.72 0.65
In the metric names, ‘0’ corresponds to the healthy class, and ‘1’ represents the disordered. PR, RE and F1 represent Precision, Recall and F1 score respectively. The
mean values over folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are provided
Table 3 Performance metrics for the binary classification task of healthy vs. disordered for male and female speakers combined
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1
Male & Female VGGish-SVM 73.35 ± 3.32 72.95 0.81 0.83 0.82 0.53 0.49 0.51
VGGish-LR 70.05 ± 3.08 70.80 0.82 0.75 0.78 0.48 0.58 0.52
VGGish-MLP 73.35 ± 3.93 73.57 0.82 0.80 0.81 0.53 0.56 0.54
VGGish-EC 73.84 ± 2.83 73.92 0.82 0.81 0.82 0.54 0.55 0.54
In the metric names, ‘0’ corresponds to the healthy class, and ‘1’ represents the disordered. PR, RE and F1 represent Precision, Recall and F1 score respectively. The
mean values over folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are provided
Fig. 4 Normalized confusion matrix for healthy vs. disordered. The predicted classes are represented on the horizontal axis, while the true classes are
represented on the vertical axis. Class labels: 0 for healthy and 1 for disordered
[45], which attains its highest accuracy of 74.50% using accuracy, F1 score, precision, and recall for male and
HuBERT-SVM. female speakers are presented in Table 4. VGGish-SVM
Experiments were also conducted with male and female achieved the highest accuracy (75.45%) for male speak-
speakers combined. The mean accuracy, F1 score, pre- ers, while VGGish-EC achieved 71.82%. For female
cision and recall are shown in Table 3. The results dem- speakers, VGGish-EC attained the highest accuracy at
onstrate that VGGish-EC achieved the highest overall 68.42%, closely followed by VGGish-SVM, VGGish-MLP,
accuracy and F1 score, with values of 73.84% and 73.92%, and VGGish-LR, with respective accuracies of 68.37%,
respectively. It was closely followed by VGGish-MLP, 64.97%, and 62.11%. Our method outperforms the
VGGish-SVM, and then VGGish-LR in terms of accu- approach presented in [45]. For male speakers, their high-
racy. This study’s results on combined male and female est accuracy was 71.95%, while for female speakers, their
speakers can not be directly compared to those of any best accuracy was 63.06%, achieved with wav2vec-SVM.
other study because of the differences in the datasets and Table 5 presents the mean accuracy, F1 score, preci-
disorders studied. Figure 4 presents the normalized con- sion, and recall for male and female speakers combined.
fusion matrices for each classifier and gender. VGGish-SVM achieved the highest overall accuracy of
68.80% and F1 score of 67.64%, followed by VGGish-EC
Hyperfunctional dysphonia vs. vocal fold paresis with an accuracy of 67.10% and F1 score of 66.39%. These
To classify hyperfunctional dysphonia and vocal fold results cannot be directly compared to previous studies
paresis, we used the same classification setup for male, because of the differences in the datasets and disorders
female, and combined gender speakers. The mean
Table 4 Performance metrics for the binary classification task of hyperfunctional dysphonia and vocal fold paresis for male and female
speakers
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1
Male VGGish-SVM 75.45 ± 6.24 74.64 0.75 0.85 0.79 0.81 0.64 0.69
VGGish-LR 66.52 Ł} 7.24 64.92 0.68 0.75 0.69 0.70 0.56 0.59
VGGish-MLP 71.66 ± 11.02 70.97 0.72 0.81 0.75 0.74 0.60 0.65
VGGish-EC 71.82 ± 7.12 70.89 0.73 0.81 0.75 0.75 0.60 0.65
wav2vec-SVM [45] 71.95 ± 12.62 – 0.74 0.75 0.74 0.67 0.66 0.66
MFCC-glottal-SVM [45] 69.05 ± 9.67 – 0.73 0.74 0.73 0.66 0.64 0.65
MFCC-SVM [45] 61.60 ± 8.86 – 0.65 0.76 0.70 0.60 0.47 0.53
HuBERT-SVM [45] 71.88 ± 10.56 – 0.73 0.80 0.76 0.70 0.62 0.66
Female VGGish-SVM 68.37 ± 6.61 67.66 0.74 0.80 0.77 0.56 0.47 0.51
VGGish-LR 62.11 ± 7.77 61.83 0.70 0.73 0.71 0.46 0.42 0.44
VGGish-MLP 64.97 ± 3.51 64.96 0.74 0.72 0.73 0.50 0.52 0.50
VGGish-EC 68.42 ± 6.39 68.08 0.75 0.77 0.76 0.54 0.52 0.53
wav2vec-SVM [45] 63.06 ± 6.77 – 0.74 0.83 0.78 0.57 0.44 0.50
MFCC-glottal-SVM [45] 59.96 ± 7.91 – 0.72 0.81 0.76 0.52 0.40 0.45
MFCC-SVM [45] 57.09 ± 7.48 – 0.71 0.74 0.72 0.45 0.42 0.43
HuBERT-SVM [45] 61.31 ± 5.94 – 0.73 0.78 0.75 0.52 0.45 0.48
In the metric names, ‘0’ represents hyperfunctional dysphonia class and ‘1’ represents vocal fold paresis. PR, RE and F1 represent Precision, Recall and F1 score
respectively. The mean values over folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are
provided
Table 5 Performance metrics for the binary classification task of hyperfunctional dysphonia and vocal fold paresis for male and female
speakers combined
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1
Male & Female VGGish-SVM 68.80 ± 6.79 67.64 0.72 0.82 0.77 0.60 0.46 0.52
VGGish-LR 63.20 ± 2.40 63.03 0.72 0.70 0.70 0.50 0.52 0.50
VGGish-MLP 65.37 ± 3.08 65.27 0.73 0.73 0.72 0.53 0.53 0.53
VGGish-EC 67.10 ± 3.93 66.39 0.72 0.78 0.75 0.57 0.48 0.52
In the metric names, ‘0’ represents hyperfunctional dysphonia class and ‘1’ represents vocal fold paresis. PR, RE and F1 represent Precision, Recall and F1 score
respectively. The mean values over folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are
provided
Fig. 5 Normalized confusion matrix for hyperfunctional dysphonia vs. vocal fold paresis. The predicted classes are represented on the horizontal axis,
while the true classes are represented on the vertical axis. Class labels: 0 for hyperfunctional dysphonia and 1 for vocal fold paresis
classification was 82.45% for male speakers and 71.54% fold paresis. The same trend was observed for female
for female speakers. Similarly, the highest accuracy for speakers. In multi-class classification, for male speakers,
hyperfunctional dysphonia vs. vocal fold paresis clas- the lowest F1 score is recorded for hyperfunctional dys-
sification was 75.45% for male speakers and 68.42% for phonia, while for female speakers, the lowest F1 score is
female speakers. In the multi-class classification sce- observed for vocal fold paresis. These classes presented
nario, the accuracy differences between male and female particular challenges in terms of accuracy, probably
speakers continued similar trends. For male speakers, our because of the smaller number of samples available for
model achieved an impressive accuracy of 77.81%, how- these classes. This underlines the importance of address-
ever, for female speakers, the highest accuracy observed ing data imbalance in future research to further enhance
was 63.11%. It is important to highlight that the binary classification performance.
classification of healthy vs. disordered voices for female As part of our future work, we plan to incorporate
speakers stands as the only case where our model exhib- explainability techniques such as LIME, SHAP, and Grad-
ited a slightly lower accuracy compared to the results CAM. These methods will enable us to better understand
reported in [45]. the contribution of different features in the classification
VGGish-SVM achieved the highest accuracy for male process and provide visual insights into the regions of the
speakers and VGGish-EC for female speakers in both spectrograms that are most influential in decision-mak-
binary classification tasks (i.e., healthy vs. disordered ing. It will help build trust in the model’s predictions and
and hyperfunctional dysphonia vs. vocal fold paresis). In facilitate its integration into diagnostic workflows.
multi-class classification, VGGish-SVM performed better This study demonstrates the efficacy of hybrid frame-
for both genders. However, while VGGish-EC achieved a works for voice disorder classification using controlled
lower overall accuracy than VGGish-SVM in multi-class datasets. However, it does not evaluate real-time perfor-
classification, it outperformed VGGish-SVM on minority mance, which is a critical factor for clinical deployment.
classes. For example, for male speakers, the precision and Furthermore, the computational demands of the VGGish
recall for hyperfunctional dysphonia with VGGish-SVM feature extractor and classifier pipeline may introduce
were 0.20% and 0.27%, respectively, while with VGGish- latency in unoptimized implementations. Future work
EC, the precision and recall were 0.23 and 0.44, respec- will focus on optimizing the framework for low-latency
tively. Similarly, VGGish-EC performed better for vocal inference (e.g., via model lightweighting, edge-device
Table 7 Multi-class classification performance metrics for male and female speakers combined
Gender Model Accuracy F1 Score PR 0 RE 0 F1 0 PR 1 RE 1 F1 1 PR 2 RE 2 F1 2
Male & Female VGGish-SVM 70.53 ± 3.22 69.53 0.81 0.83 0.82 0.38 0.36 0.37 0.48 0.39 0.41
VGGish-LR 61.00 ± 6.29 63.52 0.83 0.67 0.74 0.30 0.43 0.35 0.34 0.52 0.40
VGGish-MLP 67.85 ± 2.65 67.80 0.80 0.80 0.80 0.38 0.37 0.37 0.36 0.35 0.35
VGGish-EC 68.34 ± 3.45 68.52 0.81 0.80 0.80 0.38 0.38 0.38 0.40 0.41 0.40
In the metric names, ‘0’ represents the healthy class, ‘1’ represents hyperfunctional dysphonia, and ‘2’ represents vocal fold paresis. PR, RE and F1 represent Precision, Recall and F1 score respectively. The mean values over
folds are presented for all matrices. The highest accuracy is indicated in bold. Additionally, standard deviations for accuracy are provided
Fig. 6 Normalized confusion matrix for multi-class classification. The predicted classes are represented on the horizontal axis, while the true classes are
represented on the vertical axis. Class labels: 0 for healthy, 1 for hyperfunctional dysphonia, and 2 for vocal fold paresis
deployment) and validating its performance on stream- While our model excelled in most scenarios, there was
ing audio data acquired in clinical or telehealth settings. a slight exception. In the healthy vs. disordered task for
female speakers, our model demonstrated an accuracy
Conclusion that was 2.96% lower when compared to the baseline.
In this paper, we proposed a two-stage hybrid framework The accuracies for the combined dataset of male and
for voice disorders classification. In the first stage, we uti- female speakers are also promising in all three scenarios.
lized a pre-trained VGGish model to extract high-level It is important to note that these combined results can-
feature embeddings from the log-mel spectrograms of not be directly compared to existing studies because of
voice data. In the second stage, we evaluated four clas- variations in the datasets and the types of voice disorders
sifiers: support vector machine (SVM), logistic regres- investigated.
sion (LR), multilayer perceptron (MLP), and ensemble In binary classification, VGGish-SVM exhibited the
classifier. highest accuracy for male speakers, while VGGish-EC
The results of our study demonstrate the potential of performed best for female speakers. However, in multi-
using a pre-trained VGGish model to extract features class classification, VGGish-SVM outperformed other
for voice disorders classification. We achieved state-of- models for both genders. Notably, VGGish-EC dem-
the-art results on the SVD dataset, outperforming the onstrated its strength in handling minority classes,
baseline systems that used MFCC features, MFCC-glottal an important aspect of medical applications. The
features, as well as features extracted with pre-trained results confirm that VGGish-EC provides more bal-
wav2vec and HuBERT models. Compared to the best anced accuracy by giving importance to the minority
baseline accuracy, we improved by 6.8% for male speak- classes. Although we used oversampling to balance the
ers in healthy vs. disordered task, 3.5% and 5.36% for classes, the accuracy of minority classes remains com-
male and female speakers respectively in hyperfunctional paratively lower. Future research will focus on improv-
dysphonia vs. vocal fold paresis tasks. In the context of ing the robustness and generalizability of the proposed
multi-class classification, our method significantly out- two-stage hybrid framework for voice disorders classifi-
performed the baseline, achieving a 15.04% improvement cation. Additionally, expanding the dataset to include a
for male speakers and a 7.75% improvement for female more diverse and a broader range of voice disorders will
speakers.
29. Korkmaz Y. SS-ESC: a spectral subtraction denoising based deep network 40. Sekhar SRM, Kashyap G, Bhansali A, Abishek A A, Singh K. Dysarthric-speech
model on environmental sound classification. Signal Image Video Process. detection using transfer learning with convolutional neural networks. ICT
2024;19:50. https://doi.org/10.1007/s11760-024-03649-5. Express. 2022;8(1):61–64. https://doi.org/10.1016/j.icte.2021.07.004.
30. Korkmaz Y, Boyacı A. Classification of Turkish vowels based on formant fre- 41. Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M,
quencies. In: 2018 International conference on artificial intelligence and data Platt D, Saurous RA, Seybold B, et al. CNN architectures for large-scale audio
processing (IDAP). 2018. pp. 1–4. https://doi.org/10.1109/IDAP.2018.8620877. classification. In: 2017 Ieee International conference on acoustics, speech and
31. Hernandez A, Pérez-Toro PA, Nöth E, Orozco-Arroyave JR, Maier A, Yang SH. signal processing (icassp). IEEE; 2017. pp. 131–35. https://doi.org/10.1109/icas
Cross-lingual self-supervised speech representations for improved dysarthric sp.2017.7952132.
speech recognition. arXiv Preprint arXiv:2204 01670. 2022. https://doi.org/10. 42. Pützer M, Barry WJ. Saarbrücken Voice Database. Institute of Phonetics,
21437/interspeech.2022-10674. University of Saarland. http://www.stimmdatenbank.coli.uni-saarland.de/.
32. Rahman MU, Direkoglu C. Multi-class classification of voice disorders using 43. Pützer M, Barry WJ. Instrumental dimensioning of normal and pathological
deep transfer learning. In: Computing, internet of things and data analytics. phonation using acoustic measurements. Clin Linguist Phon. 2008;22(6):407–
ICCIDA 2023. Studies in computational intelligence. Studies in computational 20. https://doi.org/10.1080/02699200701830869.
intelligence, vol. 1145. Cham, Switzerland: Springer; 2024. https://doi.org/10.1 44. Walton C, Conway E, Blackshaw H, Carding P. Unilateral vocal fold paralysis:
007/978-3-031-53717-2_25. a systematic review of speech-language pathology management. J Voice.
33. Mallela J, Illa A, Suhas B, Udupa S, Belur Y, Atchayaram N, Yadav R, Reddy P, 2017;31(4):509.e7–509.e22. https://doi.org/10.1016/j.jvoice.2016.11.002.
Gope D, Ghosh PK. Voice based classification of patients with amyotrophic 45. Tirronen S, Kadiri SR, Alku P. Hierarchical multi-class classification of voice dis-
lateral sclerosis, Parkinson’s disease and healthy controls with CNN-LSTM orders using self-supervised models and glottal features. IEEE Open J Signal
using transfer learning. In: ICASSP 2020–2020 IEEE international conference Process. 2023;4:80–88. https://doi.org/10.1109/ojsp.2023.3242862.
on acoustics, speech and signal processing (ICASSP). IEEE. 2020. pp. 6784–88. 46. Amara F, Fezari M, Bourouba H. An improved GMM-SVM system based
https://doi.org/10.1109/icassp40776.2020.9053682. on distance metric for voice pathology detection. Appl Math Inf Sci.
34. Zaman K, Sah M, Direkoglu C, Unoki M. A survey of audio classification using 2016;10(3):1061–70. https://doi.org/10.18576/amis/100324.
deep learning. IEEE Access. 2023;11:106620–49. https://doi.org/10.1109/ACCE 47. Laguarta J, Hueto F, Subirana B. Covid-19 artificial intelligence diagnosis using
SS.2023.3318015. only cough recordings. IEEE Open J Eng Med Biol. 2020;1:275–81. https://doi.
35. Peng X, Xu H, Liu J, Wang J, He C. Voice disorder classification using org/10.1109/ojemb.2020.3026928.
convolutional neural network based on deep transfer learning. Sci Rep. 48. Rejaibi E, Komaty A, Meriaudeau F, Agrebi S, Othmani A. MFCC-based recur-
2023;13(1):7264. https://doi.org/10.1038/s41598-023-34461-9. rent neural network for automatic clinical depression recognition and assess-
36. Violeta LP, Huang WC, Toda T. Investigating self-supervised pretraining frame- ment from speech. Biomed Signal Process Control. 2022;71:103107. https://d
works for pathological speech recognition. In: Interspeech 2022. 2022. https:/ oi.org/10.1016/j.bspc.2021.103107.
/doi.org/10.21437/interspeech.2022-10043.
37. Zhu Y, Liang X, Batsis JA, Roth RM. Domain-aware intermediate pretraining for
dementia detection with limited data. In: Interspeech 2022. 2022. https://doi. Publisher’s Note
org/10.21437/interspeech.2022-10862. Springer Nature remains neutral with regard to jurisdictional claims in
38. Zhu Y, Obyat A, Liang X, Batsis JA, Roth RM. Wavbert: exploiting semantic and published maps and institutional affiliations.
non-semantic speech using Wav2vec and BERT for dementia detection. In:
Interspeech 2021. 2021. https://doi.org/10.21437/interspeech.2021-332.
39. Karaman O, Çakın H, Alhudhaif A, Polat K. Robust automated Parkinson
disease detection based on voice signals with transfer learning. Expert Syst
Appl. 2021;178:115013. https://doi.org/10.1016/j.eswa.2021.115013.
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at