1 Introduction

Speech disorders encompass a range of conditions that disrupt an individual’s ability to articulate sounds and form words coherently [1]. Dysarthria is a disorder that significantly influences an individual’s lifestyle and interactions. Originating from the Greek words “dys” (difficult) and “arthron” (joint) [2], it is characterized by muscle disturbances affecting articulation, respiration, phonation, and prosody, leading to symptoms like slurred speech and imprecise articulation [3, 4]. This disorder can be congenital or arise from neurological abnormalities affecting speech-related muscle neural pathways [5]. Dysarthria has multiple types, each tied to specific neurological conditions, making accurate classification vital for personalized treatment and interventions [6].

Over the past decade, researchers have made significant strides in dysarthria classification and severity assessment. They have identified several speech features, such as speech rate, pitch variation, articulation precision, and phonation quality, that can aid in distinguishing different types and severity levels of dysarthria [7,8,9,10]. However, despite these advancements, there are still gaps in understanding which features are most promising for precisely classifying and characterizing dysarthria severity. Another prevailing challenge is that many proposed techniques lean heavily on deep learning, which necessitates large amounts of labeled data. Providing such data is time-consuming and requires significant therapist effort, making it a substantial bottleneck in advancing the field. In addition to all of the previously mentioned, Understanding the different sound patterns in speech issues is hard for speech therapists due to the discerning of the subtle differences in voice spectral characteristics among varying severity levels of dysarthria. With each severity level, from very low to medium and beyond, the voice exhibits distinct spectral patterns that deviate increasingly from what is considered normal. While these variations provide crucial insights into the disorder’s progression, they also introduce complexities in the annotation process. The task requires not just keen auditory discernment but also a comprehensive understanding of the spectral representations, as clearly shown in Fig. 1.

Fig. 1
figure 1

Comparative analysis of voice waveforms and MFCC representations across normal and dysarthric male cases

In this figure, the audio waveforms and corresponding Mel-frequency cepstral coefficients (MFCCs) for different voice cases for the sentence ‘Well, he is nearly ninety-three years old’ for four participants, one control case, and three Dysarthric cases from each severity level (M03:VLow, M05: Low, and M01: medium). The topmost row depicts a “Normal male Case,” with a clear and consistent waveform alongside a well-defined MFCC representation. Subsequent rows display “Dysarthric male” cases with varying severity levels: Vlow, Low, and Medium. As the severity level increases, the waveform becomes more fragmented and less uniform. This is mirrored in the MFCC plots, where the color patterns become more dispersed and less consistent as severity intensifies. The variations in the MFCC representations across different severity levels may be indicative of changes in the spectral characteristics of the voice due to dysarthria. Finding an AI model that can differentiate these levels precisely based on specific features can provide a big assesst to the speech therapist in their jobs. This paper proposes a groundbreaking two-stage technique to address the prevailing challenges in dysarthria classification based on severity levels. Drawing inspiration from the potent capabilities of clustering—a form of unsupervised learning—we embark on a data-driven exploration of dysarthria cases. Clustering has proven instrumental in the medical domain, especially where data exploration is unguided by predefined labels. From discerning disease subtypes, as seen in the molecular subtyping of cancers [11], to patient stratification for personalized medical strategies, clustering unfurls latent patterns often overlooked by the human eye, leading to transformative insights [12].

The contributions of our paper are as follows:

  1. 1.

    Two-stage technique: We introduce a pioneering two-stage approach, where the first stage focuses on binary classification, differentiating between control and dysarthria cases. This involves deploying classifiers like the Support Vector Machine (SVM) and Artificial Neural Network (ANN) combined with 8 different feature extraction techniques.

  2. 2.

    Clustering for severity-level differentiation: The second stage of our methodology focuses specifically on dysarthria cases, leveraging the K-means clustering algorithm. This unsupervised learning application aids in discerning patterns within datasets, enabling precise differentiation of dysarthria severity levels.

  3. 3.

    Data-driven insights: Through clustering, our approach facilitates a deeper understanding of dysarthria cases, unraveling intricate patterns that can guide more effective and targeted interventions.

  4. 4.

    Comprehensive data exploration: By segmenting the data into distinct stimuli types, restricted sentences, and words, into two different datasets, our methodology ensures thoroughness, ensuring that the derived insights are both holistic and actionable.

The remainder of the paper is structured as follows: Sect. 2 reviews the existing studies in this field. Section 3 details our methods, including the explanation of datasets, feature selection, and the presentation of our methodology. Section 4 presents the results for each stage and dataset. Finally, Sect. 5 discusses and analyzes the results, and Sect. 6 provides the conclusion.

2 Related work

Dysarthria has garnered significant attention from researchers, with many focusing on automating its severity classification for more objective and efficient assessment [13]. The drive towards automation has primarily revolved around two significant avenues: feature extraction and the application of machine learning models.

Joshy and Rajan [14] laid the foundational groundwork by exploring the potential of deep learning architectures, including Conventional Neural Network (CNN) and Long Short-Term Memory (LSTM), for dysarthria severity classification, primarily relying on MFCC features. Their findings confirmed the superiority of CNN and Deep Neural Network (DNN) models over LSTM-based systems in this context. Building on this foundational research, [15] extended the exploration, incorporating i-vectors and a broader spectrum of deep learning techniques. This advanced study also integrated more specific features, including prosody and articulation, into the classification framework. Their enhanced approach, particularly with the DNN classifier using MFCC-based i-vectors, yielded impressive results, further underscoring the potential of deep learning in dysarthria severity classification

While acoustic features have been the cornerstone of most studies, there’s a burgeoning interest in other feature sets. Hernandez et al. [7] underscores the significance of prosody-based measures for dysarthria severity assessment, particularly when combined with MFCC features. Moreover, [16] emphasizes rhythm-based metrics, highlighting their effectiveness in enhancing dysarthria detection and severity assessment across various languages. In the quest for innovative feature representations, [17] integrates auditory cues with MFCCs, demonstrating remarkable accuracy in speaker identification and dysarthria severity levels. This auditory-based approach exemplifies the potential for combining traditional features with novel representations.

The machine learning models applied to these features have seen varied applications. Karjigi and Sreedevi [18] employs Fisher vector encoding, revealing its superiority over temporal encoding, particularly when paired with an ANN classifier. On the other hand, [19] leans on an ANN to achieve high classification accuracies, emphasizing the promise of fusing selected feature sets. Yeo et al. [20] tackles the challenge of limited data by proposing a cross-lingual approach incorporating shared and language-specific features, using eXtreme Gradient Boosting (XGBoost) to handle potential missing data issues.

Despite the advancements, there are challenges to address. Joshy and Rajan [15] points out the potential limitation of training models with a small number of subjects per class, impacting performance in speaker-independent scenarios. Furthermore, [18] notes the potential non-standardization challenge in automated speech intelligibility assessment. The field is witnessing innovative feature extraction techniques and advanced machine-learning models. The collective efforts aim to enhance dysarthria severity classification’s precision, objectivity, and efficiency, promising significant clinical evaluation and treatment planning advancements.

3 Materials and methods

In this section, we refer to the methodology of the proposed model and the dataset used for the experiments.

3.1 Dataset

We tested our model on Two datasets: the TORGO dataset [21] and the UA Speech dataset [22].

The TORGO database, a collaboration involving the University of Toronto, Holland–Bloorview Kids Rehab Hospital, and the Ontario Federation for Cerebral Palsy, is a pivotal resource for dysarthric speech research. Housing data from seven dysarthric speakers (F01, F03, F04, M01, M02, M03, M04, M05) and matched controls (FC01, FC02, FC03, MC01, MC02, MC03, MC04), where ‘F’ for female, ‘M’ for male, the symbol associated with ‘C’ means control cases and the last two digits with each participant refer to the order in which that participant was recruited. It offers aligned acoustics and articulatory insights for individuals with cerebral palsy or amyotrophic lateral sclerosis. Subjects engaged with diverse stimuli on an LCD screen, ranging from non-words assessing articulatory control to unrestricted sentences from image descriptions. The database, enriched with data from standardized assessments, electromagnetic articulography, and 3D video sequences, aids in deciphering atypical speech production, refining ASR models, and deepening clinical explorations. Its comprehensive collection illuminates the nuances of dysarthric speech, advancing our grasp on its unique patterns and implications.

The UA Speech dataset [22], is a collection of audio and video recordings from 19 individuals with cerebral palsy and 13 healthy speakers. Each participant recorded 765 isolated words. The naming convention in the dataset includes ‘Participant’ IDs, where ‘C’ stands for healthy speakers, ‘M’ for males with cerebral palsy, and ‘F’ for females with the same condition. Notably, 16 participants (M01, M04, M05, M07, M08, M09, M10, M11, M12, M13, M14, F02, F03, F04, F05) are part of the available data, while M02, M03, and F01 are excluded as they were recorded under a different protocol. Importantly, participant M06 did not approve the redistribution of his data. The dataset comprises 300 unique uncommon words and repetitive words, making it valuable for speech analysis and recognition research, especially in individuals with cerebral palsy. It was recorded using an 8-microphone array, enabling diverse speech technology development and research applications.

For the exploration in this paper, we mainly analyze the TORGO dataset and validate the model on the UA Speech dataset. The primary reason for selecting the TORGO and UA Speech datasets is their widespread use in state-of-the-art classification techniques. Many leading studies in this domain base their experiments on these datasets, making them essential for ensuring a fair and meaningful comparison of our model’s performance. Additionally, the TORGO dataset was chosen due to its diversity in stimuli types, including words and sentences, enhancing our model’s training and generalization capabilities. The UA Speech dataset was selected for its adequate size and the availability of comparable research, making it an ideal choice for evaluating generalizability.

3.2 Features selection

In dysarthric speech analysis, selecting a diverse set of features is crucial to comprehensively capture speech’s acoustic and temporal characteristics. We chose the following features due to their ability to highlight various aspects of speech that are affected by dysarthria: Mel-Frequency Cepstral Coefficients (MFCC), Constant-Q Transform (CQT), Chromagrams, Tonnetz coefficients, Zero Crossing Rate (ZCR), Spectral Roll-off, Spectral Contrast, and Tempo-related Features. This selection of the features was guided by a comprehensive review of existing literature on audio feature extraction techniques. Our goal was to incorporate a wide array of features to capture various characteristics of audio signals, ranging from basic to more sophisticated representations. These features collectively capture essential information such as spectral characteristics, harmonic structures, tonal content, pitch modulation, signal polarity changes, frequency distribution, energy variations, and timing irregularities. By incorporating this diverse set of features, we aim to ensure that our model can effectively differentiate between control and dysarthric speech by analyzing a wide range of speech properties, thereby enhancing the accuracy and robustness of our classification. Additionally, selecting different types of features helps filter the data to obtain the largest amount of dysarthric dataset for Stage 2.

To extract these features, we utilized the Librosa library, a Python music and audio analysis package. Librosa provides a comprehensive set of tools to efficiently extract and manipulate audio features. It facilitated the extraction of all the mentioned features, ensuring consistency and reliability in our analysis process

The features we extracted from both sentence-based and word-based groups of TORGO dataset are summarized in Table 1.

Table 1 Summary of feature extraction techniques

3.3 Methodology

Our methodology involves two primary stages: binary classification and clustering-based multi-class classification. We conducted our experiments exclusively on the TORGO dataset for the binary classification part. The data was split into separate stimuli types, specifically restricted sentences and words, Fig. 2 below shows the procedure of our model. Each type passes through the same procedure as follows:

Fig. 2
figure 2

Proposed two-stage model architecture for clustering-based dysarthria severity classification

3.3.1 Stage-1: Binary classification

We employed a binary classification paradigm at this stage to segregate data into control and dysarthria categories. For this purpose, two prominent classifiers were adopted: the Support Vector Machine (SVM) and the Artificial Neural Network (ANN). During this phase, we experimented with a multitude of feature extraction methodologies.

Support vector machine (SVM) SVM is a powerful supervised machine learning algorithm known for its ability to generalize effectively. It works by finding an optimal hyperplane to separate data into classes but can handle non-linear data using the ‘Kernel Trick’ [23,24,25]. SVM plays a crucial role in medical applications, including disease detection and speech disorder diagnosis, by effectively processing complex, high-dimensional data [17, 26, 27]. It distinguishes between healthy and disordered speech, classifying varying degrees of dysarthria based on intricate speech features, ensuring accurate diagnostics and personalized treatments.

Artificial neural networks (ANN) Artificial neural networks (ANNs) are essential in machine learning [28], particularly for audio signal classification. ANNs consist of input, hidden, and output layers, with the input layer representing features extracted from data, such as spectral properties in audio classification. Hidden layers contain neurons that learn complex patterns, and their configuration varies with problem complexity. Activation functions like ReLU capture non-linear relationships. Techniques like Dropout prevent overfitting by promoting generalized learning. The output layer tailors results to the task, often using ‘softmax’ for classification, providing category probabilities [29]. In audio signal classification, ANNs stand out due to their inherent ability to manage high-dimensional, non-linear data. Their capacity to autonomously learn representative features from raw data significantly diminishes the need for extensive manual feature engineering, a frequent requirement in traditional machine learning models [30, 31]. This is particularly important given the intricate nature of the speech features in our datasets.

By employing both SVM and ANN, we aim to leverage the unique strengths of these two approaches. SVM was chosen for its ability to effectively handle high-dimensional data, especially when some feature extraction methods yield only one feature. The use of kernel functions in SVM allows for the expansion of the dimensionality of the feature space, making it particularly suitable for our task. Additionally, we included ANNs to leverage their strength in learning complex, non-linear patterns within the data. This allows us to compare the performance of a classical machine learning model (SVM) with a deep learning model (ANN), providing a comprehensive evaluation of the classification task. The combination of SVM’s robustness and ANN’s pattern recognition capability ensures a thorough analysis of control and dysarthric speech classification. We applied these classifiers to both datasets, utilizing Mel-frequency cepstral coefficients (MFCCs) as the primary feature extraction methodology. Our comparative analysis, detailed in subsequent sections, demonstrates how each model performed on these datasets, highlighting the advantages and limitations observed in the classification results.

3.3.2 Stage-2: Clustering-based multi-class classification

Clustering, at its core, is an unsupervised machine-learning technique aimed at grouping data points based on inherent similarities [32]. Clustering’s advantage over classification lies in its exploratory nature. It uncovers categories rather than relying on predefined ones [33]. In complex medical data, clustering is essential for identifying patient groups, understanding disease subtypes, and tailoring treatments. It leads to personalized care, optimized treatments, and insights into disease patterns [34]. Numerous clustering techniques are available, each bringing unique strengths and applications to the forefront of data analysis. For instance, Hierarchical Clustering [35], DBSCAN [36], Mean-shift clustering [37], Agglomerative Clustering [38]. Lastly, partitioning clustering techniques [39] such as K-means Clustering Technique.

In this study, we opted for the k-Means clustering approach [40]. At its core, k-Means seeks to categorize a dataset into \(\textit{k}\) distinct and non-overlapping clusters, emphasizing minimizing the distance between data points within the same cluster while maximizing the distance between separate clusters. Due to its systematic and efficiency, we selected this approach in this study at the second stage, rendering it particularly effective for discerning patterns and relationships within large datasets. We adopted two approaches for evaluating the optimal number of clusters: one rooted in the FDA-based clustering number specific to the original dataset classes and the other leveraging AI-driven methodologies, to determine the optimal number of clusters. It should be noted that the features extracted for the two stages are the same.

We experienced two other clustering techniques in addition to the k-means clustering, the Agglomerative and DBSCAN. In our ablation studies using the TORGO-based sentence type dataset for two clusters as this experiment achieved the best results, Agglomerative clustering, although it produced 2 clusters similar to k-means, significantly underperformed compared to k-means. The results were unexpected, leading to a notable decline in model performance. DBSCAN on the other side also proved to be unreliable. This technique identifies clusters based on the density of data points without requiring a preset number of clusters and labels some points as noise if they don’t fit into any cluster. We encountered samples labeled as noise, which could not be assigned a ground truth objectively. Additionally, DBSCAN requires careful tuning of its parameters. Despite experimenting with various values, optimizing these parameters was time-consuming. Since k-means provided satisfactory results and were easier to implement, further experimentation with DBSCAN seemed unnecessary. We decided not to incorporate the results of Agglomerative and DBSCAN due to their unsatisfactory performance. we did not compare them to k-means due to the unfair and unreasonable nature of such comparisons. Algorithm 1 presents the pseudocode of our proposed methodology.

Algorithm 1
figure e

Two-stage dysarthria severity level classification

4 Experimental results

This section refers to our experiments conducted on the TORGO dataset. Explain the procedure for preparing the dataset for the classification and preprocessing steps. Each step will be explained in detail in the following subsections.

4.1 Data pre-processing

As we mentioned in describing the TORGO dataset, it includes four types of stimulus: Non-restricted sentences, No words, Restricted sentences, and isolated words. The first step is to split the data based on their prompt types. For the sake of this study, we focused on both Restricted sentences and isolated words. We applied our model to the full dataset of TORGO generated by Array Microphone, where the restricted sentence group consists of 761 dysarthria cases and the word-based group consists of 2387 dysarthria cases. The second step is to prepare metadata for each group containing the path to each audio signal, its duration, the sample rate of each audio, the folder name (contains the participant’s code and the corresponding sessions, and the class (Normal vs. Dysarthria cases). This metadata helps to extract the features that are organized and manageable. As the final preprocessing step, the dataset was split into 80% training and 20% testing, and 10-fold cross-validation was applied to the training part to avoid overfitting.

4.2 Features extraction

After preparing each dataset group (based sentences and based words), 8 types of feature extraction have been utilized, and each has been explained in Table 1. These features were not combined; instead, we experienced the two classification models on each feature group separately for the first stage of our model to find the best features that can differentiate the Normal versus dysarthria cases. We support our selection of the best feature that can provide high accuracy of dysarthric case detection by calculating the importance of the features. To calculate the importance of the features, we used a Random Forest classifier [41], which is an ensemble learning method known for its robustness and effectiveness in handling various types of data. The model provides an intrinsic measure of feature importance based on how much each feature decreases the impurity of the nodes in the trees.

Fig. 3
figure 3

Feature importance by type using random forest for feature selection in audio classification (MFCC features are most significant)

We derived the overall importance for each feature type by summing the importance of individual features within each feature type. This technique was chosen because Random Forests are non-parametric, handle high-dimensional data well, and clearly indicate feature importance, making them suitable for this multifaceted dataset. Figure 3 visualizes the results of this calculation.

4.3 Results of stage 1

We extracted the previously mentioned eight feature types for each dataset group, and for each group of features, we applied the two models for binary classification. The results of this stage are shown in Table 2 for Sentence-based and Table 3 for words-based for different evaluation metrics represented by the Accuracy, Precision, Recall, and F1-Score.

Table 2 The performance results of SVM and ANN models for a sentence-based dataset group
Table 3 The performance results of SVM and ANN models for a words-based dataset group

4.4 Results of stage 2

After several classification algorithms applied to different feature selection techniques, from Tables 2 and 3, it can be noticed that the ANN-based MFCC features selection technique for both sentence-based and word-based groups has achieved the best classification accuracy. It is crucial to mention that after classifying the test part of the dataset, the dysarthria cases are isolated for clustering. Then, K-means clustering is applied for two scenarios for selecting the number of clusters:

4.4.1 Case 1: SLP-based clusters levels

According to the Frenchay dysarthria assessment (FDA) guidelines, a speech-language pathologist (SLP) assigns the TORGO dataset to several intelligibility ratings based on clinical intelligibility and articulatory functionality. These ratings are mainly converted to three severity levels: very low, low, and medium. This assignment cannot be free of subjectivity [15], despite that many researchers depend on these levels in the literature like [14, 15, 19, 42]. In this scenario, we will follow this division (three clusters), and for each data group, we apply the K-means clustering technique. The resulting clusters for each group are shown in Table 4. We can differentiate clusters as levels of severity based on the majority of participants of each severity level (as FDA).

Table 4 Clusters’ samples for sentence and word-based datasets when k = 3
Fig. 4
figure 4

The visualization of clustering using t-SNE with only MFCC features when k = 3 for the a sentence-based group and b word-based group

We correlated the clusters with specific speech impairments, we focused on the severity classification-based clustering. In our analysis, when determining severity levels, we considered the clusters where participants’ samples appeared most frequently. If samples from a participant with a known severity level appeared in multiple clusters, we assigned the severity based on the cluster with the highest representation. Misclustered samples, appearing in lower quantities, were taken into account to ensure accurate severity level designation.

According to Table 4, comparing the samples that appeared in each cluster with the ones from the literature (based on FDA), we can differentiate among several levels as below:

  • cluster 0 can be represented as a Very low severity level due to the presence of F03 and F04.

  • cluster 1 can be represented as a Medium level of severity due to the presence of M01, M02, and M04 (all the members of the Medium level based on the literature)

  • Cluster 2 can be represented as a Low severity level because it is all M05 (a member of the Low severity level based on the literature.

The same procedure was repeated for the word-based group, and the results are shown in the same Table 4. By comparing the samples that appeared in each cluster with the ones from the literature, we can differentiate among several levels as below:

  • cluster 0 can be represented as a “low” severity level due to the presence of M05.

  • cluster 1 can be represented as a “very low” level of severity due to the presence of F04 and M03 (all the members of the very low level based on the literature)

  • Cluster 2 can represent a “medium” severity level because it is all M01, M02, and M04 (a member of the medium severity level based on the literature.

By labeling each audio signal with its cluster type, we calculated the accuracy, precision, recall, and F1-score for the cluster concerning our ground truth is the SLP-based classification levels (which are utilized by many works in the literature) as the equations below:

$$\begin{aligned} {\text {Accuracy}}&= \frac{{\text {No. of Correctly Clustered Audio Signals}}}{{\text {Total No. of Audio Signals}}} \times 100 \end{aligned}$$
(1)
$$\begin{aligned} {\text {Precision}}&= \frac{{\text {No. of Correctly Clustered positive class}}}{{\text {Total No. of Correctly clustered (Positives and Negatives)}}}\times 100 \end{aligned}$$
(2)
$$\begin{aligned} {\text {Recall}}&= \frac{{\text {Correctly clustered samples}}}{{\text {Correctly clustered samples}} + {\text {Incorrectly excluded samples}}} \times 100 \end{aligned}$$
(3)
$$\begin{aligned} F1&= 2 \cdot \frac{{\text {Precision}} \cdot {\text {Recall}}}{{\text {Precision}} + {\text {Recall}}} \end{aligned}$$
(4)

Figure 4 visualizes the dataset’s clustering attempts for both sentence and word-based groups. The performance of our clustering model on the sentence-based group to classify them into three levels of severity achieved an average accuracy of 87%. The clustering achieved this accuracy, indicating that the MFCC features effectively capture differences in speech patterns at the sentence level. The clear separation of colors in Fig. 4 suggests well-defined clusters, validating the high accuracy. By applying the same Eq. 1, the performance of our clustering model on the words-based group to classify the dysarthria cases into three levels of severity achieved an accuracy of 84%. While the clusters are distinct, the slightly lower accuracy compared to the sentence-based data may be due to reduced contextual information in word-based data. Despite this, the visualization demonstrates that MFCC features can still effectively differentiate word-level speech patterns into three distinct clusters. Table 5 shows the results for this case using the four evaluation metrics represented by the Eqs. 14.

Table 5 The evaluation metrics for Case-1-SLP-based clusters levels

4.4.2 Case 2: AI-based clusters levels

Due to the high degree of overlap among clusters leads to the hypothesis that these audio signals may not refer precisely to 3 different levels, and this overlap is a strong indicator of the closeness among the MFCC values for at least 2 clusters. For this reason, we applied three different techniques to the dysarthria cases to find the optimal number of clusters; they are:

  • Elbow method [43]: Plot the variance explained (or the sum of squared distances from each point to its assigned center) against the number of clusters “k”. As the number of clusters increases, the variance explained by the clusters also increases, but at some point, the gain in variance explained begins to diminish. The “elbow” point on the plot represents an optimal balance between precision and computational cost. Its procedure is represented by calculating the sum of squared distances (inertia) for different values of k (typically ranging from 1 to 10 or more). We then plot these values on a graph where the x-axis represents the number of clusters and the y-axis represents the sum of squared distances. The optimal number of clusters is identified at the point where the curve starts to bend or form an elbow, indicating that adding more clusters does not significantly improve the variance explained.

  • Silhouette analysis [44]: Measures the quality of clusters by calculating the silhouette coefficient for each sample. The silhouette value ranges from -1 to 1, where a high value indicates that the object is well-matched to its cluster and poorly matched to neighboring clusters. A silhouette score close to 1 suggests that the sample is appropriately clustered, while a score close to -1 indicates that the sample might have been assigned to the wrong cluster. Its procedure is represented by calculating the silhouette coefficient for each sample and for each k value, considering the mean intra-cluster distance (a) and the mean nearest-cluster distance (b). The silhouette coefficient is then calculated as \((\hbox {b} -\hbox {a}) / \hbox {max}(\hbox {a}, \hbox {b})\). We plot the average silhouette scores for different k values, and the optimal number of clusters is where the average silhouette score is maximized.

  • Davies–Bouldin index [45]: A measure of the average similarity ratio of each cluster with its most similar cluster. It considers the dispersion within clusters and the separation between clusters. A lower Davies–Bouldin Index indicates better clustering, which signifies that clusters are far apart and less dispersed. Its procedure is represented by calculating the Davies–Bouldin Index for each k value, which involves computing the average distance between each cluster center and its points, and the distance between cluster centers. The index is computed as the average ratio of within-cluster distances to between-cluster distances for the most similar cluster pairs. We plot the Davies–Bouldin Index values for different k values, and the optimal number of clusters is identified where the index is minimized.

Figure 5 shows the results of the optimal number of clusters after applying these techniques to the dysarthria cases.

Fig. 5
figure 5

Techniques to find the optimal number of clusters a Elbow, b Silhouttee analysis c Daveis–Bouldin index

All three techniques state that the optimal number of clusters is 2. So, we repeated the clustering procedure based on only two clusters. Table 6 shows the cluster samples for sentence- and word-based groups.

Table 6 clusters’ samples for sentence and word-based dataset when k = 2
Fig. 6
figure 6

The visualization of clustering using t-SNE with only MFCC features when k = 2 for the a sentence-based group and b word-based group

Based on the AI-optimal cluster number, we got a very clear differentiation among the severity levels, and it indicates that levels low and very low can be merged for both sentence and word-based groups. Figure 6 shows the clustering for each group. By calculating the accuracy of clustering after merging the low and very low clusters using Eq. 1, the results improved to be 91% for the sentence-based indicating a clear separation of the data into two clusters. The distinct division between the colors in Fig. 6 suggests that MFCC features effectively capture and differentiate primary speech patterns at the sentence level, resulting in high clustering performance. While the clusters are distinguishable, the slightly lower accuracy for the word-based group which achieved 85% compared to the sentence-based data may be due to shorter durations and less contextual information in word-based data. Despite this, the visualization demonstrates that MFCC features can still effectively distinguish word-level speech patterns into two distinct clusters. Table 7 shows the rest of the evaluation metrics for this case.

Table 7 The evaluation metrics for Case-2-AI-based clusters levels

4.5 Evaluation on UA speech dataset

Notably, while the UA Speech dataset exclusively comprises textual content, it exhibits a distinct diversity in terms of diverse vocabulary which includes a range of common, uncommon, and specialized terms, this approach introduces a unique type of lexical diversity and encompasses varied usage contexts, Also in terms of larger dataset size which provides more training data compared to the TORGO dataset. Our research commenced with the application of feature extraction through MFCC, yielding promising results. Consequently, assessing this approach using an alternative dataset became imperative, with the UA Speech dataset as a suitable candidate. It is important to highlight that the UA Speech dataset surpasses the TORGO dataset regarding the variety of word prompts and the number of utterances. Given that multiple experiments on the TORGO dataset, both sentence-based and word-based, demonstrated that MFCC achieved the best results, we decided to implement the clustering stage model by extracting the MFCC features from the UA Speech dataset and applying k-means clustering. We conducted several experiments to ensure a comprehensive and equitable comparison with other classification techniques, primarily adhering to the speaker-independent approach. The inapplicability of the speaker-dependent procedure necessitated this choice in conjunction with our clustering-based classification methodology. Table 8 presents the outcomes of these experiments in terms of Accuracy, Precision, Recall, and F1-Score, which will be subjected to a detailed analysis in the subsequent section. Figure 7 shows the clustering results for two different experiments on the UA speech dataset. In subfigure (a), the data is partitioned into two clusters, revealing a clear separation that indicates the efficacy of MFCC features in distinguishing between two broad categories of dysarthric speech. In subfigure (b), the data is divided into four clusters, which uncovers more nuanced distinctions within the dataset. This finer segmentation could be essential for identifying more specific subtypes of dysarthric speech, allowing for a more detailed understanding of the variations within the disorder. However, the increased number of clusters may introduce complexity that impacts the overall classification performance.

Table 8 Performance results for several experiments for the UA speech dataset (cluster 0: VL (very low), cluster 1: L (low), cluster 2: M (medium), cluster 3: H: high, B1: block 1, M2 and M3: microphone, CW: common words)
Fig. 7
figure 7

The visualization of clustering using t-SNE with only MFCC features for a all the data with k = 2 and b all the data with k = 4

5 Discussion and comparison

In this investigation, the focus was placed on discerning various degrees of dysarthria severity via a two-tiered approach.

5.1 Stage 1 Analysis (classifiers and data types)

In the initial stage, two classification mechanisms, Support Vector Machine (SVM) and Artificial Neural Network (ANN), were tested on two distinct data subsets: sentence-based and word-based. Upon examination of Tables 2 and 3, it is evident that among the diverse feature sets, the Mel-frequency cepstral coefficients (MFCC) feature displayed superior performance in differentiating standard and dysarthria instances for both data subsets. This has been approved as shown in Fig. 3 which aligns with our binary classification results, showing that the best performance was achieved using MFCC features.

CQT features ranked second concerning evaluation metrics (despite coming in third place in the feature importance plot), whereas other feature types demonstrated proximate performance. Interestingly, ANNs paired with MFCC yielded better accuracy rates than SVMs for both datasets. Such results led to the immediate selection of the ANN model in conjunction with MFCC to maximize the identification of dysarthria instances, thereby feeding the subsequent stage of our analytical model. Broadly, the results derived from sentence-based data showcased superior performance compared to word-based data. One potential explanation for this could be the richer contextual information embedded in sentences, providing more distinctive features for the classifiers. The findings raise an interesting question about the synergy between feature types and classifiers. The differential performances may be attributed to the inherent characteristics of the features and how each classifier processes them. For instance, ANNs, with their multiple layers and the presence of techniques like ReLU activation functions and Dropout further enhance the ANN’s ability to generalize and avoid overfitting, which is crucial for high-dimensional, non-linear data, might be better equipped to capture the nuances of MFCC features.

5.2 Stage 2 Analysis (clustering approach)

For the secondary stage, we incorporated a clustering methodology to assist speech therapists in categorizing dysarthria cases based on severity. The rationale was to expedite and enhance the accuracy of the labeling process by leveraging pivotal features that could distinguish various severity degrees. The k-means clustering algorithm was the method of choice, attributed to its efficiency and ease of implementation. The number of clusters was selected based on SLP opinion, as the literature shows in many classification techniques, and based on AI techniques to select the best number of clusters. Tables 4 and 6 elucidate the clusters generated for each data subset. It is noteworthy from Table 4, focusing on sentence-based data, that specific clusters, particularly clusters 0 and 1, exhibited a degree of overlapping. Such overlaps can be traced back to the presence of outliers that may have compromised clustering quality. Remarkably, the integration of solely MFCC features rendered such clarity, suggesting that supplementing with additional feature types could further optimize clustering results. Furthermore, a distinct pattern emerges with clusters 1 and 2 for the sentence-based group: they exclusively encompass male participants. Drawing upon literature, cluster 1 is discerned as a moderate severity level, incorporating all male individuals at this severity tier. Likewise, cluster 2, identified as a low severity bracket, predominantly featured participant M05, with only a female participant being an exception. A plausible deduction is the clearer differentiation among male participants compared to their female counterparts, which warrants further exploration. For example, a misclustered female case, F01, should have been classified under the low severity level. Although M03 is also misclustered, the predominant clustering of males and the absence of a completely female cluster suggest that male voice samples exhibit more distinguishable features. This differential recognizability likely contributes to the observed clustering pattern. This observation hints at a potential gender bias in the dataset. Specifically, this bias appears in the sentence-based data but not in the word-based data of the TORGO dataset and is most evident when the number of clusters is three, not two. we can infer that male and female voices’ acoustic and phonetic characteristics differ, potentially leading the model to identify and differentiate male voices more easily. The features used in this analysis, the MFCC capture the power spectrum of a sound, which is influenced by pitch and formant frequencies. Male and female voices typically have different pitch ranges and formant structures, with male voices generally exhibiting lower fundamental frequencies and different formant frequencies compared to female voices. These differences may result in more distinct MFCC patterns for male voices, making them easier for the clustering algorithm to recognize and group. A parallel trend is evident for the word-based data in the same table, where the clusters are clearly defined with the same degree of overlapping, but the selection of the classes was in a different order, which is considered acceptable for differentiating classes. Table 5 shows that the sentence-based model outperforms the word-based model, with higher accuracy, precision, recall, and F1-score, indicating that sentence-based features provide richer contextual information for classification. This is also reflected in Table 7, where for k = 2, the sentence-based model again achieves higher performance metrics than the word-based model. The consistent performance enhancement in both tables demonstrates the effectiveness of using sentence-based features over word-based features.

5.3 UA speech results analysis

Table 8 presents the results of clustering the UA Speech dataset based on MFCC features. After conducting various experiments using either the entire dataset or subsets, it becomes evident that the achieved performance is not particularly high. Several factors contribute to this outcome, including the relatively low variability among samples for different utterances. This limited variability implies that relying solely on MFCC features is insufficient for creating distinct partitions among different levels of severity. This contrasts the promising performance of MFCC in the word-based group of the TORGO dataset. It’s worth noting that the selection of each experiment was not arbitrary; it was guided by previous studies in the literature that have worked on the UA Speech dataset under similar scenarios involving speaker-independent dysarthria classification. The table also presents other evaluation metrics like precision, recall, and F1-score metrics. In general, using all data from all microphones provides a balanced performance with moderate scores across all metrics. However, when focusing on common words and specific microphones, there are noticeable improvements in precision and recall, particularly when the number of clusters is 2. The highest performance is observed when k = 2 with all microphones, achieving an F1-score of 62.7%, indicating that more focused datasets and optimized microphone selection can enhance speech recognition performance for individuals with different levels of speech intelligibility. The last experiment where using a larger dataset of common words from Block 1 with eight participants, shows a perfect recall of 100% but a precision of 52%, resulting in an F1-score of 68%. This highlights a trade-off, where the model’s recall improves significantly with a broader dataset, but at the expense of precision, leading to a higher number of false positives. These results underscore the importance of data selection and microphone choice in optimizing the balance between precision and recall in speech recognition tasks.

5.4 Comparative review

On comparing our methodology with extant approaches in the domain, as detailed in Sect. 2, we identified nine pertinent studies that utilized the TORGO dataset. These research endeavors focused solely on word or sentence-based data types, with a handful examining the amalgamation of various stimuli. Of these, three studies concentrated exclusively on sentence-based data, and three also explored word-based data, as showcased in Table 9. From the comparative table, Our methodology surpasses contemporary benchmarks by relying on MFCC for clustering while introducing a diverse feature set to enhance performance. Compared to existing approaches utilizing the TORGO dataset, such as Fisher vector encoding with ANN classifiers [18] and prosody-based measures [7], our method incorporates a comprehensive array of features including CQT, Chromagrams, and Tonnetz coefficients. This results in a more nuanced and accurate analysis, capturing a wider range of acoustic and temporal characteristics. Furthermore, our approach addresses the limitations of deep learning techniques [15, 41] that, despite high accuracy, often require large datasets and struggle with speaker-independent scenarios. By using robust clustering techniques supported by Elbow Method, Silhouette Analysis, and Davies–Bouldin Index, we ensure optimal clustering performance, identifying the ideal number of clusters for better accuracy and reliability.

Our study demonstrates superior performance on the UA Speech dataset, outperforming state-of-the-art techniques in speaker-independent scenarios [15, 45,46,47] as shown in Table 10. Despite the low variability and the challenges of using only MFCC features to differentiate the different classes, these results were considered promising and delight the way of experiencing more precise feature extraction techniques that can serve to extract the hidden pattern for this type of dataset.

Unlike traditional classification-based methods that rely on predefined labels, our clustering-driven approach is practical for real-world applications, facilitating the identification of dysarthria severity levels without extensive manual annotation. The generalizability of our model, achieved by working on each speaker separately based on sample distribution within clusters, addresses a common limitation in deep learning models. This speaker-independent approach enhances the model’s applicability in diverse clinical settings. Advancements in clustering accuracy not only recognize pivotal features influencing dysarthria severity differentiation but also aid speech pathologists in efficiently labeling new cases. This practical and innovative approach bridges the gap between controlled datasets and real-world applications, making significant strides towards developing AI tools that assist therapists without the need for manual data annotation.

Table 9 Comparison results with the state-of-the-art for TORGO dataset
Table 10 Comparison results with the state-of-the-art for UA speech dataset

6 Conclusions

In conclusion, our two-stage methodology presents a novel approach to discerning the intricate severity levels of dysarthria. By integrating binary classification and k-means clustering, we’ve achieved significant accuracy against the TORGO and the UA Speech datasets and underscored the utility of clustering in offering a more sophisticated understanding of the disorder. The promising outcomes achieved using sentence-based features pave the way for further research. We anticipate delving into feature augmentation, capitalizing on the potential of various feature sets combined with MFCC or separately to refine classification further. Our preliminary clustering outcomes also hint at potential gender biases in dysarthria differentiation. Recognizing this, we aim to develop gender-specific models to explore the influence of distinct features across genders and better understand the gender dynamics in dysarthria severity classification, ensuring a more holistic and inclusive comprehension of the disorder.