s10772-024-10082-z (1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

International Journal of Speech Technology (2024) 27:87–99

https://doi.org/10.1007/s10772-024-10082-z

An amalgamation of integrated features with DeepSpeech2


architecture and improved spell corrector for improving Gujarati
language ASR system
Mohit Dua1 · Bhavesh Bhagat1 · Shelza Dua2

Received: 9 September 2023 / Accepted: 2 January 2024 / Published online: 13 February 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024

Abstract
Automatic Speech Recognition systems that convert language into written text have greatly transformed human–machine
interaction. Although these systems have achieved results, in languages building accurate and reliable ASR models for low
resource languages like Gujarati comes with significant challenges. Gujarati lacks data and linguistic resources, making
developing high-performance ASR systems quite difficult. In this paper, we propose an approach to enhance the effective-
ness of a Gujarati ASR model despite resources. We achieve this by incorporating integrated features such as Mel Frequency
Cepstral Coefficients (MFCC) and Gammatone Frequency Cepstral Coefficients (GFCC) utilizing the DeepSpeech2 archi-
tecture and implementing an improved spell correction technique based on the Bidirectional Encoder Representations from
Transformers (BERT) algorithm. Our approach has demonstrated superiority over previous state-of-the-art methodologies
through testing and evaluation. The experimental results demonstrate that our proposed method consistently reduces the Word
Error Rate (WER) by 10–12 percentage points compared to the existing work, surpassing the most significant improvement
of 5.87%. Our findings demonstrate the viability of developing accurate and dependable ASR systems for languages with
limited resources, such as Gujarati.

Keywords Automatic speech recognition · Gujarati language · Low-resource · MFCC + GFCC · DeepSpeech2 · Spell
corrector

1 Introduction Furthermore, the technology improves user interfaces dra-


matically, notably in voice-activated systems such as virtual
A single-speaker Automatic Speech Recognition (ASR) sys- assistants and smart home gadgets. This enables users to
tem is a flexible instrument with a wide range of applica- communicate in real time using natural language instruc-
tions. Its principal application is in transcription services, tions, resulting in a more intuitive and accessible user expe-
where it excels at turning spoken language into precise and rience. Single-speaker ASR is helpful in contact centres and
efficient text, easing duties such as meeting documenta- customer care applications, automatically transcribing and
tion, interview transcriptions, and personalized dictation. analyzing conversations, assisting in quality assurance, com-
pliance monitoring, and the extraction of important insights.
Furthermore, the technology is critical in voice search func-
* Shelza Dua tions. Multi-speaker ASR represents a substantial leap in
[email protected] speech recognition, excelling in contexts with several speak-
Mohit Dua ers, overlapping speech, and a variety of accents. It is useful
[email protected] for transcribing dynamic meetings, conference calls, and
Bhavesh Bhagat improving voice assistants in busy environments. Its real-
[email protected] time transcribing enhances corporate settings, allowing for
1
Department of Computer Engineering, National Institute more efficient communication and thorough record-keeping.
of Technology, Kurukshetra, India Multi-speaker ASR speeds transcription, boosting material
2
Department of Electronics Communication and Engineering, accessibility in settings such as broadcast content and inter-
Punjab Engineering College, Chandigarh, India views when numerous voices participate.

Vol.:(0123456789)
88 International Journal of Speech Technology (2024) 27:87–99

Conventional automatic speech recognition (ASR) performance of the low-resource Gujarati ASR model and
systems commonly employed Hidden Markov Models make it appropriate for use in practical applications.
(HMMs), utilizing multiple components such as acoustic, The remainder of the paper is organized as follows:
pronunciation, and language models. These HMM-based Sect. 2 gives a review of this paper's relevant existing work.
models relied on feature extraction techniques like MFCCs The proposed technique is introduced in Sect. 3. Section 4
to convert raw audio into suitable feature representations. describes the experiments and provides based on outcomes
While HMMs achieved acceptable accuracy, their perfor- discussion. Finally, Sect. 5 summarises the important find-
mance was constrained by hand-crafted features and system ings and closes the article."
complexity (Joshi et al., 2022). The advent of deep learning
revolutionized ASR with the development of hybrid models
that combined Deep Neural Networks (DNN) with HMMs, 2 Related work, motivation
known as DNN-HMM models (Anoop & Ramakrishnan, and contribution
2021; Dubey & Shah, 2022).. Despite the improvement,
DNN-HMM models retained the modular nature of HMMs, Significant research was conducted in the 1970s in the field
training distinct components separately before combining of voice recognition. The HMM was introduced as a speech
them, limiting their ability to capture end-to-end speech rela- recognition technique in the early 1980s. Initially, it was
tionships (Maji et al., 2022; Schuster & Paliwal, 1997). By believed that the HMM was too simplistic to accurately rec-
optimizing the entire ASR pipeline, E2E models eliminate ognize human languages, but it eventually supplanted the
error propagation between individual components, resulting dynamic time warping technique. However, the development
in enhanced precision. The direct learning capability from of HMM's state models required a substantial quantity of
unprocessed audio and capturing end-to-end relationships task-specific knowledge and sophisticated parameter adjust-
has proven highly effective, establishing E2E models as the ments (Deshmukh, 2020). To address these issues, research-
preferred choice for contemporary ASR systems (Amodei, ers (Graves & Jaitly, 2014; Graves et al., 2006) created an
et al., 2016). end-to-end (E2E) ASR system employing RNN and the Con-
Developing accurate and dependable ASR models for nectionist Temporal Classification (CTC) technique. This
low-resource languages, however, poses significant difficul- novel technique enabled RNNs to directly recognize unseg-
ties (Failed, 2018; Forsberg, 2003). These languages, includ- mented sequences without explicit segmentation. Encoder
ing Gujarati, lack sufficient annotated data and linguistic and decoder-based E2E ASR systems, notably those incor-
resources, making the development of high-performance porating an attention mechanism in the decoder, exhibited
ASR systems challenging. Gujarati, which is primarily spo- superior performance, particularly for prolonged inputs or
ken in the Indian state of Gujarat, is one such language with outputs (Cho et al., 2014; Sutskever et al., 2014).
limited resources. Low-resource languages often lack suffi- Joshi et al. (2022) presented DNN-based End-to-end Sys-
cient amounts of diverse and high-quality training data. ASR tems, a novel approach for ASR systems. Using advanced
models heavily rely on large and varied datasets to learn deep learning architectures such as RNNs or transformers,
the intricacies of language patterns and acoustic character- these systems map the acoustic input directly to the cor-
istics. In the case of low-resource languages like Gujarati, responding text output. ASR systems are able to discover
the scarcity of relevant data makes it challenging to train intricate patterns in the input data and produce more pre-
robust models. Gujarati ASR systems are not as developed cise transcriptions. Maji et al. (2022), purposed parallelized
as those for main languages, despite the language's rich CNN-BiGRU network. This network used to capture and
cultural and linguistic history. The lack of annotated data process the emotional information in the input utterances
and limited Gujarati-specific linguistic resources presents with efficiency. Dua et al. (2019) introduced a robust ASR
significant obstacles to the development of effective ASR system designed specifically for the Hindi language in their
models. Therefore, there is an urgent need to develop novel study. The system addresses the issue of noise interference
or improved approaches that can improve the efficacy of by employing a feature representation known as GFCC
Gujarati ASR systems with limited resources. which is a modified variant of the conventional MFCC
In this paper, we present an exhaustive strategy for feature extraction method, which is known for its ability
addressing the difficulties inherent in developing accurate to capture spectral and temporal information effectively.
and trustworthy ASR models for Gujarati. Our method To improve the noise-resilience of the ASR system, the
combines the MFCC + GFCC features with the Deep- researchers (Dua et al., 2018) proposed integrating MFCC
Speech2 architecture and a post-processing phase utiliz- and GFCC feature representations. These incorporated fea-
ing an enhanced spell corrector technique. By combining tures seek to capture vital acoustic characteristics while
these techniques, we hope to considerably enhance the minimizing noise's detrimental effects.
International Journal of Speech Technology (2024) 27:87–99 89

Tailor and Shah (2018) proposed a lightweight speech Optical Character Recognition (OCR)-generated Hindi Text
recognition system for Gujarati that employs HMM on a lim- (Hu et al., 2021).
ited corpus of Gujarati words. Their system obtained a Word The proposed work in this paper integrate MFCC and
Error Rate (WER) of 5.85 percent for Gujarati and other GFCC features from the audios for building Gujrati ASR
low-resource (LR) languages. In addition, they presented system. Integrating MFCC + GFCC features (Dua et al.,
a second HMM-based ASR using a custom audio corpus 2019) is motivated by the need to capture a broader spectrum
of 650 frequent Gujarati words, which yielded a WER of of both clean and noisy acoustic characteristics in Gujarati
12.7%. Dave (2015) proposed a hybrid technique that com- speech. While MFCC features are effective in representing
bines HMM and ANN models for Gujarati, which enhanced spectral information and widely used in ASR systems, they
word recognition accuracy and decreased the WER from may lack essential temporal details crucial for speech rec-
29.43 to 20.85%. Billa (2018) developed LSTM-CTC-based ognition accuracy in noisy environments. MFCC features,
ASR systems for Gujarati, Tamil, and Telugu, achieving a due to their spectral focus, may not capture rapid changes in
significant reduction in WER in comparison to the baseline speech signals. To overcome this limitation, GFCC features
system. They utilized multilingual training techniques to are integrated to enhance temporal resolution and represent
increase the efficacy of their strategy. Krishna et al. (2018) rapid speech signal changes effectively. The pointwise con-
designed a joint acoustic modelling experiment for Indian tributions of the proposed work can be summarized as:
languages, evaluating the applicability of various acoustic
models, such as HMM with Sub-space Gaussian Mixture 1. The work does integration of MFCC + GFCC features,
Model (SGMM) and RNN-CTC, for creating a collaborative which allows the ASR model to capture both spec-
acoustic model using a common phone-set. To surmount tral and temporal information. This leads to enhanced
resource limitations, Toshniwal et al. (2017) developed E2E performance for low-resource Gujarati ASR settings,
multilingual ASR architectures, achieving WER values of addressing the limitations of using either feature in iso-
21.07% for RNN-CTC, 19.11% for LSTM-CTC, and 17.3% lation.
for the Listen, Attend and Spell (LAS) based acoustic model. 2. The work utilizes an RNN with LSTM units, inspired by
In addition to conventional statistical ASRs, research- the DeepSpeech2 model (Amodei et al., 2016), which
ers have focused on developing E2E ASRs using post- enables the model-to-model temporal relationships and
processing and spelling correction techniques. Amodei extracts contextual information from input voice data.
et al. (2016) presented the Deepspeech2 architecture, an DeepSpeech2 has already proven its effectiveness in
E2E deep learning approach for recognizing English and ASR tasks across languages with resource constraints.
Mandarin Chinese speech that required minimal language 3. To address transcription mistakes inevitable in low-
expertise and was adaptable to new languages. Srivastava resource languages, an orthography corrector has been
et al. (2019) proposed two methods for E2E ASR in a low- introduced as a post-processing step. This corrector
resource environment with Hindi-English code switching, employs contextual data and language models to iden-
addressing the problem of class imbalance and enhancing tify and fix transcription problems, thereby enhancing
performance via oversampling. Raval et al. (2021) enhanced the accuracy and readability of the transcribed Gujarati
the performance of an end-to-end speech recognition system text.
for Gujarati by training a CNN-LSTM hybrid model on the 4. Given the scarcity of annotated Gujarati speech data
Microsoft Speech Corpus dataset and employing decod- and the challenges in training, the orthography correc-
ing and post-processing techniques. Researchers have also tor approach aims to mitigate the impact of transcription
investigated probabilistic approaches and spelling correction mistakes on ASR output accuracy. It proves effective in
techniques for language-specific tasks. Using models such as improving the quality of transcribed Gujarati text.
Nave Bayes and HMM, Patel et al. (2014) presented a word-
level probabilistic approach for error correction in Gujarati
language documents. Using the BERT algorithm, Zang and 3 Proposed methodology
colleagues proposed a soft-mask misspelt word correction
for the Chinese language at the character level (2020). Hu Figure 1 depicts the proposed End-to-End (E2E) speech
et al. presented a method for correcting misspellings using a recognition system. Proposed system consists of four main
pre-trained contextual language model. This study employed phases: Speech feature extraction, DeepSpeech2 Architec-
BERT as a pre-trained model with Levenshtein distance for ture, Decoding, and Post-processing. Each phase contrib-
utes to improving the accuracy and performance of ASR.
90 International Journal of Speech Technology (2024) 27:87–99

Fig. 1  Proposed E2E ASR architecture for Gujarati

An additional phase included in the methodology to check that the output text may contain misspelt words or charac-
the robustness and effectiveness of Gujarati ASR system by ters within sentences. This moves us to the system's next
introducing the noisy disturbance in preprocessing step and step. Post-processing makes an effort to repair mistakes and
created own noisy Gujarati dataset. swap out inappropriate words with terms from the Guja-
Initially, the audio speech or signal is processed. To rati corpus. To accomplish this, the BERT (Bidirectional
improve the accuracy of ASR, a method is proposed that Encoder Representations from Transformers) paradigm is
incorporates MFCC, GFCC, and MFCC and GFCC fea- used in a modified typo correction approach. Contextual
ture extraction techniques. These features are extracted embeddings and language modelling are used in this updated
from the audio data and then transmitted for training to BERT model for spelling correction to identify the most
the DeepSpeech2 model architecture. The architecture of suitable word replacements for erroneous or misspelt words
the DeepSpeech2 model comprises a number of essential in phrases. The spell corrector technique and the BERT
components. It consists of CNN layers, GRU layers, Batch model are combined to form the Modified Spell Correc-
Normalisation layers, Dropout layers, and Dense layers, as tor BERT model. The output text from the earlier phase is
well as the CTC component. Together, these components considered to be the improved Output Text following the
discover and represent the underlying patterns and struc- post-processing step. Correct words are swapped out with
tures present in the audio data, enabling the model to make more suitable ones in this text to provide a more polished
accurate predictions. and accurate reproduction of the original audio discourse.
After 22 h of training and testing, the predicted output Correct words are swapped out with more suitable ones in
of the model is decoded using a variety of techniques to this text to provide a more polished and accurate reproduc-
generate the output text. It is essential to observe, however, tion of the original audio discourse.
International Journal of Speech Technology (2024) 27:87–99 91

In ASR systems, the MFCC approach is often utilized. It


requires transforming the short-term power spectrum of
an audio stream into coefficients that reflect the spectral
envelope. A Mel filter bank, logarithmic compression, and
a discrete cosine transform are used in this approach. The
resulting MFCC coefficients give useful information about
the spectral features of the signal (Gaudani & Patel, 2022).
In contrast, the GFCC approach is inspired on the human
hearing system. It utilizes a set of gammatone filters that
mimic the frequency selectivity of the cochlear. These filters
analyze the audio signal's power spectrum and generate coef-
ficients that represent the signal's magnitude. In addition to
spectral characteristics, GFCC captures the signal's temporal
characteristics. Figure 3 shows the integration process of
MFCC and GFCC features.
By integrating MFCC and GFCC, the objective is to
capitalize on the strengths of both approaches and generate
a more comprehensive audio signal representation. This
Fig. 2  Detected speech using VAD
combined representation has the potential to enhance the
feature set's discrimination and classification capabilities.
3.1 Voice activity detection There are numerous combinations of MFCC and GFCC.
Combining the feature vectors obtained from both method-
Voice activity detection (VAD) holds paramount importance ologies is a common practise. This generates an extended
in the realm of speaker recognition by accurately discerning feature vector containing information from both the spec-
speech segments within audio recordings. In speaker recog- tral and temporal domains. This combined feature vector
nition systems, VAD serves as a critical pre-processing step, can then be used as input to a DeepSpeech2 Architecture
allowing the system to focus specifically on the speech por- or provided to subsequent ASR processing stages. Func-
tions and disregard non-speech intervals. By identifying and tion 1 is used to implement integrated MFCC and GFCC
isolating speech activity, VAD contributes to the improve- features. MATLAB function mfcc( ) returns 39 features (13
ment of feature extraction and model training, leading to coefficients + 13 delta + 13 double delta) whereas gtcc()
more robust and accurate speaker recognition outcomes. returns 36 features (12 coefficients + 12 delta + 12 dou-
In our proposed approach we have used inbuilt MAT- ble delta). In our proposed work we are using integrated
LAB code for removing the silence part from the audios for MFCC-GFCC features matrix which is of size 75 × Num-
further processing. Figure 2 shows the output of algorithm ber of audios.
used to check the silence part and speech part from audios.

3.2 Speech feature extraction

For feature extraction we have used the output audios of


VAD. In VAD step we have found the speech and non-speech
part and we used only speech part for feature extraction.

Fig. 3  Process of integrating MFCC and GFCC features


92 International Journal of Speech Technology (2024) 27:87–99

3.3 DeepSpeech2 architecture The CTC loss layer is essential for training the Deep-
Speech2 model. CTC is a method for matching the input
DeepSpeech2's CNN layers extract high-level representa- audio sequence to the associated transcription. It allows the
tions from input audio features, employing multiple convo- model to handle variable length input sequences and create
lutional filters with varying diameters and steps. Parameter output sequences that are aligned with the input. The CTC
values, such as the number of filters, filter sizes, and strides, loss layer computes the difference between the predicted
are tailored to the ASR task and input data characteristics. and actual sequences, allowing the model to learn the right
Bidirectional GRU layers capture temporal dependencies alignments during training.
and context in audio sequences by analyzing input in both
forward and reverse directions. The number and size of GRU
layers are adjustable hyperparameters. 3.4 Decoding
Batch Normalization layers normalize preceding layer
activations, reducing internal covariate shift, enhancing In this study, two major decoding strategies, prefix decod-
model stability, and accelerating training. Parameters like ing and greedy decoding, were used, coupled with a post-
momentum and epsilon govern the normalization process. processing methodology. These algorithms are used to cre-
Dropout layers prevent overfitting by randomly setting a por- ate correct character sequences as outputs, which serve as
tion of input units to zero during training, reducing depend- hypotheses for transcribed speech.
ence on specific features and improving generalization. The
dropout rate, a hyperparameter, specifies the proportion of
input units to discard.
International Journal of Speech Technology (2024) 27:87–99 93

3.4.1 Greedy decoding hypotheses and takes a broader context into account. Due to
the investigation of multiple paths, prefix decoding typically
Greedy decoding is a popular decoding method in voice rec- requires more computational resources and can be slower
ognition systems. It requires building a character sequence than greedy decoding.
by picking the most likely output at each time step. This
strategy aims to determine the locally optimum option at
each stage while ignoring the overall optimization of the 3.5 Improved spell corrector technique
sequence. The greedy decoding equation is written as
follows: To increase the output of voice recognition systems, an
enhanced typo corrector technique employing BERT is
s = argmaxp(c∕x ∗ t) (1) proposed in this paper. The technique modifies the existing
where ′s′ is the character sequence being constructed, ′c′ algorithm for spelling correction and leverages the power
is a character from the set of possible characters, and x ∗ t of BERT for more accurate word replacements. The modi-
is the input at time ′t′. At each timestamp, the algorithm fied spell checker introduces a new mechanism known as
selects the character that, given the input at that moment, "splitter words," which performs operations such as inser-
has the highest conditional probability. This means it selects tion, deletion, replacement, and transposition of words. This
the character most likely to occur based on the information improves the process of spelling correction by contemplating
available at that time. By repeating this process for each a broader variety of potential corrections for each word. In
timestamp, the algorithm generates a character sequence s addition, bi-gram words from the Gujarati corpus are added
representing the most probable output given the input. The to resolve instances in which two words are combined,
greedy decoding algorithm is computationally effective and resulting in a decrease in Word Error Rate (WER).
simple to implement. However, because it makes locally Figure 3 depicts the Modified Spell Corrector's BERT-
optimal decisions at each stage, it may not always produce based operation. Each possible replacement from the Guja-
the optimal solution globally. In some instances, the greedy rati corpus is generated for each word in a phrase, includ-
strategy favours immediate high-probability options, which ing zero matching (no replacement), one matching, and two
can result in suboptimal outcomes. Despite its limitations, matching edit words. This inventory is utilized to determine
greedy decoding is a quick and straightforward approxima- the exact spelling of the predicted word. Next, the sentence
tion that can be used as a starting point or baseline for more is constructed by substituting each replacement term from
refined or advanced decoding techniques. the replacements list for the [MASK] token. If the current
term is not present in the list of replacements, the original
word is kept. Following tokenization, the resultant sentences
3.4.2 Prefix beam search are input into the BERT model.
The BERT model examines tokenized sentences and pro-
Prefix beam search is an advanced decoding technique that vides a list of candidate replacement words, as well as their
simultaneously investigates multiple potential hypotheses probability in relation to the phrase. Each phrase is assessed
using a prefix search algorithm. It improves the efficacy and based on its suitability as a replacement. The K most likely
adaptability of character sequence generation. Prefix beam word replacements are selected and added to the output list.
search can be represented by the following equation: A final output list is created after iterating over each word in
∑ the sentence. This list includes every word from the original
s = argmax p(s ∗ k∕x) (2) statement, along with up to K potential replacements. The
goal is to provide a list of alternative word replacements that
where ′s′ represents the constructed character sequence, have the highest likelihood of being correct, as assessed by
�s ∗ k� represents a potential hypothesis or prefix, and ′x′ the BERT model.
represents the input character sequence. Given the input By combining the Modified Spell Corrector with BERT,
sequence ′x′ , the algorithm considers various character the spelling correction process is enhanced, resulting in
sequences or prefixes �s ∗ k� and calculates the probability of higher accuracy in the output of speech recognition sys-
each hypothesis. The sum over �s ∗ k� represents the sum of tems. To generate context-appropriate word replacements,
all hypotheses' probabilities. The argmax function chooses the approach makes advantage of BERT's contextual embed-
the hypothesis ′s′ with the highest probability. dings and language modelling capabilities (Figure 4).
Prefix decoding is frequently used in automatic speech
recognition systems because it yields more comprehen-
sive and precise results than greedy decoding. During the
decoding process, it permits the exploration of multiple
94 International Journal of Speech Technology (2024) 27:87–99

Fig. 4  Sentence correction post


processing algorithm

4 Experimental results 4.1 Dataset details

Utilizing prior research, we have adapted an approach to Gujrati Language dataset consists of 2219 sentences
match the demands of the low-resource Gujarati language which contains 23,199 total words out of which 8203 are
and high-resource English languages. This effort was moti- unique words. These sentences contain 58,735 syllables
vated by prior research on the Gujarati language, which has out of which 1883 are unique syllables. Also, these sen-
limited resources. To accomplish our suggested work, we tences contain 128,199 total phonemes out of which 40 are
employed a high-quality multi-speaker voice dataset gener- unique. Table 1 gives the description about the dataset used
ated through crowdsourcing. The dataset includes two types in this proposed wok.
of speakers, resulting in a wide range of speech samples. We
created two unique system models, one for single-speaker 4.2 Evaluation metrices
datasets and the other for multi-speaker datasets. The sys-
tem was configured with an 8 GB RAM, 12 GB RAM Tesla The evaluation of the system's performance relies on the
T4 GPU, Windows 10 operating system. Both systems Word Error Rate (WER) metric. WER is a key metric for
were trained for various durations during the procedure. assessing the accuracy and effectiveness of ASR systems,
The multi-speaker model has been trained for 40 epochs providing valuable insights for system development, opti-
of 32 h, whereas the single-speaker model was trained for mization, and real-world application. With each subsequent
17 h. There were fifteen collections of training and vali- epoch number, there is a consistent decrease in the WER of
dation data. The total number of parameters in the system the model. The calculation of WER employs the Levenshtein
was 26,675,630, which reflected the model's complexity and distance concept, which identifies the minimum number of
capacity. To optimize the training procedure, we utilized the alteration operations necessary to transform one string into
Adam optimizer, which utilized gradient descent to update another (Scharenborg et al., 2017). The formula for comput-
model parameters. During training, the CTC loss function ing the WER is presented below:
was used to compute the loss.
International Journal of Speech Technology (2024) 27:87–99 95

Table 1  Details of dataset used Language Female Male


(duration measured in hours)
Duration #Speakers Duration #Speakers
Total Average Total Average

Gujarati 4.30 6.97 18 3.59 6.30 18

Table 2  Analysis of single speaker ASR system performance with Table 6  Analysis of multi-speaker ASR system performance with
clean dataset clean dataset

Single speaker ASR system with greedy decoding WER (%) Multi speaker ASR system with greedy decoding WER (%)

MFCC + DeepSpeech2 Model 40.2 MFCC + DeepSpeech2 Model 51.74


GFCC + DeepSpeech2 Model 41.7 GFCC + DeepSpeech2 Model 54.3
MFCC + GFCC + DeepSpeech2 Model 36.65 MFCC + GFCC + DeepSpeech2 Model 48.1
MFCC + GFCC + DeepSpeech2 Model + Modified Spell 28.32 MFCC + GFCC + DeepSpeech2 Model + Modified Spell 40.63
Corrector BERT Corrector BERT

Table 3  Analysis of single speaker ASR system performance with Table 7  Analysis of multi-speaker ASR system performance with
clean dataset clean dataset
Single speaker ASR system with prefix beam search WER (%) Multi speaker ASR system with prefix beam search WER (%)

MFCC + DeepSpeech2 Model 38.34 MFCC + DeepSpeech2 Model 48.83


GFCC + DeepSpeech2 Model 38.67 GFCC + DeepSpeech2 Model 49.1
MFCC + GFCC + DeepSpeech2 Model 35.2 MFCC + GFCC + DeepSpeech2 Model 45.71
MFCC + GFCC + DeepSpeech2 Model + Modified Spell 27.66 MFCC + GFCC + DeepSpeech2 Model + Modified Spell 36.65
Corrector BERT Corrector BERT

(I + D + S) (I + D + S)
WER(%) = = (3)
N (D + S + H)
Table 4  Performance analysis of a single speaker automatic speech
recognition system with noisy dataset Here,D is no. of deletions operation performed, I = no.
Single speaker ASR system with greedy decoding WER (%) of insertions operation used,H = total no. of hits, S = no. of
substitutions operations, and N = total no. of input
MFCC + DeepSpeech2 Model 42
GFCC + DeepSpeech2 Model 41 4.3 Analysis of single speaker ASR performance
MFCC + GFCC + DeepSpeech2 Model 37
MFCC + GFCC + DeepSpeech2 Model + Modified Spell 29 Various decoding methods and a modified spelling correc-
Corrector BERT
tor based on BERT were used to evaluate the ASR system's
performance. The performance of the Deepspeech2 ASR
system in a pure environment using Greedy decoding and
Table 5  Performance analysis of a single speaker automatic speech MFCC features is illustrated in Table 2. Initial Word Error
recognition system with noisy dataset Rate (WER) values were at 40.2%. Using the developed
Single speaker ASR system with prefix beam search WER (%) single-speaker ASR system decreased the WER by 11.88
percent, indicating a significant improvement in transcrip-
MFCC + DeepSpeech2 Model 43 tion accuracy.
GFCC + DeepSpeech2 Model 44 Similarly, Table 3 displays the results of the Deepspeech2
MFCC + GFCC + DeepSpeech2 Model 39 ASR system in a pure environment with Prefix beam search
MFCC + GFCC + DeepSpeech2 Model + Modified Spell 30 decoding and MFCC capabilities. The initial WER was
Corrector BERT
38.34%, and by employing the ASR system with a single
speaker, it was reduced by 10.67%. This demonstrates the
improved precision and functionality of the system.
96 International Journal of Speech Technology (2024) 27:87–99

Table 8  Analysis of multi-speaker ASR system performance with was reduced by 12.1%, indicating enhanced performance
noisy dataset and greater transcription accuracy in the presence of noise.
Multi speaker ASR system with greedy decoding WER (%) Likewise, Table 5 displays the results of the Deepspeech2
ASR system operating in a noisy environment with Greedy
MFCC + DeepSpeech2 Model 54
decoding and MFCC features. The initial WER was 43%;
GFCC + DeepSpeech2 Model 53
however, by employing the single-speaker ASR system, the
MFCC + GFCC + DeepSpeech2 Model 50
WER was reduced by 12.6%, demonstrating the system's
MFCC + GFCC + DeepSpeech2 Model + Modified Spell 43
efficacy in dealing with noise and enhancing transcription
Corrector BERT
accuracy.
These results demonstrate the significant advancements
made by the devised ASR system with a single speaker. By
Table 9  Analysis of multi-speaker ASR system performance with reducing the WER across various decoding methods and
noisy dataset environments, the system is able to accurately transcribe
Multi speaker ASR system with prefix beam search WER (%) data from a single speaker.

MFCC + DeepSpeech2 Model 49


GFCC + DeepSpeech2 Model 50
4.4 Analysis of multi‑speaker ASR performance
MFCC + GFCC + DeepSpeech2 Model 43
MFCC + GFCC + DeepSpeech2 Model + Modified Spell 39
A model for multi-speaker scenarios has been constructed
Corrector BERT
using a dataset of 4272 utterances (equivalent to 30 h of
audio). These utterances were used to train the model with
3418, validate it with 427, and test it with the remaining 427.
Moving on to the evaluation in a noisy environment, Different decoding techniques and both clear and noisy envi-
Table 4 displays the WER of the Deepspeech2 ASR sys- ronments were used to access the effectiveness of proposed
tem using Greedy decoding and MFCC features in a noisy ASR system. Table 6 demonstrates that the initial Word
environment, beginning with an initial WER of 42%. By Error Rate (WER) of the Deepspeech2 ASR system in a pure
utilizing the ASR system with a single speaker, the WER environment was 51.74 percent when using Greedy decod-
ing with MFCC features. Using the devised multi-speaker

Fig. 5  System loss plot for


multi-speaker system with
prefix beam
International Journal of Speech Technology (2024) 27:87–99 97

Table 10  Comparison with existing works


Work Feature extraction Classifier methods Decoding or post Dataset Parameter
processing WER (%) or ACC​

Dubey and Shah (2022) MFCC Transfer learning Greedy decoding Indic TTS English 0.83
approach, deep- dataset
speech-0.9.3 model
Maji et al. (2022) Spectral, prosodic, Parallelized CNN – NA SITB-OSED 84.02%
voice quality Fea- BiGRU + Fusion
ture + GBDT feature Operation
selection
Billa (2018) VTLN factor and Fea- Monolingual LSTM- NA IITM-C (Gujarati) 20.91
ture frame rate CTC Training IITM-CR (Gujarati) 21.44
Multilingual LSTM- IITM-C (Gujarati) 19.33
CTC Training IITM-CR (Gujarati) 19.30
Raval et al. (2021) MFCC CNN, BiLSTM, Prefix -LMs’ and Spell Microsoft Speech 64.78
CTC​ corrector BERT Corpus
Prefix- LMs (Gujarati) 69.94
Greedy Decoding 70.65
Proposed approach MFCC + GFCC DeepSpeech2 model Greedy and Modified Crowd-sourced, high- 28.32
and CTC layer Single spell corrector BERT quality, and multi-
Speaker Prefix beam search and speaker Gujarati 27.66
Modified spell cor- language dataset
rector BERT
Greedy & Modified Noisy Gujarati Dataset 40.63
spell corrector BERT Created
Prefix beam search and 36.65
Modified spell cor-
rector BERT
DeepSpeech2 model Greedy and Modified Crowd-sourced, 31.49
and CTC layer Multi spell corrector BERT high-quality, and
Speaker Prefix beam search and multi-speaker (male 32.64
Modified spell cor- and female) Gujarati
rector BERT language dataset
Greedy and Modified Noisy Gujarati Dataset 44.2
spell corrector BERT Created
Prefix beam search and 41.29
Modified spell cor-
rector BERT

ASR system, however, the WER has been reduced by 11.1%, Similarly, Table 9 demonstrates the performance of the
resulting in enhanced accuracy. Deepspeech2 ASR system in a noisy environment with
Table 7 demonstrates that in a pure environment, the Greedy decoding and MFCC features, beginning with a
Deepspeech2 ASR system's WER, which was initially WER of 49%. The WER was reduced by 13% with the multi-
48.83% with Prefix beam search decoding and MFCC fea- speaker ASR system, indicating improved accuracy even in
tures, was reduced by 12.18% with the developed multi- the presence of noise.
speaker ASR system. This indicates an increase in the sys- These results show that the proposed multi-speaker ASR
tem's precision. system is effective at lowering the WER of the Deepspeech2
Moving on to the assessment in a noisy environment, ASR system across a wide range of decoding approaches
Table 8 shows that the initial WER of the Deepspeech2 ASR and settings. Reduced WER indicates enhanced transcription
system in a noisy environment employing Greedy decoding accuracy, making the multi-speaker ASR system a useful
and MFCC features was 54 percent. In the presence of noise, instrument for speech recognition in a variety of real-world
the WER was lowered by 12 percent using the multi-speaker scenarios.
ASR system, exhibiting improved performance.
98 International Journal of Speech Technology (2024) 27:87–99

4.5 Training and validation loss graph The observed results demonstrate that the proposed method
for multi‑speaker system reduces Word Error Rate (WER) by 10–12 percentage points
more than previous work, which only reduced WER by
This section presents the training and validation loss values 5.87%. Table 10 shows that MFCC + GFCC features with the
for DeepSpeech2 ASR systems in a clean and noisy environ- DeepSpeech2 architecture and prefix decoding techniques,
ment utilizing greedy and prefix decoding. Figure 5 shows as well as the Modified Spell Corrector BERT algorithm,
that the loss value is reduced for multi-speaker ASR systems outperform other models.
with MFCC + GFCC features with prefix beam in a clean Overall, the proposed method incor porating
environment, which has the best decrease in WER when MFCC + GFCC features, DeepSpeech2 architecture, prefix
compared to other systems with GFCC or MFCC features beam search decoding, and the Modified Spell Corrector
combined or alone. BERT algorithm outperforms previous combinations by
effectively reducing WER and enhancing the transcription
4.6 Discussion accuracy of Gujarati ASR systems.

The proposed approach to enhancing Gujarati Automatic


Speech Recognition (ASR) models holds significant prac- 5 Conclusion
tical implications across various real-world scenarios.
In overcoming the challenges posed by limited linguistic This research introduces an innovative approach to enhance
resources, the integration of features such as Mel Frequency the effectiveness of end-to-end (E2E) Automatic Speech
Cepstral Coefficients (MFCC) and Gammatone Frequency Recognition (ASR) systems for Gujarati speech. Leveraging
Cepstral Coefficients (GFCC) using the DeepSpeech2 MFCC and GFCC feature extraction techniques, alongside
architecture showcases its adaptability. The incorporation the DeepSpeech2 architecture and prefix beam search decod-
of an improved spell correction technique based on the ing, our system showcases remarkable progress beyond
Bidirectional Encoder Representations from Transformers existing state-of-the-art methods. Notably, our approach
(BERT) algorithm further refines accuracy. These advance- consistently achieves a substantial reduction in Word Error
ments bear relevance in applications such as transcription Rate (WER) by 10–12 percentage points when compared to
services, voice search technologies, and educational plat- prior works, surpassing the previous highest improvement
forms, where accurate and reliable ASR systems in Gujarati of 5.87%. These outcomes underscore the effectiveness of
can significantly enhance user experiences. Additionally, MFCC + GFCC features and our proposed methodology
the demonstrated reduction in Word Error Rate (WER) by in enhancing the accuracy and performance of Gujarati
10–12 percentage points compared to existing methodolo- ASR systems. The findings underscore the potential of our
gies underscores the practical feasibility of deploying this proposed method to deliver superior transcription results,
approach in diverse domains, ranging from customer service thereby opening avenues for more impactful speech recogni-
interactions to inclusive accessibility services for individuals tion applications in Gujarati language processing.
with hearing impairments. Overall, the findings highlight
the potential for widespread adoption of the proposed meth-
odology, addressing the specific challenges associated with Declarations
low-resource languages like Gujarati in real-world contexts. Conflict of interest I, Mohit Dua, on the behalf of all the authors de-
clare that: this study did not receive any finding from any resource, all
4.7 Comparing with existing works the authors and the submitted manuscript do not have any conflict of
interest and this article does not contain any studies with human par-
ticipants or animal performed by any of the authors.
In our proposed work we have used MFCC and GFCC fea-
ture extraction techniques for Gujarati speech to enhance
the efficacy of E2E ASR systems. This method develops
upon prior state-of-the-art methods that have utilised vari- References
ous features including MFCC, Log frequency spectrogram,
Amodei, D. et al. (2016). Deep Speech 2 : End-to-end speech rec-
frame rate with CNN-BiLSTM, Spectral, prosodic, and, ognition in English and Mandarin. In Proceedings of the 33rd
Deepspeech0.9.3, Parallelized CNN–BiGRU, Monolingual, international conference on machine learning, 2016, (vol. 48,
or Multilingual models and VTLN factor, voice quality. pp. 173–182). Retrieved from https://​proce​edings.​mlr.​press/​v48/​
Our proposed combination of MFCC + GFCC features with amode​i16.​html
Anoop, C. S., & Ramakrishnan, A. G. (2021, July). CTC-based end-to-
DeepSpeech2 architecture and prefix beam search, as well as end ASR for the low resource Sanskrit language with spectrogram
the Modified Spell Corrector BERT algorithm for post-pro- augmentation. In 2021 National conference on communications
cessing, significantly outperformed previous combinations. (NCC) (pp. 1–6). IEEE.
International Journal of Speech Technology (2024) 27:87–99 99

Bhogale, K., Raman, A., Javed, T., Doddapaneni, S., Kunchukuttan, Krishna, H., Gurugubelli, K., Vegesna, V., & Vuppala, A. (2018).
A., Kumar, P., & Khapra, M. M. (2023, June). Effectiveness of An exploration towards joint acoustic modeling for Indian lan-
mining audio and text pairs from public data for improving ASR guages: IIIT-h submission for low resource speech recognition
systems for low-resource languages. In ICASSP 2023–2023 IEEE challenge for Indian languages, INTERSPEECH 2018 (pp.
international conference on acoustics, speech and signal process- 3192–3196). https://​doi.​org/​10.​21437/​Inter​speech.​2018-​1584.
ing (ICASSP) (pp. 1–5). IEEE. Lakshminarayanan, V. (2022). Impact of noise in automatic speech
Billa, J. (2018). ISI ASR system for the low resource speech recogni- recognition for low-resourced languages, Doctoral dissertation,
tion challenge for Indian languages. In INTERSPEECH, 2018. Rochester Institute of Technology.
Cho, K., et al. (2014). Learning phrase representations using RNN Maji, B., Swain, M., & Panda, R. (2022). A feature selection based
encoder–decoder for statistical machine translation In: Proceed- parallelized CNN-BiGRU network for speech emotion recognition
ings of the 2014 conference on empirical methods in natural lan- in Odia language.
guage processing (EMNLP), Oct 2014, pp. 1724–1734. https://​ Patel, D., & Goswami, M. (2014). Word level correction in Gujarati
doi.​org/​10.​3115/​v1/​D14-​1179. document using probabilistic approach. https://​doi.​org/​10.​1109/​
Dave, D. (2015). An approach to increase word recognition accuracy in ICGCC​EE.​2014.​69213​95.
Gujarati language. International Journal of Innovative Research Raval, D., Pathak, V., Patel, M., & Bhatt, B. (2021). Improving deep
in Computer and Communication Engineering, 3, 6442–6450. learning based automatic speech recognition for Gujarati. ACM
https://​doi.​org/​10.​15680/​ijirc​ce.​2015.​03070​12 Transactions on Asian and Low-Resource Language and Informa-
Deshmukh, A. M. (2020). Comparison of hidden Markov model and tion Processing. https://​doi.​org/​10.​1145/​34834​46
recurrent neural network in automatic speech recognition. Euro- Scharenborg, O., Ciannella, F., Palaskar, S., Black, A., Metze, F.,
pean Journal of Engineering and Technology Research, 5(8), Ondel, L., & Hasegawa-Johnson, M. (2017). Building an ASR
958–965. https://​doi.​org/​10.​24018/​ejeng.​2020.5.​8.​2077 system for a low-resource language through the adaptation of a
Dua, M., Aggarwal, R. K., & Biswas, M. (2018). Discriminative train- high-resource language ASR system: Preliminary results. In Pro-
ing using noise robust integrated features and refined HMM mod- ceedings of international conference on natural language, signal
eling. Journal of Intelligent Systems, 29(1), 327–344. and speech processing (ICNLSSP) (pp. 26–30).
Dua, M., Aggarwal, R. K., & Biswas, M. (2019). GFCC based dis- Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neu-
criminatively trained noise robust continuous ASR system for ral networks. IEEE Transactions on Signal Processing, 45(11),
Hindi language. Journal of Ambient Intelligence and Human- 2673–2681. https://​doi.​org/​10.​1109/​78.​650093
ized Computing, 10, 2301–2314. Srivastava, B., Abraham, B., Sitaram, S., Mehta, R., & Jyothi, P.
Dubey, P., & Shah, B. (2022). Deep speech based end-to-end auto- (2019). End-to-end ASR for code-switched Hindi-English speech.
mated speech recognition (ASR) for Indian-English accents. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence
arXiv preprint arXiv:​2204.​00977. learning with neural networks. In Advances in neural information
Forsberg, M. (2003). Why is speech recognition difficult. processing systems, 2014 (Vol. 27). https://​proce​edings.​neuri​ps.​
Gaudani, H., & Patel, N. M. (2022). Comparative study of robust cc/​paper/​2014/​file/​a14ac​55a4f​27472​c5d89​4ec1c​3c743​d2Pap​er.​
feature extraction techniques for ASR for limited resource Hindi pdf
language. In Proceedings of second international conference Tailor, J. H., & Shah, D. B. (2018). HMM-based lightweight speech
on sustainable expert systems: ICSES 2021 (pp. 763–775). recognition system for Gujarati language.
Springer Nature. Toshniwal, S., et al. (2017). Multilingual speech recognition with a
Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recogni- single end-to-end model.
tion with recurrent neural networks. In Proceedings of the 31st Zhang, S., Huang, H., Liu, J., & Li, H. (2020). Spelling error correction
international conference on machine learning, 2014, (vol. 32, with soft-masked BERT.
no. 2, pp. 1764–1772). Retrieved from https://​proce​edings.​mlr.​
press/​v32/​grave​s14.​html Publisher's Note Springer Nature remains neutral with regard to
Graves, A., et al. (2006). Connectionist temporal classification: jurisdictional claims in published maps and institutional affiliations.
Labelling unsegmented sequence data with recurrent neural
networks. In Proceedings of the 23rd international conference Springer Nature or its licensor (e.g. a society or other partner) holds
on machine learning, 2006. exclusive rights to this article under a publishing agreement with the
Hu, Y., Jing, X., Ko, Y. L., & Rayz, J. (2021). Misspelling correction author(s) or other rightsholder(s); author self-archiving of the accepted
with pre-trained contextual language model. manuscript version of this article is solely governed by the terms of
Joshi, B., Bhatta, B., Panday, S. P., & Maharjan, R. K. (2022). A such publishing agreement and applicable law.
novel deep learning based nepali speech recognition. In Innova-
tions in electrical and electronic engineering: Proceedings of
ICEEE 2022, (Vol. 2, pp. 433–443). Springer.

You might also like