0% found this document useful (0 votes)
65 views75 pages

A Review Deep Learning Techiques For Speech Processing2023

This review paper discusses the transformative impact of deep learning techniques on speech processing, highlighting advancements in areas such as speech recognition, synthesis, and emotion recognition. It provides a comprehensive overview of various deep learning models, their applications, and the evolution of speech processing methodologies, from traditional models to modern architectures like CNNs and transformers. The paper also addresses challenges in the field and suggests future research directions, emphasizing the need for more efficient and interpretable models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views75 pages

A Review Deep Learning Techiques For Speech Processing2023

This review paper discusses the transformative impact of deep learning techniques on speech processing, highlighting advancements in areas such as speech recognition, synthesis, and emotion recognition. It provides a comprehensive overview of various deep learning models, their applications, and the evolution of speech processing methodologies, from traditional models to modern architectures like CNNs and transformers. The paper also addresses challenges in the field and suggests future research directions, emphasizing the need for more efficient and interpretable models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Version of Record: https://www.sciencedirect.

com/science/article/pii/S1566253523001859
Manuscript_32cd6dcf8f7e351977caa62c923a0185

A Review of Deep Learning Techniques for Speech Processing


Ambuj Mehrisha , Navonil Majumdera , Rishabh Bharadwaja , Rada Mihalceab and Soujanya Poriaa,∗
a ISTD, Singapore University of Technology and Design, Singapore
b University of Michigan, USA

ARTICLE INFO ABSTRACT


Keywords: The field of speech processing has undergone a transformative shift with the advent of deep learning.
Deep Learning, The use of multiple processing layers has enabled the creation of models capable of extracting intricate
Speech Processing, features from speech data. This development has paved the way for unparalleled advancements in
Transformers, speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition,
Survey, Trends. propelling the performance of these tasks to unprecedented heights. The power of deep learning
techniques has opened up new avenues for research and innovation in the field of speech processing,
with far-reaching implications for a range of industries and applications. This review paper provides a
comprehensive overview of the key deep learning models and their applications in speech-processing
tasks. We begin by tracing the evolution of speech processing research, from early approaches, such
as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches and compare their
strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover
various speech-processing tasks, datasets, and benchmarks used in the literature and describe how
different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing, including the need for more
parameter-efficient, interpretable models and the potential of deep learning for multimodal speech
processing. By examining the field’s evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further research in this exciting and
rapidly advancing field.

1. Introduction icant advancements and breakthroughs in speech processing


research and development.
Humans employ language as a means to effectively con- Over the past few years, the field of speech processing has
vey their emotions and sentiments. Language encompasses
been transformed by introducing powerful tools, including
a collection of words forming a vocabulary, accompanied deep learning. Figure 1 illustrates the evolution of speech pro-
by grammar, which dictates the appropriate usage of these cessing models over the years, the rapid development of deep
words. It manifests in various forms, including written text, learning architecture for speech processing reflects the grow-
sign language, and spoken communication. Speech, specif- ing complexity and diversity of the field. This technology has
ically, entails the utilization of phonetic combinations of revolutionized the analysis and processing of speech signals
consonant and vowel sounds to articulate words from the
using deep neural networks (DNNs), convolutional neural
vocabulary. Phonetics, in turn, pertains to the production and networks (CNNs), and recurrent neural networks (RNNs).
perception of sounds by individuals. Through speech, indi- These architectures have proven highly effective in various
viduals are able to express themselves and convey meaning speech-processing applications, such as speech recognition,
in their chosen language. speaker recognition, and speech synthesis. This study com-
Speech processing is a field dedicated to the study and ap-
prehensively overviews the most critical and emerging deep-
plication of methods for analyzing and manipulating speech
learning techniques and their potential applications in various
signals. It encompasses a range of tasks, including auto- speech-processing tasks.
matic speech recognition (ASR) [628, 390], speaker recogni- Deep learning has revolutionized speech processing by
tion (SR) [31], and speech synthesis or text-to-speech [397]. its ability to automatically learn meaningful features from
In recent years, speech processing has garnered increasing raw speech signals, eliminating the need for manual feature
significance due to its diverse applications in areas such as
engineering. This breakthrough has led to significant ad-
telecommunications, healthcare, and entertainment. Notably,
vancements in speech processing performance, particularly
statistical modeling techniques, particularly Hidden Markov in challenging scenarios involving noise, as well as diverse
Models (HMMs), have played a pivotal role in advancing the accents and dialects. By leveraging the power of deep neural
field [149, 444]. These models have paved the way for signif- networks, speech processing systems can now adapt and gen-
∗ Corresponding author eralize more effectively, resulting in improved accuracy and
[email protected] (A. Mehrish); robustness in various applications. The inherent capability of
[email protected] (N. Majumder); deep learning to extract intricate patterns and representations
[email protected] (R. Bharadwaj); [email protected] (R.
Mihalcea); [email protected] (S. Poria)
from speech data has opened up new possibilities for tackling
ORCID (s): 0000-0003-4240-9915 (A. Mehrish); 0000-0002-1449-617X (N. real-world speech processing challenges.
Majumder); 0000-0003-1561-4286 (R. Bharadwaj); 0000-0002-0767-6703 (R. Deep learning architectures have emerged as powerful
Mihalcea); 0000-0003-3167-2208 (S. Poria)

Mehrish et al.: Preprint submitted to Elsevier Page 1 of 72

© 2023 published by Elsevier. This manuscript is made available under the Elsevier user license
https://www.elsevier.com/open-access/userlicense/1.0/
A Review of Deep Learning Techniques for Speech Processing

Vall-E
Whisper
• HuBERT
• Speechstew
• Wav2Vec 2.0
• FastSpeech2
• Conformer
Performance

• ContextNet

• LSTM
• GRU

HMM + GMM

2000s 2010s 2020 2021 2022 2023

Time

Figure 1: Evolution of speech processing models over the years.

tools in speech processing, offering remarkable improve- analysis, synthesis, and recognition of speech signals, and the
ments in various tasks. Pioneering studies, such as [185], integration of deep learning techniques has led to significant
have demonstrated the substantial gains achieved by deep advancements in these areas. By examining the current state-
neural networks (DNNs) in speech recognition accuracy com- of-the-art approaches, this paper aims to shed light on the
pared to traditional HMM-based systems. Complementing potential of deep learning for tackling the existing challenges
this, research in [3] showcased the effectiveness of convolu- and further advancing speech processing research.
tional neural networks (CNNs) for speech recognition. More- The paper provides a comprehensive exploration of deep-
over, recurrent neural networks (RNNs) have proven their learning architectures in the field of speech processing. It
efficacy in both speech recognition and synthesis, as high- begins by establishing the background, encompassing the
lighted in [161]. Recent advancements in deep learning have definition of speech signals, speech features, and traditional
further enhanced speech processing systems, with attention non-neural models. Subsequently, the focus shifts towards an
mechanisms [85] and transformers [555] playing significant in-depth examination of various deep-learning architectures
roles. Attention mechanisms enable the model to focus on specifically tailored for speech processing, including RNNs,
salient sections of the input signal, while transformers facil- CNNs, Transformers, GNNs, and diffusion models. Recog-
itate modeling long-range dependencies within the signal. nizing the significance of representation learning techniques
These developments have led to substantial improvements in this domain, the survey paper dedicates a dedicated section
in the performance and versatility of speech processing sys- to their exploration.
tems, unlocking new possibilities for applications in diverse Moving forward, the paper delves into an extensive range
domains. of speech processing tasks where deep learning has demon-
Although deep learning has made remarkable progress in strated substantial advancements. These tasks encompass
speech processing, it still faces certain challenges that need critical areas such as speech recognition, speech synthesis,
to be addressed. These challenges include the requirement speaker recognition, speech-to-speech translation, and speech
for substantial amounts of labeled data, the interpretability of synthesis. By thoroughly analyzing the fundamentals, model
the models, and their robustness to different environmental architectures, and specific tasks within the field, the paper
conditions. To provide a comprehensive understanding of the then progresses to discuss advanced transfer learning tech-
advancements in this domain, this paper presents an extensive niques, including domain adaptation, meta-learning, and
overview of deep learning architectures employed in speech- parameter-efficient transfer learning.
processing applications. Speech processing encompasses the Finally, in the conclusion, the paper reflects on the current

Mehrish et al.: Preprint submitted to Elsevier Page 2 of 72


A Review of Deep Learning Techniques for Speech Processing

state of the field and identifies potential future directions. By such as air pressure, to another form, typically an electrical
considering emerging trends and novel approaches, the paper signal.
aims to shed light on the evolving landscape of deep learning In signal processing, a signal that repetitively manifests
in speech processing and provide insights into promising after a fixed duration, known as a period, is classified as peri-
avenues for further research and development. odic. The reciprocal of this period represents the frequency
of the signal. The waveform of a periodic signal defines its
Why this paper? Deep learning has become a powerful tool shape and concurrently determines its timbre, which pertains
in speech processing because it automatically learns high- to the subjective perception of sound quality by humans. To
level representations of speech signals from raw audio data. facilitate the processing of speech, speech signals are com-
As a result, significant advancements have been made in var- monly digitized. This entails converting them into a series
ious speech-processing tasks, including speech recognition, of numerical values by measuring the signal’s amplitude at
speaker identification, speech synthesis, and more. These consistent time intervals. The sampling rate, defined by the
tasks are essential in various applications, such as human- number of samples collected per second, determines the gran-
computer interaction, speech-based search, and assistive tech- ularity of this digitization process.
nology for people with speech impairments. For example,
virtual assistants like Siri and Alexa use speech recognition 2.2. Speech Features
technology, while audiobooks and in-car navigation systems Speech features are numerical representations of speech
rely on text-to-speech systems. signals that are used for analysis, recognition, and synthesis.
Given the wide range of applications and the rapidly evolv- Broadly, speech signals can be classified into two categories:
ing nature of deep learning, a comprehensive review paper time-domain features and frequency-domain features.
that surveys the current state-of-the-art techniques and their Time-domain features are derived directly from the am-
applications in speech processing is necessary. Such a paper plitude of the speech signal over time. These are simple
can help researchers and practitioners stay up-to-date with to compute and often used in real-time speech-processing
the latest developments and trends and provide insights into applications. Some common time-domain features include:
potential areas for future research. However, to the best of
our knowledge, no current work covers a broad spectrum of • Energy: Energy is a quantitative measure of the ampli-
speech-processing tasks. tude characteristics of a speech signal over time. It is
A review paper on deep learning for speech processing computed by squaring each sample in the signal and
can also be a valuable resource for beginners interested in summing them within a specific time window. This
learning about the field. It can provide an overview of the captures the overall strength and dynamics of the sig-
fundamental concepts and techniques used in deep learning nal, revealing temporal variations in intensity. The
for speech processing and help them gain a deeper under- energy measure provides insights into segments with
standing of the field. While some survey papers focus on higher or lower amplitudes, aiding in speech recogni-
specific speech-processing tasks such as speech recognition, tion, audio segmentation, and speaker diarization. It
a broad survey would cover a wide range of other tasks such also helps identify events and transitions indicative of
as speaker recognition speech synthesis, and more. A broad changes in vocal activity. By quantifying amplitude
survey would highlight the commonalities and differences variations, energy analysis contributes to a comprehen-
between these tasks and provide a comprehensive view of sive understanding of speech signals and their acoustic
the advancements made in the field. properties.
• Zero-crossing rate: The zero-crossing rate indicates
2. Background how frequently the speech signal crosses the zero-axis
Before moving on to deep neural architectures, we discuss within a defined time frame. It is computed by counting
basic terms used in speech processing, low-level representa- the number of polarity changes in the signal during a
tions of speech signals, and traditional models used in the specific window.
field. • Pitch: Pitch refers to the perceived tonal quality in
a speaker’s voice, which is determined by analyzing
2.1. Speech Signals the fundamental frequency of the speech signal. The
Signal processing is a fundamental discipline that encom- fundamental frequency can be estimated through the
passes the study of quantities that exhibit variations in space application of pitch detection algorithms [443] or by
or time. In the realm of signal processing, a quantity exhibit- utilizing autocorrelation techniques [529].
ing spatial or temporal variations is commonly referred to as
a signal. Specifically, sound signals are defined as variations • Linear predictive coding (LPC):Linear Predictive Cod-
in air pressure. Consequently, a speech signal is identified as ing (LPC) is a powerful technique that represents the
a type of sound signal, namely pressure variations, generated speech signal as a linear combination of past samples,
by humans to facilitate spoken communication. Transducers employing an autoregressive model. The estimation of
play a vital role in converting these signals from one form, model parameters is accomplished through methods
like the Levinson-Durbin algorithm [54]. The obtained

Mehrish et al.: Preprint submitted to Elsevier Page 3 of 72


A Review of Deep Learning Techniques for Speech Processing

coefficients serve as a valuable feature representation on a non-linear mel frequency scale. The MFCCs con-
for various speech-processing tasks. sist of a set of coefficients that collectively form a Mel-
Frequency-domain features are derived from the signal frequency cepstrum 1 . With just 12 parameters related
represented in the frequency domain also known as its spec- to the amplitude of frequencies, MFCCs provide an ad-
trum. A spectrum captures the distribution of energy as a equate number of frequency channels to analyze audio,
function of frequency. Spectrograms are two-dimensional while still maintaining a compact representation. The
visual representations capturing the variations in a signal’s main objectives of MFCC extraction are to eliminate
spectrum over time. When compared against time-domain vocal fold excitation (F0) information related to pitch,
features, it is generally more complex to compute frequency- ensure the independence of the extracted features, align
domain features as they tend to involve time-frequency trans- with human perception of loudness and frequency, and
form operations such as Fourier transform. capture the contextual dynamics of phones. The pro-
cess of extracting MFCC features involves A/D con-
• Mel-spectrogram: A Mel spectrogram, also known as version, pre-emphasis filtering, framing, windowing,
a Mel-frequency spectrogram or Melspectrogram, is Fourier transform, Mel filter bank application, logarith-
a representation of the short-term power spectrum of mic operation, discrete cosine transform (DCT), and
a sound signal. It is widely used in audio signal pro- liftering. By following these steps, MFCCs enable the
cessing and speech recognition tasks. It is obtained by extraction of informative audio features while avoiding
converting the power spectrum of a speech signal into a redundancy and preserving the relevant characteristics
mel-scale, which is a perceptual scale of pitches based of the sound signal.
on the human auditory system’s response to different
frequencies. The mel-scale divides the frequency range Other types of speech features include formant frequen-
into a set of mel-frequency bands, with higher resolu- cies, pitch contour, cepstral coefficients, wavelet coefficients,
tion in the lower frequencies and coarser resolution in and spectral envelope. These features can be used for vari-
the higher frequencies. This scale is designed to mimic ous speech-processing tasks, including speech recognition,
the non-linear frequency perception of human hearing. speaker identification, emotion recognition, and speech syn-
To compute the Melspectrogram, the speech signal is thesis.
typically divided into short overlapping frames. For In the field of speech processing, frequency-based repre-
each frame, the Fast Fourier Transform (FFT) is ap- sentations such as Mel spectrogram and MFCC are widely
plied to obtain the power spectrum. The power spec- used since they are more robust to noise as compared to tem-
trum is then transformed into the mel-scale using a poral variations of the sound [7]. Time-domain features can
filterbank that converts the power values at different fre- be useful when the task warrants this information (such as
quencies to their corresponding mel-frequency bands. pauses, emotions, phoneme duration, and speech segments).
Finally, the logarithm of the mel-scale power values is It is noteworthy that the time-domain and frequency-domain
computed, resulting in the Melspectrogram. features tend to capture different sets of information and thus
Melspectrogram provides a time-frequency represen- can be used in conjunction to solve a task [514, 568, 531].
tation of the audio signal, where the time dimension
2.3. Traditional models for speech processing
corresponds to the frame index, and the frequency di-
Traditional speech representation learning algorithms
mension represents the mel-frequency bands. It cap-
based on shallow models utilize basic non-parametric models
tures both the spectral content and temporal dynamics
for extracting features from speech signals. The primary ob-
of the signal, making it useful for tasks such as speech
jective of these models is to extract significant features from
recognition, music analysis, and sound classification.
the speech signal through mathematical operations, such as
By using the Melspectrogram, the representation of
Fourier transforms, wavelet transforms, and linear predic-
the audio signal is transformed to a more perceptually
tive coding (LPC). The extracted features serve as inputs
meaningful domain, which can enhance the perfor-
to classification or regression models. The shallow models
mance of various audio processing algorithms. It is
aim to extract meaningful features from the speech signal,
particularly beneficial in scenarios where capturing the
enabling the classification or regression model to learn and
spectral patterns and frequency content of the signal
make accurate predictions.
is important for the analysis or classification task at
hand. • Gaussian Mixture Models (GMMs): Gaussian Mix-
• Mel-frequency cepstral coefficients (MFCCs): Mel- ture Models (GMMs) are powerful generative models
frequency cepstral coefficients (MFCCs) are a feature employed to represent the probability distribution of a
representation widely utilized in various applications speech feature vector. They achieve this by combining
such as speech recognition, gesture recognition, speaker multiple Gaussian distributions with different weights.
identification, and cetacean auditory perception sys- GMMs have found widespread applications in speaker
tems. MFCCs capture the power spectrum of a sound identification [259] and speech recognition tasks [463].
over a short duration by utilizing a linear cosine trans- 1 https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

formation of a logarithmically-scaled power spectrum

Mehrish et al.: Preprint submitted to Elsevier Page 4 of 72


A Review of Deep Learning Techniques for Speech Processing

Specifically, in speaker identification, GMMs are uti- This algorithm has gained significant popularity due
lized to capture the distribution of speaker-specific to its practicality and intuitive nature, making it a re-
features, enabling the recognition of individuals based liable choice for classifying speech data in numerous
on their unique characteristics. Conversely, in speech real-world scenarios. By leveraging the proximity-
recognition, GMMs are employed to model the acous- based classification, KNN provides a straightforward
tic properties of speech sounds, facilitating accurate yet powerful method for accurately categorizing speech
recognition of spoken words and phrases. GMMs play samples based on their similarities to the training data.
a crucial role in these domains, enabling robust and Its versatility and ease of implementation contribute
efficient analysis of speech-related data. to its widespread adoption in various speech-related
domains, facilitating advancements in speaker recog-
• Support Vector Machines (SVMs): Support Vector nition, language identification, and other applications
Machines (SVMs) are a widely adopted class of su- in the field of speech processing.
pervised learning algorithms extensively utilized for
various speech classification tasks [506]. They are par- • Decision trees: Decision trees are widely employed
ticularly effective in domains like speaker recognition in speech classification tasks as a class of supervised
[174, 512, 511] and phoneme recognition [52]. SVMs learning algorithms. Their operation involves recur-
excel in their ability to identify optimal hyperplanes sively partitioning the feature space into smaller re-
that effectively separate different classes in the feature gions, guided by the values of the features. Within each
space. By leveraging this optimal separation, SVMs en- partition, a decision rule is established to assign the
able accurate classification and recognition of speech input feature vector to a specific class. The strength of
patterns. As a result, SVMs have become a fundamen- decision trees lies in their ability to capture complex de-
tal tool in the field of speech analysis and play a vital cision boundaries by hierarchically dividing the feature
role in enhancing the performance of speech-related space. By analyzing the values of the input features
classification tasks. at each node, decision trees efficiently navigate the
classification process. This approach not only provides
• Hidden Markov Models (HMMs): Hidden Markov interpretability, but also facilitates the identification of
Models (HMMs) have gained significant popularity as key features contributing to the classification outcome.
a powerful tool for performing various speech recog- Through their recursive partitioning mechanism, deci-
nition tasks, particularly ASR [149, 444]. In ASR, sion trees offer a flexible and versatile framework for
HMMs are employed to model the probability distri- speech classification. They excel in scenarios where
bution of speech sounds by incorporating a sequential the decision rules are based on discernible thresholds
arrangement of hidden states along with correspond- or ranges of feature values. The simplicity and trans-
ing observations. The training of HMMs is commonly parency of decision trees make them a valuable tool for
carried out using the Baum-Welch algorithm, a vari- understanding and solving speech-related classification
ant of the Expectation Maximization algorithm, which tasks.
enables effective parameter estimation and model opti-
mization2 . To summarize, conventional speech representation learn-
By leveraging HMMs in speech recognition, it be- ing algorithms based on shallow models entail feature ex-
comes possible to predict the most likely sequence traction from the speech signal, which is subsequently used
of speech sounds given an input speech signal. This as input for classification or regression models. These al-
enables accurate and efficient recognition of spoken gorithms have found extensive applications in speech pro-
language, making HMMs a crucial component in ad- cessing tasks like speech recognition, speaker identification,
vancing speech recognition technology. Their flexibil- and speech synthesis. However, they have been progres-
ity and ability to model temporal dependencies con- sively superseded by more advanced representation learning
tribute to their widespread use in ASR and various algorithms, particularly deep neural networks, due to their
other speech-related applications, further enhancing enhanced capabilities.
our understanding and utilization of spoken language.
• The K-nearest neighbors (KNN) algorithm is a sim- 3. Deep Learning Architectures and Their
ple yet effective classification approach utilized in a Applications in Speech Processing Tasks
wide range of speech-related applications, including Deep learning architectures have revolutionized the field
speaker recognition [477] and language recognition. of speech processing by demonstrating remarkable perfor-
The core principle of KNN involves identifying the mance across various tasks. With their ability to automati-
K-nearest neighbors of a given input feature vector cally learn hierarchical representations from raw speech data,
within the training data and assigning it to the class deep learning models have surpassed traditional approaches
that appears most frequently among those neighbors. in areas such as speech recognition, speaker identification,
2Wikipedia: Baum-Welch algorithm: http://en.wikipedia.org/wiki/
and speech synthesis. These architectures have been instru-
Baum%e2%80%93Welch_algorithm mental in capturing intricate patterns, uncovering latent fea-

Mehrish et al.: Preprint submitted to Elsevier Page 5 of 72


A Review of Deep Learning Techniques for Speech Processing

tures, and extracting valuable information from vast amounts RNNs [487]. BRRNs encode both future and past (input) con-
of speech data. In this section, we delve into the applications text in separate hidden layers. The outputs of the two RNNs
of deep learning architectures in speech processing tasks, are then combined at each time step, typically by concatenat-
exploring their potential, advancements, and the impact they ing them together, to create a new, richer representation that
have had on the field. By examining the key components and includes both past and future context.
techniques employed in these architectures, we aim to provide
insights into the current state-of-the-art in deep learning for ℎ⃖⃖⃗𝑡 = (𝑊ℎℎ ⃖⃗
⃖⃖⃗ ℎ𝑡−1 + 𝑊𝑥ℎ
⃖⃖⃗ 𝑥𝑡 + 𝑏⃗ℎ⃖ ) (3)
ℎ𝑡 = (𝑊⃖⃖⃖
speech processing and shed light on the exciting prospects it ⃖⃖⃖ ⃖⃖
ℎ + 𝑊⃖⃖⃖ 𝑥 + 𝑏ℎ⃖⃖ ) (4)
holds for future advancements in the field. ℎℎ 𝑡+1 𝑥ℎ 𝑡
𝑦𝑡 = 𝑊ℎ𝑦 ⃖⃗
⃖⃖⃗ ℎ𝑡 + 𝑊⃖⃖⃖
ℎ𝑦 𝑡
⃖⃖
ℎ + 𝑏𝑦 (5)
3.1. Recurrent Neural Networks (RNNs)
It is natural to consider Recurrent Neural Networks for where high dimensional hidden states ℎ ⃖⃗𝑡−1 and ⃖⃖
ℎ𝑡+1 are
various speech processing tasks since the input speech signal hidden states modeling the forward context from 1, 2, … , 𝑡−1
is inherently a dynamic process [479]. RNNs can model a and backward context from 𝑇 , 𝑇 − 1, … , 𝑡 + 1, respectively.
given time-varying (sequential) patterns that were otherwise
hard to capture by standard feedforward neural architectures. Long Short-Term Memory Vanilla RNNs are observed to
Initially, RNNs were used in conjunction with HMMs where face another limitation, that is, vanishing gradients that do
the sequential data is first modeled by HMMs while localized not allow them to learn from long-range context information.
classification is done by the neural network. However, such To overcome this, a variant of RNN, named as LSTM, was
a hybrid model tends to inherit limitations of HMMs, for specifically designed to address the vanishing gradient prob-
instance, HMM requires task-specific knowledge and inde- lem and enable the network to selectively retain (or forget)
pendence constraints for observed states [43]. To overcome information over longer periods of time [187]. This attribute
the limitations inherited by the hybrid approach, end-to-end is achieved by maintaining separate purpose-built memory
systems completely based on RNNs became popular for se- cells in the network: the long-term memory cell 𝑐𝑡 and the
quence transduction tasks such as speech recognition and short-term memory cell ℎ𝑡 . In Equation (2), LSTM redefines
text[158, 246]. Next, we discuss RNN and it’s variants: the operator  in terms of forget gate 𝑓𝑡 , input gate 𝑖𝑡 , and
output gate 𝑜𝑡 ,
3.1.1. RNN Models
Vanilla RNN Give input sequence of T states (𝑥1 , … , 𝑥𝑇 ) 𝑖𝑡 = 𝜎(𝑊𝑥𝑖 𝑥𝑡 + 𝑊ℎ𝑖 ℎ𝑡−1 + 𝑊𝑐𝑖 𝑐𝑡−1 + 𝑏𝑖 ), (6)
with 𝑥𝑖 ∈ ℝ𝑑 , the output state at time 𝑡 can be computed as 𝑓𝑡 = 𝜎(𝑊𝑥𝑓 𝑥𝑡 + 𝑊ℎ𝑓 ℎ𝑡−1 + 𝑊𝑐𝑓 𝑐𝑡−1 + 𝑏𝑓 ), (7)
ℎ𝑡 = (𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 + 𝑏ℎ ) (1) 𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡−1 + 𝑖𝑡 ⊙ tanh (𝑊𝑥𝑐 𝑥𝑡 + 𝑊ℎ𝑐 ℎ𝑡−1 + 𝑏𝑐 ),
(8)
𝑜𝑡 = 𝜎(𝑊𝑥𝑜 𝑥𝑡 + 𝑊ℎ𝑜 ℎ𝑡−1 + 𝑊𝑐𝑜 𝑐𝑡 + 𝑏𝑜 ), (9)
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 + 𝑏𝑦 (2) ℎ𝑡 = 𝑜𝑡 ⊙ tanh (𝑐𝑡 ), (10)

where 𝜎(𝑥) = 1∕(1 + 𝑒−𝑥 ) is a logistic sigmoid activation


where 𝑊ℎℎ , 𝑊ℎ𝑥 , 𝑊𝑦ℎ are weight matrices and 𝑏ℎ , 𝑏𝑦 are
bias vectors.  is non-linear activation functions such as
function. 𝑐𝑡 is a fusion of the information from the previous
state of the long-term memory 𝑐𝑡−1 , the previous state of
Tanh, ReLU, and Sigmoid. RNNs are made of high dimen-
short-term memory ℎ𝑡−1 , and current input 𝑥𝑡 . 𝑊 and 𝑏 are
sional hidden states, notice ℎ𝑡 in the above equation, which
weight matrices and biases. ⊙ is the element-wise vector
makes it possible for them to model sequences and help over-
multiplication or Hadamard operator. Bidirectional LSTMs
come the limitation of feedforward neural networks. The
(BLSTMs) can capture longer contexts in both forward and
state of the hidden layer is conditioned on the current input
backward directions [158].
and the previous state, which makes the underlying operation
recursive. Essentially, the hidden state ℎ𝑡−1 works as a mem- Gated Recurrent Units Gated Recurrent Units (GRU) aim
ory of past inputs {𝑥𝑘 }𝑡−1 that influence the current output to be a computationally-efficient approximate of LSTM by
𝑦𝑡 .
𝑘=1
using only two gates (vs three in LSTM) and a single memory
cell (vs two in LSTM). To control the flow of information
Bidirectional RNNs For numerous tasks in speech process-
over time, a GRU uses an update gate 𝑧𝑡 to decide how much
ing, it is more effective to process the whole utterance at once.
of the new input to be added to the previous hidden state and
For instance, in speech recognition, one-shot input transcrip-
a reset gate 𝑟𝑡 to decide how much of previous hidden state
tion can be more robust than transcribing based on the partial
information to be forgotten.
(i.e. previous) context information [161]. The vanilla RNN
has a limitation in such cases as they are unidirectional in 𝑧𝑡 = 𝜎(𝑊𝑥𝑧 𝑥𝑡 + 𝑊ℎ𝑧 ℎ𝑡−1 ), (11)
nature, that is, output 𝑦𝑡 is obtained from {𝑥𝑘 }𝑡𝑘=1 , and thus,
𝑟𝑡 = 𝜎(𝑊𝑥𝑟 𝑥𝑡 + 𝑊ℎ𝑟 ℎ𝑡−1 ), (12)
agnostic of what comes after time 𝑡. Bidirectional RNNs
(BRNNs) were proposed to overcome such shortcomings of ℎ𝑡 = (1 − 𝑧𝑡 ) ⊙ ℎ𝑡−1 + 𝑧𝑡 ⊙ tanh(𝑊𝑥ℎ 𝑥𝑡 + 𝑊𝑟ℎ (𝑟𝑡 ⊙ ℎ𝑡−1 )),
(13)

Mehrish et al.: Preprint submitted to Elsevier Page 6 of 72


A Review of Deep Learning Techniques for Speech Processing

The development of Bimodal Recurrent Neural Networks


(BRNN) has led to significant advancements in the field
where ⊙ is element-wise multiplication between the two of Audiovisual Speech Activity Detection (AV-SAD) [533].
vectors (Hadamard product). BRNNs have demonstrated immense potential in improving
RNNs and their variants are widely used in various deep the performance of speech recognition systems, particularly
learning applications like speech recognition, synthesis, and in noisy environments, by combining information from vari-
natural language understanding. Although seq2seq based on ous sources. By integrating separate RNNs for each modality,
recurrent architectures such as LSTM/GRU has made great BRNNs can capture temporal dependencies within and across
strides in speech processing, they suffer from the drawback modalities. This leads to successful outcomes in speech-
of slow training speed due to internal recurrence. Another based systems, where integrating audio and visual modal-
drawback of the RNN family is their inability to leverage ities is crucial for accurate speech recognition. Compared
information from long distant time steps accurately. to conventional audio-only systems, BRNN-based AV-SAD
systems display superior performance, particularly in chal-
Connectionist Temporal Classification Connectionist Tem- lenging acoustic conditions where audio-only systems might
poral Classification (CTC) [159] is a scoring and output func- struggle.
tion commonly used to train LSTM networks for sequence- To enhance the performance of continuous speech recog-
based problems with variable timing. CTC has been applied nition, LSTM networks have been utilized in hybrid architec-
to several tasks, including phoneme recognition, ASR, and tures alongside CNNs [419]. The CNNs extract local features
other sequence-based problems. One of the major benefits of from speech frames that are then processed by LSTMs over
CTC is its ability to handle unknown alignment between in- time [419]. LSTMs have also been employed for speech syn-
put and output, simplifying the training process. When used thesis, where they have been shown to enhance the quality of
in ASR [104, 105, 378], CTC eliminates the need for manual statistical parametric speech synthesis [419].
data labeling by assigning probability scores to the output Aside from their ASR and speech synthesis applications,
given any input signal. This is particularly advantageous for LSTM networks have been utilized for speech post-filtering.
tasks such as speech recognition and handwriting recognition, To improve the quality of synthesized speech, researchers
where the input and output can vary in size. CTC also solves have proposed deep learning-based post-filters, with LSTMs
the problem of having to specify the position of a character demonstrating superior performance over other post-filter
in the output, allowing for more efficient training of the neu- types [99]. Bidirectional LSTM (Bi-LSTM) is another vari-
ral network without post-processing the output. Finally, the ant of RNN that has been widely used for speech synthesis
CTC decoder can transform the neural network output into [136]. Several RNN-based analysis/synthesis models such as
the final text without post-processing. WaveNet [404], SampleRNN [373], and Tacotron have been
developed. These neural vocoder models can generate high-
3.1.2. Application
quality synthesized speech from acoustic features without
The utilization of RNNs in popular products such as
requiring intermediate vocoding steps.
Google’s voice search and Apple’s Siri to process user input
and predict the output has been well-documented [177, 304].
3.2. Convolutional Neural Networks
RNNs are frequently utilized in speech recognition tasks,
Convolutional neural networks (CNNs) are a specialized
such as the prediction of phonetic segments from audio sig-
class of deep neural architecture consisting of one or more
nals [414]. They excel in use cases where context plays a
pairs of alternating convolutional and pooling layers. A con-
vital role in outcome prediction and are distinct from CNNs
volution layer applies filters that process small local parts of
as they utilize feedback loops to process a data sequence that
the input, where these filters are replicated along the whole
informs the final output [414].
input space. A pooling layer converts convolution layer ac-
In recent times, there have been advancements in the ar-
tivations to low resolution by taking the maximum filter ac-
chitecture of RNNs, which have been primarily focused on
tivation within a specified window and shifting across the
developing end-to-end (E2E) models [302, 411] for ASR.
activation map. CNNs are variants of fully connected neural
These E2E models have replaced conventional hybrid models
networks widely used for processing data with grid-like topol-
and have displayed substantial enhancements in speech recog-
ogy. For example, time-series data (1D grid) with samples at
nition [302, 303]. However, a significant challenge faced by
regular intervals or images (2D grid) with pixels constitute a
E2E RNN models is the synchronization of the input speech
grid-like structure.
sequence with the output label sequence [158]. To tackle this
As discussed in Section 2, the speech spectrogram re-
issue, a loss function called CTC [159] is utilized for training
tains more information than hand-crafted features, including
RNN models, allowing for the repetition of labels to construct
speaker characteristics such as vocal tract length differences
paths of the same length as the input speech sequence. An
across speakers, distinct speaking styles causing formant to
alternative method is to employ an Attention-based Encoder-
undershoot or overshoot, etc. Also, explicitly expressed these
Decoder (AED) model based on RNN architecture, which
characteristics in the frequency domain. The spectrogram
utilizes an attention mechanism to align the input speech
representation shows very strong correlations in time and
sequence with the output label sequence. However, AED
frequency. Due to these characteristics of the spectrogram,
models tend to perform poorly on lengthy utterances.

Mehrish et al.: Preprint submitted to Elsevier Page 7 of 72


A Review of Deep Learning Techniques for Speech Processing

it is a suitable input for a CNN processing pipeline that re- to their 2D counterparts in certain applications. For exam-
quires preserving locality in both frequency and time axis. ple, Alsabhan [12] found that the performance of predicting
For speech signals, modeling local correlations with CNNs emotions with a 2D CNN model was lower compared to a
will be beneficial. The CNNs can also effectively extract the 1D CNN model.
structural features from the spectrogram and reduce the com- 1D convolution is useful in speech processing for several
plexity of the model through weight sharing. This section reasons:
will discuss the architecture of 1D and 2D CNNs used in
• Since, speech signals are sequences of amplitudes sam-
various speech-processing tasks.
pled over time, 1D convolution can be applied along
3.2.1. CNN Model Variants temporal dimension to capture temporal variations in
2D CNN Since spectrograms are two-dimensional visual the signal.
representations, one can leverage CNN architectures widely • Robustness to distortion and noise: Since, 1D con-
used for visual data processing (images and videos) by per- volution allows local feature extraction, the resultant
forming convolutions in two dimensions. The mathematical features are often resilient to global distortions of the
equation for a 2D convolutional layer can be represented as: signal. For instance, a speaker might be interrupted in
the middle of an utterance. Local features would still
(∑
𝐿 ∑
𝑀 )
𝑦(𝑘) 𝑥(𝑙) 𝑤(𝑘) + 𝑏(𝑘) (14) produce robust representations for those relevant spans,
𝑖,𝑗 = 𝜎 𝑖+𝑙−1,𝑗+𝑚−1 𝑙,𝑚 which is key to ASR, among many speech process-
ing task. On the other hand, speech signals are often
𝑙=1 𝑚=1

Here, 𝑥(𝑙) contaminated with noise, making extracting meaning-


𝑖,𝑗 is the pixel value of the 𝑙 input channel at the
𝑡ℎ
ful information difficult. 1D convolution followed by
spatial location (𝑖, 𝑗), 𝑤(𝑘) is the weight of the 𝑚𝑡ℎ filter at
𝑙,𝑚 pooling layers can mitigate the impact of noise [180],
the 𝑙 channel producing the 𝑘𝑡ℎ feature map, and 𝑏(𝑘) is the
𝑡ℎ
improving speech recognition systems’ accuracy.
bias term for the 𝑘𝑡ℎ feature map.
The output feature map 𝑦(𝑘) The basic building block of a 1D CNN is the convolu-
𝑖,𝑗 is obtained by convolving the
input image with the filters and then applying an activation tional layer, which applies a set of filters to the input data. A
function 𝜎 to introduce non-linearity. The convolution opera- convolutional layer employs a collection of adjustable param-
tion involves sliding the filter window over the input image, eters called filters to carry out convolution operations on the
computing the dot product between the filter and the input input data, resulting in a set of feature maps as the output,
pixels at each location, and producing a single output pixel. which represent the activation of each filter at each position
However, there are some drawbacks to using a 2D CNN in the input data. The size of the feature maps depends on the
for speech processing. One of the main issues is that 2D size of the input data, the size of the filters, and the number
convolutions are computationally expensive, especially for of filters used. The activation function used in a 1D CNN
large inputs. This is because 2D convolutions involve many is typically a non-linear function, such as the rectified linear
multiplications and additions, and the computational cost unit (ReLU) function.
grows quickly with the input size. Given an input sequence 𝑥 of length 𝑁, a set of 𝐾 filters
To address this issue, a 1D CNN can be designed to 𝑊𝑘 of length 𝑀, and a bias term 𝑏𝑘 , the output feature map
operate directly on the speech signal without needing a spec- 𝑦𝑘 of the 𝑘𝑡ℎ filter is given by
trogram. 1D convolutions are much less computationally 𝑀−1

expensive than 2D convolutions because they only operate 𝑦𝑘 [𝑛] = ReLU(𝑏𝑘 + 𝑊𝑘 [𝑚] ∗ 𝑥[𝑛 − 𝑚]) (15)
on one dimension of the input. This reduces the multipli- 𝑚=0
cations and additions required, making the network faster
and more efficient. In addition, 1D feature maps require less where 𝑛 ranges from 𝑀 − 1 to 𝑁 − 1, and ∗ denotes the
memory during processing, which is especially important for convolution operation. After the convolutional layer, the
real-time applications. A neural network’s memory require- output tensor is typically passed through a pooling layer,
ments are proportional to its feature maps’ size. By using 1D reducing the feature maps’ size by down-sampling. The most
convolutions, the size of the feature maps can be significantly commonly used pooling operation is the max-pooling, which
reduced, which can improve the efficiency of the network and keeps the maximum value from a sliding window across each
reduce the memory requirements. feature map.
CNNs often replace previously popular methods like
1D CNN 1D CNN is essentially a special case of 2D CNN HMMs and GMM-UBM in various cases. Moreover, CNNs
where the height of the filter is equal to the height the spec- possess the ability to acquire features that remain robust de-
togram. Thus, the filter only slides along the temporal dimen- spite variations in speech signals resulting from diverse speak-
sion and the height of the resultant feature maps is one. As ers, accents, and background noise. This is made possible due
such, 1D convolutions are computationally less expensive and to three key properties of CNNs: locality, weight sharing, and
memory efficient [261], as compared to 2D CNNs. Several pooling. The locality property enhances resilience against
studies [262, 245, 6] have shown that 1D CNNs are preferable non-white noise by enabling the computation of effective fea-
tures from cleaner portions of the spectrum. Consequently,

Mehrish et al.: Preprint submitted to Elsevier Page 8 of 72


A Review of Deep Learning Techniques for Speech Processing

only a smaller subset of features is affected by the noise, al-


lowing higher network layers a better opportunity to handle Output
the noise by combining higher-level features computed for
each frequency band. This improvement over standard fully
dilation=4

connected neural networks, which process all input features


hidden

in the lower layers, highlights the significance of locality. As dilation=2


a result, locality reduces the number of network weights that
must be learned.
hidden

dilation=1
3.2.2. Application
CNNs have proven to be versatile tools for a range of Input
speech-processing tasks. They have been successfully applied
to speech recognition [390, 4], including in hybrid NN-HMM
models for speech recognition, and can be used for multi-class
classification of words [5]. In addition, CNNs have been Figure 2: TCNNs leverage causal and dilated convolutions
proposed for speaker recognition in an emotional speech, to model temporal dependencies in sequential data. Causal
with a constrained CNN model presented in [498].
convolutions ensure that future information is not used dur-
ing training, while dilated convolutions increase the receptive
CNNs, both 1D and 2D, have emerged as the core building field without increasing computational complexity. This makes
block for various speech processing models, including acous- TCNNs an effective and efficient solution for a wide range of
tic models [485, 162, 273] in ASR systems. For instance, in tasks, including speech recognition, action recognition, and
2021, researchers from Facebook AI proposed wav2vec2.0 music analysis.
[485], a hybrid ASR system based on CNNs for learning
representations of raw speech signals that were then fed into
a transformer-based language model. The system achieved architecture combines the best practices of modern CNNs and
state-of-the-art results on several benchmark datasets. has demonstrated comparable performance to recurrent archi-
Similarly, Google’s VGGVox [92] used a CNN with VGG tectures such as LSTMs and GRUs. The TCN approach could
architecture to learn speaker embeddings from Mel spectro- revolutionize speech processing by providing an alternative
grams, achieving state-of-the-art results in speaker recogni- to the widely used recurrent neural network models.
tion. CNNs have also been widely used in developing state-
of-the-art speech enhancement and text-to-speech architec- 3.2.4. TCNN Model Variants
tures. For instance, the architecture proposed in [311, 543] for The architecture of TCNN is based upon two princi-
Deep Noise Suppression (DNS) [459] challenge and Google’s ples:(1) There is no information “leakage” from future to
Tacotron2 [493] are examples of models that use CNNs as past;(2) the architecture can map an input sequence of any
their core building blocks. In addition to traditional tasks length to an output sequence of the same length, similar to
like ASR and speaker identification, CNNs have also been RNN. TCN consists of dilated, causal 1D fully-convolutional
applied to non-traditional speech processing tasks like emo- layers with the same input and output lengths to satisfy the
tion recognition [230], Parkinson’s disease detection [224], above conditions. In other words, TCNN is simply a 1D
language identification [500] and sleep apnea detection [499]. fully-convolutional network (FCN) with casual convolutions
In all these tasks, CNN extracted features from speech signals as shown in Figure 2.
and fed them into the task classification model.
• Causal Convolution [404]: Causal convolution con-
3.2.3. Temporal Convolution Neural Networks volves the input at a specific time point 𝑡 solely with
Recurrent neural networks, including RNNs, LSTMs, and the temporally-prior elements.
GRUs, have long been popular for deep-learning sequence
modeling tasks. They are especially favored in the speech- • Dilated Convolution [629]: By itself, causal convolu-
processing domain. However, recent studies have revealed tion filters have a limited range of perception, meaning
that certain CNN architectures can achieve state-of-the-art they can only consider a fixed number of elements
accuracy in tasks such as audio synthesis, word-level lan- in the past. Therefore, it is challenging to learn any
guage modelling, and machine translation, as reported in dependency between temporally distant elements for
[233, 234, 102]. The advantage of convolutional neural net- longer sequences. Dilated convolution ameliorates this
works is that they enable faster training by allowing parallel limitation by repeatedly applying dilating filters to ex-
computation. They can avoid common issues associated with pand the range of perception, as shown in Figure 2.
recurrent models, such as the vanishing or exploding gradient The dilation is achieved by uniformly inserting zeros
problem or the inability to retain long-term memory. between the filter weights.
In a recent study by Bai et al. [30], they proposed a generic Consider a 1-D sequence 𝑥 ∈ 𝐑𝑛 and a filter: 𝑓 ∶
Temporal Convolutional Neural Network (TCNN) architec- {0, ..., 𝑘 − 1} → 𝐑, the dilated convolution operation
ture that can be applied to various speech-related tasks. This

Mehrish et al.: Preprint submitted to Elsevier Page 9 of 72


A Review of Deep Learning Techniques for Speech Processing

𝐹𝑑 on an element 𝑦 of the sequence is defined as precludes parallelization. Additionally, although dedicated


gated RNNs such as LSTM and GRU have helped to mitigate
the vanishing gradient problem to some extent, it can still be
𝑘−1

𝐹𝑑 (𝑦) = (𝑥 ∗𝑑 𝑓 )(𝑠) = 𝑓 (𝑖).𝑥𝑦−𝑑.𝑖 , (16) a challenge to maintain long-term dependencies in RNNs.
𝑖=0
Proposed by Vaswani et al. [555], Transformer solved
where 𝑘 is filter size, 𝑑 is dilation factor, and 𝑦 − 𝑑.𝑖 is a critical shortcoming of RNNs by allowing parallelization
the span along the past. The dilation step introduces a within the training sample, that is, facilitating the processing
fixed step between every two adjacent filter taps. When of the entire input sequence at once. Since then, the primary
𝑑 = 1, a dilated convolution acts as a normal convolu- idea of using only the attention mechanism to construct an
tion. Whereas, for larger dilation, the filter acts on a encoder and decoder has served as the basic recipe for many
wide but non-contiguous range of inputs. Therefore, state-of-the-art architectures across the domains of machine
dilation effectively expands the receptive field of the learning. In this survey, we use transformer to denote archi-
convolutional networks. tectures that are inspired by Transformer [109, 46, 447, 446,
167]. This section overviews the transformer’s fundamental
3.2.5. Application design proposed by Vaswani et al. [555] and its adaptations
Recent studies have shown that the TCNN architecture not for different speech-related applications.
only outperforms traditional recurrent networks like LSTMs
and GRUs in terms of accuracy but also possesses a set of 3.3.1. Basic Architecture
advantageous properties, including: Transformer architecture [555] comprises an attention-
based encoder and decoder, with each module consisting
• Parallelism is a key advantage of TCNN over RNNs. of a stack of identical blocks. Each block in the encoder
In RNNs, time-step predictions depend on their prede- and decoder consists of two sub-layers: a multi-head atten-
cessors’ completion, which limits parallel computation. tion (MHA) mechanism and a position-wise fully connected
In contrast, TCNNs apply the same filter to each span feedforward network as described in Figure 3. The MHA
in the input, allowing parallel application thereof. This mechanism in the encoder allows each input element to attend
feature enables more efficient processing of long input to every other element in the sequence, enabling the model to
sequences compared to RNNs that process sequentially. capture long-range dependencies in the input sequence. The
decoder typically uses a combination of MHA and encoder-
• The receptive field size can be modified in various ways
decoder attention to attend to both the input sequence and
to enhance the performance of TCNNs. For example,
the previously generated output elements. The feedforward
incorporating additional dilated convolutional layers,
network in each block of the Transformer provides non-linear
employing larger dilation factors, or augmenting the
transformations to the output of the attention mechanism.
filter size are all effective methods. Consequently, TC-
Next, we discuss operations involved in transformer layers,
NNs offer superior management of the model’s mem-
that is, multi-head attention and position-wise feedforward
ory size and are highly adaptable to diverse domains.
network:
• When dealing with lengthy input sequences, LSTM
and GRU models tend to consume a significant amount Attention in Transformers Attention mechanism, first pro-
of memory to retain the intermediate outcomes for posed by Bahdanau et al. [28], has revolutionized sequence
their numerous cell gates. On the other hand, TCNNs modeling and transduction models in various tasks of NLP,
utilize shared filters throughout a layer, and the back- speech, and computer vision [148, 80, 566, 60]. Broadly, it
propagation route depends solely on the depth of the allows the model to focus on specific parts of the input or out-
network. This makes TCNNs a more memory-efficient put sequence, without being limited by the distance between
alternative to LSTMs and GRUs, especially in scenar- the elements. We can describe the attention mechanism as
ios where memory constraints are a concern. the mapping of a query vector and set of key-value vector
pairs to an output. Precisely, the output vector is computed
TCNNs can perform real-time speech enhancement in as a weighted summation of value vectors where the weight
the time domain [413]. They have much fewer trainable of a value vector is obtained by computing the compatibility
parameters than earlier models, making them more efficient. between the query vector and key vector. Let, each query 𝑄
TCNs have also been used for speech and music detection in and key 𝐾 are 𝑑𝑘 dimensional and value 𝑉 is 𝑑𝑣 dimensional.
radio broadcasts [212, 297]. They have been used for single Specific to the Transformer, the compatibility function be-
channel speech enhancement [322, 466] and are trained as tween a query and each
√ key is computed as their dot product
filter banks to extract features from waveform to improve the between scaled by 𝑑𝑘 . To obtain the weights on values,
performance of ASR [306]. the scaled dot product values are passed through a softmax
function:
3.3. Transformers
While recurrence in RNNs (Section 3.1) is a boon for ( )
neural networks to model sequential data, it is also a bane as Attention(Q, K, V) = sof tmax
QK𝑇
V (17)

the recurrence in time to update the hidden state intrinsically 𝑑𝑘

Mehrish et al.: Preprint submitted to Elsevier Page 10 of 72


A Review of Deep Learning Techniques for Speech Processing

Attention(Q,K,V)
module for building a deep model. For example, each encoder
Linear block output can be defined as follows:
MatMul
MultiHeadAttn(Q,K,V)

(20)

Softmax
Concatenate 𝐻 = LayerNorm(Self Attention(𝑋) + 𝑋)
head M
Mask (opt) Scaled Dot Product Attention head 2
head 1

(21)
′ ′
Scale 𝐻 = LayerNorm(FFN(𝐻 ) + 𝐻 )
Linear Linear Linear

Self Attention(.) denotes attention module with Q = K =


MatMul

Q K V V = X, where X is the output of the previous layer.


Transformer-based architecture turned out to be better
Q K V

than many other architectures such as RNN, LSTM/GRU, etc.


Figure 3: Illustrations of attention (left) and multi-headed One of the major difficulties when applying a Transformer to
attention (right). speech applications that it requires more complex configura-
tions (e.g., optimizer, network structure, data augmentation)
than the conventional RNN-based models. Speech signals
Here multiple queries, keys, and value vectors, are packed are continuous-time signals with much higher dimensionality
together in matrix form respectively denoted by Q ∈ ℝ𝑁×𝑑𝑘 , than text data. This high dimensionality poses significant
K ∈ ℝ𝑀×𝑑𝑘 , and V ∈ ℝ𝑀×𝑑𝑣 . N and M represent the lengths computational challenges for the Transformer architecture,
of queries and keys (or values). Scaling of dot product atten- originally designed for sequential text data. Speech signals
tion becomes critical to tackling the issue of small gradients also have temporal dependencies, which means that the model
with the increase in 𝑑𝑘 [555]. needs to be able to process and learn from the entire signal
Instead of performing single attention in each transformer rather than just a sequence of inputs. Also, speech signals are
block, multiple attentions in lower-dimensional space have inherently variable and complex. The same sentence can be
been observed to work better [555]. This observation gave spoken differently and even by the same person at different
rise to Multi-Head Attention: For ℎ heads and dimension times. This variability requires the model to be robust to
of tokens in the model 𝑑𝑚 , the 𝑑𝑚 -dimensional query, key, differences in pitch, accent, and speed of speech.
and values are projected ℎ times to 𝑑𝑘 , 𝑑𝑘 , and 𝑑𝑣 dimensions
using learnable linear projections3 . Each head performs at- 3.3.2. Application
tention operation as per Equation (17). The ℎ 𝑑𝑣 -dimensional Recent advancements in NLP which lead to a paradigm
are concatenated and projected back to 𝑑𝑚 using another pro- shift in the field are highly attributed to the foundation models
jection matrix: that are primarily a part of the transformers category, with
self-attention being a key ingredient [42]. The recent mod-
els have demonstrated human-level performance in several
MultiHeadAttn(Q, K, V) = Concat(head 1 , .... head ℎ )W𝑂 ,professional and academic benchmarks. For instance, GPT4
(18) scored within the top 10% of test takers on a simulated ver-
𝑉sion of the Uniform Bar Examination [407]. While speech
with head 𝑖 = Attention(QW𝑖 , KW𝑖 , VW𝑖 )
𝑄 𝐾
processing has not yet seen a shift in paradigm as in NLP
(19) owing to the capabilities of foundational models, even so,
transformers have significantly contributed to advancement
Where W , W ∈ ℝ
𝑄 𝐾 𝑑 ×𝑑 ,W ∈ ℝ
𝑉 𝑑 ×𝑑 ,W ∈𝑂
in the field including but not limited to the following tasks:
𝑚𝑜𝑑𝑒𝑙 𝑘 𝑚𝑜𝑑𝑒𝑙 𝑣

ℝℎ𝑑𝑣 ×𝑑𝑚𝑜𝑑𝑒𝑙 are learnable projection matrices. Intuitively,


automatic speech recognition, speech translation, speech syn-
multiple attention heads allow for attending to parts of the
thesis, and speech enhancement, most of which we discuss
sequence differently (e.g., longer-term dependencies versus
in detail in Section 5.
shorter-term dependencies). Intuitively, multiple attention
RNNs and Transformers are two widely adopted neural
heads allow for attending in different representational spaces
network architectures employed in the domain of Natural Lan-
jointly.
guage Processing (NLP) and speech processing. While RNNs
Position-wise FFN The position-wise FNN consists of two process input words sequentially and preserve a hidden state
dense layers. It is referred to position-wise since the same two vector over time, Transformers analyze the entire sentence
dense layers are used for each positioned item in the sequence in parallel and incorporate an internal attention mechanism.
and are equivalent to applying two 1 × 1 convolution layers. This unique feature makes Transformers more efficient than
RNNs [244]. Moreover, Transformers employ an attention
Residual Connection and Normalization Residual con- mechanism that evaluates the relevance of other input tokens
nection and Layer Normalization are employed around each in encoding a specific token. This is particularly advanta-
geous in machine translation, as it allows the Transformer
3 Projection weights are neither shared across heads nor query, key, and
to incorporate contextual information, thereby enhancing
values.
translation accuracy [244]. To achieve this, Transformers

Mehrish et al.: Preprint submitted to Elsevier Page 11 of 72


A Review of Deep Learning Techniques for Speech Processing

combine word vector embeddings and positional encodings, discrete tokens as input, necessitating using a tokenizer or a
which are subsequently subjected to a sequence of encoders speech recognition system, introducing errors and noise. Fur-
and decoders. These fundamental differences between RNNs thermore, pre-training on large-scale text corpora can lead to
and Transformers establish the latter as a promising option domain mismatch problems when processing speech data. To
for various natural language processing tasks [244]. address these limitations, dedicated frameworks have been
A comparative study on transformer vs. RNN [244] in developed for learning speech representations using trans-
speech applications found that transformer neural networks formers, including wav2vec [485], data2vec [24], Whisper
achieve state-of-the-art performance in neural machine trans- [445], VALL-E [562], Unispeech [565], SpeechT5 [16] etc.
lation and other natural language processing applications We discuss some of them as follows.
[244]. The study compared and analysed transformer and
conventional RNNs in a total of 15 ASR, one multilingual • Speech representation learning frameworks, such as
ASR, one ST, and two TTS applications. The study found wav2vec, have enabled significant advancements in
that transformer neural networks outperformed RNNs in most speech processing tasks. One recent framework, w2v-
applications tested. Another survey of transformer-based BERT [585], combines contrastive learning and MLM
models in speech processing found that transformers have to achieve self-supervised speech pre-training on dis-
an advantage in comprehending speech, as they analyse the crete tokens. Fine-tuning wav2vec models with lim-
entire sentence simultaneously, whereas RNNs process input ited labeled data has also been demonstrated to achieve
words one by one. state-of-the-art results in speech recognition tasks [25].
Transformers have been successfully applied in end-to- Moreover, XLS-R [20], another model based on wav2vec
end speech processing, including automatic speech recogni- 2.0, has shown state-of-the-art results in various tasks,
tion (ASR), speech translation (ST), and text-to-speech (TTS) domains, data regimes, and languages, by leveraging
[309]. In 2018, the Speech-Transformer was introduced as a multilingual data augmentation and contrastive learn-
no-recurrence sequence-to-sequence model for speech recog- ing techniques on a large scale. These models learn
nition. To reduce the dimension difference between input universal speech representations that can be transferred
and output sequences, the model’s architecture was modified across languages and domains, thus representing a sig-
by adding convolutional neural network (CNN) layers before nificant advancement in speech representation learn-
feeding the features to the transformer. In a later study [388], ing.
the authors proposed a method to improve the performance • Transformers have been increasingly popular in the
of end-to-end speech recognition models based on transform- development of frameworks for learning representa-
ers. They integrated the connectionist temporal classification tions from multi-modal data, such as speech, images,
(CTC) with the transformer-based model to achieve better and text. Among these frameworks, Data2vec [24] is
accuracy and used language models to incorporate additional a self-supervised training approach that aims to learn
context and mitigate recognition errors. joint representations to capture cross-modal correla-
In addition to speech recognition, the transformer model tions and transfer knowledge across modalities. It has
has shown promising results in TTS applications. The trans- outperformed other unsupervised methods for learn-
former based TTS model generates mel-spectrograms, fol- ing multi-modal representations in benchmark datasets.
lowed by a WaveNet vocoder to output the final audio results However, for tasks that require domain-specific models,
[309]. Several neural network-based TTS models, such as such as speech recognition or speaker identification,
Tacotron 2, DeepVoice 3, and transformer TTS, have outper- domain-specific models may be more effective, partic-
formed traditional concatenative and statistical parametric ularly when dealing with data in specific domains or
approaches in terms of speech quality [309, 493, 428]. languages. The self-supervised training approach of
One of the strengths of Transformer-based architectures Data2vec enables cost-effective and scalable learning
for neural speech synthesis is their high efficiency while con- of representations without requiring labeled data, mak-
sidering the global context [162, 494]. The Transformer TTS ing it a promising framework for various multi-modal
model has shown advantages in training and inference effi- learning applications.
ciency over RNN-based models such as Tacotron 2 [493].
The efficiency of the Transformer TTS network can speed • The field of speech recognition has undergone a revolu-
up the training about 4.25 times [309]. Moreover, Multi- tionary change with the advent of the Whisper model
Speech, a multi-speaker TTS model based on the Transformer [445]. This innovative solution has proven to be highly
[309], has demonstrated the effectiveness of synthesizing a versatile, providing exceptional accuracy for various
more robust and better quality multi-speaker voice than naive speech-related tasks, even in challenging environments.
Transformer-based TTS. The Whisper model achieves its outstanding perfor-
In contrast to the strengths of Transformer-based architec- mance through a minimalist approach to data pre-processing
tures in neural speech synthesis, large language models based and weak supervision, which allows it to deliver state-
on Transformers such as BERT [109], GPT [446], XLNet of-the-art results in speech processing. The model is
[618], and T5 [450] have limitations when it comes to speech capable of performing multilingual speech recogni-
processing. One of the issues is that these models require tion, translation, and language identification, thanks to

Mehrish et al.: Preprint submitted to Elsevier Page 12 of 72


A Review of Deep Learning Techniques for Speech Processing

VQ-Wav2Vec Wav2Vec 2.0/XLSR-53 WavLM/UniSpeech-SAT

UniSpeech/DeCoAR 2.0 BigSSL


Mockingjay W2V-BERT
XLS-R
DiscreteBERT Conformer Wav2Vec-Conformer HuBERT

2019 2020 2021 2022

8B
34M 118M 317M 317M

85M 317M 1B 2B

110M 1B 317M

Figure 4: Timeline highlighting notable large Transformer models developed for speech
processing, along with their corresponding parameter sizes.

its training on a diverse audio dataset. Its multitask- ing a significant advancement in TTS technology.
ing model can cater to various speech-related tasks,
The timeline highlights the development of large trans-
such as transcription, voice assistants, education, en-
former based models for speech processing is shown in Fig-
tertainment, and accessibility. One of the unique fea-
ure 4. The size of the models has grown exponentially, with
tures of Whisper is its minimalist approach to data
significant breakthroughs achieved in speech recognition,
pre-processing, which eliminates the need for signifi-
synthesis, and translation. These large models have set new
cant standardization and simplifies the speech recogni-
performance benchmarks in the field of speech processing,
tion pipeline. The resulting models generalize well to
but also pose significant computational and data requirements
standard benchmarks and deliver competitive perfor-
for training and inference.
mance without fine-tuning, demonstrating the potential
of advanced machine learning techniques in speech 3.4. Conformer
processing.
The Transformer architecture, which utilizes a self-attention
• Text-to-speech synthesis has been a topic of interest for mechanism, has successfully replaced recurrent operations in
many years, and recent advancements have led to the previous architectures. Over the past few years, various Trans-
development of new models such as VALL-E [562]. former variants have been proposed [162]. Architectures
VALL-E is a novel text-to-speech synthesis model that combining Transformers and CNNs have recently shown
has gained significant attention due to its unique ap- promising results on speech-processing tasks [582]. To ef-
proach to the task. Unlike traditional TTS systems, ficiently model both local and global dependencies of an
VALL-E treats the task as a conditional language mod- audio sequence, several attempts have been made to com-
elling problem and leverages a large amount of semi- bine CNNs and Transformers. One such architecture pro-
supervised data to train a generalized TTS system. It posed by the authors is the Conformer [162], a convolution-
can generate high-quality personalized speech with augmented transformer for speech recognition. Conformer
a 3-second acoustic prompt from an unseen speaker outperforms RNNs, previous Transformers, and CNN-based
and provides diverse outputs with the same input text. models, achieving state-of-the-art performance in speech
VALL-E also preserves the acoustic environment and recognition. The Conformer model consists of several build-
the speaker’s emotions about the acoustic prompt, with- ing blocks, including convolutional layers, self-attention lay-
out requiring additional structure engineering, pre- ers, and feedforward layers. The architecture of the Con-
designed acoustic features, or fine-tuning. Further- former model can be summarized as follows:
more, VALL-E X [659] is an extension of VALL-E • Input Layer: The Conformer model inputs a sequence
that enables cross-lingual speech synthesis, represent- of audio features, such as MFCCs or Mel spectrograms.

Mehrish et al.: Preprint submitted to Elsevier Page 13 of 72


A Review of Deep Learning Techniques for Speech Processing

• Convolutional Layers: Local features are extracted has demonstrated exceptional performance in elderly speech
from the audio signal by processing the input sequence recognition and shows promise for the clinical diagnosis and
through convolutional layers. treatment of Alzheimer’s disease.
Several enhancements have been made to the Conformer-
• Self-Attention Layers: The Conformer model incorpo- based model to address high word error rates without a lan-
rates self-attention layers following the convolutional guage model, as documented in [336]. Wu [598] proposed
layers. Self-attention is a mechanism that enables the a deep sparse Conformer to improve its long-sequence rep-
model to focus on various sections of the input se- resentation capabilities. Furthermore, Burchi and Timofte
quence while making predictions. This is especially [49] have recently enhanced the noise robustness of the Effi-
advantageous for speech recognition because it facil- cient Conformer architecture by processing both audio and
itates capturing long-term dependencies in the audio visual modalities. In addition, models based on Conformer,
signal. such as Transducers [252], have been adopted for real-time
• Feedforward Layers: After the self-attention layers, speech recognition [414] due to their ability to process audio
the Conformer model applies a sequence of feedfor- data much more quickly than conventional recurrent neural
ward layers intended to process the output of the self- network (RNN) models.
attention layers further and ready it for the ultimate
3.5. Sequence to Sequence Models
prediction.
The sequence-to-sequence (seq2seq) model in speech
• Output Layer: Finally, the output from the feedfor- processing is popularly used for ASR, ST, and TTS tasks.
ward layers undergoes a softmax activation function to The general architecture of the seq2seq model involves an
generate the final prediction, typically representing a encoder-decoder network that learns to map an input sequence
sequence of character labels or phonemes. to an output sequence of varying lengths. In the case of ASR,
the input sequence is the speech signal, which is processed
The conformer model has emerged as a promising neu- by the encoder network to produce a fixed-length feature
ral network architecture for various speech-related research vector representation of the input signal. The decoder network
tasks, including but not limited to speech recognition, speaker inputs this feature vector and produces the corresponding text
recognition, and language identification. In a recent study by sequence. This can be achieved through a stack of RNNs
Gulati et al. [162], the conformer model was demonstrated [436], Transformer [116] or Conformer [162] in the encoder
to outperform previous state-of-the-art models, particularly and decoder networks.
in speech recognition significantly. This highlights the po- The sequence-to-sequence model has emerged as a potent
tential of the conformer model as a key tool for advancing tool in speech translation. It can train end-to-end to efficiently
speech-related research. map speech spectrograms in one language to their correspond-
ing spectrograms in another. The notable advantage of this
3.4.1. Application approach is eliminating the need for an intermediate text rep-
The Conformer model stands out among other speech resentation, resulting in improved efficiency. Additionally,
recognition models due to its ability to efficiently model both the Seq2seq models have been successfully implemented
local and global dependencies of an audio sequence. This in speech generation tasks, where they reverse the ASR ap-
is crucial for speech recognition, language translation, and proach. In such applications, the input text sequence serves as
audio classification [1, 162, 2]. The model achieves this the input, with the encoder network creating a feature vector
through self-attention and convolution modules, combining representation of the input text. The decoder network then
the strengths of CNNs and Transformers. While CNNs cap- leverages this representation to generate the desired speech
ture local information in audio sequences, the self-attention signal.
mechanism captures global dependencies [2]. The Conformer Karita et al. [244] conducted an extensive study com-
model has achieved remarkable performance in speech recog- paring the performance of transformer and traditional RNN
nition tasks, setting benchmarks on datasets such as Lib- models on 15 different benchmarks for Automatic Speech
riSpeech and AISHELL-1. Recognition (ASR), including a multilingual ASR bench-
Despite these successes, speech synthesis and recognition mark, a Speech Translation (ST) benchmark, and two Text-
challenges persist, including difficulties generating natural- to-Speech (TTS) benchmarks. In addition, they proposed
sounding speech in non-English languages and real-time a shared Sequence-to-Sequence (S2S) architecture for AST,
speech generation. To address these limitations, Wang et TTS, and ST tasks, which is depicted in Figure 5.
al. [658] proposed a novel approach that combines noisy stu-
dent training with SpecAugment and large Conformer models • Encoder
pre-trained on the Libri-Light dataset using the wav2vec 2.0 𝑋0 = Encoder−PreNet(𝑋),
pre-training method. This approach achieved state-of-the-art (22)
𝑋𝑒 = Encoder−Main(𝑋0 )
word error rates on the LibriSpeech dataset. Recently, Wang
et al. [575] developed Conformer-LHUC, an extension of the where 𝑋 is the sequence of speech features (e.g. Mel
Conformer model that employs learning hidden unit contri- spectrogram) for AST and ST and phoneme or charac-
bution (LHUC) for speaker adaptation. Conformer-LHUC ter sequence for TTS.

Mehrish et al.: Preprint submitted to Elsevier Page 14 of 72


A Review of Deep Learning Techniques for Speech Processing

ASR: CE,CTC of ASR has seen significant progress, with several advanced
ST: CE
techniques emerging as popular options. These include the
CTC approach, which has been further developed and im-
TTs: L1, L2, BCE

proved upon through recent advancements [160], as well as


Decoder-PostNet attention-based approaches that have also gained traction [85].
The growing interest in these techniques has increased the
use of seq2seq models in the speech community.
ASR/ST: Linear (CE)
ASR: Linear (CTC)
TTS: Post-net

• Attention-based Approaches: The attention mecha-


Encoder-Main Decoder-Main nism is a crucial component of sequence-to-sequence
Source Attention models, allowing them to effectively weigh input acous-
tic features during decoding [28, 355]. Attention-based
Bi-directional +
RNN / Self Attention Uni-directional
RNN/ Self Attention Seq2seq models utilize previously generated output to-
Encoder-PreNet Decoder-PreNet kens and the complete input sequence to factorize the
joint probability of the target sequence into individual
time steps. The attention mechanism is conditioned
ASR/ST: Subsample ASR/ST: Embed
TTS: Pre-net
on the current decoder states and runs over the en-
TTS: PreNet

coder output representations to incorporate information


from the input sequence into the decoder output. In-
Source Target corporating attention mechanisms in Seq2Seq models
Sequence Sequence
has resulted in an impressive performance in various
speech processing tasks, such as speech recognition
Figure 5: Unified formulation for Sequence-to-Sequence archi- [389, 436, 541, 591], text-to-speech [493, 620, 401],
tecture in speech applications [244]. 𝑋 and 𝑌 are source and and voice conversion [530, 210]. These models have
target sequences respectively. demonstrated competitiveness with traditional state-
of-the-art approaches. Additionally, attention-based
Seq2Seq models have been used for confidence esti-
• Decoder mation tasks in speech recognition, where confidence
scores generated by a speech recognizer can assess
𝑌0 [1 ∶ 𝑡 − 1] = Decoder−PreNet(𝑌 [1 ∶ 𝑡 − 1]), transcription quality [312]. Furthermore, these models
𝑌𝑑 [𝑡] = Decoder−Main(𝑋𝑒 , 𝑌0 [1 ∶ 𝑡 − 1]), have been explored for few-shot learning, which has
the potential to simplify the training and deployment
𝑌𝑝𝑜𝑠𝑡 [1 ∶ 𝑡] = Decoder−PostNet(𝑌𝑑 [1 ∶ 𝑡]),
of speech recognition systems [183].
(23)
• Connectionist Temporal Classification: While attention-
During the training stage, input to the decoder is ground based methods create a soft alignment between input
truth target sequence 𝑌 [1 ∶ 𝑡 − 1]. The Decoder-Main and target sequences, approaches that utilize CTC loss
module is utilized to produce a subsequent target frame. aim to maximize log conditional likelihood by con-
This is accomplished by utilizing the encoded sequence sidering all possible monotonic alignments between
𝑋𝑒 and the prefix of the target prefix 𝑌0 [1 ∶ 𝑡 − 1]. The them. These CTC-based Seq2Seq models have deliv-
decoder is mostly unidirectional for sequence gener- ered competitive results across various ASR bench-
ation and often uses an attention mechanism [28] to marks [182, 365, 526, 162] and have been extended
produce the output. to other speech-processing tasks such as voice con-
version [648, 655, 339], speech synthesis [648] etc.
Seq2seq models have been widely used in speech pro- Recent studies have concentrated on enhancing the
cessing, initially based on RNNs. However, RNNs face the performance of Seq2Seq models by combining CTC
challenge of processing long sequences, which can lead to with attention-based mechanisms, resulting in promis-
the loss of the initial context by the end of the sequence ing outcomes. This combination remains a subject of
[244]. To overcome this limitation, the transformer archi- active investigation in the speech-processing domain.
tecture has emerged, leveraging self-attention mechanisms
to handle sequential data. The transformer has shown re- 3.6. Reinforcement Learning
markable performance in tasks such as ASR, ST, and speech Reinforcement learning (RL) is a machine learning paradigm
synthesis. As a result, the use of RNN-based seq2seq models that trains an agent to perform discrete actions in an envi-
has declined in favour of the transformer-based approach. ronment and receive rewards or punishments based on its
interactions. The agent aims to learn a policy that maximizes
3.5.1. Application its long-term reward. In recent years, RL has become in-
Seq2seq models have been used for speech processing creasingly popular and has been applied to various domains,
tasks such as voice conversion [530, 210], speech synthesis including robotics, game playing, and natural language pro-
[583, 567, 399, 400, 210], and speech recognition. The field cessing. RL has been utilized in speech recognition, speaker

Mehrish et al.: Preprint submitted to Elsevier Page 15 of 72


A Review of Deep Learning Techniques for Speech Processing

diarization, and speech enhancement tasks in the speech field. accurately. In this case, the system receives an audio input and
One of the significant benefits of using RL for speech tasks outputs a text sequence corresponding to the spoken words.
is its ability to learn directly from raw audio data, eliminat- The environmental states might be learned from the input
ing the need for hand-engineered features. This can result audio features. The actions might be the generated phonemes.
in better performance compared to traditional methods that The reward could be the similarity between the generated and
rely on feature extraction. By capturing intricate patterns gold phonemes, quantified in edit distance. Several works
and relationships in the audio data, RL-based speech systems have also achieved promising results for non-native speech
have the potential to enhance accuracy and robustness. recognition [448]
DRL pre-training has shown promise in reducing training
3.6.1. Basic Models time and enhancing performance in various Human-Computer
The utilization of deep reinforcement learning (DRL) in Interaction (HCI) applications, including speech recognition
speech processing involves the environment (a set of states [453]. Recently, researchers have suggested using a reinforce-
𝑆), agent, actions (𝐴), and reward (𝑟). The semantics of ment learning algorithm to develop a Speech Enhancement
these components depends on the task at hand. For instance, (SE) system that effectively improves ASR systems. However,
in ASR tasks, the environment can be composed of speech ASR systems are often complicated and composed of non-
features, the action can be the choices of phonemes, and the differentiable units, such as acoustic and language models.
reward could be the correctness of those phonemes given the Therefore, the ASR system’s recognition outcomes should be
input. Audio signals are one-dimensional time-series signals employed to establish the objective function for optimizing
that undergo pre-processing and feature extraction procedures. the SE model. Other than ASR, SE, some studies have also
Pre-processing steps include noise suppression, silence re- focused on SER using DRL algorithms [282, 454, 243]
moval, and channel equalization, improving audio signal
quality and creating robust and efficient audio-based systems. Speaker identification Similarly, for speaker identification
Previous research has demonstrated that pre-processing im- tasks, the actions can be the speaker’s choices, and a binary
proves the performance of deep learning-based audio systems reward can be the correctness of choice.
[288].
Feature extraction is typically performed after pre-processing Speech synthesis and coding Likewise, the states can be
to convert the audio signal into meaningful and informative the input text, the actions can be the generated audio, and the
features while reducing their number. MFCCs and spectro- reward could be the similarity between the gold and generated
grams are popular feature extraction choices in speech-based mel-spectrogram.
systems [288]. These features are then given to the DRL Deep reinforcement learning has several advantages over
agent to perform various tasks depending on the application. traditional machine learning techniques. It can learn from
For instance, consider the scenario where a human speaks to raw data without needing hand-engineered features, making it
a DRL-trained machine, where the machine must act based more flexible and adaptable. It can also learn from feedback,
on features derived from audio signals. making it more robust and able to handle noisy environments.
However, deep reinforcement learning also has some chal-
• Value-based DRL: Given the state of the environment lenges that must be addressed. It requires a lot of data to train
(𝑠), a value function 𝑄 ∶ 𝑆 × 𝐴 → ℝ is learned to and can be computationally expensive. It also requires care-
estimate overall future reward 𝑄(𝑠, 𝑎) should an action ful selection of the reward function to ensure that the system
𝑎 be taken. This value function is parameterized with learns the desired behavior.
deep networks like CNN, Transformers, etc.
3.7. Graph Neural Network
• Policy-based DRL: As opposed to value-based RL, Over the past few years, the field of Graph Neural Net-
policy-based RL methods learns a policy function 𝜋 ∶ works (GNNs) has witnessed a remarkable expansion as a
𝑆 → 𝐴 that chooses the best possible action (𝑎) based widely adopted approach for analysing and learning from data
on reward. on graphs. GNNs have demonstrated their potential in various
• Model-based DRL: Unlike the previous two approaches, domains, including computer science, physics, mathematics,
model-based RL learns the dynamics of the environ- chemistry, and biology, by delivering successful outcomes.
ment in terms of the state transition probabilities, i.e., Furthermore, in recent times, the speech-processing domain
a function 𝑀 ∶ 𝑆 × 𝐴 × 𝑆 → ℝ. Given such a model, has also witnessed the growth of GNNs.
policy, or value functions are optimized.
3.7.1. Basic Models
3.6.2. Application Speech processing involves analysing and processing au-
In speech-related research, deep reinforcement learning dio signals, and GNNs can be useful in this context when we
can be used for several purposes, including: represent the audio data as a graph. In this answer, we will
explain the architecture of GNNs for speech processing. The
Speech recognition and Emotion modeling Deep rein- standard GNN pipeline is shown in Figure 6, according to
forcement learning (DRL) can be used to train speech recog- the application the GNN layer can consist of Graph Convolu-
nition systems [231, 453, 536, 88, 89] to transcribe speech tional Layers [652], Graph Attention Layers [556], or Graph

Mehrish et al.: Preprint submitted to Elsevier Page 16 of 72


A Review of Deep Learning Techniques for Speech Processing

h1 h1
h2 h2
Graph G h0 MLP
h0 Graph Predcition
GNNl

h3 h3 MLP
h4 Concat(hiL,hjL) Edge Prediction
Embed. h4
Edge Features {e0ij}
Layer l: {hli}, {elij} Layer l+1: {hl+1i}, {el+1ij} MLP
Embed. {hLi} Node Prediction
Node Features {h0i}
Input Layer L x GNN Layer Prediction Layer

Figure 6: A standard experimental pipeline for GCNs, which embeds the graph node and embeds the graph node edge features,
performs several GNN layers to compute convolutional features, and finally predicts a task-specific MLP layer.

Transformer [632]. interrelationships between various entities, which is suit-


able for speech processing tasks such as speaker diariza-
Graph Representation of Speech Data The first step in tion [502, 571, 501], speaker verification [228, 496], speech
using GNNs for speech processing is representing the speech synthesis [523, 338, 522], or speech separation [576, 391],
data as a graph. One way to do this is to represent the speech which require the analysis of complex data representations.
signal as a sequence of frames, each representing a short GNNs retain a state representing information from their neigh-
audio signal segment. We can then represent each frame as a borhood with arbitrary depth, unlike standard neural net-
node in the graph, with edges connecting adjacent frames. works. GNNs can be used to model the relationship between
phonemes and words. GNNs can learn to recognize words
Graph Convolutional Layers Once the speech data is rep- in spoken language by treating the phoneme sequence as a
resented as a graph, we can use graph convolutional layers to graph. GNNs can also be used to model the relationship be-
learn representations of the graph nodes. Graph convolutional tween different acoustic features, such as pitch, duration, and
layers are similar to traditional ones, but instead of operating amplitude, in speech signals, improving speech recognition
on a grid-like structure, they operate on graphs. These layers accuracy.
learn to aggregate information from neighboring nodes to GNNs have shown promising results in multichannel
update the features of each node. speech enhancement, where they are used for extracting
clean speech from noisy mixtures captured by multiple mi-
Graph Attention Layers Graph attention layers can be
crophones [544]. The authors of a recent study [392] pro-
combined with graph convolutional layers to give more im-
pose a novel approach to multichannel speech enhancement
portance to certain nodes in the graph. Graph attention layers
by combining Graph Convolutional Networks (GCNs) with
learn to assign weights to neighbor nodes based on their fea-
spatial filtering techniques such as the Minimum Variance
tures, which can help capture important patterns in speech
Distortionless Response (MVDR) beamformer. The algo-
data. Several works have used graph attention layers for neu-
rithm aims to extract speech and noise from noisy signals
ral speech synthesis [338] or speaker verification [227] and
by computing the Power Spectral Density (PSD) matrices of
diarization [277].
the noise and the speech signal of interest and then obtaining
Recurrent Layers Recurrent layers can be used in GNNs optimal weights for the beam former using a frequency-time
for speech processing to capture temporal dependencies be- mask. The proposed method combines the MVDR beam
tween adjacent frames in the audio signal. Recurrent layers former with a super-Gaussian joint maximum a posteriori
allow the network to maintain an internal state that carries (SGJMAP) based SE gain function and a GCN-based sep-
information from previous time steps, which can be useful aration network. The SGJMAP-based SE gain function is
for modeling the dynamics of speech signals. used to enhance the speech signals, while the GCN-based
separation network is used to separate the speech from the
Output Layers The output layer of a GNN for speech pro- noise further.
cessing can be a classification layer that predicts a label for
the speech data (e.g., phoneme or word) or a regression layer 3.8. Diffusion Probabilistic Model
that predicts a continuous value (e.g., pitch or loudness). The Diffusion probabilistic models, inspired by non-equilibrium
output layer can be a traditional fully connected layer or a thermodynamics [186, 510], have proven to be highly effec-
graph pooling layer that aggregates information from all the tive for generating high-quality images and audio. These mod-
nodes in the graph. els create a Markov chain of diffusion steps (𝑥𝑡 ∼ 𝑞(𝑥𝑡 |𝑥𝑡−1 ))
from the original data (𝑥0 ) to the latent variable 𝑥𝑇 ∼  (𝟎, 𝐈)
3.7.2. Application by gradually adding pre-scheduled noise to the data. The re-
The advantages of using GNNs for speech processing verse diffusion process then reconstructs the desired data
tasks include their ability to represent the dependencies and samples (𝑥0 ) from the noise 𝑥𝑇 , as shown in Figure 7. Unlike
VAE or flow models, diffusion models keep the dimension-

Mehrish et al.: Preprint submitted to Elsevier Page 17 of 72


A Review of Deep Learning Techniques for Speech Processing

diffusion process

reverse process

Figure 7: The Diffusion Probabilistic Model is a generative model that progressively


transforms a noise distribution into the target data distribution through a series of diffusion
steps, where the noise level decreases as the process continues. The model is trained by
maximizing the likelihood of the data distribution and can be used for tasks such as speech
synthesis, enhancement, and denoising.

ality of the latent variables fixed. While mostly used for It is non-autoregressive and generates high-fidelity audio for
image and audio synthesis, diffusion models have potential different waveform generation tasks, such as neural vocoding
applications in speech-processing tasks, such as speech syn- conditioned on mel spectrogram, class-conditional genera-
thesis and enhancement. This section offers a comprehensive tion, and unconditional generation. DiffWave delivers speech
overview of the fundamental principles of diffusion models quality on par with the strong WaveNet vocoder [404] while
and explores their potential uses in the speech domain. synthesizing audio much faster.
Diffusion models have shown great promise in speech
Forward diffusion process Given a clean speech data 𝑥0 ∼ processing, particularly in speech enhancement [347, 489,
𝑞𝑑𝑎𝑡𝑎 (𝑥0 ), 442, 348]. Recent advances in diffusion probabilistic models
have led to the development of a new speech enhancement
algorithm that incorporates the characteristics of the noisy
𝑇

𝑞(𝑥1 , ..., 𝑥𝑇 |𝑥0 ) = 𝑞(𝑥𝑡 |𝑥𝑡−1 ). (24)
speech signal into the diffusion and reverses processes [349].
This new algorithm is a generalized form of the probabilistic
𝑡=1

At every time step 𝑡, 𝑞(𝑥𝑡 |𝑥𝑡−1 ) ∶=  (𝑥𝑡 ; 1 − 𝛽𝑡 𝑥𝑡−1 , 𝛽𝑡 𝐈)



diffusion model, known as the conditional diffusion proba-
where {𝛽𝑡 ∈ (0, 1)}𝑇𝑡=1 . As the forward process progresses, bilistic model. During its reverse process, it can adapt to
the data sample 𝑥0 losses its distinguishable features, and as non-Gaussian real noises in the estimated speech signal. In
𝑇 → ∞, 𝑥𝑇 approaches a standard Gaussian distribution. addition, Qiu et al. [442] propose SRTNet, a novel method
for speech enhancement that uses the diffusion model as
Reverse diffusion process The reverse diffusion process a module for stochastic refinement. The proposed method
is defined by a Markov chain from 𝑥𝑇 ∼  (𝟎, 𝐈) to 𝑥0 and comprises a joint network of deterministic and stochastic
parameterized by 𝜃: modules, forming the “enhance-and-refine” paradigm. The
paper also includes a theoretical demonstration of the pro-
posed method’s feasibility and presents experimental results
𝑇

𝑝𝜃 (𝑥0 , ..., 𝑥𝑇 −1 |𝑥𝑇 ) = 𝑝𝜃 (𝑥𝑡−1 |𝑥𝑡 ) (25)
𝑡=1
to support its effectiveness.

where 𝑥𝑇 ∼  (0, 𝐼) and the transition probability 𝑝𝜃 (𝑥𝑡−1 |𝑥𝑡 )


4. Speech Representation Learning
is learnt through noise-estimation. This process eliminates
the Gaussian noise added in the forward diffusion process. The process of speech representation learning is essen-
tial for extracting pertinent and practical characteristics from
3.8.1. Application speech signals, which can be utilized for various downstream
Diffusion models have emerged as a leading approach for tasks such as speaker identification, speech recognition, and
generating high-quality speech in recent years [67, 269, 433, emotion recognition. While traditional methods for engineer-
434, 218, 204]. These non-autoregressive models transform ing features have been extensively used, recent advancements
white noise signals into structured waveforms via a Markov in deep-learning-based techniques utilizing supervised or
chain with a fixed number of steps. One such model, FastDiff, unsupervised learning have shown remarkable potential in
has achieved impressive results in high-quality speech syn- this field. Nonetheless, a novel approach founded on self-
thesis [204]. By leveraging a stack of time-aware diffusion supervised representation learning has surfaced, aiming to
processes, FastDiff can generate high-quality speech samples unveil the inherent structure of speech data and acquire rep-
58 times faster than real-time on a V100 GPU, making it resentations that capture the underlying structure of the data.
practical for speech synthesis deployment for the first time. This approach surpasses traditional feature engineering meth-
It also outperforms other competing methods in end-to-end ods and can significantly increase the accuracy to a consid-
text-to-speech synthesis. Another powerful diffusion proba- erable extent and effectiveness of downstream tasks. The
bilistic model proposed for audio synthesis is DiffWave [269]. primary objective of this new paradigm is to uncover in-

Mehrish et al.: Preprint submitted to Elsevier Page 18 of 72


A Review of Deep Learning Techniques for Speech Processing

formative and meaningful features from speech signals and Stacked Cross Entropy
outperform existing approaches. Therefore, this approach is Filter Bank -vectors Loss
considered a promising direction for future research in speech
representation learning.
This section provides a comprehensive overview of the
evolution of speech representation learning with neural net-
works. We will examine various techniques and architectures
developed over the years, including the emergence of unsu-
pervised representation learning methods like autoencoders,
generative adversarial networks (GANs), and self-supervised Fully Connected
representation learning frameworks. We will also examine
Hidden Layers

the difficulties and constraints associated with these tech-


niques, such as data scarcity, domain adaptation, and the
Figure 8: 𝑑-vector model architecture.
interpretability of learned representations. Through a com-
prehensive analysis of the advantages and limitations of dif-
ferent representation learning approaches, we aim to provide
insights into how to harness their power to improve the accu- These deep speaker representations can be applied to a
racy and robustness of speech processing systems. range of speaker-recognition tasks beyond verification and
identification, including diarization [572, 637, 287], voice
4.1. Supervised Learning conversion [594, 323, 86], multi-speaker TTS [478, 420,
In supervised representation learning, the model is trained 607], speaker adaptation [84] etc. To provide a comprehen-
using annotated datasets to learn a mapping between input sive overview, we analyzed deep embeddings from the per-
data and output labels. The set of parameters that define the spectives of input raw [226, 456] or mel-spectogram [509],
mapping function is optimized during training to minimize network architecture [325, 108], temporal pooling strategies
the difference between the predicted and true output labels [384], and loss functions [507, 91, 569]. In the following sub-
in the training data. The goal of supervised representation section, we introduce two representative deep embeddings: 𝑑-
learning is to enable the model to learn a useful representa- vector [553] and 𝑥-vector [508, 509]. These embeddings have
tion or features of the input data that can be used to accurately been widely adopted recently and have demonstrated state-
predict the output label for new, unseen data. For instance, of-the-art performance in various speaker-recognition tasks.
supervised representation learning in speech processing using By understanding the strengths and weaknesses of different
CNNs learn speech features from spectrograms. CNNs can deep learning-based techniques for speaker-representation
identify patterns in spectrograms relevant to speech recogni- learning, we can better leverage their power to improve the
tion, such as those corresponding to different phonemes or accuracy and robustness of speaker-recognition systems.
words. Unlike CNNs, which typically require spectrogram
• d-vector technique, proposed by Variani et al. (2014)
input, RNNs can directly take in the raw speech signals as
[553], serves as a frame-level speaker embedding method,
input and learn to extract features or representations that are
as illustrated in Figure 8. In this approach, during the
relevant for speech recognition or other speech-processing
training phase, each frame within a training utterance
tasks. Learning speaker representations typically involves
is labeled with the speaker’s true identity. This trans-
minimizing a loss function. Chung et al. [91] compares their
forms the training process into a classification task,
effectiveness for speaker recognition tasks, we distill it in Ta-
where a maxout Deep Neural Network (DNN) classi-
ble 1 to present an overview of commonly used loss functions.
fies the frames based on the speaker’s identity. The
Additionally, a new angular variant of the prototypical loss
DNN employs softmax as the output layer to minimize
is introduced in their work. Results from extensive experi-
the cross-entropy loss between the ground-truth frame
mental validation on the VoxCeleb1 test set indicate that the
labels and the network’s output. During the testing
GE2E and prototypical networks outperform other models in
phase, the 𝑑-vector technique extracts the output acti-
terms of performance.
vation of each frame from the last hidden layer of the
4.1.1. Deep speaker representations DNN, serving as the deep embedding feature for that
Speaker representation is a critical aspect of speech pro- frame. To generate a compact representation called the
cessing, allowing machines to analyze and process various 𝑑-vector, the technique computes the average of the
parts of a speaker’s voice, including pitch, intonation, accent, deep embedding features from all frames within an ut-
and speaking style. In recent years, deep neural networks terance. The underlying hypothesis is that the compact
(DNNs) have shown great promise in learning robust features representation space developed using a development
for speaker recognition. This section reviews deep learning- set can effectively generalize to unseen speakers during
based techniques for speaker representation learning that the testing phase [553].
have demonstrated significant improvements over traditional • x-vector [508, 509] is a segment-level speaker embed-
methods. ding and an advancement over the 𝑑-vector method

Mehrish et al.: Preprint submitted to Elsevier Page 19 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 1
The table summarizes various loss functions used in training the speaker recognition models
including their formulation [91].

Loss Function Objective Type Description


∑𝑁 exp 𝑊𝑦𝑇 𝑥𝑖 +𝑏𝑦𝑖
Softmax Classification 𝐿𝑆 = − 𝑁1 𝑖=1 ∑𝐶log 𝑖
𝑇
𝑗=1 exp 𝑊𝑗 𝑥𝑖 +𝑏𝑦𝑖
1 ∑𝑁 exp 𝑠(cos 𝜃 −𝑚)
AM-Softmax (CosFace) [569] Classification 𝐿𝐶 = − 𝑁 𝑖=1 log exp 𝑠(cos 𝜃 −𝑚)+∑𝑦𝑖 ,𝑖 exp 𝑠(cos 𝜃 )
𝑦𝑖 ,𝑖 𝑗≠𝑦𝑖 𝑗,𝑖
∑𝑁 exp 𝑠(cos 𝜃 −𝑚)
AAM-Softmax (ArcFace) [103] Classification 𝐿𝐴 = − 𝑁1 𝑖=1 log exp 𝑠(cos 𝜃 +𝑚)+∑𝑦𝑖 ,𝑖 exp 𝑠(cos 𝜃 )
∑𝑁 𝑦𝑖 ,𝑖 𝑗≠𝑦𝑖 𝑗,𝑖

Triplet [486] Metric learning [640] 𝐿𝑇 = 𝑁1 𝑗=1 max(0, ||𝑥𝑗,0 − 𝑥𝑗,1 ||22 , ||𝑥𝑗,0 − 𝑥𝑘≠𝑗,1 ||22 + 𝑚)
∑ exp S
𝐿𝑃 = − 𝑁1 𝑗=1 log ∑𝑁 𝑗,𝑗
𝑁
Prototypical [507] Metric learning [507]
𝑘=1 exp S𝑗,𝑘
1 ∑ exp S𝑗,𝑖,𝑗
Generalized end-to-end (GE2E) [561] Metric learning [573] 𝐿𝐺 = − 𝑁 𝑗,𝑖 log ∑𝑁
𝑘=1 exp S𝑗,𝑖,𝑘
∑ exp S
Angular Prototypical Metric learning 𝐿𝐴𝑃 = − 𝑁1 𝑗,𝑖 log ∑𝑁 𝑗,𝑖,𝑗
exp S
𝑘=1 𝑗,𝑖,𝑘

performance of the 𝑥-vector.

4.2. Unsupervised learning


Unsupervised representation learning for speech process-
ing has gained significant emphasis over the past few years.
segment-level
-vectors
Similar to visual modality in CV and text modality in NLP,
speech i.e. audio modality introduces unique challenges. Un-
Statistics Pooling supervised speech representation learning is concerned with
learning useful speech representations without using anno-
tated data. Usually, the model is first pre-trained on the task
where plenty of data is available. The model is then fined
tuned or used to extract input representations for a small
frame-level

model, specifically targeting tasks with limited data.


One approach to addressing the unique challenges of
unsupervised speech representation learning is to use prob-
abilistic latent variable models (PLVM), which assume an
unknown generative process produces the data and enables
Figure 9: 𝑥-vector model architecture. 𝑥1 ,𝑥2 ,....,𝑥𝑇 are the learning of rich structural representations and reasoning
the spectral features such as Mel spectrograms of the about observed and unobserved factors of variation in com-
speech utterance. plex datasets such as speech within a probabilistic framework.
PLVM specified a joint distribution 𝑝(𝑥, 𝑧) over unobserved
stochastic latent variable z and observed variables x. By fac-
as it incorporates additional modeling of temporal torizing the joint distribution into modular components, it
information and phonetic information in speech sig- becomes possible to learn rich structural representations and
nals, resulting in improved performance compared to reason about observed and unobserved factors of variation
the 𝑑-vector. 𝑥-vector employs an aggregation pro- in complex datasets such as speech within a probabilistic
cess to move from frame-by-frame speaker labeling framework. The likelihood of a PLVM given a data x can be
to utterance-level speaker labeling as highlighted in written as
Figure 9. The network structure of the 𝑥-vector is de-

picted in a figure, which consists of time-delay layers 𝑝(𝑥) = 𝑝(𝑥|𝑧)𝑝(𝑧)𝑑𝑧. (26)
for extracting frame-level speech embeddings, a statis-
tical pooling layer for concatenating mean and standard Probabilistic latent variable models provide a powerful
deviation of embeddings as a segment-level feature, way to learn a representation that captures the underlying
and a standard feedforward network for classifying relationships between observed and unobserved variables,
the segment-level feature to its speaker. 𝑥-vector is without requiring explicit supervision or labels. These mod-
the segment-level speaker embedding generated from els involve unobserved latent variables that must be inferred
the feedforward network’s second-to-last hidden layer. from the observed data, typically using probabilistic infer-
The authors in [616, 472] have also discovered the ence techniques such as Markov Chain Monte Carlo (MCMC)
significance of data augmentation in enhancing the methods. In the context of representation learning, Varia-
tional autoencoders (VAE) are commonly used with latent

Mehrish et al.: Preprint submitted to Elsevier Page 20 of 72


A Review of Deep Learning Techniques for Speech Processing
Probabilistic Latent Variable
which can be expressed as:

𝑢𝑛𝑠𝑢𝑝 =
|𝑦|
∑ ∑
1
𝑝𝜃 (𝑦𝑗 , 𝑥𝑖 ) log 𝑝𝜃 (𝑦𝑗 , 𝑥𝑖 ) (29)
𝑁𝑈 (𝑥𝑖 )∈𝑋𝑈 𝑗=1

where 𝑝𝜃 (𝑦𝑗 |𝑥𝑖 ) is the predicted probability of the 𝑗-th label


Self-Supervised

for the unlabelled data point 𝑥𝑖 . Finally the overall objective


function for semi-supervised learning can be expressed as
 = 𝑠𝑢𝑝 + 𝛼𝑢𝑛𝑠𝑢𝑝 , 𝛼 is a hyperparameter that controls the
weight of the unsupervised loss term. The goal is to find the
optimal parameters 𝜃 that minimize this objective function.
Figure 10: Overview of difference between probabilistic latent
Semi-supervised learning involves learning a model from
variable models and self-supervised learning. In latent variable
models learn the functions 𝑓 (.) and 𝑔(.) learn the parameters
of distribution 𝑝 and 𝑞. The latent variable 𝑧 is used for both labelled and unlabelled data by minimizing a combina-
representing learning. tion of supervised and unsupervised loss terms. By leverag-
ing the additional unlabelled data, semi-supervised learning
can improve the generalization and performance of the model
variable models for various speech processing tasks, leverag- in downstream tasks.
ing the power of probabilistic modeling to capture complex Semi-supervised learning techniques are increasingly be-
patterns in speech data. ing employed to enhance the performance of DNNs across a
range of downstream tasks in speech processing, including
4.3. Semi-supervised Learning ASR, TTS, etc. The primary objective of such approaches
Semi-supervised learning can be viewed as a process of is to leverage large unlabelled datasets to augment the per-
optimizing a model using both labeled and unlabeled data. formance of supervised tasks that rely on labelled datasets.
The set of labeled data points, denoted by 𝑋𝐿 , contains 𝑁𝐿 The recent advancements in speech recognition have led to a
items, where each item is represented as (𝑥𝑖 , 𝑦𝑖 ) with 𝑦𝑖 being growing interest in the integration of semi-supervised learn-
the label of 𝑥𝑖 . On the other hand, the set of unlabeled data ing methods to improve the performance of ASR and TTS
points, denoted by 𝑋𝑈 , consists of 𝑁𝑈 items, represented as systems [658, 34, 657, 229, 605, 89]. This approach is partic-
𝑥𝑁𝐿 +1 , 𝑥𝑁𝐿 +2 , ..., 𝑥𝑁𝐿 +𝑁𝑈 . ularly beneficial in scenarios where labelled data is scarce or
In semi-supervised learning, the objective is to train a expensive to acquire. In fact, for many languages around the
model 𝑓𝜃 with parameters 𝜃 that can minimize the expected globe, labelled data for training ASR models are often inade-
loss over the entire dataset. The loss function 𝐿(𝑦, 𝑓𝜃 (𝑥)) is quate, making it challenging to achieve optimal results. Thus,
used to quantify the deviation between the model’s prediction using a semi-supervised learning model trained on abundant
𝑓𝜃 (𝑥) and the ground truth label 𝑦. The expected loss can be resource data can offer a viable solution that can be readily
mathematically expressed as: extended to low-resource languages.
Semi-supervised learning has emerged as a valuable tool
for addressing the challenges of insufficient annotations and
𝐿(𝑦, 𝑓𝜃 (𝑥)) = 𝐸(𝑥,𝑦)∼𝑝𝑑𝑎𝑡𝑎 (𝑥,𝑦) [𝐿(𝑦, 𝑓𝜃 (𝑥))] (27) poor generalization [165]. Research in various domains, in-
cluding image quality assessment [341], has demonstrated
where 𝑝𝑑𝑎𝑡𝑎 (𝑥, 𝑦) is the underlying data distribution.In semi- that leveraging both labelled and unlabelled data through
supervised learning, the loss function is typically decom- semi-supervised learning can lead to improved performance
posed into two parts: a supervised loss term that is only and generalization. In the domain of speech quality assess-
defined on the labeled data, and an unsupervised loss term ment, several studies [490] have exploited the generalization
that is defined on both labeled and unlabelled data. The capabilities of semi-supervised learning to enhance perfor-
supervised loss term is calculated as follows: mance.
Moreover, semi-supervised learning has gained signifi-
cant attention in other areas of speech processing, such as
𝑠𝑢𝑝 =
1 ∑
𝐿(𝑦, 𝑓𝜃 (𝑥)) (28) end-to-end speech translation [430]. By leveraging large
amounts of unlabelled data, semi-supervised learning ap-
𝑁𝐿 (𝑥,𝑦)∈𝑋𝐿
proaches have demonstrated promising results in improving
The unsupervised loss term leverages the unlabelled data the performance and robustness of speech translation mod-
to encourage the model to learn meaningful representations els. This highlights the potential of semi-supervised learning
that capture the underlying structure of the data. One com- to address the limitations of traditional supervised learning
mon approach is to use a regularization term that encourages approaches in a variety of speech processing tasks.
the model to produce similar outputs for similar input data.
This can be achieved by minimizing the distance between the 4.4. Self-supervised representation learning
output of the model for two similar input data points. One (SSRL)
such regularization term is the entropy minimization term, Self-supervised representation learning (SSRL) is a ma-
chine learning approach that focuses on achieving robust and

Mehrish et al.: Preprint submitted to Elsevier Page 21 of 72


A Review of Deep Learning Techniques for Speech Processing
Reconstructed Input Recover the Masked Input Predict the Future
resembling ELMo and BERT’s masked language modeling
(MLM). These non-autoregressive generative approaches dif-
fer in their use of advanced structures, such as bidirectional
LSTM (for ELMo) and transformer (for BERT), with recent
models producing contextual embeddings. In the context of
the speech, Mockingjay [330] applied masking to all feature
dimensions in the speech domain, whereas TERA [329] ap-
plied to mask only to a particular subset of feature dimensions.
The summary of generative self-supervised approaches along
Masked Input

with the data used for training the models are outlined in
Table 2. We further discuss different generative approaches
input input input
sequence sequence sequence

(a) (b) (c) as highlighted in Figure 11 as follows:

• Auto-encoding Models: Auto-encoding Models have


Figure 11: Generative approaches to self-supervised learning. garnered significant attention in the domain of self-
supervised learning, particularly Autoencoders (AEs)
and Variational Autoencoders (VAEs). AEs consist
in-depth feature learning while minimizing reliance on exten- of an encoder and a decoder that work together to
sively annotated datasets, thus reducing the annotation bot- reconstruct input while disregarding less important
tleneck [132, 289]. SSRL comprises various techniques that details, prioritizing the extraction of meaningful fea-
allow models to be trained without needing human-annotated tures. VAEs, a probabilistic variant of AEs, have found
labels [132, 289]. One of the key advantages of SSRL is its wide-ranging applications in the field of speech mod-
ability to operate on unlabelled datasets, which reduces the eling. Furthermore, the vector-quantized variational
need for large annotated datasets [132, 289]. In recent years, autoencoder (VQ-VAE) [551] has been developed as
self-supervised learning has progressed rapidly, with some an extended generative model. The VQ-VAE intro-
methods approaching or surpassing the efficacy of fully super- duces parameterization of the posterior distribution to
vised learning methods. Self-supervised learning methods represent discrete latent representations. Remarkably,
typically involve pretext tasks that generate pseudo labels for the VQ-VAE has demonstrated notable success in gen-
discriminative model training without actual labeling. The erative spoken language modeling. By combining a
difference between self-supervised representation learning discrete latent space with self-supervised learning, its
and unsupervised representation is highlighted in Figure 10. performance is further improved.
In contrast to unsupervised representation learning, SSRL
techniques are designed to generate these pseudo labels for • Autoregressive models: Autoregressive generative self-
model training. The ability of SSRL to achieve robust and in- supervised learning uses autoregressive prediction cod-
depth feature learning without relying heavily on annotated ing technique [95] to model the probability distribu-
datasets holds great promise for the continued development tion of a sequence of data points. This approach aims
of machine learning techniques. to predict the next data point in a sequence based on
SSRL differs from supervised learning mainly in terms the previous data points. Autoregressive models typi-
of its data requirements. While supervised learning relies on cally use RNNs or a transformer architecture as a basic
labeled data, where the model learns from input-output pairs, model.
SSL generates its own labels from the input data, eliminating The authors in paper [404] introduce a generative model
the need for labeled data [289]. The SSL approach trains for raw audio called WaveNet, based on PixelCNN
the model to predict a portion of the input data, which is [402]. To enhance the model’s ability to handle long-
then utilized as a label for the task at hand [289]. Although range temporal dependencies, the authors incorporate
SSRL is an unsupervised learning technique, it seeks to tackle dilated causal convolutions [404]. They also utilize
tasks commonly associated with supervised learning without Gated Residual blocks and skip connections to improve
relying on labeled data [289]. the model’s expressivity.

4.4.1. Generative Models • Masked Reconstruction: The concept of masked recon-


This method involves instructing a model to produce sam- struction is influenced by the masked language model
ples resembling the input data without explicitly learning the (MLM) task proposed in BERT [109]. This task in-
labels, creating valuable representations applicable to other volves masking specific tokens in input sentences with
tasks. The detailed architecture for generative models with learned masking tokens or other input tokens, and train-
three different variants is shown in Figure 11. The earliest ing the model to reconstruct these masked tokens from
self-supervised method, predicting masked inputs using sur- the non-masked ones. Recent research has explored
rounding data, originated from the text field in 2013 with similar pretext tasks for speech representation learning
word2vec. The continuous bag of words (CBOW) concept that help models develop contextualized representa-
of word2vec predicts a central word based on its neighbors, tions capturing information from the entire input, like

Mehrish et al.: Preprint submitted to Elsevier Page 22 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 2
Summary of generative self-supervised approaches and proposed models for speech pro-
cessing with associated metrics and training Data. ASR: Automatic Speech Recognition,
PR: Phoneme Recognition. PC: Phoneme Classification, SR: Speaker Recognition, LS:
LibriSpeech.
Pre-Training Dataset
Model Reference Task (Metric)
Dataset (hours) Training Test
PC LS (360h) LS (360h) LS (test-clean)
Mockingjay [330]
SR LS (360h) LS (100h) LS (100h)
PASE [418] ASR LS (50 hr) DIRHA DIRHA
DIRHA DIRHA
PASE+ [458] ASR LS (50 hr)
CHiME-5 CHiME-5
LS (100h, 360h, 460 h, 960h) LS (100h, 360h, 460 h, 960h) LS (test-clean)
DeCoAR [326] ASR
WSJ si284 WSJ si284 LS (test-other)

the DeCoAR model [326]. This approach assists the


model in comprehending input data better, leading to
more precise and informative representations.
Discriminator

4.4.2. Contrastive Models


The technique involves training a model to differentiate Bilinear Bilinear

between similar and dissimilar pairs of data samples, which


helps the model acquire valuable representations that can
be utilized for various tasks, as shown on Figure 12. The
fundamental principle of contrastive learning is to generate
positive and negative pairs of training samples based on the
comprehension of the data. The model must learn a function
that assigns high similarity scores to two positive samples
and low similarity scores to two negative samples. There-
fore, generating appropriate samples is crucial for ensuring
that the model comprehends the fundamental features and
structures of the data. Table 3 outlines popular contrastive
self-supervised models used for different speech-processing
tasks. We discuss Wav2Vec 2.0 since it has achieved state-
Anchor Positive Negative

of-the-art results in different downstream tasks.


Figure 12: Contrastive Self-supervised learning: Contrastive
• Wav2Vec 2.0 [26] is a framework for self-supervised Predictive Coding.
learning of speech representations that is one of the
current state-of-the-art models for ASR [26]. The train-
ing of the model occurs in two stages. Initially, the global structures while minimizing mutual information be-
model operates in a self-supervised mode during the tween patches of corrupted graphs and the original graph’s
first phase, where it uses unlabelled data and aims to global representation.
achieve the best speech representation possible. The
second phase is fine-tuning a particular dataset for a 4.4.3. Predictive Models
specific purpose. Wav2Vec 2.0 takes advantage of self- In training predictive models, the primary concept in-
supervised training and uses convolutional layers to volves creating simpler objectives or targets to minimize the
extract features from raw audio. need for data generation. However, the most critical and
difficult aspect is ensuring that the task’s difficulty level
In the speech field, researchers have explored different is appropriate for the model to learn effectively. Predic-
approaches to avoid overfitting, including augmentation tech- tive SSRL methods have been leveraged in ASR through
niques like Speech SimCLR [220] and the use of positive and transformer-based models to acquire meaningful representa-
negative pairs through methods like Contrastive Predictive tions [23, 196, 329] and have proven transformative in exploit-
Coding (CPC) (Ooster and Meyer [406]), Wav2vec (v1, v2.0) ing the growing abundance of data [150]. Table 4 highlight
(Schneider et al. [485]), VQ-wav2vec (Baevski et al. [25]), popularly used SSRL methods along with the data used for
and Discrete BERT [23]." "In the graph field, researchers training these models. In the following section we breifly dis-
have developed approaches like Deep Graph Infomax (DGI) cuss three popular predictive SSRL approaches used widely
(Velickovic et al., 2019 [557]) to learn representations that in various downstream tasks.
maximize the mutual information between local patches and

Mehrish et al.: Preprint submitted to Elsevier Page 23 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 3
Summary of contrastive self-supervised approaches and proposed models for speech pro-
cessing with associated metrics and training Data. ASR: Automatic Speech Recognition,
PR: Phoneme Recognition. PC: Phoneme Classification, SR: Speaker Recognition, LS:
LibriSpeech, LL: LibriLight, WSJ: Wall Street Journal.
Pre-Training Dataset
Model Reference Task
Dataset (hours) Training Test
PC LS (100h) LS (100h) LS (100h)
CPC [405]
SR LS (100h) LS (100h) LS (100h)
LS (100h, 360h)
Modified CPC [467] PC CV-Dataset CV-Dataset
Zerospeech2017(45h)
WSJ (80h) WSJ (80h)
LS (960h) LS (960h) WSJ (test92, test93)
TIMIT (5h) TIMIT (5h) LS (test-clean, test-other)
ASR
Bidirectional CPC [247] SSA (1h) SSA (1h) TED3 (dev, test)
TED3 (440h) TED3 (440h) SwithBoard (eval2000)
SwithBoard (310h) SwithBoard (310h)
Audio Set (2500h) Audio Set (2500h)
OpenSLR
ASR-Multi AVSpeech (3100h) AVSpeech (3100h)
ALFFA
CV-Dataset (430h CV-Dataset (430h)
LS 80/860h
ASR WSJ (si284) WSJ (eval92)
wav2vec [485] LS 960h + WSJ (si284)
PR TIMIT TIMIT TIMIT
LS (960h) LS (test-clean)
ASR LS (960h)
wav2vec 2.0 [26] LL (60000h) LS (test-other)
LS (960h)
PR TIMIT TIMIT
LL (60000h)
ASR LS (960h) WSJ (si284) WSJ (eval92)
vq-wav2vec 2.0 [25]
PR LS (960h) TIMIT TIMIT
wav2vec-C [476] ASR Alexa-10k Alexa-eval Alexa-eval
LS (test)
LS (test-other)
w2v-BERT [96] ASR LL (60000h) LS (960h)
LS (dev)
LS (dev-other)
LS (960h)
ASR WJS (si284) WJS (si284) WJS (si284)
Speech SimCLR [220]
TED2
LS (960h)
PR WJS (si284) TIMIT TIMIT
TED2
LL (60000h)
UnSpeech [381] ASR-Mult GigaSpeech (10000h) SUPERB SUPERB
VP (24000h)

• The direct application of BERT-type training to speech supervised speech representation learning. Even with
input presents challenges due to the unsegmented and a mere 10-minute fine-tuning set, it achieved a Word
unstructured nature of speech. To overcome this obsta- Error Rate (WER) of 25% on the standard test-other
cle, a pioneering model known as Discrete BERT [23] subset. This approach effectively tackles the challenge
has been developed. This model converts continuous of directly applying BERT-type training to continu-
speech input into a sequence of discrete codes, facili- ous speech input and holds substantial potential for
tating code representation learning. The discrete units significantly enhancing speech recognition accuracy
are obtained from a pre-trained vq-wav2vec model
[25], and they serve as both inputs and targets within • The HuBERT [196] and TERA [329] models are two
a standard BERT model. The architecture of Discrete self-supervised approaches for speech representation
BERT, illustrated in Figure 13 (a), incorporates a soft- learning. HuBERT uses an offline clustering step to
max normalized output layer. During training, cate- align target labels with a BERT-like prediction loss,
gorical cross-entropy loss is employed, with a masked with the prediction loss applied only over the masked
perspective of the original speech input utilized for pre- regions as outlined in Figure 13 (b). This encourages
dicting code representations. Remarkably, the Discrete the model to learn a combined acoustic and language
BERT model has exhibited impressive efficacy in self- model over the continuous inputs. On the other hand,
TERA is a self-supervised speech pre-training method

Mehrish et al.: Preprint submitted to Elsevier Page 24 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 4
Summary of predictive self-supervised approaches and proposed models for speech pro-
cessing with associated metrics and training Data. ASR: Automatic Speech Recognition,
PR: Phoneme Recognition. PC: Phoneme Classification, SR: Speaker Recognition, LL:
LibriLight, LS: LibriSpeech.
Pre-Training Dataset
Model Reference Task (Metric)
Dataset (hours) Training Test
LS (test)
LS (test-other)
ASR LL (60000h) LS (960h)
BEST-RQ [78] LS (dev)
LS (dev-other)
LL (60000h)
ASR-Multi GigaSpeech (10000h) SUPERB SUPERB
VP (24000h)
data2vec [24] ASR LS (960h) LS (10m, 1h, 100h, 960h) LS (960h)
LS (test)
Discrete BERT [23] ASR LS (960h) LS (100h)
LS (test-other)
LS (960h) LS (test)
HuBERT [625] ASR LS (960h)
LL (60000h) LS (test-other)
WavLM [71] ASR LL (60000h) SUPERB SUPERB

speech representations’ performance in terms of reusability.


Self-supervised learning has emerged as a widely adopted
Acoustic Unit Discovery
K-means on MFCC

and effective technique for speech processing tasks due to its


softmax
output

ability to train models with large amounts of unlabeled data.


A comprehensive overview of self-supervised approaches,
evaluation metrics, and training data is provided in Table
masked timesteps masked timesteps
discrete tokens

4 for speech recognition, speaker recognition, and speech


enhancement. Researchers and practitioners can use this
resource to select appropriate self-supervised methods and
datasets to enhance their speech-processing systems. As
self-supervised learning techniques continue to advance and
input input
refine, we can expect significant progress and advancements
in speech processing.
sequence sequence

(a) (b)

Figure 13: Predictive Self-supervised learning: (a) Discrete 5. Speech Processing Tasks
BERT (b) HuBERT. In recent times, the field of speech processing has gained
significant attention due to its rapid evolution and its cru-
cial role in modern technological applications. This field in-
that reconstructs acoustic frames from their altered volves the use of diverse techniques and algorithms to analyse
counterparts using a stochastic policy to alter along var- and understand spoken language, ranging from basic speech
ious dimensions, including time, frequency, and tasks. recognition to more complex tasks such as spoken language
These alterations help extract feature-based speech rep- understanding and speaker identification. Since speech is one
resentations that can be fine-tuned as part of down- of the most natural forms of communication, speech process-
stream models. ing has become a critical component of many applications
such as virtual assistants, call centres, and speech-to-text
Microsoft has introduced UniSpeech-SAT [72] and WavLM
transcription. In this section, we provide a comprehensive
[71] models, which follow the HuBERT framework. These
overview of the various speech-processing tasks and the tech-
models have been designed to enhance speaker representation
niques used to achieve them, while also discussing the current
and improve various downstream tasks. The key focus of
challenges and limitations faced in this field and its potential
these models is data augmentation during the pre-training
for future development.
stage, resulting in superior performance. WavLM model has
The assessment of speech-processing models depends
exhibited outstanding effectiveness in diverse downstream
greatly on the calibre of datasets employed. By utilizing
tasks, such as automatic speech recognition, phoneme recog-
standardized datasets, researchers are enabled to objectively
nition, speaker identification, and emotion recognition. It is
gauge the efficacy of varying approaches and identify scopes
worth highlighting that this model currently holds the top
for advancement. The selection of evaluation metrics plays a
position on the SUPERB leaderboard [615], which evaluates

Mehrish et al.: Preprint submitted to Elsevier Page 25 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 5
Comparative analysis of speech processing datasets: This table summarizes the essential
features of different speech-processing datasets, including their typical applications in various
speech-processing tasks. ASR: Automatic Speech Recognition, PR: Phoneme Recognition.
PC: Phoneme Classification, SR: Speaker Recognition, SV: Speaker Verification, SER:
Speech Emotion Recognition, IC: Intent Classification, TTS: Text-to-Speech, VC: Voice
Conversion, ST: Speech Translation, SS: Speech Separation
Dataset Language Lenght (hours) ASR PR PC SR SV SER IC TTS VC ST SS
TIMIT Acoustic-Phonetic Continuous Speech Corpus English 5.4
Lip Reading Sentences 2 (LRS2) English
LibriSpeech (LS) English 1000
GigaSpeech English 10000
Fleurs Multilingual 12
LibriTTS English 585
L2ARCTIC English 11.2
CMUARCTIC English 20
Wall Street Journal (WSJ) English
VoxPopuli (VP) Multilingual 1800
BABEL (BBL) Multilingual
Common Voice (CV-dataset) Multilingual 9283
CSTR VCTK English
HUB 5 English 2000
CHiME-5 English 50.12
TED-LIUM 3 (TED 3) English 452
TED-LIUM 2 (TED 2) English 118
AISHELL-1 Mandarin 520
AISHELL-3 Mandarin 85
AISHELL-4 Mandarin 120
Arabic Speech Corpus Arabic 3.7
Persian Consonant Vowel Combination Persian -
ALFFA Multilingual 5.2-18.3
OpenSLR-multi Multilingual 4.4-265.9
VCTK English 44
VoxCeleb1/2 English
Fluent Speech Commands (FSC) English 14.7
Emotional Speech Dataset (ESD) English 29
Interactive Emotional Dyadic Motion Capture (IEMOCAP) English 12
Multimodal EmotionLines Dataset ( MELD) English -
LibraSeepch En-Fr English/French -
CoVoST-2 Multilingual 2880
LibriLight (LL) English 60000

critical role in this process, hinging on the task at hand and tion and analysis of acoustic features, including spectral and
the desired outcome. Therefore, it is essential that researchers prosodic features, which are then employed to recognize spo-
conduct a meticulous appraisal of different metrics to make ken words. Next, an acoustic model matches the extracted
informed decisions. This paper offers a thorough summary features to phonetic units, while a language model predicts
of frequently utilized datasets and metrics across diverse the most probable sequence of words based on the recognized
downstream tasks, as presented in Table 5 and, Table 6. phonetic units. Ultimately, the acoustic and language model
outcomes are merged to produce the transcription of spoken
5.1. Automatic speech recognition (ASR) & words. Deep learning techniques have gained popularity in
conversational multi-speaker AST recent years, allowing for improved accuracy in ASR sys-
5.1.1. Task Description tems [26, 445]. This paper provides an overview of the key
Automatic speech recognition (ASR) technology enables components involved in ASR and highlights the role of deep
machines to convert spoken language into text or commands, learning techniques in enhancing the technology’s accuracy.
serving as a cornerstone of human-machine communication Most speech recognition systems that use deep learning
and facilitating a wide range of applications such as speech- aim to simplify the processing pipeline by training a single
to-speech translation and information retrieval [345]. ASR model to directly map speech signals to their correspond-
involves multiple intricate steps, starting with the extrac- ing text transcriptions. Unlike traditional ASR systems that

Mehrish et al.: Preprint submitted to Elsevier Page 26 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 6
Comprehensive Evaluation Metrics for Speech Processing Tasks. This table provides a
comprehensive overview of the evaluation metrics used to assess the performance of speech-
based systems across various tasks such as ASR, speaker verification, and TTS. The table
highlights the specific metrics employed for each task, along with the score range and
commonly used datasets.
Tasks Metric Description Score range Evaluation dataset
Automatic speech recognition WER Word Error Rate 0-1 TIMIT
CER Character Error Rate 0-1 LibriSpeech
Phoneme recognition Accuracy Classification accuracy 0-1 TIMIT
Phoneme classification F1-score Harmonic mean of precision and recall 0-1 TIMIT
Speaker recognition EER Equal Error Rate 0-1 VoxCeleb1
Speaker verification FAR/FRR False Acceptance Rate / False Rejection Rate 0-1 VoxCeleb1
Speech emotion recognition Accuracy Classification accuracy 0-1 IEMOCAP, ESD
Intent classification F1-score Harmonic mean of precision and recall 0-1 ATIS, SNIPS
Text-to-speech MOS Mean Opinion Score 1-5 LJSpeech, LibriTTS
Voice conversion MOS Mean Opinion Score 1-5 VCC 2016
Speech translation BLEU Bilingual Evaluation Understudy 0-1 MuST-C
Speech separation SI-SDRi Signal to Distortion Ratio -20-30 WSJ0-2mix
Speech enhancement PESQ Perceptual Evaluation of Speech Quality -0.5-4.5 NOIZEUS
Voice activity detection F1-score Harmonic mean of precision and recall 0-1 QUT-NOISE

require multiple components to extract and model features, datasets used for this purpose. In this context, several popular
such as HMMs and GMMs, end-to-end models do not rely datasets have gained prominence for use in ASR systems.
on hand-designed components [19, 307]. Instead, end-to-
end ASR systems use DNNs to learn acoustic and linguistic • Common Voice: Mozilla’s Common Voice project [17]
representations directly from the input speech signals [307]. is dedicated to producing an accessible, unrestricted
One popular type of end-to-end model is the encoder-decoder collection of human speech for the purpose of train-
model with attention. This model uses an encoder network ing speech recognition systems. This ever-expanding
to map input audio signals to hidden representations, and dataset features contributions from more than 9, 000
a decoder network to generate text transcriptions from the speakers spanning 60 different languages.
hidden representations. During the decoding process, the • LibriSpeech: LibriSpeech [412] is a corpus of approx-
attention mechanism enables the decoder to selectively focus imately 1,000 hours of read English speech created
on different parts of the input signal [307]. from audiobooks in the public domain. It is widely
End-to-end ASR models can be trained using various used for speech recognition research and is notable for
techniques such as CTC [245], which is used to train models its high audio quality and clean transcription.
without explicit alignment between the input and output se-
quences, and RNNs, which are commonly used to model tem- • VoxCeleb: VoxCeleb [92] is a large-scale dataset con-
poral dependencies in sequential data such as speech signals. taining over 1 million short audio clips of celebrities
Transfer learning-based approaches can also improve end-to- speaking, which can be used for speech recognition
end ASR performance by leveraging pre-trained models or and recognition research. It includes a diverse range of
features [327, 106, 491]. While end-to-end ASR models have speakers from different backgrounds and professions.
shown promising results in various applications, there is still
room for improvement to achieve human-level performance • TIMIT: The TIMIT corpus [153] is a widely used
[327, 625, 236, 237, 106, 137]. Nonetheless, deep learning- speech dataset consisting of recordings consisting of
based end-to-end ASR architecture offers a promising and 630 speakers representing eight major dialects of Amer-
efficient approach to speech recognition that can simplify the ican English, each reading ten phonetically rich sen-
processing pipeline and improve recognition accuracy. tences. It has been used as a benchmark for speech
recognition research since its creation in 1986.
5.1.2. Dataset
• CHiME-5: The CHiME-5 dataset [33] is a collection of
The development and evaluation of ASR systems are
recordings made in a domestic environment to simulate
heavily dependent on the availability of large datasets. As
a real-world speech recognition scenario. It includes
a result, ASR is an active area of research, with numerous
6.5 hours of audio from multiple microphone arrays

Mehrish et al.: Preprint submitted to Elsevier Page 27 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 7
Table summarizing the performance of different ASR models in terms of WER% on five
different datasets (LibriSpeech test, LibriSpeech clean, TIMIT, Common Voice, WSJ eval92,
and GigaSpeech) also highlighting the use of extra data during training. ZS stands for
Zero-Shot Performance.
Extra Extra
Model Architecture WER% ↓ WER% ↓ Model Architecture WER% ↓
Training Data Training Data
LibriSpeech test clean others TIMIT
Conformer + Wav2vec 2.0 [658] Conformer + wav2vec2.0 Y 1.4 2.6 wav2vec 2.0 [26] Transformer + CNN Y 8.3
w2v-BERT XXL [96] CNN+Transformer Y 1.4 2.5 vq-wav2vec [25] Transformer + CNN Y 11.6
SpeechStew (1B)[58] Conformer Y 1.7 3.3 LSTM + Monophone Reg [457] LSTM N 14.5
SpeechStew (100M) [58] Conformer N 2.0 4.0 Common Voice
ContextNet + SpecAugment [415] LSTM+CNN Y 1.7 3.4 SpeechStew (1B) [58] Conformer N 10.8
Conformer (L) [162] Conformer N 1.9 4.1 Whisper [445] N 9.5
ContextNet [169] Conformer + wav2vec2.0 N 1.9 3.4 WSJ eval92
Squeezeformer [255] Conformer N 2.47 5.97 SpeechStew (100M) [58] Conformer N 1.3
LSTM Transducer [636] LSTM N 2.23 5.6 tdnn+chain [435] TDNN N 2.32
Transformer Transducer [331] Transformer N 2.0 4.2 GigaSpeech
Whisper [445] N 2.7 (ZS) 5.6 (ZS) Conformer/Transformer-AED [61] Conformer N 10.80

and is designed to test the performance of ASR systems where authors include CNN layers before submitting prepro-
in noisy and reverberant environments. cessed speech features to the input. By incorporating more
CNN layers, it becomes feasible to diminish the gap between
Other notable datasets include Google’s Speech Com- the sizes of the input and output sequences, given that the
mands Dataset [589], the Wall Street Journal dataset4 , and number of frames in audio exceeds the number of tokens in
TED-LIUM [470]. text. This results in a favorable impact on the training pro-
cess. The change in the original architecture is minimal, and
5.1.3. Models
the model achieves a competitive word error rate (WER) of
The use of RNN-based architecture in speech recognition
10.9% on the Wall Street Journal (WSK) speech recognition
has many advantages over traditional acoustic models. One
dataset (Table 7). Despite its numerous advantages, Trans-
of the most significant benefits is their ability to capture long-
formers in its pristine state has several issues when applied
term temporal dependencies [244] in speech data, enabling
to ASR. RNN, with its overall training speed (i.e., conver-
them to model the dynamic nature of speech signals. Addi-
gence) and better WER because of effective joint training
tionally, RNNs can effectively process variable-length audio
and decoding methods, is still the best option.
sequences, which is essential in speech recognition tasks
The authors in [116] propose the Speech Transformer,
where the duration of spoken words and phrases can vary
which has the advantage of faster iteration time, but slower
widely. RNN-based models can efficiently identify and seg-
convergence compared to RNN-based ASR. However, in-
ment phonemes, detect and transcribe spoken words, and can
tegrating the Speech Transformer with the naive language
be trained end-to-end, eliminating the need for intermediate
model (LM) is challenging. To address this issue, various im-
steps. These features make RNN-based models particularly
provements in the Speech Transformer architecture have been
useful in real-time applications, such as speech recognition in
proposed in recent years. For example, [245] suggests incor-
mobile devices or smart homes [117, 178], where low latency
porating the Connectionist Temporal Classification (CTC)
and high accuracy are crucial.
loss into the Speech Transformer. CTC is a popular tech-
In the past, RNNs were the go-to model for ASR. How-
nique used in speech recognition to align input and output
ever, their limited ability to handle long-range dependencies
sequences of varying lengths and one-to-many or many-to-
prompted the adoption of the Transformer architecture. For
one mappings. It introduces a blank symbol representing
example, in 2019, Google’s Speech-to-Text API transitioned
gaps between output symbols and computes the loss function
to a Transformer-based architecture that surpassed the previ-
by summing probabilities across all possible paths. The loss
ous RNN-based model, especially in noisy environments and
function encourages the model to assign high probabilities
for longer sentences, as reported in [651]. Additionally, Face-
to correct output symbols and low probabilities to incorrect
book AI Research introduced wav2vec 2.0, a self-supervised
output symbols and the blank symbol, allowing the model to
learning approach that leverages a Transformer-based archi-
predict sequences of varying lengths. The CTC loss is com-
tecture to perform unsupervised speech recognition. wav2vec
monly used with RNNs such as LSTM and GRU, which are
2.0 has significantly outperformed the previous RNN-based
well-suited for sequential data. CTC loss is a powerful tool for
model and achieved state-of-the-art results on several bench-
training neural networks to perform sequence-to-sequence
mark datasets.
tasks where the input and output sequences have varying
Transformer for the ASR task is first proposed in [116],
lengths and mappings between them are not one-to-one.
4 https://www.ldc.upenn.edu/ Various other improvements have also been proposed to

Mehrish et al.: Preprint submitted to Elsevier Page 28 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 8 Model (USM) [656] developed by Google, which has been


Comparison of performance between wav2vec2.0 Large and trained on over 12 million hours of speech and 28 billion
Whisper on different datasets. The zero-shot Whisper model sentences of text in more than 300 languages. With its 2
consistently outperforms wav2vec2.0 Large on several datasets, billion parameters, USM can recognize speech in both com-
indicating significant performance differences. mon languages like English and Mandarin and less-common
languages. Other popular acoustic models for speech recogni-
Dataset wav2vec2.0 Large Whisper Large
tion include Quartznet [273], Citrinet [365], and Conformer
Common Voice 29.9 9.0 [162]. These models can be chosen and switched based on the
Fleurs En 14.6 4.4 specific use case and performance requirements of the speech
recognition pipeline. For example, Conformer-based acous-
Tedlium 10.5 4.0 tic models are preferred for addressing robust ASR, as shown
CHiME6 65.8 25.5 in a recent study. Another study found that Conformer-15 is
more effective in handling real-world data and can produce
VoxPopuli En 17.9 7.3 up to 43% fewer errors on noisy data than other popular ASR
Switchboard 28.3 13.8 models. Additionally, fine-tuning pre-trained models such
as BERT [109] and GPT [446] has been explored for ASR
CallHome 34.8 17.6
tasks, leading to state-of-the-art performance on benchmark
LibriSpeech Clean 2.7 2.7 datasets like LibriSpeech (refer to Table 7). An open-source
LibriSpeech Other 6.2 5.2 toolkit called Vosk6 provides pre-trained models for multiple
languages optimized for real-time and efficient performance,
making it suitable for applications that require such perfor-
enhance the performance of Speech Transformer architecture mance.
and integrate it with the naive language model, as the use The field of speech recognition has made significant
of the transformer directly for ASR has not been effective progress by adopting unsupervised pre-training techniques,
in exploiting the correlation among the speech frames. The such as those utilized by Wav2Vec 2.0 [26]. Another recent
sequence order of speech, which the recurrent processing of advancement in automatic speech recognition (ASR) is the
input features can represent, is an important distinction. The whisper model, which has achieved human-level accuracy
degradation in performance for long sentences is reported when transcribing the LibriSpeech dataset. These two cutting-
using absolute positional embedding (AED) [85]. The prob- edge frameworks, Wav2Vec 2.0 and whisper, currently repre-
lems associated with long sequences can become more acute sent the state-of-the-art in ASR. The whisper model is trained
for transformer [672]. To address this issue, a transition was on an extensive supervised dataset, including over 680,000
made from absolute positional encoding to relative positional hours of audio data collected from the web, which has made
embeddings [672]. Whereas authors in [539] replace posi- it more resilient to various accents, background noise, and
tional embeddings with pooling layers. In a considerably technical jargon. The whisper model is also capable of tran-
different approach, the authors in [383] propose a novel way scribing and translating audio in multiple languages, making
of combining positional embedding with speech features by it a versatile tool. OpenAI has released inference models and
replacing positional encoding with trainable convolution lay- code, laying the groundwork for the development of practical
ers. This update further improves the stability of optimization applications based on the whisper model.
for large-scale learning of transformer networks. The above In contrast to its predecessor, Wav2Vec 2.0 is a self-
works confirmed the superiority of their techniques against supervised learning framework that trains models on unla-
sinusoidal positional encoding. beled audio data before fine-tuning them on specific datasets.
In 2016, Baidu introduced a hybrid ASR model called It uses a contrastive predictive coding (CPC) loss function to
Deep Speech 2 [13] that uses both RNNs and Transformers. learn speech representations directly from raw audio data, re-
The model also uses CNNs to extract features from the audio quiring less labeled data. The model’s performance has been
signal, followed by a stack of RNNs to model the temporal impressive, achieving state-of-the-art results on several ASR
dependencies and a Transformer-based decoder to generate benchmarks. These advances in unsupervised pre-training
the output sequence. This approach achieved state-of-the-art techniques and the development of novel ASR frameworks
results on several benchmark datasets such as LibriSpeech, like Whisper and Wav2Vec 2.0 have greatly improved the
VoxForge, WSJeval92 etc. The transition of ASR models field of speech recognition, paving the way for new real-world
from RNNs to Transformers has significantly improved per- applications. In summary, the Table 8 highlights the varying
formance, especially for long sentences and noisy environ- effectiveness of wav2vec2.0 large and whisper models across
ments. different datasets.
The Transformer architecture has been widely adopted 5 https://www.assemblyai.com/blog/conformer-1/
6 https://alphacephei.com/vosk/lm
by different companies and research groups for their ASR
models, and it is expected that more organizations will follow
this trend in the upcoming years. One of the advanced speech
models that leverage this architecture is the Universal Speech

Mehrish et al.: Preprint submitted to Elsevier Page 29 of 72


A Review of Deep Learning Techniques for Speech Processing

5.2. Neural Speech Synthesis tasks. Moreover, it has been used as a benchmark for nu-
5.2.1. Task Description merous neural speech synthesis models, including Tacotron
Neural speech synthesis is a technology that utilizes artifi- [583], WaveNet [404], and DeepVoice [18, 156].
cial intelligence and deep learning techniques to create speech Apart from the LJ Speech dataset, several other datasets
from text or other inputs. Its applications are widespread, are widely used in neural speech synthesis research. The
including in healthcare, where it can be used to develop assis- CMU Arctic [267] and L2 Arctic [661] datasets contain
tive technologies for those who are unable to communicate recordings of English speakers with diverse accents reading
due to neurological impairments. To generate speech, deep passages designed to capture various phonetic and prosodic
neural networks like CNNs, RNNs, transformers, and diffu- aspects of speech. The LibriSpeech [412], VoxCeleb [92],
sion models are trained using phonemes and the mel spectrum. TIMIT Acoustic-Phonetic Continuous Speech Corpus [153],
The process involves several components, such as text anal- and Common Voice Dataset [17] are other valuable datasets
ysis, acoustic models, and vocoders, as shown in Figure 14. that offer ample opportunities for training and evaluating
Acoustic models convert linguistic features into acoustic fea- text-to-speech synthesis models.
tures, which are then used by the vocoder to synthesize the
final speech signal. Various architectures, including neural 5.2.3. Models
vocoders based on GANs like HiFi-GAN [268], are used Neural network-based text-to-speech (TTS) systems have
by the vocoder to generate speech. Neural speech synthe- been proposed using neural networks as the basis for speech
sis also enables manipulation of voice, pitch, and speed of synthesis, particularly with the emergence of deep learn-
speech signals using frameworks such as Fastspeech2 [460] ing. In Statistical Parametric Speech Synthesis (SPSS), early
and NANSY/NANSY++ [82, 83]. These frameworks use neural models replaced HMMs for acoustic modeling. The
information bottleneck to disentangle analysis features for first modern neural TTS model, WaveNet [404], generated
controllable synthesis. The research in neural speech syn- waveforms directly from linguistic features. Other models,
thesis can be classified into two prominent approaches: au- such as DeepVoice 1/2 [18, 156], used neural network-based
toregressive and non-autoregressive models. Autoregressive models to follow the three components of statistical para-
models generate speech one element at a time, sequentially, metric synthesis. End-to-end models, including Tacotron
while non-autoregressive models generate all the elements 1 & 2 [583, 493], Deep Voice 3, and FastSpeech 1 & 2
simultaneously, in parallel. Table 9 outlines the different [462, 460], simplified text analysis modules and utilized
architecture proposed under each category. mel-spectrograms to simplify acoustic features with char-
The evaluation of synthesized speech is of paramount acter/phoneme sequences as input. Fully end-to-end TTS
importance for assessing its quality and fidelity. It serves systems, such as ClariNet [427], FastSpeech 2 [460], and
as a means to gauge the effectiveness of different speech EATS [114], are capable of directly generating waveforms
synthesis techniques, algorithms, and parameterization meth- from text inputs. Compared to concatenative synthesis 7 and
ods. In this regard, the application of statistical tests has statistical parametric synthesis, neural network-based speech
emerged as a valuable approach to objectively measure the synthesis offers several advantages including superior voice
similarity between synthesized speech and natural speech quality, naturalness, intelligibility, and reduced reliance on
[139]. These tests complement the traditional Mean Opinion human preprocessing and feature development. Therefore,
Score (MOS) evaluations and provide quantitative insights end-to-end TTS systems represent a promising direction for
into the performance of speech synthesis systems. Addition- advancing the field of speech synthesis.
ally, widely used objective metrics such as Mel Cepstral Dis- Transformer models have become increasingly popular
tortion (MCD) and Word Error Rate (WER) contribute to the for generating mel-spectrograms in TTS systems [460, 309].
comprehensive evaluation of synthesized speech, enabling These models are preferred over RNN structures in end-to-
researchers and practitioners to identify areas for improve- end TTS systems because they improve training and inference
ment and refine the synthesis process. By employing these efficiency [462, 309]. In a study conducted by Li et al. [309],
objective metrics and statistical tests, the evaluation of syn- a multi-head attention mechanism replaced both RNN struc-
thesized speech becomes a rigorous and systematic process, tures and the vanilla attention mechanism in Tacotron 2 [493].
enhancing the overall quality and fidelity of speech synthesis This approach addressed the long-distance dependency prob-
techniques. lem and improved pluralization. Phoneme sequences were
used as input to generate the mel-spectrogram, and speech
5.2.2. Datasets samples were synthesized using WaveNet as a vocoder. Re-
The field of neural speech synthesis is rapidly advancing sults showed that the transformer-based TTS approach was
and relies heavily on high-quality datasets for effective train- 4.25 times faster than Tacotron 2 and achieved similar MOS
ing and evaluation of models. One of the most frequently (Mean Opinion Score) performance.
utilized datasets in this field is the LJ Speech [217], which fea- Aside from the work mentioned above, there are other
tures about 24 hours of recorded speech from a single female studies that are based on the Tacotron architecture. For exam-
speaker reading passages from the public domain LJ Speech ple, Skerry-Ryan et al. [505] and Wang et al. [584] proposed
Corpus. This dataset is free and has corresponding transcripts, Tacotron-based models for prosody control. These models
making it an excellent choice for text-to-speech synthesis 7 https://en.wikipedia.org/wiki/Concatenative_synthesis

Mehrish et al.: Preprint submitted to Elsevier Page 30 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 9
Exploring the Landscape of TTS and Vocoder Architectures: Autoregressive and Non-
Autoregressive Models.
Method Text-To-Speech Vocoder
Tacotron [583] ,Tacotron2 [493], Deep Voice 1,2,3
WaveNet [404], WaveRNN [232], WaveGAN [422]
Transformer-TTS [309], DurIAN [627], Flowtron [550]
Autoregressive Model LPCNet [548], GAN-TTS [38], MultiBand-WaveRNN [627]
RobuTrans [310], DeviceTTS [211],Wave-Tacotron [590]
ImporvedLPCNet [547], Bunched LPCNet2 [416]
Apple TTS [9]
ParaNet [423], FastSpeech [462], JDI-T [317], EATS [115]
FastSpeech2 [460], FastPitch [284], Glow-TTS [250]
Flow-TTS [376], SpeedySpeech [546] Parallel-WaveNet [403], WaveGlow [437], Parallel-WaveGAN [608]
Parallel Tacotron [126], BVAE-TTS [296] MelGAN [275], MultiBand-MelGAN [612], VocGAN [614], WaveGrad [67]
Non-Autoregressive Model
Parallel Tacotron2 [126], Grad-TTS [433], VITS [251] DiffWave [269], HiFi-GAN [268], StyleMelGAN [386], Fre-GAN [254]
RAD-TTS [495], WaveGrad2 [69], DelightfulTTS [342] iSTFTNet [241], Avocodo [32]
PortaSpeech [461], DiffGAN-TTS [340], JETS [318]
WavThruVec [504], FastDiff [204], CLONE [343]

use a separate encoder to compute style information from systems like FastSpeech 2 and FastSpeech 2s, researchers
reference audio that is not provided in the text. Another note- have also been exploring the potential of Variational Autoen-
worthy work is the Global-style-Token (GST) [584] which coder (VAE) based TTS models [296, 195, 163, 251]. These
improves on style embeddings by adding an attention layer models can learn a latent representation of speech signals
to capture a wider range of acoustic styles. from textual input and may be able to produce high-quality
The FastSpeech [462] algorithm aims to improve the in- speech with less training data and greater control over the gen-
ference speed of TTS systems. To achieve this, it utilizes erated speech characteristics. For example, authors in [251]
a feedforward network based on 1D convolution and the used a conditional variational autoencoder (CVAE) to model
self-attention mechanism in transformers to generate Mel- the acoustic features of speech and an adversarial loss to im-
spectrograms in parallel. Additionally, it solves the issue prove the naturalness of the generated speech. This approach
of sequence length mismatch between the Mel-spectrogram involved conditioning the CVAE on the linguistic features
sequence and its corresponding phoneme sequence by em- of the input text and using an adversarial loss to match the
ploying a length regulator based on a duration predictor. The distribution of the generated speech to that of natural speech.
FastSpeech model was evaluated on the LJSpeech dataset and Results from this method have shown promise in generating
demonstrated significantly faster Mel-spectrogram generation speech that exhibits natural prosody and intonation.
than the autoregressive transformer model while maintaining WaveGrad [67] and DiffWave [269] have emerged as sig-
comparable performance. FastPitch builds on FastSpeech by nificant contributions in the field, employing diffusion models
conditioning the TTS model on fundamental frequency or to generate raw waveforms with exceptional performance. In
pitch contour, which improves convergence and eliminates contrast, GradTTS [433] and DiffTTS [218] utilize diffusion
the need for knowledge distillation of Mel-spectrogram tar- models to generate mel features rather than raw waveforms.
gets in FastSpeech. Addressing the intricate challenge of one-shot many-to-many
FastSpeech 2 [460] represents a transformer-based Text- voice conversion, DiffVC [434] introduces a novel solver
to-Speech (TTS) system that addresses the limitations of based on stochastic differential equations. Expanding the
its predecessor, FastSpeech, while effectively handling the scope of sound generation to include singing voice synthesis,
challenging one-to-many mapping problem in TTS. It intro- DiffSinger [334] introduces a shallow diffusion mechanism.
duces the utilization of a broader range of speech information, Additionally, Diffsound [611] proposes a sound generation
including energy, pitch, and more accurate duration, as con- framework that incorporates text conditioning and employs
ditional inputs. Furthermore, FastSpeech 2 trains the system a discrete diffusion model, effectively resolving concerns
directly on a ground-truth target, enhancing the quality of the related to unidirectional bias and accumulated errors.
synthesized speech. Additionally, a simplified variant called EdiTTS [527] introduces a diffusion-based audio model
FastSpeech 2s has been proposed in [61], eliminating the that is specifically tailored for the text-to-speech task. Its
requirement for intermediate Mel-spectrograms and enabling innovative approach involves the utilization of the denoising
the direct generation of speech from text during inference. reversal process to incorporate desired edits through coarse
Experimental evaluations conducted on the LJSpeech dataset perturbations in the prior space. Similarly, Guided-TTS [249]
demonstrated that both FastSpeech 2 and FastSpeech 2s offer and Guided-TTS2 [257] stand as early text-to-speech models
a streamlined training pipeline, resulting in fast, robust, and that have effectively harnessed diffusion models for sound
controllable speech synthesis compared to FastSpeech. generation. Furthermore, Levkovitch et al. [301] have made
Furthermore, in addition to the transformer-based TTS notable contributions by combining a voice diffusion model

Mehrish et al.: Preprint submitted to Elsevier Page 31 of 72


A Review of Deep Learning Techniques for Speech Processing

Linguistic Acoustic
Text Features Acoustic Features Waveform
Text Waveform
analysis Model Generation

Figure 14: Neural Text-to-speech (TTS) pipeline: a diagram showing the main modules of a typical TTS system. The system
takes text input and processes it through various stages to generate speech output. The text analysis module tokenizes the input
text and generates linguistic features such as phonemes and prosody. The acoustic model module then converts these linguistic
features into acoustic features, such as mel spectrograms, using a neural network. Finally, the waveform generation module
synthesizes the speech waveform from the acoustic features using another neural network.

with a spectrogram domain conditioning technique. This com-


bined approach facilitates text-to-speech synthesis, even with uLM
previously unseen voices during the training phase, thereby
enhancing the model’s versatility and capabilities. Speech
InferGrad [74] enhances the diffusion-based text-to-speech Discrete Generation
model by incorporating the inference process during train-
Resynthesis

ing, particularly when a limited number of inference steps Quantizer Decoder


(u2S)
are available. This improvement results in faster and higher- Encoder
quality sampling. SpecGrad [264] introduces adaptations to (S2u)
the time-varying spectral envelope of diffusion noise based on
conditioning log-mel spectrograms, drawing inspiration from
signal processing techniques. ItoTTS [597] presents a unified
framework that combines text-to-speech and vocoder mod-
els, utilizing linear SDE (Stochastic Differential Equation)
as its fundamental principle. ProDiff [206] proposes a pro-
gressive and efficient diffusion model specifically designed Figure 15: The architecture of the Generative Spoken Language
for generating high-quality text-to-speech synthesis. Unlike Model GSLM introduced by Meta in [281]. GSLM model
traditional diffusion models that require a large number of it- operates through a three-part architecture. Firstly, the encoder
erations, ProDiff parameterizes the model by predicting clean takes the speech waveform and transforms it into distinct
data and incorporates a teacher-synthesized mel-spectrogram units represented as S2u. Secondly, the decoder reverses this
as a target to minimize data discrepancies and improve the
mapping by converting the units back to the original waveform,
sharpness of predictions. Finally, Binaural Grad [299] ex-
represented as u2S. Finally, the language model is unit-based
and captures the distribution of unit sequences, which can be
plores the application of diffusion models in binaural audio viewed as a form of pseudo-text.
synthesis, aiming to generate binaural audio from monau-
ral audio sources. It accomplishes this through a two-stage
diffusion-based framework.
same subword segmentation for the input text as for the ASR
5.2.4. Alignment output targets. While RNN models with soft attention mech-
Improving the alignment of text and speech in TTS ar- anisms have been proven to be highly effective in various
chitecture has been the focus of recent research [250, 433, tasks, including speech synthesis, their use in online settings
225, 377, 316, 495, 375, 22, 64, 461, 29, 646, 35, 492]. Tra- results in quadratic time complexity due to the pass over the
ditional TTS models require external aligners to provide at- entire input sequence for generating each element in the out-
tention alignments of phoneme-to-frame sequences, which put sequence. In [449], the authors proposed an end-to-end
can be complex and inefficient. Although autoregressive TTS differentiable method for learning monotonic alignments, en-
models use an attention mechanism to learn these alignments abling the computation of attention in linear time. Several
online, these alignments tend to be brittle and often fail to gen- enhancements, such as those proposed in [79], have been
eralize to long utterances and out-of-domain text, resulting proposed in recent years to improve alignment in TTS mod-
in missing or repeating words. els. Additionally, in [21], the authors introduced a generic
In their study [121], the authors presented a novel text alignment learning framework that can be easily extended to
encoder network that includes an additional objective func- various neural TTS models.
tion to explicitly align text and speech encodings. The text The use of normalizing flow has been introduced to ad-
encoder architecture is straightforward, consisting of an em- dress output diversity issues in parallel TTS architectures.
bedding layer, followed by two bidirectional LSTM layers This technique is utilized to model the duration of speech, as
that maintain the input’s resolution. The study utilized the evidenced by studies conducted in [250, 495, 377]. One such

Mehrish et al.: Preprint submitted to Elsevier Page 32 of 72


A Review of Deep Learning Techniques for Speech Processing

flow-based generative model is Glow-TTS [250], developed Speech resynthesis is a vital research area with various
specifically for parallel TTS without the need for an external applications, including speech enhancement and voice con-
aligner. The model employs the generic Glow architecture version, and recent advancements have revolutionized the
previously used in computer vision and vocoder models to field by incorporating self-supervised discrete representa-
produce mel-spectrograms from text inputs, which are then tions. These techniques enable the generation of high-quality
converted to speech audio. Glow-TTS has demonstrated supe- speech that maintains or degrades acoustic cues from natural
rior synthesis speed over the autoregressive model, Tacotron speech recordings, and they have been used in the GSLM
2, while maintaining comparable speech quality. [281] architecture for acoustic modeling, speech recognition,
Recently, a new TTS model called EfficientTTS [377] has and synthesis, as outlined in Figure 15. It comprises a dis-
been introduced. This model outperforms previous models crete speech encoder, a generative language model, and a
such as Tacotron 2 and Glow-TTS in terms of speech quality, speech decoder, all trained without supervision. GSLM is the
training efficiency, and synthesis speed. The EfficientTTS only prior work addressing the generative aspect of speech
model uses a multi-head attention mechanism to align in- pre-training, which builds a text-free language model using
put text and speech encodings, enabling it to generate high- discovered units.
quality speech with fewer parameters and faster synthesis
speed. Overall, the introduction of normalizing flow and the 5.2.6. Voice Conversion
development of models such as Glow-TTS and EfficientTTS Modifying a speaker’s voice in a provided audio sam-
have significantly improved the quality and efficiency of TTS ple to that of another individual is called voice conversion,
systems. preserving linguistic content information. TTS and Voice
To resolve output diversity issues in parallel TTS archi- conversion share a common objective of generating natu-
tectures, normalizing flow has been introduced to model the ral speech. While models based on RNNs and CNNs have
duration of speech [250, 495, 377]. Glow-TTS [250] is a been successfully applied to voice conversion, the use of
flow-based generative model for parallel TTS that does not the transformer has shown promising results. Voice Trans-
require any external aligner12345. It is built on the generic former Network (VTN) [210] is a seq2seq voice conversion
Glow model that is previously used in computer vision and (VC) model based on the transformer architecture with TTS
vocoder models3. Glow-TTS is designed to produce mel- pre-training. Seq2seq VC models are attractive as they can
spectrograms from text input, which can then be converted convert prosody, and the VTN is a novel approach in this field
to speech audio4. It has been shown to achieve an order-of- that has been proven to be effective in converting speech from
magnitude speed-up over the autoregressive model, Tacotron a source to a target without changing the linguistic content.
2, at synthesis with comparable speech quality. EfficientTTS ASR and TTS-based voice conversion is a promising ap-
is a recent study that proposed a new TTS model, which proach to voice conversion [534]. It involves using an ASR
significantly outperformed models such as Tacotron 2 [493] model to transcribe the source speech into the linguistic repre-
and Glow-TTS [250] in terms of speech quality, training effi- sentation and then using a TTS model to synthesize the target
ciency, and synthesis speed. The EfficientTTS [377] model speech with the desired voice characteristics [432]. How-
uses a multi-head attention mechanism to align the input text ever, this approach overlooks the modeling of prosody, which
and speech encodings, enabling it to generate high-quality plays an important role in speech naturalness and conversion
speech with fewer parameters and faster synthesis speed. similarity. To address this issue, researchers have proposed to
directly predict prosody from the linguistic representation in
5.2.5. Speech Resynthesis a target-speaker-dependent manner [649]. Other researchers
Speech resynthesis is the process of generating speech have explored using a mix of ASR and TTS features to im-
from a given input signal. The input signal can be in vari- prove the quality of voice conversion [209, 665, 86, 647].
ous forms, such as a digital recording, text, or other types of CycleGAN [238, 239, 240], VAE [82, 595, 235], and
data. The aim of speech resynthesis is to create an output VAE with the generative adversarial network [191] are other
that closely resembles the original signal in terms of sound popular VC other popular approaches for non-parallel-voice
quality, prosody, and other acoustic characteristics. Speech conversion. CycleGAN-VC [238] uses a cycle-consistent ad-
resynthesis is an important research area with various ap- versarial network to convert the source voice to the target
plications, including speech enhancement [528, 193, 363], voice and can generate high-quality speech without any extra
and voice conversion [362]. Recent advancements in speech data, modules, or alignment procedure. Several improve-
resynthesis have revolutionized the field by incorporating ments and modifications are also proposed in recent years
self-supervised discrete representations to generate disentan- [239, 240, 191]. VAE-based voice conversion is a promising
gled representations of speech content, prosodic information, approach that can generate high-quality speech with a small
and speaker identity. These techniques enable the genera- amount of training data [82, 595, 235].
tion of speech in a controlled and precise manner, as seen
in [281, 431, 439, 497]. The objective is to generate high- 5.2.7. Vocoders
quality speech that maintains or degrades acoustic cues, such The field of audio synthesis has undergone significant
as phonotactics, syllabic rhythm, or intonation, from natural advancements in recent years, with various approaches pro-
speech recordings. posed to enhance the quality of synthesized audio. Prior

Mehrish et al.: Preprint submitted to Elsevier Page 33 of 72


A Review of Deep Learning Techniques for Speech Processing

studies have concentrated on improving discriminator archi- DiffWave model uses adaptive noise spectral shaping
tectures or incorporating auxiliary training losses. For in- to adapt the diffusion noise. This adaptation, achieved
stance, MelGAN introduced a multiscale discriminator that through time-varying filtering, improves sound quality,
uses window-based discriminators at different scales and ap- particularly in high-frequency bands. Other examples
plies average pooling to downsample the raw waveform. It of diffusion-based vocoders include InferGrad [74],
enforces the correspondence between the input Mel spec- SpecGrad [264], and Priorgrad [293]. InfraGrad incor-
trogram and the synthesized waveform using an L1 feature porates the inference process into training to reduce in-
matching loss from the discriminator. In contrast, GAN- ference iterations while maintaining high quality. Spec-
TTS [38] utilizes an ensemble of discriminators that operate Grad adapts the diffusion noise distribution to a given
on random windows of different sizes and enforce the map- acoustic feature and uses adaptive noise spectral shap-
ping between the conditioner and the waveform adversarially ing to generate high-fidelity speech waveforms.
using conditional discriminators. Another approach, paral-
lel WaveGAN [608], extends the single short-time Fourier • Flow-based models: Parallel WaveNet, WaveGlow,
transform loss to multi-resolution and employs it as an auxil- etc. [354, 437, 258, 429, 294] are based on normaliz-
iary loss for GAN training. Recently, some researchers have ing flows and are capable of generating high-fidelity
improved MelGAN by integrating the multi-resolution short- speech in real-time. While flow-based vocoders gener-
time Fourier transform loss. HiFi-GAN reuses the multi-scale ally perform worse than autoregressive vocoders with
discriminator from MelGAN and introduces the multi-period regard to modeling the density of speech signals, re-
discriminator for high-fidelity synthesis. UnivNet employs cent research [354] has proposed new techniques to
a multi-resolution discriminator that takes multi-resolution improve their performance.
spectrograms as input and can enhance the spectral structure Universal neural vocoding is a challenging task that has
of a synthesized waveform. In contrast, CARGAN integrates achieved limited success to date. However, recent advances
partial autoregression into the generator to enhance pitch in speech synthesis have shown a promising trend toward
and periodicity accuracy. The recent generative models for improving zero-shot performance by scaling up model sizes.
modeling raw audio can be categorized into the following Despite its potential, this approach has yet to be extensively
groups. explored. Nonetheless, several approaches have been pro-
• Autoregressive models: Although WaveNet is renowned posed to address the challenges of universal vocoding. For
for its exceptional ability to generate high-quality speech, example, WaveRNN has been utilized in previous studies to
including natural-sounding intonation and prosody, achieve universal vocoding (Lorenzo-Trueba et al. [344]; Paul
other neural vocoders have emerged as potential alter- et al. [421]). Another approach Jiao et al. [221] developed
natives in recent years. For instance, LPCNet [548] em- involves constructing a universal vocoder using a flow-based
ploys a combination of linear predictive coding (LPC) model. Additionally, the GAN vocoder has emerged as a
and deep neural networks (DNNs) to generate speech of promising candidate for this task, as suggested by You et al.
similar quality while being computationally efficient [626].
and capable of producing low-bitrate speech. Simi-
larly, SampleRNN [373], an unconditional end-to-end 5.2.8. Controllable Speech Synthesis
model, has demonstrated potential as it leverages a hi- Controllable Speech Synthesis [549, 122, 676, 545, 462,
erarchical RNN architecture and is trained end-to-end 584, 276] is a rapidly evolving research area that focuses
to generate raw speech of high quality. on generating natural-sounding speech with the ability to
control various aspects of speech, including pitch, speed, and
• Generative Adversarial Network (GAN) vocoders: Nu- emotion. Controllable Speech Synthesis is positioned in the
merous vocoders have been created that employ Gener- emerging field of affective computing at the intersection of
ative Adversarial Networks (GANs) to generate speech three disciplines: expressive speech analysis [535], natural
of exceptional quality. These GAN-based vocoders, language processing, and machine learning. This field aims
which include MelGAN MelGAN [275]and HiFIGAN to develop systems capable of recognizing, interpreting, and
[268], are capable of producing high-fidelity raw audio generating human-like emotional responses in interactions
by conditioning on mel spectrograms. Furthermore, between humans and machines.
they can synthesize audio at speeds several hundred Expressive speech analysis is a critical component of this
times faster than real-time on a single GPU, as evi- field. It provides mathematical tools to analyse speech sig-
denced by research conducted in [113, 39, 608, 268, nals and extract various acoustic features, including pitch,
275]. loudness, and duration, that convey emotions in speech. Nat-
ural language processing is also crucial to this field, as it
• Diffusion-based models: In recent years, there have
helps to process the text input and extract the meaning and
been several novel architectures proposed that are based
sentiment of the words. Finally, machine learning techniques
on diffusion. Two prominent examples of these are
are used to model and control the expressive features of the
WaveGrad [68] and DiffWave [269]. The WaveGrad
synthesized speech, enabling the systems to produce more
model architecture builds upon prior works from score
expressive and controllable speech [11, 337, 550, 274, 517,
matching and diffusion probabilistic models, while the

Mehrish et al.: Preprint submitted to Elsevier Page 34 of 72


A Review of Deep Learning Techniques for Speech Processing

666, 410, 205, 295]. larly to the standard transformer, exhibiting robustness for
In the last few years, notable advancements have been extra-long sentences. Lastly, Zheng et al. [670] proposed an
achieved in this field [452, 248, 164], and several approaches approach that combines a local recurrent neural network with
have been proposed to enhance the quality of synthesized the transformer to capture sequential and local information in
speech. For example, some studies propose using deep learn- sequences. Evaluation of a 20-hour Mandarin speech corpus
ing techniques to synthesize expressive speech and condi- demonstrated that this model outperforms the transformer
tional generation models to control the prosodic features of alone in performance.
speech [452, 248]. Others propose using motion matching- In their recent paper [610], the authors proposed a novel
based algorithms to synthesize gestures from speech [164]. method for extracting dynamic prosody information from au-
dio recordings, even in noisy environments. Their approach
5.2.9. Disentangling and Transferring employs probabilistic denoising diffusion models and knowl-
The importance of disentangled representations for neural edge distillation to learn speaking style features from a teacher
speech synthesis cannot be overstated, as it has been widely model, resulting in a highly accurate reproduction of prosody
recognized in the literature that this approach can greatly and timber. This model shows great potential in applications
improve the interpretability and expressiveness of speech such as speech synthesis and recognition, where noise-robust
synthesis models [360, 194, 438]. Disentangling multiple prosody information is crucial. Other noteworthy advances
styles or prosody information during training is crucial to en- in the development of robust TTS systems include the work
hance the quality of expressive speech synthesis and control. by [495], which focuses on a robust speech-text alignment
Various disentangling techniques have been developed using module, as well as the use of normalizing flows for diverse
adversarial and collaborative games, the VAE framework, speech synthesis.
bottleneck reconstructions, and frame-level noise modeling
combined with adversarial training. 5.2.11. Low-Resource Neural Speech Synthesis
For instance, Ma et al. [360] have employed adversarial High-quality paired text and speech data are crucial for
and collaborative games to enhance the disentanglement of building high-quality Text-to-Speech (TTS) systems [147].
content and style, resulting in improved controllability. Hsu Unfortunately, most languages are not supported by popular
et al. [194] have utilized the VAE framework with adver- commercialized speech services due to the lack of sufficient
sarial training to separate speaker information from noise. training data [604]. To overcome this challenge, researchers
Qian et al. [438] have introduced speech flow, which can have developed TTS systems under low data resource scenar-
disentangle rhythm, pitch, content, and timbre through three ios using various techniques [147, 604, 127, 540].
bottleneck reconstructions. In another work based on, adver- Several techniques have been proposed by researchers to
sarial training, Zhang et al. [642] have proposed a method enhance the efficiency of low-resource/Zero-shot TTS sys-
that disentangles noise from the speaker by modeling the tems. One of these is the use of semi-supervised speech syn-
noise at the frame level. thesis methods that utilize unpaired training data to improve
Developing high-quality speech synthesis models that can data efficiency, as suggested in a study by Liu et al. [328]. An-
handle noisy data and generate accurate representations of other method involves cascading pre-trained models for ASR,
speech is a challenging task. To tackle this issue, Zhang et al. MT, and TTS to increase data size from unlabelled speech,
[650] propose a novel approach involving multi-length ad- as proposed by Nguyen et al. [395]. In addition, researchers
versarial training. This method allows for modeling different have employed crowdsourced acoustic data collection to de-
noise conditions and improves the accuracy of pitch predic- velop TTS systems for low-resource languages, as shown in
tion by incorporating discriminators on the mel-spectrogram. a study by Butryna et al. [50]. Huang et al. [205] introduced
By replacing the traditional pitch predictor model with this a zero-shot style transfer approach for out-of-domain speech
approach, the authors demonstrate significant improvements synthesis that generates speech samples exhibiting a new
in the fidelity of synthesized speech. and distinctive style, such as speaker identity, emotion, and
prosody.
5.2.10. Robustness
Using neural TTS models can present issues with robust- 5.3. Speaker recognition
ness, leading to low-quality audio samples for unseen or atyp- 5.3.1. Task Description
ical text. In response, Li et al. [310] proposed RobuTrans Speech signal consists of information on various charac-
[310], a robust transformer that converts input text to lin- teristics of a speaker, such as origin, identity, gender, emotion,
guistic features before feeding it to the encoder. This model etc. This property of speech allows speech-based speaker
also includes modifications to the attention mechanism and profiling with a wide range of applications in forensics, rec-
position embedding, resulting in improved MOS scores com- ommendation systems, etc. The research on recognizing
pared to other TTS models. Another approach to enhancing speakers is extensive and aims to solve two major tasks:
robustness is the s-Transformer, introduced by Wang et al. speaker identification (what is the identity?) and speaker
[579], which models speech at the segment level, allowing verification (is the speaker he/she claims to be?). Speaker
it to capture long-term dependencies and use segment-level recognition/verification tasks require extracting a fixed-length
encoder-decoder attention. This technique performs simi- vector, called speaker embedding, from unconstrained utter-
ances. These embeddings represent the speakers and can

Mehrish et al.: Preprint submitted to Elsevier Page 35 of 72


A Review of Deep Learning Techniques for Speech Processing

be used for identification or verification tasks. Recent state- using multiple mobile devices. Additionally, the RedDots
of-the-art speaker-embedding-extractor models are based on project [291] and VOICES corpus [465] offer unique collec-
DNNs and have shown superior performance on both speaker tions of offline voice recordings in furnished rooms with back-
identification and verification tasks. ground noise, while the CN-CELEB database [135] focuses
on a specific person of interest extracted from bilibili.com
• Speaker Recognition (SR) relies on speaker identifi- using an automated pipeline followed by human verification.
cation as a key aspect, where an unknown speaker’s The BookTubeSpeech dataset [426] was also collected
speech sample is compared to speech models of known using an automated pipeline from BookTube videos, and the
speakers to determine their identity. The primary aim Hi-MIA database [440] was designed specifically for far-field
of speaker identification is to distinguish an individ- scenarios using multiple microphone arrays. The FFSVC20
ual’s identity from a group of known speakers. This challenge [441] and DIHARD challenge [473] are speaker
process involves a detailed analysis of the speaker’s verification and diarization research initiatives focusing on
voice characteristics such as pitch, tone, accent, and far-field and robustness challenges, respectively. Finally, the
other pertinent features to establish their identity. Re- LibriSpeech dataset [412], originally intended for speech
cent advancements in deep learning techniques have recognition, is also useful for speaker recognition tasks due
significantly enhanced speaker identification, leading to its included speaker identity labels.
to the creation of accurate, efficient, and end-to-end
models. Various deep learning-based models such as 5.3.3. Models
CNNs, RNNs, and their combinations have demon- Speaker identification (SI) and verification (SV) are cru-
strated exceptional performance in several subtasks of cial research topics in the field of speech technology due
speaker identification, including verification, identifi- to their significant importance in various applications such
cation, diarization, and robust recognition [458, 247, as security [125], forensics [270], biometric authentication
260]. [170], and speaker diarization [601]. Speaker recognition
has become more popular with technological advancements,
• Speaker Verification (SV) is a process that involves
including the Internet of Things (IoT), smart devices, voice
confirming the identity of a speaker through their speech.
assistants, smart homes, and humanoids. Therefore, a signif-
It differs from speaker identification, which aims to
icant quantity of research has been conducted in this field,
identify unknown speakers by comparing their voices
and many methods have been developed, making the state-
with that of registered speakers in a database. Speaker
of-the-art in this field quite mature and versatile. However, it
verification verifies whether a speaker is who they
has become increasingly challenging to provide an overview
claim to be by comparing their voice with an avail-
of the various methods due to the high number of studies in
able speaker template. Deep learning-based speaker
the field.
verification relies on Speaker Representation based on
A neural network approach for speaker verification was
embeddings, which involves learning low-dimensional
first attempted by Variani et al. [554] in 2014, utilizing four
vector representations from speech signals that cap-
fully connected layers for speaker classification. Their ap-
ture speaker characteristics, such as pitch and speaking
proach has successfully verified speakers with short-duration
style, and can be used to compare different speech
utterances by obtaining the 𝑑-vector by averaging the out-
signals and determine their similarity.
put of the last hidden layer across frames. Although various
5.3.2. Dataset attempts have been made to directly learn speaker represen-
The VoxCeleb dataset (VoxCeleb 1 & 2) is widely used tation from raw waveforms by other researchers (Ravanelli
in speaker recognition research, as mentioned in [92]. This and Bengio [456], Jung et al. [226]), other well-designed
dataset consists of speech data collected from publicly avail- neural networks like CNNs and RNNs have been proposed
able media, employing a fully automated pipeline that incor- for speaker verification tasks by Ye and Yang [621]. Nev-
porates computer vision techniques. The pipeline retrieves ertheless, the field still requires more powerful deep neural
videos from YouTube and applies active speaker verification networks for superior extraction of speaker features.
using a two-stream synchronization CNN. Speaker identity Speaker verification has seen notable advancements with
is further confirmed through CNN-based facial recognition. the advent of more powerful deep neural networks. One such
Another commonly employed dataset is TIMIT, which com- model is the 𝑥-vector-based system proposed by Snyder et al.
prises recordings of phonetically balanced English sentences [509], which has gained widespread popularity due to its re-
spoken by a diverse set of speakers. TIMIT is commonly used markable performance. Since its introduction, the 𝑥-vector
for evaluating speech recognition and speaker identification system has undergone significant architectural enhancements
systems, as referenced in [153]. and optimized training procedures [103]. The widely-used
Other noteworthy datasets in the field include the SITW ResNet [176] architecture has been incorporated into the sys-
database [371], which provides hand-annotated speech sam- tem to improve its performance further. Adding residual
ples for benchmarking text-independent speaker recognition connections between frame-level layers has been found to
technology, and the RSR2015 database [286], which contains improve the embeddings [152, 634]. This technique has also
speech recordings acquired in a typical office environment aided in faster convergence of the back-propagation algorithm

Mehrish et al.: Preprint submitted to Elsevier Page 36 of 72


A Review of Deep Learning Techniques for Speech Processing

and mitigated the vanishing gradient problem [176]. Tang 5.4. Speaker Diarization
et al. [532] proposed further improvements to the 𝑥-vector 5.4.1. Task Description
system. They introduced a hybrid structure based on TDNN Speaker diarization is a critical component in the analysis
and LSTM to generate complementary speaker information of multi-speaker audio data, and it addresses the question of
at different levels. They also suggested a multi-level pool- "who spoke when." The term "diarize" refers to the process
ing strategy to collect the speaker information from global of making a note or keeping a record of events, as per the
and local perspectives. These advancements have signifi- English dictionary. A traditional speaker diarization system
cantly improved speaker verification systems’ performance comprises several crucial components that work together
and paved the way for further developments in the field. to achieve accurate and efficient speaker diarization. In this
Desplanques et al. [108] propose a state-of-the-art archi- section, we will discuss the different components of a speaker
tecture for speaker verification utilizing a Time Delay Neural diarization system (Figure 16) and their role in achieving
Network (TDNN) called ECAPA-TDNN. The paper presents accurate speaker diarization.
a range of enhancements to the existing 𝑥-vector architec-
ture that leverages recent developments in face verification • Acoustic Features Extraction: In the analysis of multi-
and computer vision. Specifically, the authors suggest three speaker speech data, one critical component is the ex-
major improvements. Firstly, they propose restructuring the traction of acoustic features [14, 538]. This process
initial frame layers into 1-dimensional Res2Net modules with involves extracting features such as pitch, energy, and
impactful skip connections, which can better capture the re- MFCCs from the audio signal. These acoustic features
lationships between different time frames. Secondly, they play a crucial role in identifying different speakers by
introduce Squeeze-and-Excitation blocks to the TDNN lay- analyzing their unique characteristics.
ers, which help highlight the most informative channels and
• Segmentation: Segmentation is a crucial component
improve feature discrimination. Lastly, the paper proposes
in the analysis of multi-speaker audio data, where the
channel attention propagation and aggregation to efficiently
audio signal is divided into smaller segments based on
propagate attention weights through multiple TDNN layers,
the silence periods between speakers [14, 538]. This
further enhancing the model’s ability to discriminate between
process helps in reducing the complexity of the prob-
speakers.
lem and makes it easier to identify different speakers
Additionally, the paper presents a new approach that uti-
in smaller segments
lizes ECAPA-TDNN from the speaker recognition domain
as the backbone network for a multiscale channel adaptive • Speaker Embedding Extraction: This process involves
module. The proposed method achieves promising results, obtaining a low-dimensional representation of each
demonstrating the effectiveness of the proposed architecture speaker’s voice, which is commonly referred to as
in speaker verification. Overall, ECAPA-TDNN offers a com- speaker embedding. This is achieved by passing the
prehensive solution to speaker verification by introducing acoustic features extracted from the speech signal through
several novel contributions that improve the existing 𝑥-vector a deep neural network, such as a CNN or RNN[508].
architecture, which has been state-of-the-art in speaker verifi-
cation for several years. The proposed approach also achieves • Clustering: In this component, the extracted speaker
promising results, suggesting that the proposed architecture embeddings are clustered based on similarity, and each
can effectively tackle the challenges of speaker verification. cluster represents a different speaker [14, 538]. This
The attention mechanism is a powerful method for obtain- process commonly uses unsupervised clustering algo-
ing a more discriminative utterance-level feature by explicitly rithms, such as k-means clustering.
selecting frame-level representations that better represent
• Speaker Classification: In this component, the speaker
speaker characteristics. Recently, the Transformer model
embeddings are classified into different speaker identi-
with a self-attention mechanism has become effective in var-
ties using a supervised classification algorithm, such
ious application fields, including speaker verification. The
as SVM or MLP [14, 538].
Transformer architecture has been extensively explored for
speaker verification. TESA [370] is an architecture based • Re-segmentation: This component is responsible for
on the Transformer’s encoder, proposed as a replacement refining the initial segmentation by adjusting the seg-
for conventional PLDA-based speaker verification to capture ment boundaries based on the classification results. It
speaker characteristics better. TESA outperforms PLDA on helps in improving the accuracy of speaker diarization
the same dataset by utilizing the next sentence prediction by reducing the errors made during the initial segmen-
task of BERT [109]. Zhu et al. [675] proposed a method to tation.
create fixed-dimensional speaker verification representation
using a serialized multi-layer multi-head attention mecha- Various studies focus on traditional speaker diarization sys-
nism. Unlike other studies that redesign the inner structure tems [14, 538]. This paper will review the recent efforts
of the attention module, their approach strictly follows the toward deep learning-based speaker diarizations techniques.
original Transformer, providing simple but effective modifi-
cations.

Mehrish et al.: Preprint submitted to Elsevier Page 37 of 72


A Review of Deep Learning Techniques for Speech Processing

Speaker
utterance with multiple Speech Vs Non-Speech Covariance
overlapping
Speakers
Speaker
VAD smbedding Scoring Speaker
extraction Change

Speaker ID Segmented
and
Clustering Resegmentation
Clustered
utterances

Figure 16: Speaker diarization system diagram showcasing the process of identifying and differentiating multiple speakers in an
audio recording using various techniques such as VAD, segmentation, clustering and re-segmentation.

5.4.2. Dataset sations in office environments. The database was de-


• NIST SRE 2000 (Disk-8) or CALLHOME dataset: The veloped as part of the AMI project, which aimed to
NIST SRE 2000 (Disk-8) corpus, also referred to as develop technology for automatically analyzing multi-
the CALLHOME dataset, is a frequently utilized re- party meetings. The database contains over 100 hours
source for speaker diarization in contemporary research of audio and video recordings of meetings involving
papers. Originally released in 2000, this dataset com- four to seven participants, totaling 112 meetings. The
prises conversational telephone speech (CTS) collected meetings were held in multiple offices and were de-
from diverse speakers representing a wide range of signed to reflect the kinds of discussions that take place
ages, genders, and dialects. It includes 500 sessions of in typical business meetings. The audio recordings
multilingual telephonic speech, each containing two to were captured using close-talk microphones placed on
seven speakers, with two primary speakers in each con- each participant and additional microphones placed in
versation. The dataset covers various topics, including the room to capture ambient sound. The video record-
personal and familial relationships, work, education, ings were captured using multiple cameras placed around
and leisure activities. The audio recordings were ob- the room. In addition to the audio and video recordings,
tained using a single microphone and had a sampling the database also includes annotations that provide
rate of 8 kHz, with 16-bit linear quantization. additional information about the meetings, including
speaker identities, speech transcriptions, and informa-
• Directions into Heterogeneous Audio Research (DI- tion about the meeting structure (e.g., turn-taking pat-
HARD) Challenge and dataset: The DIHARD Chal- terns). The AMI database has been used extensively
lenge, organized by the National Institute of Standards in research on automatic speech recognition, speaker
and Technology (NIST), aims to enhance the accuracy diarization, and other related speech and language pro-
of speech recognition and diarization in challenging cessing topics.
acoustic environments, such as crowded spaces, distant
microphones, and reverberant rooms. The challenge • VoxSRC Challenge and VoxConverse corpus: The Vox-
comprises tasks requiring advanced machine-learning Celeb Speaker Recognition Challenge (VoxSRC) is
techniques, including speaker diarization, recognition, an annual competition designed to assess the capa-
and speech activity detection. The DIHARD dataset bilities of speaker recognition systems in identifying
used in the challenge comprises over 50 hours of speech speakers from speech recorded in real-world environ-
from more than 500 speakers, gathered from diverse ments. The challenge provides participants with a
sources like meetings, broadcast news, and telephone dataset of audio and visual recordings of interviews,
conversations. These recordings feature various acous- news shows, and talk shows featuring famous individ-
tic challenges, such as overlapping speech, background uals. The VoxSRC encompasses several tracks, includ-
noise, and distant or reverberant speech, captured through ing speaker diarization, and comprises a development
different microphone setups. To aid in the evaluation set (20.3 hours, 216 recordings) and a test set (53.5
process, the dataset has been divided into separate de- hours, 310 recordings). Recordings in the dataset may
velopment and evaluation sets. The assessment metrics feature between one and 21 speakers, with a diverse
used to gauge performance include diarization error range of ambient noises, such as background music
rate (DER), as well as accuracy in speaker verification, and laughter. To facilitate the speaker diarization track
identification, and speech activity detection. of the VoxSRC-21 and VoxSRC-22 competitions, Vox-
Converse, an audio-visual diarization dataset contain-
• Augmented Multi-party Interaction (AMI) database: ing multi-speaker clips of human speech sourced from
The AMI database is a collection of audio and video YouTube videos, is available, and additional details are
recordings that capture real-world multi-party conver-

Mehrish et al.: Preprint submitted to Elsevier Page 38 of 72


A Review of Deep Learning Techniques for Speech Processing

provided on the project website 8 . 171 meeting sessions held across various locations. It
features two distinct audio sources – one recorded us-
• LibriCSS: The LibriCSS corpus is a valuable resource ing lapel microphones for individual speakers and the
for researchers studying speech separation, recogni- other using omnidirectional microphone arrays placed
tion, and speaker diarization. The corpus comprises on the table. It is an ideal dataset for evaluating speaker
10 hours of multichannel recordings captured using a diarization systems integrated with the ASR module.
7-channel microphone array in a real meeting room. AMI’s value proposition is further enhanced by provid-
The audio was played from the LibriSpeech corpus, ing forced alignment data, which captures the timings
and each of the ten sessions was subdivided into six at the word and phoneme levels and speaker labeling.
10-minute mini-sessions. Each mini-session contained Finally, it’s worth noting that each meeting session
audio from eight speakers and was designed to have dif- involves a small group of three to five speakers.
ferent overlap ratios ranging from 0% to 40%. To make
research easier, the corpus includes baseline systems 5.4.3. Models
for speech separation and Automatic Speech Recogni- Speaker diarization has been a subject of research in the
tion (ASR) and a baseline system that integrates speech field of audio processing, with the goal of separating speak-
separation, speaker diarization, and ASR. These base- ers in an audio recording. In recent years, deep learning
line systems have already been developed and made has emerged as a powerful technique for speaker diarization,
available to researchers. leading to significant advancements in this field. In this ar-
ticle, we will explore some of the recent developments in
• Rich Transcription Evaluation Series: The Rich Tran-
deep learning architecture for speaker diarization, focusing
scription Evaluation Series dataset is a collection of
on different modules of speaker diarization as outlined in
speech data used for speaker diarization evaluation.
Figure 16. Through this discussion, we will highlight major
The Rich Transcription Fall 2003 Evaluation (RT-03F)
advancements in each module.
was the first evaluation in the series focused on "Who
Said What" tasks. The dataset has been used in sub- • Segmentation and clustering: Speaker diarization sys-
sequent evaluations, including the Second DIHARD tems typically use a range of techniques for segmenting
Diarization Challenge, which used the Jaccard index to speech, such as identifying speaker change, uniform
compute the JER (Jaccard Error Rate) for each pair of speaker segmentation, ASR-based word segmentation,
segmentations. The dataset is essential for data-driven and supervised speaker turn detection. However, each
spoken language processing methods and calculates approach has its own benefits and drawbacks. Uni-
speaker diarization accuracy at the utterance level. The form speaker segmentation involves dividing speech
dataset includes rules, evaluation methods, and base- into segments of equal length, which can be difficult
line systems to promote reproducible research in the to optimize to capture speaker turn boundaries and in-
field. The dataset has been used in various speaker clude enough speaker information. ASR-based word
diarization systems and their subtasks in the context segmentation identifies word boundaries using auto-
of broadcast news and CTS data matic speech recognition, but the resulting segments
• CHiME-5/6 challenge and dataset The CHiME-5/6 may be too brief to provide adequate speaker informa-
challenge is a speech processing challenge focusing tion. Supervised speaker turn detection, on the other
on distant multi-microphone conversational speech di- hand, involves a specialized model that can accurately
arization and recognition in everyday home environ- identify speaker turn timestamps. While this method
ments. The challenge provides a dataset of recordings can achieve high accuracy, it requires labeled data for
from everyday home environments, including dinner training. These techniques have been widely discussed
recordings originally collected for and exposed during in previous research, and choosing the appropriate one
the CHiME-5 challenge. The dataset is designed to be depends on the specific requirements of the applica-
representative of natural conversational speech. The tion.
challenge features two audio input conditions: single- – The authors in [98] propose real-time speaker di-
channel and multichannel. Participants are provided arization system that combines incremental clus-
with baseline systems for speech enhancement, speech tering and local diarization applied to a rolling
activity detection (SAD), and diarization, as well as window of speech data and is designed to han-
results obtained with these systems for all tracks. The dle overlapping speech segments. The proposed
challenge aims to improve the robustness of diarization pipeline is designed to utilize end-to-end overlap-
systems to variations in recording equipment, noise aware segmentation to detect and separate over-
conditions, and conversational domains. lapping speakers.
• AMI dataset: The AMI database is a comprehensive – In another related work, authors in [643] intro-
collection of 100 hours of recordings sourced from duce a novel speaker diarization system with a
8 https://www.robots.ox.ac.uk/
generalized neural speaker clustering module as
vgg/data/voxconverse/
the backbone.

Mehrish et al.: Preprint submitted to Elsevier Page 39 of 72


A Review of Deep Learning Techniques for Speech Processing

– In a recent study conducted by Park et al. [417], tion systems that can effectively handle both voice
a new framework for spectral clustering is pro- activity and overlapped speech detection. This
posed that allows for automatic parameter tun- approach can also be a post-processing step to
ing of the clustering algorithm in the context identify and assign overlapped speech regions
of speaker diarization. The proposed technique accurately. Notable examples of such works in-
utilizes normalized maximum eigengap (NME) clude those by Bullock et al. [47] and Bredin and
values to determine the number of clusters and Laurent [45].
threshold parameters for each row in an affinity
matrix during spectral clustering. The authors • End-to-End Neural Diarization: In addition to the
demonstrated that their method outperformed ex- above work, end-to-end speaker diarization systems
isting state-of-the-art methods on two different have gained the attention of the research community
datasets for speaker diarization. due to their ability to handle speaker overlaps and their
optimization to minimize diarization errors directly. In
– Bayesian HMM clustering of x-vector sequences one such work, the authors propose end-to-end neural
(VBx) diarization approach, which clusters x- speaker diarization that does not rely on clustering and
vectors using a Bayesian hidden Markov model instead uses a self-attention-based neural network to
(BHMM) [285], combined with a ResNet101 (He directly output the joint speech activities of all speakers
et al. [176]) 𝑥-vector extractor achieves superior for each segment [145]. Following the trend, several
results on CALLHOME [111], AMI [53] and DI- other works propose enhanced architectures based on
HARD II [474] datasets self-attention [324, 630]
• Speaker Embedding Extraction and Classification:
5.5. Speech-to-speech translation
– Attentive Aggregation for Speaker Diarization 5.5.1. Task Description
[278]: This approach uses an attention mech- Speech-to-text translation (ST) is the process of con-
anism to aggregate embeddings from multiple verting spoken language from one language to another in
frames and generate speaker embeddings. The text form. Traditionally, this has been achieved using a cas-
speaker embeddings are then used for clustering caded structure that incorporates automatic speech recogni-
to identify speaker segments. tion (ASR) and machine translation (MT) components. How-
– End-to-End Speaker Diarization with Self-Attention ever, a more recent end-to-end (E2E) method [524, 480, 639,
[145]: This method uses a self-attention mecha- 62, 166, 669, 15] has gained popularity due to its ability to
nism to capture the correlations between the input eliminate issues with error propagation and high latency as-
frames and generates embeddings for each frame. sociated with cascaded methods [518, 63]. The E2E method
The embeddings are then used for clustering to uses an audio encoder to analyze audio signals and a text
identify speaker segments. decoder to generate translated text.
One notable advantage of ST systems is that they allow
– Wang et al. [577] present an innovative method
for more natural and fluent communication than other lan-
for measuring similarity between speaker em-
guage translation methods. By translating speech in real-time,
beddings in speaker diarization using neural net-
ST systems can capture the subtleties of speech, including
works. The approach incorporates past and future
tone, intonation, and rhythm, which are essential for effective
contexts and uses a segmental pooling strategy.
communication. Developing ST systems is a highly intri-
Furthermore, the speaker embedding network
cate process that involves integrating various technologies
and similarity measurement model are jointly
such as speech recognition, natural language processing, and
trained. The paper extends this framework to
machine translation. One significant obstacle in ST is the
target-speaker voice activity detection (TS-VAD)
variation in accents and dialects across different languages,
[372]. The proposed method effectively learns
which can significantly impact the accuracy of the translation.
the similarity between speaker embeddings by
considering both past and future contexts. 5.5.2. Dataset
– Time-Depth Separable Convolutions for Speaker There are numerous datasets available for the end-to-end
Diarization [266]: This approach uses time-depth speech translation task, with some of the most widely used
separable convolutions to generate embeddings ones being MuST-C [56], IWSLT [483], and CoVoST 2 [564].
for each frame, which are then used for clustering These datasets cover a variety of languages, including English,
to identify speaker segments. The method is com- German, Spanish, French, Italian, Dutch, Portuguese, Roma-
putationally efficient and achieves state-of-the-art nian, Arabic, Chinese, Japanese, Korean, and Russian. For
performance on several benchmark datasets. instance, TED-LIUM [470] is a suitable dataset for speech-to-
text, text-to-speech, and speech-to-speech translation tasks,
• Re-segmentation:
as it contains transcriptions and audio recordings of TED
– Numerous studies in this field centre around de- talks in English, French, German, Italian, and Spanish. An-
veloping a re-segmentation strategy for diariza- other open-source dataset is Common Voice, which covers

Mehrish et al.: Preprint submitted to Elsevier Page 40 of 72


A Review of Deep Learning Techniques for Speech Processing

several languages, including English, French, German, Ital- marks covering 23 languages without pretraining. The paper
ian, and Spanish. Additionally, VoxForge9 is designed for also discusses neural acoustic feature modeling, which ex-
acoustic model training and includes speech recordings and tracts acoustic features directly from raw speech signals to
transcriptions in several languages, including English, French, simplify inductive biases and enhance speech description.
German, Italian, and Spanish. LibriSpeech [412] is a dataset
of spoken English specifically designed for speech recogni- 5.6. Speech enhancement
tion and speech-to-text translation tasks. Lastly, How2 [124] 5.6.1. Task Description
is a multimodal machine translation dataset that includes In situations where there is ambient noise present, speech
speech recordings, text transcriptions, and video and image recognition systems can encounter difficulty in correctly in-
data, covering English, German, Italian, and Spanish. These terpreting spoken language signals, resulting in reduced per-
datasets have been instrumental in training state-of-the-art formance [123]. One possible solution to address this issue
speech-to-speech translation models and will continue to play is the development of speech enhancement systems that can
a crucial role in further advancing the field. eliminate noise and other types of signal distortion from
spoken language, thereby improving signal quality. These
5.5.3. Models systems are frequently implemented as a preprocessing step
End-to-end speech translation models are a promising to enhance the accuracy of speech recognition and can serve
approach to direct the speech translation field. These models as an effective approach for enhancing the performance of
use a single sequence-to-sequence model for speech-to-text ASR systems in noisy environments. This section will delve
translation and then text-to-speech translation. In 2017, re- into the significance of speech enhancement technology in
searchers demonstrated that end-to-end models outperform boosting the accuracy of speech recognition.
cascade models[3]. One study published in 2019 provides an
overview of different end-to-end architectures and the usage 5.6.2. Dataset
of an additional connectionist temporal classification (CTC) One popular dataset for speech enhancement tasks is
loss for better convergence [27]. The study compares differ- AISHELL-4, which comprises authentic Mandarin speech
ent end-to-end architectures for speech-to-text translation. In recordings captured during conferences using an 8-channel
2019, Google introduced Translatotron [219], an end-to-end circular microphone array. In accordance with [144], AISHELL-
speech-to-speech translation system. Translatotron uses a 4 is composed of 211 meeting sessions, each featuring 4 to
single sequence-to-sequence model for speech-to-text trans- 8 speakers, for a total of 120 hours of content. This dataset
lation and then text-to-speech translation. No transcripts or is of great value for research into multi-speaker processing
other intermediate text representations are used during in- owing to its realistic acoustics and various speech qualities,
ference. The system was validated by measuring the BLEU including speaker diarization and speech recognition
score, computed with text transcribed by a speech recogni- Another popular dataset used for speech enhancement is
tion system. Though the results lag behind a conventional the dataset from Deep Noise Suppression (DNS) challenge
cascade system, the feasibility of the end-to-end direct speech- [459], a large-scale dataset of noisy speech signals and their
to-speech translation was demonstrated [219]. corresponding clean speech signals. The DNS dataset con-
In a recent publication from 2020, researchers presented tains over 10, 000 hours of noisy speech signals and over
a study on an end-to-end speech translation system. This 1, 000 hours of clean speech signals, making it useful for
system incorporates pre-trained models such as Wav2Vec training deep learning models for speech enhancement. The
2.0 and mBART, along with coupling modules between the Voice Bank Corpus (VCTK) is another dataset containing
encoder and decoder. The study also introduces an efficient speech recordings from 109 speakers, each recording approx-
fine-tuning technique, which selectively trains only 20% of imately 400 sentences. The dataset contains clean and noisy
the total parameters [622]. The system developed by the speech recordings, making it useful for training speech en-
UPC Machine Translation group actively participated in the hancement models. These datasets provide realistic acoustics,
IWSLT 2021 offline speech translation task, which aimed rich natural speech characteristics, and large-scale noisy and
to develop a system capable of translating English audio clean speech signals, making them useful for training deep
recordings from TED talks into German text. learning models.
E2E ST is often improved by pretraining the encoder
and/or decoder with transcripts from speech recognition or 5.6.3. Models
text translation tasks [110, 563, 639, 603]. Consequently, it Several Classical algorithms have been reported in the
has become the standard approach used in various toolkits literature for speech enhancement, including spectral subtrac-
[214, 563, 660, 669]. However, transcripts are not always tion [41], Wiener and Kalman filtering [319, 482], MMSE
available, and the significance of pretraining for E2E ST estimation [128], comb filtering [222], subspace methods
is rarely studied. Zhang et al. [638] explored the effective- [171]. Phase spectrum compensation [409]. However, clas-
ness of E2E ST trained solely on speech-translation pairs sical algorithms such as spectral subtraction and Wiener fil-
and proposed an algorithm for training from scratch. The tering approach the problem in the spectral domain and are
proposed system outperforms previous studies in four bench- restricted to stationary or quasi-stationary noise.
Neural network-based approaches inspired from other ar-
9 http://www.voxforge.org/
eas such as computer vision [188, 146, 10] and generative

Mehrish et al.: Preprint submitted to Elsevier Page 41 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 10
Performance of different speech enhancement algorithms on the Deep Noise Suppression
(DNS) Challenge dataset. The table showcases improvements in PESQ-WB, PESQ-NB,
SI-SDR-WB, and SI-SDR-NB metrics, and identifies the top-performing methods in each
category.
Model PESQ-WB PESQ-NB SI-SDR-WB SI-SDR-NB Architecture
FRCRN [664] 3.23 - - - U-Net + CRN
Sudo rm -rf [543] 2.95 - 19.7 - UConvBlock + CNN
DCTCRN-P [311] 2.82 - - - CNN
PoCoNet [216] 2.7885 - - - -
FullSubNet [172] 2.777 3.305 17.29 - LSTM
RNN-Modulation [559] 2.75 - - - GRU
Conv-TasNet-SNR [271] 2.73 - - - CNN
Sudo rm-rf [542] 2.69 - 18.6 - UConvBlock + CNN
RemixIT [543] 2.34 - 16.0 - UConvBlock
SN-Net [668] - 3.39 - 19.52 CNN
DCCRN-E-Aug [202] - 3.214 - - CNN + LSTM
DTLN [592] - 3.04 16.34 - LSTM
DCCRN-E [202] - 3.04 - - CNN + LSTM

adversarial networks [596, 321, 471, 142] or developed for 5.7. Audio Super Resolution
general audio processing tasks [588, 157] have outperformed 5.7.1. Task Description
the classical approaches. Various neural network models Audio super-resolution is a technique that involves pre-
based on different architectures, including fully connected dicting the missing high-resolution components of low-resolution
neural networks [606], deep denoising autoencoder [346], audio signals. Achieving this task can be difficult due to the
CNN [143], LSTM [77], and Transformer [263] have effec- continuous nature of audio signals. Current methods typically
tively handled diverse noisy conditions. approach super-resolution by treating audio as discrete data
Diffusion-based models have also shown promising re- and focusing on fixed scale factors. In order to accomplish au-
sults for speech enhancement [298, 623, 349] and have led dio super-resolution, deep neural networks are trained using
to the development of novel speech enhancement algorithms pairs of low and high-quality audio examples. During testing,
called Conditional Diffusion Probabilistic Model (CDiffuSE) the model predicts missing samples within a low-resolution
that incorporates characteristics of the observed noisy speech signal. Some recent deep network approaches have shown
signal into the diffusion and reverse processing [349]. CD- promise by framing the problem as a regression issue either
iffuSE is a generalized formulation of the diffusion proba- in the time or frequency domain [320]. These methods have
bilistic model that can adapt to non-Gaussian real noises in been able to achieve impressive results.
the estimated speech signal. Another diffusion-based model
for speech enhancement is StoRM [298], which stands for 5.7.2. Datasets
Stochastic Regeneration Model. It uses a predictive model to This section provides an overview of the diverse datasets
remove vocalizing and breathing artifacts while producing utilized in Audio Super Resolution literature. One of the
high-quality samples using a diffusion process, even in ad- most frequently used datasets is the MUSDB18, specifically
verse conditions. StoRM has shown great ability at bridging designed for music source separation and enhancement. This
the performance gap between predictive and generative ap- dataset encompasses more than 150 songs with distinct tracks
proaches for speech enhancement. Furthermore, authors in for individual instruments. Another prominent dataset is Ur-
[623] propose cold diffusion process is an advanced iterative banSound8K, which comprises over, 8000 environmental
version of the diffusion process to recover clean speech from sound files collected from 10 different categories, making it
noisy speech. According to the authors, it can be utilized ideal for evaluating Audio Super Resolution algorithms in
to restore high-quality samples from arbitrary degradations. noisy environments. Furthermore, the VoiceBank dataset is
Table 10 summarizing the performance of different speech en- another essential resource for evaluating Audio Super Resolu-
hancement algorithms on the Deep Noise Suppression (DNS) tion systems, comprising over 10,000 speech recordings from
Challenge dataset using different metrics. five distinct speakers. This dataset offers a rich source of
information for assessing speech processing systems, includ-
ing Audio Super Resolution. Another dataset, LibriSpeech,

Mehrish et al.: Preprint submitted to Elsevier Page 42 of 72


A Review of Deep Learning Techniques for Speech Processing

features more than 1000 hours of spoken words from several TIMIT dataset is popular, providing, 6300 phonetically tran-
books and speakers, making it valuable for evaluating Audio scribed utterances from 630 speakers. On the other hand,
Super Resolution algorithms to enhance the quality of spoken CHiME-5 is designed for speech separation and recognition
words. Finally, the TED-LIUM dataset, which includes over in real-world environments and includes multichannel record-
140 hours of speech recordings from various speakers giving ings of 20 speakers in locations such as cafés, buses, and
TED talks, provides a real-world setting for evaluating Audio pedestrian areas. Despite its primary purpose, CHiME-5
Super Resolution algorithms for speech enhancement. By is widely used for voice activity detection. AURORA-4 is
using these datasets, researchers can evaluate Audio Super specifically designed to evaluate the robustness of ASR sys-
Resolution systems for a wide range of audio signals and im- tems and contains over 10, 000 in noisy speech utterances
prove the generalizability of these algorithms for real-world recorded in environments like car noise, babble noise, and
scenarios. street noise. It is also extended to VAD for evaluating chal-
lenging scenarios. DEMAND is a suitable dataset for eval-
5.7.3. Models uating VAD algorithms as it includes over 1200 artificially
Audio super-resolution has been extensively explored us- created noise signals with various noise types like white noise,
ing deep learning architectures [455, 624, 320, 290, 168, 40, pink noise, and café noise. Finally, VoxCeleb contains over
8, 393, 253, 333]. One notable paper by Rakotonirina [455] 100,000 utterances from more than 6,000 speakers, primarily
proposes a novel network architecture that integrates convolu- designed for speaker recognition systems evaluation, but it
tion and self-attention mechanisms for audio super-resolution. can also be used for voice activity detection.
Specifically, they use Attention-based Feature-Wise Linear
Modulation (AFiLM) [455] to modulate the activations of the 5.8.3. Models
convolutional model. In another recent work by Yoneyama Recent advances in deep learning have greatly improved
et al. [624], the super-resolution task is decomposed into do- the performance of voice activity detection (VAD), partic-
main adaptation and resampling processes to handle acoustic ularly in noisy environments [464, 380]. To further im-
mismatch in unpaired low- and high-resolution signals. To prove VAD accuracy, researchers have explored various deep
address this, they jointly optimize the two processes within learning architectures, including NAS-VAD [464] and self-
the CycleGAN framework. attentive VAD [223]. NAS-VAD employs neural architecture
Moreover, the Time-Frequency Network (TFNet) [320] search to reduce the need for human effort in network de-
proposed a deep network that achieves promising results by sign and has demonstrated superior performance in terms of
modeling the task as a regression problem in either time or fre- AUC and F1-score compared to other models. Similarly, self-
quency domain. To further enhance audio super-resolution, attentive VAD uses a self-attention mechanism to capture
the paper proposes a time-frequency network that combines long-term dependencies in input signals and has also outper-
time and frequency domain information. Finally, recent ad- formed other models on the TIMIT dataset. Additionally, a
vancements in diffusion models have introduced new ap- deep neural network (DNN) system has been proposed for au-
proaches to neural audio upsampling. Specifically, Lee and tomatic speech detection in audio signals [380]. This system
Han [290], and Han and Lee [168] propose NU-Wave 1 and uses MLPs, RNNs, and CNNs, with CNNs delivering the best
2 diffusion probabilistic models, respectively, which can pro- performance. Furthermore, a hybrid acoustic-lexical deep
duce high-quality waveforms with a sampling rate of 48kHz learning approach has been proposed for deception detection,
from coarse 16kHz or 24kHz inputs. These models are a combining both acoustic and lexical features.
promising direction for improving audio super-resolution.
5.9. Speech Quality Assessment
5.8. Voice Activity Detection (VAD) 5.9.1. Task Description
5.8.1. Task Description Speech quality assessment is a crucial process that in-
Due to the increasing sophistication of mobile devices like volves the objective evaluation of speech signals using various
smartphones, speech-controlled applications have become metrics and measures. The primary aim of this assessment is
incredibly popular. These apps offer a hands-free method for to determine the level of intelligibility and comprehensibility
controlling home devices, facilitating telephony, and allow- of speech to a human listener. Although human evaluation is
ing drivers to safely use their vehicle’s infotainment systems considered the gold standard for assessing speech quality, it
while on the go. However, accurately distinguishing between can be time-consuming, expensive, and not scalable. Mean
noise and human speech is critical for these applications to opinion score (MOS) is the most commonly used and reliable
work without interruption. To overcome this issue, Voice method of obtaining human judgments for speech quality es-
Activity Detection (VAD) systems have been created to rec- timation. Accurate speech quality assessment is essential in
ognize speech presence or absence, thus ensuring consistent the development and design of real-world applications such
and effective operation. as ASR, Speech Enhancement, and VoIP.

5.8.2. Datasets 5.9.2. Datasets


Voice activity detection models can be trained and evalu- The speech quality assessment algorithms are evaluated
ated using various datasets, each with unique features. The using several datasets, each with unique characteristics. The
TIMIT Acoustic-Phonetic Continuous Speech Corpus [153]

Mehrish et al.: Preprint submitted to Elsevier Page 43 of 72


A Review of Deep Learning Techniques for Speech Processing

has clean speech recordings and artificially generated de- as phone conversations, meetings, and live events, where var-
graded versions for speech synthesis and quality assessment ious extraneous sounds may contaminate speech. Tradition-
research. The NOIZEUS dataset [203] is designed for eval- ally, speech separation has been studied as a signal-processing
uating noise reduction and speech quality assessment algo- problem, where researchers have focused on developing al-
rithms, with clean speech and artificially degraded versions gorithms to separate sources based on their spectral charac-
containing various types of noise and distortion. The ETSI teristics [635, 558]. However, recent advances in machine
Aurora databases [361] are used for evaluating speech en- learning have led to a new approach that formulates speech
hancement techniques and quality assessment algorithms, separation as a supervised learning problem [181, 587, 352].
containing speech recordings with different types of distor- This approach has seen a significant improvement in per-
tions like acoustic echo and background noise. Furthermore, formance with the advent of deep neural networks, which
for training and validation, the clean speech recordings from can learn complex relationships between input features and
the DNS Challenge [459] can be used along with the noise output sources.
dataset such as FSDK50 [138] for additive noise degradation.
5.10.2. Datasets
5.9.3. Models The WSJ0-2mix dataset comprises mixtures of two Wall
Current objective methods such as Perceptual Evaluation Street Journal corpus (WSJ) speakers. It consists of a train-
of Speech Quality (PESQ) [468] and Perceptual Objective ing set of 30,000 mixtures and a test set of 5000 mixtures,
Listening Quality Assessment (POLQA) [36] for evaluating and it has been widely used to evaluate speech separation
the quality of speech mostly rely on the availability of the algorithms. CHiME-4 is a dataset that contains recordings
corresponding clean reference. These methods fail in real- of multiple speakers in real-world environments, such as a
world scenarios where the ground truth clean reference is living room, a kitchen, and a café and is designed to test algo-
unavailable. In recent years, several attempts to automati- rithms in challenging acoustic environments. TIMIT-2mix is
cally estimate the MOS using neural networks for performing a dataset based on the TIMIT corpus, consisting of mixtures
quality assessment and predicting ratings or scores have at- of two speakers, and includes a training set of 462 mixtures
tracted much attention [516, 406, 55, 118, 119, 57]. These and a test set of 400 mixtures. The dataset provides a more
approaches outperform traditional approaches without the controlled environment than CHiME-4 to test speech separa-
need for a clean reference. However, they lack robustness and tion algorithms. LibriMix is derived from the LibriSpeech
generalization capabilities, limiting their use in real-world corpus and includes mixtures of up to four speakers, with
applications. The authors in [406] explore Deep machine a training set of 100,000 mixtures and a test set of 1,000
listening for Estimating Speech Quality (DESQ) for predict- mixtures, providing a more realistic and challenging envi-
ing the perceived speech quality based on phoneme posterior ronment than WSJ0-2mix. Lastly, the MUSDB18 dataset
probabilities obtained using a deep neural network. contains mixtures of music tracks separated into individual
In recent years, there have been several quality assessment stems, including vocals, drums, bass, and other instruments.
frameworks developed to estimate speech quality, such as It consists of a training set of 100 songs and a test set of 50
NORESQA [369] based on non-matching reference (NMR). songs. Despite not being specifically designed for that pur-
NORESQA takes inspiration from the human ability to assess pose, it has been used as a benchmark for evaluating speech
speech quality even when the content is non-matching. Addi- separation algorithms.
tionally, NORESQA introduces two new metrics - NORESQA-
score, which is based on SI-SDR for speech, and NORESQA- 5.10.3. Models
MOS, which evaluates the Mean Opinion Score (MOS) of a Deep Clustering++ [181], first proposed in 2015, em-
speech recording using non-matching references. A recent ploys deep neural networks to extract features from the input
extension to NORESQA, known as NORESQA-MOS, has signal and cluster similar feature vectors in a latent space
been proposed in [368]. The primary difference between to separate different speakers. The model’s performance is
these frameworks is that while NORESQA estimates speech improved using spectral masking and a permutation invariant
quality using non-matching references through NORESQA- training method. The advantage of this model is its ability
score and NORESQA-MOS, NORESQA-MOS is specifically to handle multiple speakers, but it also has a high computa-
designed to assess the MOS of a given speech recording using tional cost. Chimera++ [587] is another effective model that
NMRs. combines deep clustering with mask-inference networks in a
multi-objective training scheme. The model is trained using a
5.10. Speech Separation multitask learning approach, optimizing speech enhancement
5.10.1. Task Description and speaker identification. Chimera++ can perform speech
Speech separation refers to separating a mixed audio sig- enhancement and speaker identification but has a relatively
nal into its sources, including speech, music, and background long training time.
noise. The problem is often referred to as the cocktail party TasNet v2 [352] employs a convolutional neural network
problem [175], as it mimics the difficulty of listening to a (CNN) to process the input signal and generate a time-frequency
conversation in a noisy room with multiple speakers. This mask for each source. The model is trained using an invariant
problem is particularly relevant in real-world scenarios such permutation training (PIT) method [265], which enables it

Mehrish et al.: Preprint submitted to Elsevier Page 44 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 11
Table comparing the performance of different speech separation methods using SI-SDRi
metrics on various speech separation benchmarks.
Model Architecture WSJ0-2mix WSJ0-3mix WSJ0-5mix Libri2Mix Libri5Mix Libri10Mix Libri20Mix WHAM
Separate And Diffuse [357] Diffusion 23.9 20.9 - 21.5 14.2 9 5.2 -
MossFormer (L) [663] Transformer 22.8 21.2 - - - - - -
MossFormer (M) [663] Transformer 22.5 20.8 - - - - - 17.3
SepFormer [520] Transformer 22.3 19.5 - - - - - -
Sandglasset [283] Transformer + LSTM 21.0 19.5 - - - - - -
Hungarian PIT [120] RNN - - 13.22 - 12.72 7.78 4.26 -
TDANet (L) [308] Transformer + CNN - - - 17.4 - - - 15.2
TDANet [308] Transformer + CNN - - - 16.9 - - - 14.8
Sepit [356] CNN 22.4 20.1 - - 13.7 8.2 - -
Gated DualPathRNN [387] CNN + LSTM 20.12 16.85 10.56 - - - - -
Dual-path RNN [351] LSTM 18.8 - - - - - - -
Conv-Tasnet [353] CNN 15.3 - - - - - - -

to separate multiple sources accurately. TasNet v2 achieves convolutional neural network to extract features from the
state-of-the-art performance in various speech separation input signal and generate a time-frequency mask for each
tasks with high separation accuracy, but its disadvantage is source. Wavesplit achieves impressive performance in var-
its relatively high computational cost. The variant of TasNet ious speech separation tasks. The advantage of this model
based on CNNs is proposed in [353]. The model is called is its high separation accuracy and relatively fast processing
Conv-TasNet and can generate a time-frequency mask for time, but its disadvantage is its relatively high memory usage.
each source to obtain the separated source’s signal. Com- Numerous studies have investigated the application of
pared to previous models, Conv-TasNet has faster processing Transformer architecture in the context of speech separa-
time but lower accuracy. tion. One such study is SepFormer [520], which has yielded
In recent research, encoder-decoder architectures have encouraging outcomes on the WSJ0-2mix and WSJ0-3mix
been explored for effectively separating source signals. One datasets, as evidenced by the data presented in Table 11. Addi-
promising approach is the Hybrid Tasnet architecture [613], tionally, MossFormer [663] is another cutting-edge architec-
which utilizes an encoder to extract features from the input ture that has successfully pushed the boundaries of monaural
signal and a decoder to generate the independent sources. speech separation across multiple speech separation bench-
This hybrid architecture captures both short-term and long- marks. It is worth noting that although both models employ
term dependencies in the input signal, leading to improved attention mechanisms, MossFormer integrates a blend of
separation performance. However, it should be noted that convolutional modules to further amplify its performance.
this model’s higher computational cost should be considered Diffusion models have been proven to be highly effec-
when selecting an appropriate separation method. tive in various machine learning tasks related to computer
Dual-path RNN [351] uses RNN architecture to perform vision, as well as speech-processing tasks. The recent de-
speech separation. The model uses a dual-path structure velopment of DiffSep [484] for speech separation, which is
[351] to capture low-frequency and high-frequency informa- based on score-matching of a stochastic differential equa-
tion in the input signal. Dual-path RNN achieves impressive tion, has shown competitive performance on the VoiceBank-
performance in various speech separation tasks. The advan- DEMAND dataset. Additionally, Separate And Diffuse [357],
tage of this model is its ability to capture low-frequency and another diffusion-based model that utilizes a pretrained dif-
high-frequency information, but its disadvantage is its high fusion model, currently represents the state-of-the-art per-
computational cost. Gated DualPathRNN [387] is a variant formance in various speech separation benchmarks (refer
of Dual-path RNN that employs gated recurrent units (GRUs) to Table 11). These advancements demonstrate the signifi-
to improve the model’s performance. The model uses a gating cant potential of diffusion models in advancing the field of
mechanism to control the flow of information in the recur- machine learning and speech processing.
rent network, allowing it to capture long-term dependencies
in the input signal. Gated DualPathRNN achieves state-of- 5.11. Spoken Language Understanding
the-art performance in various speech separation tasks. The 5.11.1. Task Description
advantage of this model is its ability to capture long-term de- Spoken Language Understanding (SLU) is a rapidly devel-
pendencies, but its disadvantage is its higher computational oping field that brings together speech processing and natural
cost than other models. language processing to help machines comprehend human
Wavesplit [633] employs a Wave-U-Net [519] architec- speech and respond appropriately. The ultimate goal of SLU
ture to perform speech separation. The model uses a fully is to bridge the gap between human and machine understand-

Mehrish et al.: Preprint submitted to Elsevier Page 45 of 72


A Review of Deep Learning Techniques for Speech Processing

ing. Typically, SLU tasks involve identifying the domain or with a diverse set of speakers and background
topic of a spoken utterance, determining the speaker’s intent noise conditions.
or goal in making the utterance, and filling in any relevant – Leroy et al. [300]: This dataset is a federated
slots or variables associated with that intent. For example, learning-based keyword spotting dataset, it is
consider the spoken utterance, "What is the weather like in composed of data from multiple sources that are
San Francisco today?" An SLU system would need to identify trained together without sharing the raw data.
the domain (weather), the intent (obtaining current weather The dataset consists of audio recordings from
information), and the specific slot to be filled (location-San multiple devices and environments, with the goal
Francisco) to generate an appropriate response. By improving of improving the robustness of KS across differ-
SLU capabilities, we can enable more effective communi- ent devices and settings
cation between humans and machines, making interactions
more natural and efficient. – Auto-KWS [570]: This dataset is automatically
Data-driven methods are frequently utilized to achieve generated using TTS approach. The dataset con-
these tasks, employing large datasets to train models capable sists of 1000 keywords spoken by 100 different
of accurately recognizing and interpreting spoken language. synthetic voices, with variations in accent, gen-
Among these methods, machine learning techniques, such der, and age.
as deep neural networks, are widely employed, given their – Speech Commands [589]: This data is a large-
exceptional ability to handle complex and ambiguous speech scale dataset for KS task that consists of over
data. The SLU task may be subdivided into the following 100, 000 spoken commands in English, with each
categories for greater clarity. command belonging to 35 different keywords.
The dataset is specifically designed to be highly
• Keyword Spotting: Keyword Spotting (KS) is a tech- varied and challenging, with a diverse set of speak-
nique used in speech processing to identify specific ers and background noises. It is commonly used
words or phrases within spoken language. It involves as a benchmark dataset for KS research.
analysing audio recordings and detecting instances of
pre-defined keywords or phrases. This technique is • Intent Classification and Slot Filling
commonly used in applications such as voice assis-
tants, where the system needs to recognize specific – ATIS [179]: The Airline Travel Information Sys-
commands or questions from the user. tem (ATIS) dataset is a collection of spoken queries
and responses related to airline travel, such as
• Intent Classification: Intent Classification (IC) is a flight reservations, flight status, and airport in-
spoken language understanding task that involves iden- formation. The dataset is annotated with both
tifying the intent behind a spoken sentence. It is usu- intent labels (e.g. “flight booking”, “flight status
ally implemented as a pipeline process, with a speech inquiry") and slot labels (e.g. depart city, arrival
recognition module followed by text processing that city, date). The ATIS dataset has been used ex-
classifies the intents. However, end-to-end intent clas- tensively as a benchmark for natural language
sification using speech has numerous advantages com- understanding models.
pared to the conventional pipeline approach using AST – SNIPS [101]: SNIPS is a dataset of voice com-
followed by NLP modules. mands designed for building a natural language
• Slot Filling: Slot Filling (SF) is a widely used technique understanding system. It consists of thousands
in Speech Language Understanding (SLU) that enables of examples of spoken requests, each annotated
the extraction of important information, such as names, with the intent of the request (e.g. “play music”,
dates, and locations, from a user’s speech. The process “set an alarm”, etc.). The dataset is widely used
involves identifying the specific pieces of information for training IC and SF models.
that are relevant to the user’s request and placing them – Fluent Speech Commands [350]: It is a dataset
into pre-defined slots. For instance, if a user asks for of voice commands for controlling smart home
the weather in a particular city, the system will identify devices, such as lights, thermostats, and locks.
the city name and fill it into the appropriate slot, thereby The dataset consists of over 1,5000 spoken com-
providing an accurate and relevant response. mands, each labeled with the intended devices
and action (e.g. “turn on the living room lights”,
5.11.2. Dataset “set the thermostat to 72 degrees”). The dataset
• Keyword Spotting Datasets: is designed to have variations in speaker accent,
background noise, and device placement.
– Coucke et al. [100]: This dataset is a speech com-
mand recognition dataset that consists of 105,000 – MIT-Restaurant and MIT-Movie [335]: These
spoken commands in English, with each com- are two datasets created by researchers at MIT
mand being one of 35 keywords. The dataset for training natural language understanding mod-
is designed to be highly varied and challenging, els from restaurant and movie information re-

Mehrish et al.: Preprint submitted to Elsevier Page 46 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 12
Comprehensive performance analysis of various models for Keyword Spotting (KS) and Slot
Filling (SF) tasks, evaluated on two benchmark datasets: Google Speech Commands for
KS and ATIS for SF.
Keyword Spotting on Google Speech Commands (Accuracy % ↑) Slot Filling on ATIS (F1 ↑)

Model Reference Google Speech Commands V1 12 Google Speech Commands V2 12 Google Speech Commands V2 35 Model Reference ATIS

TripletLoss-res15 [560] 98.56 98.37 97.0 CTRAN [451] 0.9846


Wav2KWS [488] 97.9 98.5 97.8 Bi-model with a decoder [581] 0.9689
KWT-3 [37] 97.49 ±0.15 98.56 ±0.07 97.69 ±0.09 Joint BERT [70] 0.961
KWT-1 [37] 97.27 ±0.08 98.43±0.08 97.74 ±0.03 Joint BERT + CRF [70] 0.96
KWT-2 [37] 97.26±0.18 98.08±0.10 96.95±0.14 SF-ID [398] 0.958
Attention RNN [475] 95.6 96.9 93.9 Capsule-NLU [641] 0.952

quests. The dataset contains spoken and text- Despite the remarkable progress made in the field of
based queries, each labeled with the intent of the SLU, accurately comprehending human speech in real-life
request (e.g. “find a nearby Italian restaurant”,” situations continues to pose significant challenges. These
get information about the movie Inception”) and challenges are amplified by the presence of diverse accents,
relevant slot information (e.g. restaurant type, dialects, and linguistic variations. In a notable study, Vanzo
movie name, etc). The datasets are widely used et al. [552] emphasize the significance of SLU in facilitat-
for benchmarking natural language understand- ing effective human-robot interaction, particularly within the
ing models. context of house service robots. The authors delve into the
specific obstacles encountered in this domain, which encom-
5.11.3. Models pass handling noisy and unstructured speech, accommodating
• Keyword Spotting: The state-of-the-art techniques various accents and speech variations, and deciphering com-
for keyword spotting in speech involve deep learn- plex commands involving multiple actions. To overcome
ing models, such as CNNs [469] and transformers these obstacles, ongoing research endeavors are dedicated to
[37]. Wav2Keyword is one of the popular model based developing innovative solutions that enhance the precision
on Wav2Vec2.0 architecture [488] and have achieved and efficacy of SLU systems. By addressing these challenges,
SOTA results on Speech Commands data V1 and V21. the aim is to enable more robust and accurate speech compre-
Another model that achieves SOTA classification accu- hension in diverse real-life scenarios.
racy on the Google Speech commands dataset is Key- Recent studies, including the comprehensive analysis of
word Transformer (KWT) [488]. KWT uses a trans- the performance of different models and techniques for Key-
former model and achieves 98.6% and 97.7% accuracy word Spotting (KS) and Slot Filling (SF) tasks on Google
on the 12 and 35-word tasks, respectively. KWT also Speech Commands and ATIS benchmark datasets (Table 12),
has low latency and can be used on mobile devices. have furnished valuable insights into the strengths and lim-
itations of such approaches in SLU. Capitalizing on these
• The DIET architecture, as introduced in [48], is a
findings and leveraging the latest advances in deep learning
transformer-based multitask model that addresses in-
and speech recognition could help us continue to expand the
tent classification and entity recognition simultane-
frontiers of spoken language understanding and drive further
ously. DIET allows for the seamless integration of
innovation in this domain.
various pre-trained embeddings such as BERT, GloVe,
and ConveRT. Results from experiments show that 5.12. Audio/visual multimodal speech processing
DIET outperforms fine-tuned BERT and has the added The process of speech perception in humans is intricate
benefit of being six times faster to train. and involves multiple sensory modalities, including auditory
• Chang et al. [59] investigated the effectiveness of prompt and visual cues. The generation of speech sounds involves
tuning on the GSLM architecture and showcased its articulators such as the tongue, lips, and teeth, whose move-
competitiveness on various SLU tasks, such as KS, ments are critical for producing different speech sounds and
IC, and SF. Impressively, this approach achieves com- visible to others. The importance of visual cues becomes
parable results with fewer trainable parameters than more pronounced for individuals with hearing impairments
full fine-tuning. Despite being a popular and effective who depend on lip-reading to comprehend spoken language,
technique in numerous NLP tasks, prompt tuning has while individuals with normal hearing can also benefit from
not received much attention in the speech community. visual cues in noisy environments.
Additionally, other researchers have pursued a different When investigating language comprehension and commu-
path by utilizing pre-trained wav2vec2.0 and different nication, it is essential to consider both auditory and visual
adapters [315] to attain state-of-the-art outcomes. information, as studies have demonstrated that visual infor-
mation can assist in distinguishing between acoustically sim-

Mehrish et al.: Preprint submitted to Elsevier Page 47 of 72


A Review of Deep Learning Techniques for Speech Processing

ilar sounds that differ in articulatory characteristics. A com- generation holds immense potential for various appli-
prehensive understanding of the interaction between these cations, including teleconferencing, creating virtual
sensory modalities can lead to the development of assistive characters with specific facial expressions, and enhanc-
technologies for individuals with hearing impairments and ing speech comprehension. In recent years, signifi-
enhance communication strategies in challenging listening cant advancements have been made in the field of talk-
environments. ing face generation, as evidenced by notable studies
[515, 671, 65, 133, 134].
5.12.1. Task Description
The tasks under audiovisual multimodal processing can 5.12.2. Datasets
be subdivided into the following categories. Several datasets are widely used for audiovisual multi-
modal research, including VoxCeleb, TCD-TIMID [173] , etc.
• Lip-reading: Lip-reading is a remarkable ability that
We briefly discuss some of them in the following section.
allows us to comprehend spoken language from silent
videos. However, it is a challenging task even for hu- • TCD-TIMID [173]: This is an extensive and diverse
mans. Recent advancements in deep learning technol- audiovisual dataset that encompasses both audio and
ogy have enabled the development of neural network- video recordings of 600 distinct sentences spoken by 60
based lip-reading models to accomplish this task with participants. The dataset features a wide range of speak-
high accuracy. These models take silent facial videos ers with different genders, accents, and backgrounds,
as input and produce the corresponding speech audio or making it highly suitable for talker-independent speech
characters as output. The potential applications of au- recognition research. The audio recordings are of ex-
tomatic lip-reading models are vast and diverse, includ- ceptional quality, captured using high-fidelity micro-
ing enabling videoconferencing in noisy environments, phones with a sampling rate of 48kHz. Meanwhile, the
using surveillance videos as long-range listening de- video footage is of 720p resolution and includes depth
vices, and facilitating conversations in noisy social information for every frame
settings. Developing these models could significantly
improve our daily lives. • LipReading in the Wild (LRW) [93]: The LRW is a
comprehensive audiovisual dataset that encompasses
• Audiovisual speech separation: Recent years have wit- 500 distinct words spoken by more than 1000 speakers.
nessed a growing interest in audiovisual speech sep- This dataset has been segmented into distinct training,
aration, driven by the remarkable human capacity to evaluation, and test sets to facilitate efficient research.
selectively focus on a specific sound source amidst Additionally, the LRW-1000 dataset [617] represents
background noise, commonly known as the "cocktail a subset of LRW, featuring a 1000-word vocabulary.
party effect." This phenomenon poses a significant Researchers can benefit from pre-trained weights in-
challenge in computer speech recognition, prompting cluded with this dataset, simplifying the evaluation
the development of automatic speech separation tech- process. Overall, these datasets are highly regarded in
niques aimed at isolating individual speech sources the scientific community for their size and versatility
from complex audio signals. In a noteworthy study by in supporting research related to speech recognition
Ephrat et al. (2018) Ephrat et al. [130], the authors and natural language processing
proposed that audiovisual speech separation surpasses
audio-only approaches by leveraging visual cues from • LRS2 and LRS3 10 : The LRS2 and LRS3 datasets are
a speaker’s face to resolve ambiguity in speech signals. additional examples of audiovisual speech recognition
By integrating visual information, the model’s ability datasets that have been gathered from videos captured
to disentangle overlapping speech signals is enhanced. in real-world settings. Each of these datasets has its
The implications of automatic speech separation extend own distinct train/test split and includes cropped face
across diverse applications, including assistive tech- tracks as well as corresponding audio clips sourced
nologies for individuals with hearing impairments and from British television. Both datasets are considered
head-mounted devices designed to facilitate effective to be of significant value to researchers in the field
communication in noisy meeting scenarios. of speech recognition, particularly those focused on
audiovisual analysis.
• Talking face generation: Generating a realistic talking
face of a target character, synchronized with a given • GRID [97]: This dataset comprises high-fidelity audio
speech and ensuring smooth transitions between fa- and video recordings of more than 1000 sentences
cial images, is the objective of talking face generation. spoken by 34 distinct speakers, including 18 males
This task has garnered substantial interest and poses and 16 females. The sentences were gathered using the
a significant challenge due to the dynamic nature of prompt "put red at G9 now" and are widely employed
facial movements, which depend on both visual infor- in research related to audio-visual speech separation
mation (input face image) and acoustic information and talking face synthesis. The dataset is considered
(input speech audio) to achieve accurate lip-speech 10 https://www.robots.ox.ac.uk/ṽgg/data/lip_reading/lrs2.html
synchronization. Despite its challenges, talking face

Mehrish et al.: Preprint submitted to Elsevier Page 48 of 72


A Review of Deep Learning Techniques for Speech Processing

to be of exceptional quality and is highly sought after are proposed in [10, 379, 146].
in the scientific community. The rise of Deepfake videos on the internet has led to a
surge in demand for creating realistic talking faces for various
5.12.3. Models applications, such as video production, marketing, and en-
In recent years, there has been a remarkable surge in tertainment. Previously, the conventional approach involved
the development of algorithms tailored for multimodal tasks. manipulating 3D meshes to create specific faces, which was
Specifically, significant attention has been devoted to the time-consuming and limited to certain identities. However,
advancement of neural networks for Text-to-Speech (TTS) recent advancements in deep generative models have made
applications [462, 460, 461, 251]. The integration of visual significant progress. For example, DAVS [671] introduced
and auditory modalities through multimodal processing has an end-to-end trainable deep neural network capable of learn-
played a pivotal role in enhancing various tasks relevant to our ing a joint audiovisual representation, which uses adversarial
daily lives. Lip-reading, for instance, has witnessed notable training to disentangle the latent space. Another architecture
progress in recent years, whether accompanied by audio or proposed by ATVGnet [65] consists of an audio transfor-
not. Son et al. have made a significant contribution to this mation network (AT-net) and a visual generation network
field with their hybrid model [513]. Combining convolutional (VG-net) for processing acoustic and visual information, re-
neural networks (CNN), long short-term memory (LSTM) spectively. This method introduced a regression-based dis-
networks, and an attention mechanism, their model captures criminator, a dynamically adjustable pixel-wise loss, and an
correlations between lip videos and audio, enabling accurate attention mechanism. In [674], a novel framework for talking
character generation. Additionally, the authors introduce a face generation was presented, which discovers audiovisual
new dataset called LRS, which facilitates the development of coherence through an asymmetrical mutual information esti-
lip-reading models. mator. Furthermore, the authors in [133] proposed an end-
Another noteworthy model, LiRA [359], focuses on self- to-end approach based on generative adversarial networks
supervised learning for lip-reading. It leverages lip image that use noisy speech for talking face generation. In addition,
sequences and audio waveforms to derive high-level repre- alternative methods based on conditional recurrent adversar-
sentations during the pre-training stage, achieving word-level ial networks and speech-driven talking face generation were
and sentence-level lip-reading capabilities. In the realm of introduced in [515, 134].
capturing human emotions expressed through acoustic sig-
nals, Ephrat et al. [129] propose an innovative model that
frames the task as an acoustic regression problem instead of 6. Advanced Transfer Learning Techniques for
a visual-to-text modeling approach. Their work emphasizes Speech Processing
the advantages of this perspective. Furthermore, Vid2Speech 6.1. Domain Adaptation
[131], a CNN-based model, takes facial image sequences 6.1.1. Task Description
as input and generates corresponding speech audio wave- Domain adaptation is a field that deals with adapting a
forms. It employs a two-tower CNN model that processes fa- model trained on a labeled dataset from a source domain to
cial grayscale images while calculating optical flow between a target domain, where the source domain differs from the
frames. Additionally, other models such as those based on target domain. The goal of domain adaptation is to reduce
mutual information maximization [667] and spatiotemporal the performance gap between the source and target domains
fusion [653] have been proposed for the lip-reading task, fur- by minimizing the difference between their distributions. In
ther expanding the methodologies explored in this domain. speech processing, domain adaptation has various applica-
In an early attempt to develop algorithms for audiovisual tions such as speech recognition [44, 396, 292, 87, 200],
speech separation, the authors of [130] proposed a CNN- speaker verification [600, 76, 578, 645, 184], and speech
based architecture that encodes facial images and speech synthesis [602, 631]. This section explores the use of domain
spectrograms to compute a complex mask for speech sepa- adaptation in these tasks by reviewing recent literature on
ration. Additionally, they introduced the AVspeech dataset the subject. Specifically, we discuss the techniques used in
in this work. AV-CVAE [394] utilizes a conditional VAE domain adaptation, their effectiveness, and the challenges
to detect the lip movements of the speaker and predict sep- that arise when applying them to speech processing.
arated speech. In a deviation from speech signals, [385]
focuses on audiovisual singing separation and employs a two- 6.1.2. Models
stream CNN architecture, Y-Net [374], to process audio and Various techniques have been proposed to adapt a deep
video separately. This work introduces a large dataset of solo learning model for speech processing tasks. An example
singing videos for audiovisual singing separation. The Visu- of a technique is reconstruction-based domain adaptation,
alSpeech [151] architecture takes a face image sequence and which leverages an additional reconstruction task to generate
mixed audio of lip movement as input and predicts a complex a communal representation for all the domains. The Deep
mask. It also proposes a cross-modal embedding space to Reconstruction Classification Network (DRCN) [154] is an
facilitate the correlation of audio and visual modalities. Fi- illustration of such an approach, as it endeavors to address
nally, FaceFilter [94] uses still images as visual information, both tasks concurrently: (i) classification of the source data
and other methods for the audiovisual speech separation task and (ii) reconstruction of the input data. Another technique

Mehrish et al.: Preprint submitted to Elsevier Page 49 of 72


A Review of Deep Learning Techniques for Speech Processing

used in domain adaptation is the domain-adversarial neural 6.2.2. Models


network architecture, which aims to learn domain-invariant In low-resource ASR, meta-learning is used to quickly
features using a gradient reversal layer [51, 654, 574]. adapt unseen target languages by formulating ASR for dif-
Different domain adaptation techniques are successfully ferent languages as different tasks and meta-learning the
applied to different speech processing tasks, such as speaker initialization parameters from many pretraining languages
recognition [313, 200, 44, 396] and verification [76, 75, 645, [192, 503]. The proposed approach, MetaASR [192], sig-
305, 673], where the goal is to verify the identity of a speaker nificantly outperforms the state-of-the-art multitask pretrain-
using their voice. One approach for domain adaptation in ing approach on all target languages with different combi-
speaker verification is to use adversarial domain training to nations of pretraining languages. In speaker verification,
learn speaker-independent features insensitive to variations meta-learning is used to improve the meta-learning training
in the recording environment [75]. for SV by introducing two methods to improve the backbone
Domain adaptation has also been applied to speech recog- embedding network [73]. The proposed methods can ob-
nition [367, 213, 631, 521] to improve speech recognition tain consistent improvements over the existing meta-learning
accuracy in a target domain. One recent approach for domain training framework [279].
adaptation in ASR is prompt-tuning [112], which involves Meta-learning has proven to be a promising approach in
fine-tuning the ASR system on a small amount of data from various speech-related tasks, including low-resource ASR
the new domain. Another approach is to use adapter modules and speaker verification. In addition to these tasks, meta-
for transducer-based speech recognition systems [364, 481], learning has also been applied to few-shot speaker adaptive
which can balance the recognition accuracy of general speech TTS and language-agnostic TTS, demonstrating its potential
and improve recognition on adaptation domains. The Ma- to improve performance across different speech technologies.
chine Speech Chain integrates both end-to-end (E2E) ASR Meta-TTS [208] is an example of a meta-learning model
and neural text-to-speech (TTS) into one circle [631]. This used for a few-shot speaker adaptive TTS. It can synthesize
integration can be used for domain adaptation by fine-tuning high-speaker-similarity speech from a few enrolment samples
the E2E ASR on a small amount of data from the new domain with fewer adaptation steps. Similarly, a language-agnostic
and then using the TTS to generate synthetic speech in the meta-learning approach is proposed in [358] for low-resource
new domain for further training. TTS.
In addition to domain adaptation techniques used in speech
recognition, there has been growing interest in adapting text- 6.3. Parameter-Efficient Transfer Learning
to-speech (TTS) models to specific speakers or domains. This Transfer learning has played a significant role in the re-
research direction is critical, especially in low-resource set- cent progress of speech processing. Fine-tuning pre-trained
tings where collecting sufficient training data can be challeng- large models, such as those trained on LibriSpeech [412] or
ing. Several recent works have proposed different approaches Common Voice [17], has been widely used for transfer learn-
for speaker and domain adaptation in TTS, such as AdaSpeech ing in speech processing. However, fine-tuning all parameters
[66, 609, 599]. for each downstream task can be computationally expensive.
To overcome this challenge, researchers have been exploring
6.2. Meta Learning parameter-efficient transfer learning techniques that optimize
6.2.1. Task Description only a fraction of the model parameters, aiming to improve
Meta-learning is a branch of machine learning that fo- training efficiency. This article investigates these parameter-
cuses on improving the learning algorithms used for tasks efficient transfer learning techniques in speech processing,
such as parameter initialization, optimization strategies, net- evaluates their effectiveness in improving training efficiency
work architecture, and distance metrics. This approach has without sacrificing performance, and discusses the challenges
been demonstrated to facilitate faster fine-tuning, better per- and opportunities associated with these techniques, highlight-
formance convergence, and the ability to train models from ing their potential to advance the field of speech processing.
scratch, which is especially advantageous for speech-processing
tasks. Meta-learning techniques have been employed in var- 6.3.1. Adapters
ious speech-processing tasks, such as low-resource ASR In recent years, retrofitting adapter modules with a few
[192, 215], SV [644], TTS [208] and domain generalization parameters to pre-trained models has emerged as an effective
for speaker recognition [242]. approach in speech processing. This involves optimizing the
Meta-learning has the potential to improve speech pro- adapter modules while keeping the pre-trained parameters
cessing tasks by learning better learning algorithms that can frozen for downstream tasks. Recent studies (Li et al., 2023;
adapt to new tasks and data more efficiently. Meta-learning Liu et al., 2021) [315, 615] have shown that adapters often
can also reduce the cost of model training and fine-tuning, outperform fine-tuning while using only a fraction of the
which is particularly useful for low-resource speech process- total parameters. Different adapter architectures are available,
ing tasks. Further investigation is required to delve into the such as bottleneck adapters (Houlsby et al., 2019)[190], tiny
full potential of meta-learning in speech processing and to attention adapters (Zhao et al., 2022)[662], prefix-tuning
develop more effective meta-learning algorithms for different adapters (Li and Liang, 2021)[314], and LoRA adapters (Hu
speech-processing tasks. et al., 2022)[198], among others Next, we will review the
different approaches for parameter-efficient transfer learning.

Mehrish et al.: Preprint submitted to Elsevier Page 50 of 72


A Review of Deep Learning Techniques for Speech Processing

Add and
LayerNorm Attention
LoRA

Prefix
Adapter Tuning Attention LoRA

Feed Forward FF Up
FF FF
Down UP
Nonlinear LoRA LoRA LoRA
Add and
LayerNorm
FF Down Hidden States
Multi-Head Hidden States
Attention
Bottleneck Multi-Head Attention Multi-Head Attention
Adapter

Figure 17: Transformer architecture and Adapter, Prefix Tuning, and LoRA.

The different approaches are illustrated in Figure 17 and matrices 𝑲 and 𝑽 , while the query matrix 𝑸 remains un-
Figure 18 changed. The resulting matrices are then used for multi-head
attention, where each head of the attention mechanism is
Residual computed as follows:

head𝑖 = Attn(𝑸𝑾𝑄(𝑖) , [𝑷𝐾(𝑖) , 𝑲𝑾𝑄(𝑖) ], [𝑷𝑉(𝑖) , 𝑽 𝑾𝑄(𝑖) ]) (31)


Layer 1D Conv* 1D Conv* 1D Conv*
SE
Norm k=3 k=5 k=3

Convolution Adapter
where Attn(⋅) is scaled dot-product attention given by:

𝑸𝑲 𝑇
Figure 18: The architecture of 1D convolution layer-based Attn(𝑸, 𝑲, 𝑽 ) = softmax( √ )𝑽 (32)
lightweight adapter. 𝑘 is the kernel size of 1D convolution. ∗ 𝑑𝑘
denotes depth-wise convolution.
The attention heads in each layer are modified by prefix tun-
ing, with only the prefix vectors 𝑷 𝐾 and 𝑷 𝑉 being updated
Adapter Tuning. Adapters are a type of neural module that during training. This approach provides greater control over
can be retrofitted onto a pre-trained language model, with the transmission of acoustic information between layers and
significantly fewer parameters than the original model. One effectively activates the pre-trained model’s knowledge.
such type is the bottleneck or standard adapter (Houlsby et
LoRA. LoRA is a novel approach proposed by Hu et al.
al., 2019; Pfeiffer et al., 2020) [189, 425]. The adapter takes
(2021) [199], which aims to approximate weight updates
an input vector ℎ ∈ 𝐑𝑑 and down-projects it to a lower-
in the Transformer by injecting trainable low-rank matri-
dimensional space with dimensionality 𝑚 (where 𝑚 < 𝑑),
ces into its layers. In this method, a pre-trained weight ma-
applies a non-linear function 𝑔(⋅), and then up-projects the
trix 𝑊 ∈ ℝ𝑑×𝑘 is updated by a low-rank decomposition
result back to the original 𝑑-dimensional space. Finally, the
𝑾 + Δ𝑾 = 𝑾 + 𝑾 down𝑾 up, where 𝑾 down ∈ ℝ𝑑×𝑟 ,
output is obtained by adding a residual connection.
𝑾 up ∈ ℝ𝑟×𝑘 are tunable parameters and 𝑟 represents the
𝒉 ← 𝒉 + 𝑔(𝒉𝑾down )𝑾up (30) rank of the decomposition matrices, with 𝑟 < 𝑑. Specifically,
for a given input 𝒙 to the linear projection in the multi-headed
where matrices 𝑾down and 𝑾up are used as down and up attention layer, LoRA modifies the projection output 𝒉 as fol-
projection matrices, respectively, with 𝑾 down having di- lows:
mensions ℝ𝑑×𝑚 and 𝑾up having dimensions ℝ𝑚×𝑑 . Previous
studies have empirically shown that a two-layer feedforward 𝒉 ← 𝒉 + 𝑠 ⋅ 𝒙𝑾down 𝑾up (33)
neural network with a bottleneck is effective. In this work,
we follow the experimental settings outlined in [425] for the In this work, LoRA is integrated into four locations of the
adapter, which is inserted after the feedforward layer of every multi-head attention layer, as illustrated in Figure 17. Thanks
transformer module, as depicted in Figure 17. to its lightweight nature, the pre-trained model can accom-
modate many small modules for different tasks, allowing
Prefix tuning. Recent studies have suggested modifying the for efficient task switching by replacing the modules. Addi-
attention module of the Transformer model to improve its per- tionally, LoRA incurs no inference latency and achieves a
formance in natural language processing tasks. This approach convergence rate that is comparable to that of training the
involves adding learnable vectors to the pre-trained multi- original model, unlike fully fine-tuned models [199].
head attention keys and values at every layer, as depicted in Convolutional Adapter. CNNs have become increasingly
Figure 17. Specifically, two sets of learnable prefix vectors, popular in the field of speech processing due to their ability
𝑷𝑲 and 𝑷𝑽 , are concatenated with the original key and value to learn task-specific information and combine channel-wise

Mehrish et al.: Preprint submitted to Elsevier Page 51 of 72


A Review of Deep Learning Techniques for Speech Processing

Table 13
The study evaluated various parameter-efficient training methods on pre-trained Word2Vec
2.0, including full fine-tuning, on the SURE benchmark. The fraction of trainable parameters
were represented by percentages, with the number of KS task’s trainable parameters given.
Results are reported using weighted-f1 as the metric (w-f1) on MELD, with the best
performance in bold and the second best underlined. To avoid data imbalance, the
researchers opted for using weighted-f1 as the metric. The study cites Li et al. (2023)
[315] as a reference.

SER (acc % / w-f1) ↑ SR (acc %) ↑ ASR (wer) ↓ KS (acc %) ↑


Method #Parameters
ESD MELD ESD VCTK ESD FLEURS LS Speech Command

Fine Tuning 315,703,947 96.53 42.93 99.00 92.36 0.2295 0.135 0.0903 99.08
Adapter 25,467,915 (8.08%) 94.07 41.58 98.87 96.32 0.2290 0.214 0.2425 99.19
Prefix Tuning 1,739,787 (0.55%) 90.00 44.21 99.73 98.49 0.2255 0.166 0.1022 98.86
LoRA 3,804,171 (1.20%) 90.00 47.05 99.00 97.61 0.2428 0.149 0.1014 98.28
ConvAdapter 2,952,539 (0.94%) 91.87 46.30 99.60 97.61 0.2456 0.2062 0.2958 98.99

Table 14
Results on SURE benchmark for full fine-tuning and other parameter-efficient training
methods on pre-trained Wav2Vec 2.0 for IC and PR tasks on FS: Fluent Speech [350] and
LS: LibriSpeech [412] datasets, respectively.

IC PR SF
Method FS LS SNIPS
#Parameters #Parameters #Parameters
ACC% ↑ PER ↓ F1 % ↑ CER ↓
Fine-Tuning 315707288 99.60 311304394 0.0577 311375119 93.89 0.1411
Adapter 25471256 (8.06%) 99.39 25278538 (8.01%) 0.1571 25349263 (8.14%) 92.60 0.1666
Prefix Tuning 1743128 (0.55%) 93.43 1550410 (0.49%) 0.1598 1621135 (0.50%) 62.32 0.6041
LoRA 3807512 (1.20%) 99.68 3614794 (1.16%) 0.1053 3685519 (1.18%) 90.61 0.2016
ConvAdapter 3672344 (1.16%) 95.60 3479626 (1.11%) 0.1532 3550351 (1.14%) 59.27 0.6405

information within local receptive fields. To further improve Table 15


the efficiency of CNNs for speech processing tasks, Li et Results on the SURE benchmark for the TTS task. MCD and
al. (2023) [315] proposed a lightweight adapter, called the WER are the metrics used to compare fine-tuning and other
ConvAdapter, which uses three 1D convolutional layers, layer parameter-efficient approaches.
normalization, and a squeeze-and-excite module (Zhang et al., LTS L2ARCTIC
2017) [201], as shown in Figure 18. By utilizing depth-wise Method Parameters (%)

convolution, which requires fewer parameters and is more


MCD ↓ WER ↓ MCD ↓ WER ↓

computationally efficient, the authors were able to achieve


Fine-tuning 35802977 6.2038 0.2655 6.71469 0.2141

better performance while using fewer resources. In this ap- Adapter 659200 6.1634 0.3143 6.544 0.2504

proach, the ConvAdapter is added to the same location as the Prefix 153600 6.2523 0.3334 7.4264 0.3244

Bottleneck Adapter (Figure 17). LoRA 81920 6.8319 0.3786 7.0698 0.3291
Table 13, Table 14, and Table 15 present the results of Convadapter 108800 6.9202 0.3365 6.9712 0.3227
various speech processing tasks in the SURE benchmark.
The findings demonstrate that the adapter-based methods
perform comparably well in fine-tuning. However, there is outputs of the larger model or, by using, the larger model’s
no significant advantage of any particular adapter type over hidden representations as input to the smaller model. Knowl-
others for these benchmark tasks and datasets. edge distillation is effective in reducing the computational
cost of training and inference.
6.3.2. Knowledge Distillation (KD) Cho et al. [81] conducted knowledge distillation (KD)
Knowledge distillation involves training a smaller model by directly applying it to the downstream task. One way to
to mimic the behavior of a larger and more complex model. improve this approach is to use KD as pre-training for various
This can be done by training the smaller model to predict the downstream tasks, thus allowing for knowledge reuse. A

Mehrish et al.: Preprint submitted to Elsevier Page 52 of 72


A Review of Deep Learning Techniques for Speech Processing

noteworthy result achieved by Denisov and Vu [107] was larger and more comprehensive models, along with the
using KD in pretraining. However, they achieved this by utilization of larger datasets. By leveraging these re-
initializing an utterance encoder with a trained ASR model’s sources, it becomes possible to create TTS models that
backbone, followed by a trained NLU backbone. Knowledge exhibit enhanced naturalness and human-like prosody.
distillation can be applied directly into a wav2vec 2.0 encoder One promising approach to achieve this is through the
without ASR training and a trained NLU module to enhance application of adversarial training, where a discrim-
this method. Kim et al. [256] implemented a more complex inator is employed to distinguish between machine-
architecture, utilizing KD in both the pretraining and fine- generated speech and reference speech. This adversar-
tuning stages. ial framework facilitates the generation of TTS models
that closely resemble human speech, providing a sig-
6.3.3. Model Compression nificant step forward in achieving more realistic and
Researchers have also explored various architectural mod- high-quality synthesized speech. By exploring these
ifications to existing models to make them more parameter- avenues, researchers aim to push the boundaries of
efficient. One such approach is pruning [141, 586], where speech synthesis technology, ultimately enhancing the
motivated by lottery-ticket hypothesis (LTH) [140], the task- overall performance and realism of TTS systems.
irrelevant parameters are masked based on some threshold
defined by importance score, such as some parameter norm. 2. Multilingual Models: Self-supervised learning has
Another form of compression could be low-rank factoriza- emerged as a transformative approach in the field of
tion [197], where the parameter matrices are factorized into speech recognition, particularly for low-resource lan-
lower-rank matrices with much fewer parameters. Finally, guages characterized by scarce or unavailable labeled
quantization is a popular approach to reduce the model size datasets. The recent development of the XLS-R model,
and improve energy efficiency with a minimal performance a state-of-the-art self-supervised speech recognition
penalty. It involves transforming 32-bit floating point model model, represents a significant milestone in this do-
weights into integers with fewer bit-counts [619]—8-bit, 4- main. With a remarkable scale of over 2 billion param-
bit, 2-bit, and even 1-bit—through scaling and shifting. At the eters, the XLS-R model has been trained on a diverse
same time, the quantization of the activation is also handled dataset spanning 128 languages, surpassing its prede-
based on the input. cessor in terms of language coverage. The notable
Lai et al. [280] iteratively prune and subsequently fine- advantage of scaling up larger multilingual models like
tune wav2vec2.0 on downstream tasks to obtained improved XLS-R lies in the substantial performance improve-
results over fine-tuned wav2vec2.0. Winata et al. [593] em- ments they offer. As a result, these models are poised
ploy low-rank transformers to excise the model size by half to outperform single-language models and hold im-
and increase the inference speed by 1.35 times. Peng et al. mense promise for the future of speech recognition.
[424] employ KD and quantization to make wav2vec2.0 twice By harnessing the power of self-supervised learning
as fast, twice as energy efficient, and 4.8 times smaller at the and leveraging multilingual datasets, the XLS-R model
cost of a 7% increase in WER. Without the KD step, the model showcases the potential for addressing the challenges
is 3.6 times smaller with mere 0.1% WER degradation. posed by low-resource languages and advancing the
field of speech recognition to new heights.

7. Conclusion and Future Research Directions 3. Multimodal Speech Models: Traditional speech and
text models have typically operated within a single
The rapid advancements in deep learning techniques have
modality, focusing solely on either speech or text in-
revolutionized speech processing tasks, enabling significant
puts and outputs. However, as the scale of generative
progress in speech recognition, speaker recognition, and
models continues to grow exponentially, the integration
speech synthesis. This paper provides a comprehensive re-
of multiple modalities becomes a natural progression.
view of the latest developments in deep learning techniques
This trend is evident in the latest developments, such
for speech-processing tasks. We begin by examining the early
as the unveiling of groundbreaking language models
developments in speech processing, including representation
like GPT-4 [407] and Kosmos-I [207], which demon-
learning and HMM-based modeling, before presenting a con-
strate the ability to process both images and text jointly.
cise summary of fundamental deep learning techniques and
These pioneering multimodal models pave the way
their applications in speech processing. Furthermore, we
for the emergence of large-scale architectures that can
discuss key speech-processing tasks, highlight the datasets
seamlessly handle speech and other modalities in a
used in these tasks, and present the latest and most relevant
unified manner. The convergence of multiple modali-
research works utilizing deep learning techniques.
ties within a single model opens up new avenues for
We envisage several lines of development in speech pro-
comprehensive understanding and generation of mul-
cessing:
timodal content, and it is highly anticipated that we
1. Large Speech Models: In addition to the advancements will witness the rapid development of large multimodal
made with wav2vec2.0, further progress in the field models tailored for speech and beyond in the near fu-
of ASR and TTS models involves the development of ture.

Mehrish et al.: Preprint submitted to Elsevier Page 53 of 72


A Review of Deep Learning Techniques for Speech Processing

4. In-Context Learning: Utilizing mixed-modality mod- yet, explainability could be built-in as inductive bias
els opens up possibilities for the development of in- in architecture. To this end, brain-inspired architec-
context learning approaches for a wide range of speech- tures [382] are being developed, which may shed more
related tasks. This paradigm allows the tasks to be light on this aspect of large models.
explicitly defined within the input, along with accom-
panying examples. Remarkable progress has already 8. Neuroscience-inspired Architectures:In recent years,
been demonstrated in large language models (LLMs), there has been significant research exploring the paral-
including notable works such as InstructGPT [408], lels between speech-processing architectures and the
FLAN-T5 [90], and LLaMA [537]. These models intricate workings of the human brain [382]. These
showcase the efficacy of in-context learning, where studies have unveiled compelling evidence of a strong
the integration of context-driven information empow- correlation between the layers of speech models and
ers the models to excel in various speech tasks. By the functional hierarchy observed in the human brain.
leveraging mixed-modality models and incorporating This intriguing finding has served as a catalyst for the
contextual cues, researchers are advancing the bound- development of neuroscience-inspired speech models
aries of speech processing capabilities, paving the way that demonstrate comparable performance to state-of-
for more versatile and context-aware speech systems. the-art (SOTA) models [382]. By drawing inspiration
from the underlying principles of neural processing
5. Controllable Speech Generation:An intriguing appli- in the human brain, these innovative speech models
cation stemming from the aforementioned concept is aim to enhance our understanding of speech percep-
controllable text-to-speech (TTS), which allows for tion and production while pushing the boundaries of
fine-grained control over various attributes of the syn- performance in the field of speech processing.
thesized speech. Attributes such as tone, accent, age,
9. Text-to-Audio Models for Text-to-Speech: Lately, trans-
gender, and more can be precisely controlled through
former and diffusion-based text-to-audio (TTA) model
in-context text guidance. This controllability in TTS
development is turning into an exciting area of research.
opens up exciting possibilities for personalization and
Until recently, most of these models [332, 272, 611,
customization, enabling users to tailor the synthesized
155, 580] overlooked speech in favour of general audio.
speech to their specific requirements. By leveraging
In the future, however, the models will likely strive to
advanced models and techniques, researchers are mak-
be equally performant in both audio and speech. To
ing significant strides in developing controllable TTS
that end, current TTS methods will likely be an inte-
systems that provide users with a powerful and flexible
gral part of those models. Recently, Suno-AI [525]
speech synthesis experience.
have aimed at striking a good balance between general
6. Parameter-efficient Learning: With the increasing scale audio and speech, although their implementation is not
of LLMs and speech models, it becomes imperative public, nor have they provided any detailed paper.
to adapt these models with minimal parameter up-
dates. This necessitates the development of special- CRediT authorship contribution statement
ized adapters that can efficiently update these emerg-
ing mixed-modality large models. Additionally, model Ambuj Mehrish: Conceptualization, Writing - Origi-
compression techniques have proven to be practical nal Draft. Navonil Majumder: Writing - Original Draft.
solutions in addressing the challenges posed by these Rishabh Bhardwaj: Review, Editing. Rada Mihalcea: Re-
large models. Recent research [280, 593, 424] has view, Editing. Soujanya Poria: Review, Editing.
demonstrated the effectiveness of model compression,
highlighting the sparsity that exists within these mod- Declaration of Competing Interest
els, particularly for specific tasks. By employing model
Declaration of Competing Interest The authors declare
compression techniques, researchers can reduce the
that they have no known competing financial interests or
computational requirements and memory footprint of
personal relationships that could have appeared to influence
these models while preserving their performance, mak-
the work reported in this paper.
ing them more practical and accessible for real-world
applications.
Acknowledgement
7. Explainability: Explainability remains elusive to these
large networks as they grow. Researchers are steadfast This research is supported by the Ministry of Educa-
in explaining these networks’ functioning and learn- tion, Singapore, under its AcRF Tier-2 grant (Project no.
ing dynamics. Recently, much work has been done to T2MOE2008, and Grantor reference no. MOE-T2EP20220-
learn the fine-tuning and in-context learning dynamics 0017), and A*STAR under its RIE 2020 AME programmatic
of these large models for text under the neural-tangent- grant (project reference no. RGAST2003. Any opinions,
kernel (NTK) asymptotic framework [366]. Such ex- findings and conclusions or recommendations expressed in
ploration is yet to be done in the speech domain. More this material are those of the author(s) and do not reflect the
views of the Ministry of Education, Singapore.

Mehrish et al.: Preprint submitted to Elsevier Page 54 of 72


A Review of Deep Learning Techniques for Speech Processing

References [18] Arık, S.Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A.,
Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., et al., 2017. Deep
[1] , 2022. Conformer-1. AssemblyAI URL: https://www.assemblyai. voice: Real-time neural text-to-speech, in: International conference
com/blog/conformer-1/. on machine learning, PMLR. pp. 195–204.
[2] , 2022. Speech recognition with conformer. Nvidia URL: [19] Audhkhasi, K., Saon, G., Tüske, Z., Kingsbury, B., Picheny, M., 2019.
https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_ Forget a bit to learn better: Soft forgetting for ctc-based automatic
recognition_with_conformer.html. speech recognition., in: Interspeech, pp. 2618–2622.
[3] Abdel-Hamid, O., Mohamed, A.r., Jiang, H., Deng, L., Penn, G., Yu, [20] Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N.,
D., 2014a. Convolutional neural networks for speech recognition. Singh, K., von Platen, P., Saraf, Y., Pino, J., et al., 2021. Xls-r:
IEEE/ACM Transactions on audio, speech, and language processing Self-supervised cross-lingual speech representation learning at scale.
22, 1533–1545. arXiv preprint arXiv:2111.09296 .
[4] Abdel-Hamid, O., Mohamed, A.r., Jiang, H., Deng, L., Penn, G., [21] Badlani, R., Łańcucki, A., Shih, K.J., Valle, R., Ping, W., Catanzaro,
Yu, D., 2014b. Convolutional neural networks for speech recogni- B., 2022a. One tts alignment to rule them all, in: ICASSP 2022-
tion. IEEE/ACM Transactions on Audio, Speech, and Language 2022 IEEE International Conference on Acoustics, Speech and Signal
Processing 22, 1533–1545. doi:10.1109/TASLP.2014.2339736. Processing (ICASSP), IEEE. pp. 6092–6096.
[5] Abdel-Hamid, O., Mohamed, A.r., Jiang, H., Penn, G., 2012. Apply- [22] Badlani, R., Łańcucki, A., Shih, K.J., Valle, R., Ping, W., Catanzaro,
ing convolutional neural networks concepts to hybrid nn-hmm model B., 2022b. One tts alignment to rule them all, in: ICASSP 2022 -
for speech recognition, in: 2012 IEEE International Conference on 2022 IEEE International Conference on Acoustics, Speech and Signal
Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280. Processing (ICASSP), pp. 6092–6096. doi:10.1109/ICASSP43922.2022.
doi:10.1109/ICASSP.2012.6288864. 9747707.
[6] Abdeljaber, O., Avci, O., Kiranyaz, S., Gabbouj, M., Inman, D.J., [23] Baevski, A., Auli, M., Mohamed, A., 2019a. Effectiveness of
2017. Real-time vibration-based structural damage detection using self-supervised pre-training for speech recognition. arXiv preprint
one-dimensional convolutional neural networks. Journal of Sound arXiv:1911.03912 .
and Vibration 388, 154–170. [24] Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M., 2022.
[7] Abdul, Z.K., Al-Talabani, A.K., 2022. Mel frequency cepstral co- Data2vec: A general framework for self-supervised learning in
efficient and its applications: A review. IEEE Access 10, 122136– speech, vision and language, in: International Conference on Machine
122158. doi:10.1109/ACCESS.2022.3223444. Learning, PMLR. pp. 1298–1312.
[8] Abdulatif, S., Cao, R., Yang, B., 2022. Cmgan: Conformer- [25] Baevski, A., Schneider, S., Auli, M., 2019b. vq-wav2vec: Self-
based metric-gan for monaural speech enhancement. arXiv preprint supervised learning of discrete speech representations. arXiv preprint
arXiv:2209.11112 . arXiv:1910.05453 .
[9] Achanta, S., Antony, A., Golipour, L., Li, J., Raitio, T., Rasipuram, [26] Baevski, A., Zhou, Y., Mohamed, A., Auli, M., 2020. wav2vec 2.0:
R., Rossi, F., Shi, J., Upadhyay, J., Winarsky, D., et al., 2021. On- A framework for self-supervised learning of speech representations.
device neural speech synthesis, in: 2021 IEEE Automatic Speech Advances in Neural Information Processing Systems 33.
Recognition and Understanding Workshop (ASRU), IEEE. pp. 1155– [27] Bahar, P., Bieschke, T., Ney, H., 2019. A comparative study on end-
1161. to-end speech to text translation, in: 2019 IEEE Automatic Speech
[10] Afouras, T., Chung, J.S., Zisserman, A., 2018. The conversa- Recognition and Understanding Workshop (ASRU), IEEE. pp. 792–
tion: Deep audio-visual speech enhancement. arXiv preprint 799.
arXiv:1804.04121 . [28] Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine trans-
[11] Aggarwal, V., Cotescu, M., Prateek, N., Lorenzo-Trueba, J., Barra- lation by jointly learning to align and translate. arXiv preprint
Chicote, R., 2020. Using vaes and normalizing flows for one-shot arXiv:1409.0473 .
text-to-speech synthesis of expressive speech, in: ICASSP 2020 - [29] Bai, H., Zheng, R., Chen, J., Ma, M., Li, X., Huang, L., 2022. A3t:
2020 IEEE International Conference on Acoustics, Speech and Signal Alignment-aware acoustic and text pretraining for speech synthe-
Processing (ICASSP), pp. 6179–6183. doi:10.1109/ICASSP40776.2020. sis and editing, in: International Conference on Machine Learning,
9053678. PMLR. pp. 1399–1411.
[12] Alsabhan, W., 2023. Human–computer interaction with a real-time [30] Bai, S., Kolter, J.Z., Koltun, V., 2018. An empirical evaluation of
speech emotion recognition with ensembling techniques 1d convolu- generic convolutional and recurrent networks for sequence modeling.
tion neural network and attention. Sensors 23, 1386. arXiv preprint arXiv:1803.01271 .
[13] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, [31] Bai, Z., Zhang, X.L., 2021. Speaker recognition based on deep
E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al., learning: An overview. Neural Networks 140, 65–99.
2016. Deep speech 2: End-to-end speech recognition in english and [32] Bak, T., Lee, J., Bae, H., Yang, J., Bae, J.S., Joo, Y.S., 2022. Av-
mandarin, in: International conference on machine learning, PMLR. ocodo: Generative adversarial network for artifact-free vocoder.
pp. 173–182. arXiv preprint arXiv:2206.13404 .
[14] Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., [33] Barker, J., Watanabe, S., Vincent, E., Trmal, J., 2018. The
Vinyals, O., 2012. Speaker diarization: A review of recent research. fifth’chime’speech separation and recognition challenge: dataset,
IEEE Transactions on audio, speech, and language processing 20, task and baselines. arXiv preprint arXiv:1803.10609 .
356–370. [34] Baskar, M.K., Watanabe, S., Astudillo, R., Hori, T., Burget, L., Čer-
[15] Ansari, E., Axelrod, A., Bach, N., Bojar, O., Cattoni, R., Dalvi, nockỳ, J., 2019. Semi-supervised sequence-to-sequence asr using
F., Durrani, N., Federico, M., Federmann, C., Gu, J., et al., 2020. unpaired speech and text. arXiv preprint arXiv:1905.01152 .
Findings of the iwslt 2020 evaluation campaign, in: Proceedings of [35] Battenberg, E., Skerry-Ryan, R., Mariooryad, S., Stanton, D., Kao,
the 17th International Conference on Spoken Language Translation, D., Shannon, M., Bagby, T., 2020. Location-relative attention mecha-
pp. 1–34. nisms for robust long-form speech synthesis, in: ICASSP 2020-2020
[16] Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, IEEE International Conference on Acoustics, Speech and Signal Pro-
T., Li, Q., Zhang, Y., et al., 2021. Speecht5: Unified-modal encoder- cessing (ICASSP), IEEE. pp. 6194–6198.
decoder pre-training for spoken language processing. arXiv preprint [36] Beerends, J.G., Schmidmer, C., Berger, J., Obermann, M., Ullmann,
arXiv:2110.07205 . R., Pomy, J., Keyhl, M., 2013. Perceptual objective listening quality
[17] Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, assessment (polqa), the third generation itu-t standard for end-to-end
J., Morais, R., Saunders, L., Tyers, F.M., Weber, G., 2019. Com- speech quality measurement part i—temporal alignment. Journal of
mon voice: A massively-multilingual speech corpus. arXiv preprint the Audio Engineering Society 61, 366–384.
arXiv:1912.06670 .

Mehrish et al.: Preprint submitted to Elsevier Page 55 of 72


A Review of Deep Learning Techniques for Speech Processing

[37] Berg, A., O’Connor, M., Cruz, M.T., 2021. Keyword trans- [56] Cattoni, R., Di Gangi, M.A., Bentivogli, L., Negri, M., Turchi, M.,
former: A self-attention model for keyword spotting. arXiv preprint 2021. Must-c: A multilingual corpus for end-to-end speech transla-
arXiv:2104.00769 . tion. Computer Speech & Language 66, 101155.
[38] Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., [57] Cauchi, B., Siedenburg, K., Santos, J.F., Falk, T.H., Doclo, S., Goetze,
Casagrande, N., Cobo, L.C., Simonyan, K., 2019a. High fi- S., 2019. Non-intrusive speech quality prediction using modula-
delity speech synthesis with adversarial networks. arXiv preprint tion energies and lstm-network. IEEE/ACM Transactions on Audio,
arXiv:1909.11646 . Speech, and Language Processing 27, 1151–1163.
[39] Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., [58] Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., Norouzi, M., 2021.
Casagrande, N., Cobo, L.C., Simonyan, K., 2019b. High fidelity Speechstew: Simply mix all available speech recognition data to train
speech synthesis with adversarial networks, in: International Confer- one large neural network. arXiv preprint arXiv:2104.02133 .
ence on Learning Representations. [59] Chang, K.W., et al., 2022. An exploration of prompt tuning on
[40] Birnbaum, S., Kuleshov, V., Enam, Z., Koh, P.W.W., Ermon, S., generative spoken language model for speech processing tasks. arXiv
2019. Temporal film: Capturing long-range sequence dependencies preprint arXiv:2203.16773 .
with feature-wise modulations. Advances in Neural Information [60] Chaudhari, S., Mithal, V., Polatkan, G., Ramanath, R., 2021. An
Processing Systems 32. attentive survey of attention models. ACM Transactions on Intelligent
[41] Boll, S., 1979. Suppression of acoustic noise in speech using spectral Systems and Technology (TIST) 12, 1–32.
subtraction. IEEE Transactions on acoustics, speech, and signal [61] Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.Q., Weng, C., Su,
processing 27, 113–120. D., Povey, D., Trmal, J., Zhang, J., et al., 2021a. Gigaspeech: An
[42] Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von evolving, multi-domain asr corpus with 10,000 hours of transcribed
Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al., audio. arXiv preprint arXiv:2106.06909 .
2021. On the opportunities and risks of foundation models. arXiv [62] Chen, J., Ma, M., Zheng, R., Huang, L., 2020a. Mam: Masked
preprint arXiv:2108.07258 . acoustic modeling for end-to-end speech-to-text translation. arXiv
[43] Bourlard, H.A., Morgan, N., 1994. Connectionist speech recognition: preprint arXiv:2010.11445 .
a hybrid approach. volume 247. Springer Science & Business Media. [63] Chen, J., Ma, M., Zheng, R., Huang, L., 2021b. Specrec: An alterna-
[44] Bousquet, P.M., Rouvier, M., 2019. On robustness of unsupervised tive solution for improving end-to-end speech-to-text translation via
domain adaptation for speaker recognition, in: Interspeech. spectrogram reconstruction., in: Interspeech, pp. 2232–2236.
[45] Bredin, H., Laurent, A., 2021. End-to-end speaker segmentation for [64] Chen, J., Tan, X., Leng, Y., Xu, J., Wen, G., Qin, T., Liu, T.Y., 2021c.
overlap-aware resegmentation. arXiv preprint arXiv:2104.04045 . Speech-t: Transducer for text to speech and beyond. Advances in
[46] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhari- Neural Information Processing Systems 34, 6621–6633.
wal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., [65] Chen, L., Maddox, R.K., Duan, Z., Xu, C., 2019a. Hierarchical
2020. Language models are few-shot learners. Advances in neural cross-modal talking face generation with dynamic pixel-wise loss, in:
information processing systems 33, 1877–1901. Proceedings of the IEEE/CVF conference on computer vision and
[47] Bullock, L., Bredin, H., Garcia-Perera, L.P., 2020. Overlap-aware di- pattern recognition, pp. 7832–7841.
arization: Resegmentation using neural end-to-end overlapped speech [66] Chen, M., Tan, X., Li, B., Liu, Y., Qin, T., Zhao, S., Liu, T.Y., 2021d.
detection, in: ICASSP 2020-2020 IEEE International Conference Adaspeech: Adaptive text to speech for custom voice. arXiv preprint
on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. arXiv:2103.00993 .
7114–7118. [67] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.,
[48] Bunk, T., Varshneya, D., Vlasov, V., Nichol, A., 2020. Diet: 2020b. Wavegrad: Estimating gradients for waveform generation.
Lightweight language understanding for dialogue systems. arXiv arXiv preprint arXiv:2009.00713 .
preprint arXiv:2004.09936 . [68] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.,
[49] Burchi, M., Timofte, R., 2023. Audio-visual efficient conformer for 2020c. Wavegrad: Estimating gradients for waveform generation, in:
robust speech recognition, in: Proceedings of the IEEE/CVF Winter International Conference on Learning Representations.
Conference on Applications of Computer Vision, pp. 2258–2267. [69] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Dehak, N.,
[50] Butryna, A., Chu, S.H.C., Demirsahin, I., Gutkin, A., Ha, L., He, Chan, W., 2021e. Wavegrad 2: Iterative refinement for text-to-speech
F., Jansche, M., Johny, C., Katanova, A., Kjartansson, O., et al., synthesis. arXiv preprint arXiv:2106.09660 .
2020. Google crowdsourced speech corpora and related open-source [70] Chen, Q., Zhuo, Z., Wang, W., 2019b. Bert for joint intent classifica-
resources for low-resource languages and dialects: an overview. arXiv tion and slot filling. arXiv preprint arXiv:1902.10909 .
preprint arXiv:2010.06778 . [71] Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda,
[51] C S, A., A P, P., Ramakrishnan, A.G., 2021. Unsupervised domain N., Yoshioka, T., Xiao, X., et al., 2022a. Wavlm: Large-scale self-
adaptation schemes for building asr in low-resource languages, in: supervised pre-training for full stack speech processing. IEEE Journal
2021 IEEE Automatic Speech Recognition and Understanding Work- of Selected Topics in Signal Processing 16, 1505–1518.
shop (ASRU), pp. 342–349. doi:10.1109/ASRU51503.2021.9688269. [72] Chen, S., Wu, Y., Wang, C., Chen, Z., Chen, Z., Liu, S., Wu, J., Qian,
[52] Campbell, W., Campbell, J., Reynolds, D., Jones, D., Leek, T., 2003. Y., Wei, F., Li, J., et al., 2022b. Unispeech-sat: Universal speech
Phonetic speaker recognition with support vector machines. Advances representation learning with speaker aware pre-training, in: ICASSP
in neural information processing systems 16. 2022-2022 IEEE International Conference on Acoustics, Speech and
[53] Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, Signal Processing (ICASSP), IEEE. pp. 6152–6156.
T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., et al., 2006. [73] Chen, Y., Guo, W., Gu, B., 2021f. Improved meta-learning training
The ami meeting corpus: A pre-announcement, in: Machine Learning for speaker verification. arXiv preprint arXiv:2103.15421 .
for Multimodal Interaction: Second International Workshop, MLMI [74] Chen, Z., Tan, X., Wang, K., Pan, S., Mandic, D., He, L., Zhao,
2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers 2, S., 2022c. Infergrad: Improving diffusion models for vocoder by
Springer. pp. 28–39. considering inference in training, in: ICASSP 2022-2022 IEEE In-
[54] Castiglioni, P., 2005. Levinson-durbin algorithm. Encyclopedia of ternational Conference on Acoustics, Speech and Signal Processing
Biostatistics 4. (ICASSP), IEEE. pp. 8432–8436.
[55] Catellier, A.A., Voran, S.D., 2020. Wawenets: A no-reference con- [75] Chen, Z., Wang, S., Qian, Y., 2020d. Adversarial domain adap-
volutional waveform-based approach to estimating narrowband and tation for speaker verification using partially shared network., in:
wideband speech quality, in: ICASSP 2020-2020 IEEE International Interspeech, pp. 3017–3021.
Conference on Acoustics, Speech and Signal Processing (ICASSP), [76] Chen, Z., Wang, S., Qian, Y., 2021g. Self-supervised learning based
IEEE. pp. 331–335. domain adaptation for robust speaker verification, in: ICASSP 2021-

Mehrish et al.: Preprint submitted to Elsevier Page 56 of 72


A Review of Deep Learning Techniques for Speech Processing

2021 IEEE International Conference on Acoustics, Speech and Signal [96] Chung, Y.A., Zhang, Y., Han, W., Chiu, C.C., Qin, J., Pang, R., Wu,
Processing (ICASSP), IEEE. pp. 5834–5838. Y., 2021. W2v-bert: Combining contrastive learning and masked
[77] Chen, Z., Watanabe, S., Erdogan, H., Hershey, J.R., 2015. Speech language modeling for self-supervised speech pre-training, in: 2021
enhancement and recognition using multi-task learning of long short- IEEE Automatic Speech Recognition and Understanding Workshop
term memory recurrent neural networks, in: Sixteenth Annual Con- (ASRU), IEEE. pp. 244–250.
ference of the International Speech Communication Association. [97] Cooke, M., Barker, J., Cunningham, S., Shao, X., 2006. An audio-
[78] Chiu, C.C., Qin, J., Zhang, Y., Yu, J., Wu, Y., 2022. Self-supervised visual corpus for speech perception and automatic speech recognition.
learning with random-projection quantizer for speech recognition, in: The Journal of the Acoustical Society of America 120, 2421–2424.
International Conference on Machine Learning, PMLR. pp. 3915– [98] Coria, J.M., Bredin, H., Ghannay, S., Rosset, S., 2021. Overlap-
3924. aware low-latency online speaker diarization based on end-to-end
[79] Chiu, C.C., Raffel, C., 2017. Monotonic chunkwise attention. arXiv local segmentation, in: 2021 IEEE Automatic Speech Recognition
preprint arXiv:1712.05382 . and Understanding Workshop (ASRU), IEEE. pp. 1139–1146.
[80] Cho, K., Courville, A., Bengio, Y., 2015. Describing multimedia [99] Coto-Jiménez, M., 2019. Improving post-filtering of artificial speech
content using attention-based encoder-decoder networks. IEEE Trans- using pre-trained lstm neural networks. Biomimetics 4, 39.
actions on Multimedia 17, 1875–1886. [100] Coucke, A., Chlieh, M., Gisselbrecht, T., Leroy, D., Poumeyrol, M.,
[81] Cho, W.I., Kwak, D., Yoon, J., Kim, N.S., 2020. Speech to text adap- Lavril, T., 2019. Efficient keyword spotting using dilated convolutions
tation: Towards an efficient cross-modal distillation, in: Interspeech. and gating, in: ICASSP 2019-2019 IEEE International Conference
[82] Choi, H.S., Lee, J., Kim, W., Lee, J., Heo, H., Lee, K., 2021. Neural on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp.
analysis and synthesis: Reconstructing speech from self-supervised 6351–6355.
representations. Advances in Neural Information Processing Systems [101] Coucke, A., Saade, A., Ball, A., Bluche, T., Caulier, A., Leroy, D.,
34, 16251–16265. Doumouro, C., Gisselbrecht, T., Caltagirone, F., Lavril, T., et al.,
[83] Choi, H.S., Yang, J., Lee, J., Kim, H., 2022. Nansy++: Unified 2018. Snips voice platform: an embedded spoken language under-
voice synthesis with neural analysis and synthesis. arXiv preprint standing system for private-by-design voice interfaces. arXiv preprint
arXiv:2211.09407 . arXiv:1805.10190 .
[84] Chorowski, J., Weiss, R.J., Bengio, S., Van Den Oord, A., 2019. [102] Dauphin, Y.N., Fan, A., Auli, M., Grangier, D., 2017. Language mod-
Unsupervised speech representation learning using wavenet autoen- eling with gated convolutional networks, in: International conference
coders. IEEE/ACM transactions on audio, speech, and language on machine learning, PMLR. pp. 933–941.
processing 27, 2041–2053. [103] Deng, J., Guo, J., Xue, N., Zafeiriou, S., 2019. Arcface: Additive
[85] Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y., angular margin loss for deep face recognition, in: Proceedings of the
2015. Attention-based models for speech recognition. Advances in IEEE/CVF conference on computer vision and pattern recognition,
neural information processing systems 28. pp. 4690–4699.
[86] Chou, J.c., Yeh, C.c., Lee, H.y., 2019. One-shot voice conversion by [104] Deng, K., Cao, S., Zhang, Y., Ma, L., 2021. Improving hybrid
separating speaker and content representations with instance normal- ctc/attention end-to-end speech recognition with pretrained acoustic
ization. arXiv preprint arXiv:1904.05742 . and language models, in: 2021 IEEE Automatic Speech Recogni-
[87] Chowdhury, A., Cozzo, A., Ross, A., 2022. Domain adaptation for tion and Understanding Workshop (ASRU), pp. 76–82. doi:10.1109/
speaker recognition in singing and spoken voice, in: ICASSP 2022- ASRU51503.2021.9688009.
2022 IEEE International Conference on Acoustics, Speech and Signal [105] Deng, K., Cao, S., Zhang, Y., Ma, L., Cheng, G., Xu, J., Zhang,
Processing (ICASSP), IEEE. pp. 7192–7196. P., 2022a. Improving ctc-based speech recognition via knowledge
[88] Chung, H., Jeon, H.B., Park, J.G., 2020a. Semi-supervised training transferring from pre-trained language models, in: ICASSP 2022 -
for sequence-to-sequence speech recognition using reinforcement 2022 IEEE International Conference on Acoustics, Speech and Signal
learning, in: 2020 International Joint Conference on Neural Networks Processing (ICASSP), pp. 8517–8521. doi:10.1109/ICASSP43922.2022.
(IJCNN), IEEE. pp. 1–6. 9747887.
[89] Chung, H., Jeon, H.B., Park, J.G., 2020b. Semi-supervised training [106] Deng, K., Cao, S., Zhang, Y., Ma, L., Cheng, G., Xu, J., Zhang,
for sequence-to-sequence speech recognition using reinforcement P., 2022b. Improving ctc-based speech recognition via knowledge
learning, in: 2020 International Joint Conference on Neural Networks transferring from pre-trained language models, in: ICASSP 2022-
(IJCNN), pp. 1–6. doi:10.1109/IJCNN48605.2020.9207023. 2022 IEEE International Conference on Acoustics, Speech and Signal
[90] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Processing (ICASSP), IEEE. pp. 8517–8521.
Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., [107] Denisov, P., Vu, N.T., 2020. Pretrained semantic speech embed-
Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Valter, D., Narang, dings for end-to-end spoken language understanding via cross-modal
S., Mishra, G., Yu, A.W., Zhao, V., Huang, Y., Dai, A.M., Yu, H., teacher-student learning, in: Interspeech.
Petrov, S., hsin Chi, E.H., Dean, J., Devlin, J., Roberts, A., Zhou, [108] Desplanques, B., Thienpondt, J., Demuynck, K., 2020. Ecapa-tdnn:
D., Le, Q.V., Wei, J., 2022. Scaling instruction-finetuned language Emphasized channel attention, propagation and aggregation in tdnn
models. ArXiv abs/2210.11416. based speaker verification. arXiv preprint arXiv:2005.07143 .
[91] Chung, J.S., Huh, J., Mun, S., Lee, M., Heo, H.S., Choe, S., Ham, C., [109] Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-
Jung, S., Lee, B.J., Han, I., 2020c. In defence of metric learning for training of deep bidirectional transformers for language understand-
speaker recognition. arXiv preprint arXiv:2003.11982 . ing. arXiv preprint arXiv:1810.04805 .
[92] Chung, J.S., Nagrani, A., Zisserman, A., 2018. Voxceleb2: Deep [110] Di Gangi, M.A., Negri, M., Turchi, M., 2019. Adapting transformer
speaker recognition, in: INTERSPEECH. to end-to-end spoken language translation, in: Proceedings of INTER-
[93] Chung, J.S., Zisserman, A., 2017. Lip reading in the wild, in: Com- SPEECH 2019. International Speech Communication Association
puter Vision–ACCV 2016: 13th Asian Conference on Computer (ISCA), pp. 1133–1137.
Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected [111] Diez, M., Burget, L., Landini, F., Wang, S., Černockỳ, H., 2020.
Papers, Part II 13, Springer. pp. 87–103. Optimizing bayesian hmm based x-vector clustering for the second
[94] Chung, S.W., Choe, S., Chung, J.S., Kang, H.G., 2020d. Facefilter: dihard speech diarization challenge, in: ICASSP 2020-2020 IEEE
Audio-visual speech separation using still images. arXiv preprint International Conference on Acoustics, Speech and Signal Processing
arXiv:2005.07074 . (ICASSP), IEEE. pp. 6519–6523.
[95] Chung, Y.A., Hsu, W.N., Tang, H., Glass, J., 2019. An unsuper- [112] Dingliwal, S., Shenoy, A., Bodapati, S., Gandhe, A., Gadde, R.T.,
vised autoregressive model for speech representation learning. arXiv Kirchhoff, K., 2021. Prompt-tuning in asr systems for efficient
preprint arXiv:1904.03240 . domain-adaptation. arXiv preprint arXiv:2110.06502 .

Mehrish et al.: Preprint submitted to Elsevier Page 57 of 72


A Review of Deep Learning Techniques for Speech Processing

[113] Donahue, C., McAuley, J., Puckette, M., 2018. Adversarial audio lenges. IEEE Signal Processing Magazine 39, 42–62.
synthesis. arXiv preprint arXiv:1802.04208 . [133] Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z., 2020. End-to-end
[114] Donahue, J., Dieleman, S., Binkowski, M., Elsen, E., Simonyan, generation of talking faces from noisy speech, in: ICASSP 2020-
K., 2020a. End-to-end adversarial text-to-speech, in: International 2020 IEEE international conference on acoustics, speech and signal
Conference on Learning Representations. processing (ICASSP), IEEE. pp. 1948–1952.
[115] Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., Simonyan, [134] Eskimez, S.E., Zhang, Y., Duan, Z., 2021. Speech driven talking
K., 2020b. End-to-end adversarial text-to-speech. arXiv preprint face generation from a single image and an emotion condition. IEEE
arXiv:2006.03575 . Transactions on Multimedia 24, 3480–3490.
[116] Dong, L., Xu, S., Xu, B., 2018. Speech-transformer: A no-recurrence [135] Fan, Y., Kang, J., Li, L., Li, K., Chen, H., Cheng, S., Zhang, P., Zhou,
sequence-to-sequence model for speech recognition, in: 2018 IEEE Z., Cai, Y., Wang, D., 2020. Cn-celeb: a challenging chinese speaker
International Conference on Acoustics, Speech and Signal Processing recognition dataset, in: ICASSP 2020-2020 IEEE International Con-
(ICASSP), pp. 5884–5888. doi:10.1109/ICASSP.2018.8462506. ference on Acoustics, Speech and Signal Processing (ICASSP), IEEE.
[117] Dong, P., Wang, S., Niu, W., Zhang, C., Lin, S., Li, Z., Gong, Y., pp. 7604–7608.
Ren, B., Lin, X., Tao, D., 2020. Rtmobile: Beyond real-time mobile [136] Fan, Y., Qian, Y., Xie, F.L., Soong, F.K., 2014. Tts synthesis with bidi-
acceleration of rnns for speech recognition, in: 2020 57th ACM/IEEE rectional lstm based recurrent neural networks, in: Fifteenth annual
Design Automation Conference (DAC), IEEE. pp. 1–6. conference of the international speech communication association.
[118] Dong, X., Williamson, D.S., 2020a. An attention enhanced multi-task [137] Fazel, A., Yang, W., Liu, Y., Barra-Chicote, R., Meng, Y., Maas,
model for objective speech assessment in real-world environments, R., Droppo, J., 2021. Synthasr: Unlocking synthetic data for speech
in: ICASSP 2020-2020 IEEE International Conference on Acoustics, recognition. arXiv preprint arXiv:2106.07803 .
Speech and Signal Processing (ICASSP), IEEE. pp. 911–915. [138] Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X., 2021. Fsd50k:
[119] Dong, X., Williamson, D.S., 2020b. A pyramid recurrent network for an open dataset of human-labeled sound events. IEEE/ACM Trans-
predicting crowdsourced speech-quality ratings of real-world signals. actions on Audio, Speech, and Language Processing 30, 829–852.
arXiv preprint arXiv:2007.15797 . [139] Franco-Galván, C., Herrera-Camacho, A., Escalante-Ramírez, B.,
[120] Dovrat, S., Nachmani, E., Wolf, L., 2021. Many-speakers single 2019. Application of different statistical tests for validation of syn-
channel speech separation with optimal permutation training. arXiv thesized speech parameterized by cepstral coefficients and lsp. Com-
preprint arXiv:2104.08955 . putación y Sistemas 23, 461–467.
[121] Drexler, J., Glass, J., 2019. Explicit alignment of text and speech [140] Frankle, J., Carbin, M., 2018. The lottery ticket hypothesis: Training
encodings for attention-based end-to-end speech recognition, in: 2019 pruned neural networks. CoRR abs/1803.03635. URL: http://arxiv.
IEEE Automatic Speech Recognition and Understanding Workshop org/abs/1803.03635, arXiv:1803.03635.
(ASRU), IEEE. pp. 913–919. [141] Frantar, E., Alistarh, D., 2023. Sparsegpt: Massive language models
[122] Du, C., Yu, K., 2022. Phone-level prosody modelling with gmm- can be accurately pruned in one-shot. ArXiv abs/2301.00774.
based mdn for diverse and controllable speech synthesis. IEEE/ACM [142] Fu, S.W., Liao, C.F., Tsao, Y., Lin, S.D., 2019. Metricgan: Generative
Transactions on Audio, Speech, and Language Processing 30, 190– adversarial networks based black-box metric scores optimization
201. doi:10.1109/TASLP.2021.3133205. for speech enhancement, in: International Conference on Machine
[123] Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.R., Lee, C.H., 2014. Robust Learning, PMLR. pp. 2031–2041.
speech recognition with speech enhanced deep neural networks, in: [143] Fu, S.W., Tsao, Y., Lu, X., 2016. Snr-aware convolutional neural
Fifteenth annual conference of the international speech communica- network modeling for speech enhancement., in: Interspeech, pp. 3768–
tion association. 3772.
[124] Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., [144] Fu, Y., Cheng, L., Lv, S., Jv, Y., Kong, Y., Chen, Z., Hu, Y., Xie, L.,
Metze, F., Torres, J., Giro-i Nieto, X., 2021. How2sign: a large- Wu, J., Bu, H., et al., 2021. Aishell-4: An open source dataset for
scale multimodal dataset for continuous american sign language, in: speech enhancement, separation, recognition and speaker diarization
Proceedings of the IEEE/CVF conference on computer vision and in conference scenario. arXiv preprint arXiv:2104.03603 .
pattern recognition, pp. 2735–2744. [145] Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., Watan-
[125] Edu, J.S., Such, J.M., Suarez-Tangil, G., 2020. Smart home personal abe, S., 2019. End-to-end neural speaker diarization with self-
assistants: a security and privacy review. ACM Computing Surveys attention, in: 2019 IEEE Automatic Speech Recognition and Un-
(CSUR) 53, 1–36. derstanding Workshop (ASRU), IEEE. pp. 296–303.
[126] Elias, I., Zen, H., Shen, J., Zhang, Y., Jia, Y., Weiss, R.J., Wu, Y., [146] Gabbay, A., Shamir, A., Peleg, S., 2017. Visual speech enhancement.
2021. Parallel tacotron: Non-autoregressive and controllable tts, in: arXiv preprint arXiv:1711.08789 .
ICASSP 2021-2021 IEEE International Conference on Acoustics, [147] Gabryś, A., Huybrechts, G., Ribeiro, M.S., Chien, C.M., Roth, J.,
Speech and Signal Processing (ICASSP), IEEE. pp. 5709–5713. Comini, G., Barra-Chicote, R., Perz, B., Lorenzo-Trueba, J., 2022.
[127] Elneima, A., Bińkowski, M., 2022. Adversarial text-to-speech for Voice filter: Few-shot text-to-speech speaker adaptation using voice
low-resource languages, in: Proceedings of the The Seventh Arabic conversion as a post-processing module, in: ICASSP 2022-2022
Natural Language Processing Workshop (WANLP), pp. 76–84. IEEE International Conference on Acoustics, Speech and Signal
[128] Ephraim, Y., 1992. A bayesian estimation approach for speech en- Processing (ICASSP), IEEE. pp. 7902–7906.
hancement using hidden markov models. IEEE Transactions on [148] Galassi, A., Lippi, M., Torroni, P., 2020. Attention in natural lan-
Signal Processing 40, 725–735. guage processing. IEEE transactions on neural networks and learning
[129] Ephrat, A., Halperin, T., Peleg, S., 2017. Improved speech recon- systems 32, 4291–4308.
struction from silent video, in: Proceedings of the IEEE International [149] Gales, M., Young, S., et al., 2008. The application of hidden markov
Conference on Computer Vision Workshops, pp. 455–462. models in speech recognition. Foundations and Trends® in Signal
[130] Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, Processing 1, 195–304.
A., Freeman, W.T., Rubinstein, M., 2018. Looking to listen at the [150] Gao, C., Gu, Y., Caliva, F., Liu, Y., 2023. Self-supervised speech
cocktail party: A speaker-independent audio-visual model for speech representation learning for keyword-spotting with light-weight trans-
separation. arXiv preprint arXiv:1804.03619 . formers. arXiv preprint arXiv:2303.04255 .
[131] Ephrat, A., Peleg, S., 2017. Vid2speech: speech reconstruction from [151] Gao, R., Grauman, K., 2021. Visualvoice: Audio-visual speech sepa-
silent video, in: 2017 IEEE International Conference on Acoustics, ration with cross-modal consistency, in: 2021 IEEE/CVF Conference
Speech and Signal Processing (ICASSP), IEEE. pp. 5095–5099. on Computer Vision and Pattern Recognition (CVPR), IEEE. pp.
[132] Ericsson, L., Gouk, H., Loy, C.C., Hospedales, T.M., 2022. Self- 15490–15500.
supervised representation learning: Introduction, advances, and chal- [152] Garcia-Romero, D., McCree, A., Snyder, D., Sell, G., 2020. Jhu-

Mehrish et al.: Preprint submitted to Elsevier Page 58 of 72


A Review of Deep Learning Techniques for Speech Processing

hltcoe system for the voxsrc speaker recognition challenge, in: 6633–6637.
ICASSP 2020-2020 IEEE International Conference on Acoustics, [173] Harte, N., Gillen, E., 2015. Tcd-timit: An audio-visual corpus of
Speech and Signal Processing (ICASSP), IEEE. pp. 7559–7563. continuous speech. IEEE Transactions on Multimedia 17, 603–615.
[153] Garofolo, J.S., 1993. Timit acoustic phonetic continuous speech [174] Hatch, A.O., Kajarekar, S., Stolcke, A., 2006. Within-class covari-
corpus. Linguistic Data Consortium, 1993 . ance normalization for svm-based speaker recognition, in: Ninth
[154] Ghifary, M., Kleijn, W.B., Zhang, M., Balduzzi, D., Li, W., 2016. international conference on spoken language processing.
Deep reconstruction-classification networks for unsupervised do- [175] Haykin, S., Chen, Z., 2005. The cocktail party problem. Neural
main adaptation, in: Computer Vision–ECCV 2016: 14th European computation 17, 1875–1902.
Conference, Amsterdam, The Netherlands, October 11–14, 2016, [176] He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning
Proceedings, Part IV 14, Springer. pp. 597–613. for image recognition, in: Proceedings of the IEEE conference on
[155] Ghosal, D., Majumder, N., Mehrish, A., Poria, S., 2023. Text-to- computer vision and pattern recognition, pp. 770–778.
audio generation using instruction-tuned llm and latent diffusion [177] He, Y., Prabhavalkar, R., Rao, K., Li, W., Bakhtin, A., McGraw, I.,
model. arXiv:2304.13731. 2017. Streaming small-footprint keyword spotting using sequence-
[156] Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., to-sequence models, in: 2017 IEEE Automatic Speech Recognition
Raiman, J., Zhou, Y., 2017. Deep voice 2: Multi-speaker neural and Understanding Workshop (ASRU), IEEE. pp. 474–481.
text-to-speech. Advances in neural information processing systems [178] He, Y., Sainath, T.N., Prabhavalkar, R., McGraw, I., Alvarez, R.,
30. Zhao, D., Rybach, D., Kannan, A., Wu, Y., Pang, R., et al., 2019.
[157] Giri, R., Isik, U., Krishnaswamy, A., 2019. Attention wave-u-net Streaming end-to-end speech recognition for mobile devices, in:
for speech enhancement, in: 2019 IEEE Workshop on Applications ICASSP 2019-2019 IEEE International Conference on Acoustics,
of Signal Processing to Audio and Acoustics (WASPAA), IEEE. pp. Speech and Signal Processing (ICASSP), IEEE. pp. 6381–6385.
249–253. [179] Hemphill, C.T., Godfrey, J.J., Doddington, G.R., 1990. The atis spo-
[158] Graves, A., 2012. Sequence transduction with recurrent neural net- ken language systems pilot corpus, in: Speech and Natural Language:
works. arXiv preprint arXiv:1211.3711 . Proceedings of a Workshop Held at Hidden Valley, Pennsylvania,
[159] Graves, A., Graves, A., 2012. Connectionist temporal classification. June 24-27, 1990.
Supervised sequence labelling with recurrent neural networks , 61– [180] Hendrycks, D., Dietterich, T., 2019. Benchmarking neural net-
93. work robustness to common corruptions and perturbations, in: In-
[160] Graves, A., Jaitly, N., 2014. Towards end-to-end speech recogni- ternational Conference on Learning Representations. URL: https:
tion with recurrent neural networks, in: International conference on //openreview.net/forum?id=HJz6tiCqYm.
machine learning, PMLR. pp. 1764–1772. [181] Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S., 2016. Deep clus-
[161] Graves, A., Mohamed, A.r., Hinton, G., 2013. Speech recognition tering: Discriminative embeddings for segmentation and separation,
with deep recurrent neural networks, in: 2013 IEEE international in: 2016 IEEE international conference on acoustics, speech and
conference on acoustics, speech and signal processing, Ieee. pp. 6645– signal processing (ICASSP), IEEE. pp. 31–35.
6649. [182] Higuchi, Y., Yan, B., Arora, S., Ogawa, T., Kobayashi, T., Watanabe,
[162] Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., S., 2022. Bert meets ctc: New formulation of end-to-end speech
Han, W., Wang, S., Zhang, Z., Wu, Y., et al., 2020. Conformer: recognition with pre-trained masked language model. arXiv preprint
Convolution-augmented transformer for speech recognition. arXiv arXiv:2210.16663 .
preprint arXiv:2005.08100 . [183] Higy, B., Bell, P., 2018. Few-shot learning with attention-based
[163] Guo, H., Xie, F., Soong, F.K., Wu, X., Meng, H., 2022. A multi- sequence-to-sequence models. arXiv preprint arXiv:1811.03519 .
stage multi-codebook vq-vae approach to high-performance neural [184] Himawan, I., Villavicencio, F., Sridharan, S., Fookes, C., 2019. Deep
tts. arXiv preprint arXiv:2209.10887 . domain adaptation for anti-spoofing in speaker verification systems.
[164] Habibie, I., Elgharib, M., Sarkar, K., Abdullah, A., Nyatsanga, S., Computer Speech & Language 58, 377–402.
Neff, M., Theobalt, C., 2022. A motion matching-based framework [185] Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N.,
for controllable gesture synthesis from speech, in: ACM SIGGRAPH Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al., 2012.
2022 Conference Proceedings, pp. 1–9. Deep neural networks for acoustic modeling in speech recognition:
[165] Hady, M.F.A., Schwenker, F., 2013. Semi-supervised learning. Hand- The shared views of four research groups. IEEE Signal processing
book on Neural Information Processing , 215–239. magazine 29, 82–97.
[166] Han, C., Wang, M., Ji, H., Li, L., 2021. Learning shared semantic [186] Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic
space for speech-to-text translation. arXiv preprint arXiv:2105.03095 models. Advances in Neural Information Processing Systems 33,
. 6840–6851.
[167] Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., [187] Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory.
Xiao, A., Xu, C., Xu, Y., et al., 2022. A survey on vision transformer. Neural computation 9, 1735–1780.
IEEE transactions on pattern analysis and machine intelligence 45, [188] Hou, J.C., Wang, S.S., Lai, Y.H., Tsao, Y., Chang, H.W., Wang,
87–110. H.M., 2018. Audio-visual speech enhancement using multimodal
[168] Han, S., Lee, J., 2022. Nu-wave 2: A general neural audio upsampling deep convolutional neural networks. IEEE Transactions on Emerging
model for various sampling rates. arXiv preprint arXiv:2206.08545 . Topics in Computational Intelligence 2, 117–128.
[169] Han, W., Zhang, Z., Zhang, Y., Yu, J., Chiu, C.C., Qin, J., Gulati, A., [189] Houlsby, N., , et al., 2019a. Parameter-efficient transfer learning for
Pang, R., Wu, Y., 2020. Contextnet: Improving convolutional neural nlp, in: International Conference on Machine Learning, pp. 2790–
networks for automatic speech recognition with global context. arXiv 2799.
preprint arXiv:2005.03191 . [190] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Larous-
[170] Hanifa, R.M., Isa, K., Mohamad, S., 2021. A review on speaker silhe, Q., Gesmundo, A., Attariyan, M., Gelly, S., 2019b. Parameter-
recognition: Technology and challenges. Computers & Electrical efficient transfer learning for NLP, in: Chaudhuri, K., Salakhutdinov,
Engineering 90, 107005. R. (Eds.), Proceedings of the 36th International Conference on Ma-
[171] Hansen, P.S., 1997. Signal subspace methods for speech enhancement. chine Learning, PMLR. pp. 2790–2799. URL: https://proceedings.
Ph.D. thesis. Citeseer. mlr.press/v97/houlsby19a.html.
[172] Hao, X., Su, X., Horaud, R., Li, X., 2021. Fullsubnet: A full-band [191] Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M., 2017.
and sub-band fusion model for real-time single-channel speech en- Voice conversion from unaligned corpora using variational autoen-
hancement, in: ICASSP 2021-2021 IEEE International Conference coding wasserstein generative adversarial networks. arXiv preprint
on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. arXiv:1704.00849 .

Mehrish et al.: Preprint submitted to Elsevier Page 59 of 72


A Review of Deep Learning Techniques for Speech Processing

[192] Hsu, J.Y., Chen, Y.J., Lee, H.y., 2020. Meta learning for end-to- Voice transformer network: Sequence-to-sequence voice conversion
end low-resource speech recognition, in: ICASSP 2020-2020 IEEE using transformer with text-to-speech pretraining. arXiv preprint
International Conference on Acoustics, Speech and Signal Processing arXiv:1912.06813 .
(ICASSP), IEEE. pp. 7844–7848. [211] Huang, Z., Li, H., Lei, M., 2020. Devicetts: A small-footprint,
[193] Hsu, W.N., Remez, T., Shi, B., Donley, J., Adi, Y., 2022a. Revise: fast, stable network for on-device text-to-speech. arXiv preprint
Self-supervised speech resynthesis with visual input for universal and arXiv:2010.15311 .
generalized speech enhancement. arXiv preprint arXiv:2212.11377 . [212] Hung, Y.N., Wu, C.W., Orife, I., Hipple, A., Wolcott, W., Lerch,
[194] Hsu, W.N., Zhang, Y., Weiss, R.J., Chung, Y.A., Wang, Y., Wu, A., 2022. A large tv dataset for speech and music activity detection.
Y., Glass, J., 2019. Disentangling correlated speaker and noise for EURASIP Journal on Audio, Speech, and Music Processing 2022,
speech synthesis via data augmentation and adversarial factorization, 21.
in: ICASSP 2019-2019 IEEE International Conference on Acoustics, [213] Hwang, D., Misra, A., Huo, Z., Siddhartha, N., Garg, S., Qiu, D.,
Speech and Signal Processing (ICASSP), IEEE. pp. 5901–5905. Sim, K.C., Strohman, T., Beaufays, F., He, Y., 2022. Large-scale
[195] Hsu, W.N., Zhang, Y., Weiss, R.J., Zen, H., Wu, Y., Wang, Y., Cao, Y., asr domain adaptation using self-and semi-supervised learning, in:
Jia, Y., Chen, Z., Shen, J., et al., 2018. Hierarchical generative mod- ICASSP 2022-2022 IEEE International Conference on Acoustics,
eling for controllable speech synthesis, in: International Conference Speech and Signal Processing (ICASSP), IEEE. pp. 6627–6631.
on Learning Representations. [214] Inaguma, H., Kiyono, S., Duh, K., Karita, S., Soplin, N.E.Y., Hayashi,
[196] Hsu, W.N., et al., 2021. Hubert: Self-supervised speech represen- T., Watanabe, S., 2020. Espnet-st: All-in-one speech translation
tation learning by masked prediction of hidden units. IEEE/ACM toolkit. arXiv preprint arXiv:2004.10234 .
Transactions on Audio, Speech, and Language Processing 29, 3451– [215] Indurthi, S., Han, H., Lakumarapu, N.K., Lee, B., Chung, I., Kim,
3460. S., Kim, C., 2020. End-end speech-to-text translation with modality
[197] Hsu, Y.C., Hua, T., Chang, S.E., Lou, Q., Shen, Y., Jin, H., 2022b. agnostic meta-learning, in: ICASSP 2020-2020 IEEE International
Language model compression with weighted low-rank factorization. Conference on Acoustics, Speech and Signal Processing (ICASSP),
ArXiv abs/2207.00112. IEEE. pp. 7904–7908.
[198] Hu, E.J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., [216] Isik, U., Giri, R., Phansalkar, N., Valin, J.M., Helwani, K., Krish-
Wang, L., Chen, W., 2022a. LoRA: Low-rank adaptation of large naswamy, A., 2020. Poconet: Better speech enhancement with
language models, in: International Conference on Learning Repre- frequency-positional embeddings, semi-supervised conversational
sentations. URL: https://openreview.net/forum?id=nZeVKeeFYf9. data, and biased loss. arXiv preprint arXiv:2008.04470 .
[199] Hu, E.J., et al., 2021. Lora: Low-rank adaptation of large language [217] Ito, K., Johnson, L., 2017. The lj speech dataset. https://keithito.
models, in: International Conference on Learning Representations. com/LJ-Speech-Dataset/.
[200] Hu, H.R., Song, Y., Liu, Y., Dai, L.R., McLoughlin, I., Liu, L., 2022b. [218] Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S., 2021. Diff-
Domain robust deep embedding learning for speaker recognition, in: tts: A denoising diffusion model for text-to-speech. arXiv preprint
ICASSP 2022-2022 IEEE International Conference on Acoustics, arXiv:2104.01409 .
Speech and Signal Processing (ICASSP), IEEE. pp. 7182–7186. [219] Jia, Y., Ramanovich, M.T., Remez, T., Pomerantz, R., 2022. Transla-
[201] Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E., 2017. Squeeze- totron 2: High-quality direct speech-to-speech translation with voice
and-excitation networks. URL: https://arxiv.org/abs/1709.01507, preservation, in: International Conference on Machine Learning,
doi:10.48550/ARXIV.1709.01507. PMLR. pp. 10120–10134.
[202] Hu, Y., Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., Wu, J., [220] Jiang, D., Li, W., Cao, M., Zou, W., Li, X., 2020. Speech simclr: Com-
Zhang, B., Xie, L., 2020. Dccrn: Deep complex convolution re- bining contrastive and reconstruction objective for self-supervised
current network for phase-aware speech enhancement. arXiv preprint speech representation learning. arXiv preprint arXiv:2010.13991 .
arXiv:2008.00264 . [221] Jiao, Y., Gabryś, A., Tinchev, G., Putrycz, B., Korzekwa, D., Klimkov,
[203] Hu, Y., Loizou, P.C., 2007. Evaluation of objective quality measures V., 2021. Universal neural vocoding with parallel wavenet, in:
for speech enhancement. IEEE Transactions on audio, speech, and ICASSP 2021-2021 IEEE International Conference on Acoustics,
language processing 16, 229–238. Speech and Signal Processing (ICASSP), IEEE. pp. 6044–6048.
[204] Huang, R., Lam, M.W., Wang, J., Su, D., Yu, D., Ren, Y., Zhao, Z., [222] Jin, W., Liu, X., Scordilis, M.S., Han, L., 2009. Speech enhance-
2022a. Fastdiff: A fast conditional diffusion model for high-quality ment using harmonic emphasis and adaptive comb filtering. IEEE
speech synthesis. arXiv preprint arXiv:2204.09934 . transactions on audio, speech, and language processing 18, 356–368.
[205] Huang, R., Ren, Y., Liu, J., Cui, C., Zhao, Z., 2022b. Generspeech: [223] Jo, Y.R., Moon, Y.K., Cho, W.I., Jo, G.S., 2021. Self-attentive vad:
Towards style transfer for generalizable out-of-domain text-to-speech. Context-aware detection of voice from noise, in: ICASSP 2021-2021
Advances in Neural Information Processing Systems 35, 10970– IEEE International Conference on Acoustics, Speech and Signal
10983. Processing (ICASSP), IEEE. pp. 6808–6812.
[206] Huang, R., Zhao, Z., Liu, H., Liu, J., Cui, C., Ren, Y., 2022c. Prodiff: [224] Johri, A., Tripathi, A., et al., 2019. Parkinson disease detection using
Progressive fast diffusion model for high-quality text-to-speech, in: deep neural networks, in: 2019 twelfth international conference on
Proceedings of the 30th ACM International Conference on Multime- contemporary computing (IC3), IEEE. pp. 1–4.
dia, pp. 2595–2605. [225] Ju, Y., Kim, I., Yang, H., Kim, J.H., Kim, B., Maiti, S., Watanabe,
[207] Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., S., 2022. Trinitts: Pitch-controllable end-to-end tts without external
Cui, L., Mohammed, O.K., Liu, Q., Aggarwal, K., Chi, Z., Bjorck, aligner, in: Proc. Interspeech, pp. 16–20.
J., Chaudhary, V., Som, S., Song, X., Wei, F., 2023. Language is [226] Jung, J.w., Heo, H.S., Kim, J.h., Shim, H.j., Yu, H.J., 2019.
not all you need: Aligning perception with language models. ArXiv Rawnet: Advanced end-to-end deep neural network using raw wave-
abs/2302.14045. forms for text-independent speaker verification. arXiv preprint
[208] Huang, S.F., Lin, C.J., Liu, D.R., Chen, Y.C., Lee, H.y., 2022d. arXiv:1904.08104 .
Meta-tts: Meta-learning for few-shot speaker adaptive text-to-speech. [227] Jung, J.w., Heo, H.S., Yu, H.J., Chung, J.S., 2021a. Graph attention
IEEE/ACM Transactions on Audio, Speech, and Language Process- networks for speaker verification, in: ICASSP 2021-2021 IEEE In-
ing 30, 1558–1571. ternational Conference on Acoustics, Speech and Signal Processing
[209] Huang, W.C., Hayashi, T., Li, X., Watanabe, S., Toda, T., 2021. (ICASSP), IEEE. pp. 6149–6153.
On prosody modeling for asr+ tts based voice conversion, in: 2021 [228] Jung, J.w., Heo, H.S., Yu, H.J., Chung, J.S., 2021b. Graph attention
IEEE Automatic Speech Recognition and Understanding Workshop networks for speaker verification, in: ICASSP 2021 - 2021 IEEE
(ASRU), IEEE. pp. 642–649. International Conference on Acoustics, Speech and Signal Processing
[210] Huang, W.C., Hayashi, T., Wu, Y.C., Kameoka, H., Toda, T., 2019. (ICASSP), pp. 6149–6153. doi:10.1109/ICASSP39728.2021.9414057.

Mehrish et al.: Preprint submitted to Elsevier Page 60 of 72


A Review of Deep Learning Techniques for Speech Processing

[229] Kahn, J., Lee, A., Hannun, A., 2020. Self-training for end-to-end Learning robust and multilingual speech representations. arXiv
speech recognition, in: ICASSP 2020-2020 IEEE International Con- preprint arXiv:2001.11128 .
ference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. [248] Kenter, T., Wan, V., Chan, C.A., Clark, R., Vit, J., 2019. Chive:
pp. 7084–7088. Varying prosody in speech synthesis with a linguistically driven dy-
[230] Kakuba, S., Poulose, A., Han, D.S., 2022. Deep learning-based namic hierarchical conditional variational network, in: International
speech emotion recognition using multi-level fusion of concurrent Conference on Machine Learning, PMLR. pp. 3331–3340.
features. IEEE Access 10, 125538–125551. [249] Kim, H., Kim, S., Yoon, S., 2022a. Guided-tts: A diffusion model for
[231] Kala, T., Shinozaki, T., 2018. Reinforcement learning of speech text-to-speech via classifier guidance, in: International Conference
recognition system based on policy gradient and hypothesis selection, on Machine Learning, PMLR. pp. 11119–11133.
in: 2018 ieee international conference on acoustics, speech and signal [250] Kim, J., Kim, S., Kong, J., Yoon, S., 2020. Glow-tts: A generative
processing (icassp), IEEE. pp. 5759–5763. flow for text-to-speech via monotonic alignment search. Advances in
[232] Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Neural Information Processing Systems 33, 8067–8077.
Lockhart, E., Stimberg, F., Oord, A., Dieleman, S., Kavukcuoglu, K., [251] Kim, J., Kong, J., Son, J., 2021a. Conditional variational autoencoder
2018. Efficient neural audio synthesis, in: International Conference with adversarial learning for end-to-end text-to-speech, in: Interna-
on Machine Learning, PMLR. pp. 2410–2419. tional Conference on Machine Learning, PMLR. pp. 5530–5540.
[233] Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A.v.d., Graves, [252] Kim, J., Lee, J., 2021. Generalizing rnn-transducer to out-domain au-
A., Kavukcuoglu, K., 2016. Neural machine translation in linear time. dio via sparse self-attention layers. arXiv preprint arXiv:2108.10752
arXiv preprint arXiv:1610.10099 . .
[234] Kalchbrenner, N., Grefenstette, E., Blunsom, P., 2014. A convo- [253] Kim, J., Lee, Y., Hong, S., Ok, J., 2022b. Learning continuous repre-
lutional neural network for modelling sentences. arXiv preprint sentation of audio for arbitrary scale super resolution, in: ICASSP
arXiv:1404.2188 . 2022-2022 IEEE International Conference on Acoustics, Speech and
[235] Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N., 2019. Acvae-vc: Signal Processing (ICASSP), IEEE. pp. 3703–3707.
Non-parallel voice conversion with auxiliary classifier variational au- [254] Kim, J.H., Lee, S.H., Lee, J.H., Lee, S.W., 2021b. Fre-gan:
toencoder. IEEE/ACM Transactions on Audio, Speech, and Language Adversarial frequency-consistent audio synthesis. arXiv preprint
Processing 27, 1432–1443. arXiv:2106.02297 .
[236] Kanda, N., Wu, J., Wu, Y., Xiao, X., Meng, Z., Wang, X., Gaur, [255] Kim, S., Gholami, A., Shaw, A., Lee, N., Mangalam, K., Malik,
Y., Chen, Z., Li, J., Yoshioka, T., 2022a. Streaming speaker- J., Mahoney, M.W., Keutzer, K., 2022c. Squeezeformer: An effi-
attributed asr with token-level speaker embeddings. arXiv preprint cient transformer for automatic speech recognition. arXiv preprint
arXiv:2203.16685 . arXiv:2206.00888 .
[237] Kanda, N., Xiao, X., Gaur, Y., Wang, X., Meng, Z., Chen, Z., Yosh- [256] Kim, S., Kim, G., Shin, S., Lee, S., 2021c. Two-stage textual
ioka, T., 2022b. Transcribe-to-diarize: Neural speaker diarization for knowledge distillation for end-to-end spoken language understand-
unlimited number of speakers using end-to-end speaker-attributed asr, ing, in: ICASSP 2021 - 2021 IEEE International Conference on
in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Acoustics, Speech and Signal Processing (ICASSP), pp. 7463–7467.
Speech and Signal Processing (ICASSP), IEEE. pp. 8082–8086. doi:10.1109/ICASSP39728.2021.9414619.
[238] Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N., 2019. Cyclegan-vc2: [257] Kim, S., Kim, H., Yoon, S., 2022d. Guided-tts 2: A diffusion
Improved cyclegan-based non-parallel voice conversion, in: ICASSP model for high-quality adaptive text-to-speech with untranscribed
2019-2019 IEEE International Conference on Acoustics, Speech and data. arXiv preprint arXiv:2205.15370 .
Signal Processing (ICASSP), IEEE. pp. 6820–6824. [258] Kim, S., Lee, S.g., Song, J., Kim, J., Yoon, S., 2018. Flowavenet: A
[239] Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N., 2020. Cyclegan- generative flow for raw audio. arXiv preprint arXiv:1811.02155 .
vc3: Examining and improving cyclegan-vcs for mel-spectrogram [259] Kinnunen, T., Karpov, E., Franti, P., 2005. Real-time speaker identi-
conversion. arXiv preprint arXiv:2010.11672 . fication and verification. IEEE Transactions on Audio, Speech, and
[240] Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N., 2021. Maskcyclegan- Language Processing 14, 277–288.
vc: Learning non-parallel voice conversion with filling in frames, [260] Kinoshita, K., Ochiai, T., Delcroix, M., Nakatani, T., 2020. Improv-
in: ICASSP 2021-2021 IEEE International Conference on Acoustics, ing noise robust automatic speech recognition with single-channel
Speech and Signal Processing (ICASSP), IEEE. pp. 5919–5923. time-domain enhancement network, in: ICASSP 2020-2020 IEEE
[241] Kaneko, T., Tanaka, K., Kameoka, H., Seki, S., 2022. istftnet: Fast International Conference on Acoustics, Speech and Signal Processing
and lightweight mel-spectrogram vocoder incorporating inverse short- (ICASSP), IEEE. pp. 7009–7013.
time fourier transform, in: ICASSP 2022-2022 IEEE International [261] Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj, M., In-
Conference on Acoustics, Speech and Signal Processing (ICASSP), man, D.J., 2021. 1d convolutional neural networks and applica-
IEEE. pp. 6207–6211. tions: A survey. Mechanical Systems and Signal Processing 151,
[242] Kang, J., Liu, R., Li, L., Cai, Y., Wang, D., Zheng, T.F., 2020. 107398. URL: https://www.sciencedirect.com/science/article/
Domain-invariant speaker vector projection by model-agnostic meta- pii/S0888327020307846, doi:https://doi.org/10.1016/j.ymssp.2020.
learning. arXiv preprint arXiv:2005.11900 . 107398.
[243] Kansizoglou, I., Bampis, L., Gasteratos, A., 2019. An active learn- [262] Kiranyaz, S., Ince, T., Hamila, R., Gabbouj, M., 2015. Convolu-
ing paradigm for online audio-visual emotion recognition. IEEE tional neural networks for patient-specific ecg classification, in: 2015
Transactions on Affective Computing 13, 756–768. 37th Annual International Conference of the IEEE Engineering in
[244] Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Medicine and Biology Society (EMBC), IEEE. pp. 2608–2611.
Someki, M., Soplin, N.E.Y., Yamamoto, R., Wang, X., et al., 2019a. [263] Koizumi, Y., Yatabe, K., Delcroix, M., Masuyama, Y., Takeuchi,
A comparative study on transformer vs rnn in speech applications, D., 2020. Speech enhancement using self-adaptation and multi-head
in: 2019 IEEE Automatic Speech Recognition and Understanding self-attention, in: ICASSP 2020-2020 IEEE International Conference
Workshop (ASRU), IEEE. pp. 449–456. on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp.
[245] Karita, S., Yalta, N., Watanabe, S., Delcroix, M., Ogawa, A., 181–185.
Nakatani, T., 2019b. Improving transformer-based end-to-end speech [264] Koizumi, Y., Zen, H., Yatabe, K., Chen, N., Bacchiani, M., 2022.
recognition with connectionist temporal classification and language Specgrad: Diffusion probabilistic model based neural vocoder with
model integration, in: INTERSPEECH. adaptive noise spectral shaping. arXiv preprint arXiv:2203.16749 .
[246] Kawakami, K., 2008. Supervised sequence labelling with recurrent [265] Kolbæk, M., Yu, D., Tan, Z.H., Jensen, J., 2017. Multitalker speech
neural networks. Ph.D. thesis. Technical University of Munich. separation with utterance-level permutation invariant training of
[247] Kawakami, K., Wang, L., Dyer, C., Blunsom, P., Oord, A.v.d., 2020. deep recurrent neural networks. IEEE/ACM Transactions on Audio,

Mehrish et al.: Preprint submitted to Elsevier Page 61 of 72


A Review of Deep Learning Techniques for Speech Processing

Speech, and Language Processing 25, 1901–1913. [284] Łańcucki, A., 2021. Fastpitch: Parallel text-to-speech with pitch
[266] Koluguri, N.R., Park, T., Ginsburg, B., 2022. Titanet: Neural model prediction, in: ICASSP 2021-2021 IEEE International Conference
for speaker representation with 1d depth-wise separable convolutions on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp.
and global context, in: ICASSP 2022-2022 IEEE International Con- 6588–6592.
ference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. [285] Landini, F., Profant, J., Diez, M., Burget, L., 2022. Bayesian hmm
pp. 8102–8106. clustering of x-vector sequences (vbx) in speaker diarization: theory,
[267] Kominek, J., Black, A.W., 2004. The cmu arctic speech databases, implementation and analysis on standard tasks. Computer Speech &
in: Fifth ISCA workshop on speech synthesis. Language 71, 101254.
[268] Kong, J., Kim, J., Bae, J., 2020a. Hifi-gan: Generative adversarial [286] Larcher, A., Lee, K.A., Ma, B., Li, H., 2012. The rsr2015: Database
networks for efficient and high fidelity speech synthesis. Advances for text-dependent speaker verification using multiple pass-phrases,
in Neural Information Processing Systems 33, 17022–17033. in: Annual Conference of the International Speech Communication
[269] Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B., 2020b. Dif- Association (Interspeech).
fwave: A versatile diffusion model for audio synthesis. arXiv preprint [287] Larcher, A., Mehrish, A., Tahon, M., Meignier, S., Carrive, J.,
arXiv:2009.09761 . Doukhan, D., Galibert, O., Evans, N., 2021. Speaker embeddings
[270] Koval, S., Krynov, S., 2020. Practice of usage of spectral analysis for diarization of broadcast data in the allies challenge, in: ICASSP
for forensic speaker identification, in: RLA2C 1998-Speaker Recog- 2021-2021 IEEE International Conference on Acoustics, Speech and
nition and its Commercial and Forensic Applications, pp. 136–140. Signal Processing (ICASSP), IEEE. pp. 5799–5803.
[271] Koyama, Y., Vuong, T., Uhlich, S., Raj, B., 2020. Exploring the best [288] Latif, S., Qadir, J., Qayyum, A., Usama, M., Younis, S., 2020. Speech
loss function for dnn-based low-latency speech enhancement with technology for healthcare: Opportunities, challenges, and state of the
temporal convolutional networks. arXiv preprint arXiv:2005.11611 . art. IEEE Reviews in Biomedical Engineering 14, 342–356.
[272] Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., D’efossez, A., Copet, [289] Lee, H.y., Mohamed, A., Watanabe, S., Sainath, T., Livescu, K., Li,
J., Parikh, D., Taigman, Y., Adi, Y., 2022. Audiogen: Textually S.W., Yang, S.w., Kirchhoff, K., 2022a. Self-supervised represen-
guided audio generation. ArXiv abs/2209.15352. tation learning for speech processing, in: Proceedings of the 2022
[273] Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Conference of the North American Chapter of the Association for
Lavrukhin, V., Leary, R., Li, J., Zhang, Y., 2020. Quartznet: Deep Computational Linguistics: Human Language Technologies: Tutorial
automatic speech recognition with 1d time-channel separable con- Abstracts, pp. 8–13.
volutions, in: ICASSP 2020-2020 IEEE International Conference [290] Lee, J., Han, S., 2021. Nu-wave: A diffusion probabilistic model for
on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. neural audio upsampling. arXiv preprint arXiv:2104.02321 .
6124–6128. [291] Lee, K.A., Larcher, A., Wang, G., Kenny, P., Brümmer, N.,
[274] Kulkarni, A., Colotte, V., Jouvet, D., 2020. Transfer learning of the Van Leeuwen, D., Aronowitz, H., Kockmann, M., Vaquero, C., Ma,
expressivity using flow metric learning in multispeaker text-to-speech B., et al., 2015. The reddots data collection for speaker recognition,
synthesis, in: INTERSPEECH 2020. in: Interspeech 2015.
[275] Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W.Z., [292] Lee, K.A., Wang, Q., Koshinaka, T., 2019. The coral+ algorithm
Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C., 2019. Mel- for unsupervised domain adaptation of plda, in: ICASSP 2019-2019
gan: Generative adversarial networks for conditional waveform syn- IEEE International Conference on Acoustics, Speech and Signal
thesis. Advances in neural information processing systems 32. Processing (ICASSP), IEEE. pp. 5821–5825.
[276] Kwon, O., Jang, I., Ahn, C., Kang, H.G., 2019. An effective style [293] Lee, S.g., Kim, H., Shin, C., Tan, X., Liu, C., Meng, Q., Qin, T., Chen,
token weight control technique for end-to-end emotional speech syn- W., Yoon, S., Liu, T.Y., 2021a. Priorgrad: Improving conditional
thesis. IEEE Signal Processing Letters 26, 1383–1387. doi:10.1109/ denoising diffusion models with data-dependent adaptive prior, in:
LSP.2019.2931673. International Conference on Learning Representations.
[277] Kwon, Y., Heo, H.S., Jung, J.w., Kim, Y.J., Lee, B.J., Chung, J.S., [294] Lee, S.g., Kim, S., Yoon, S., 2020. Nanoflow: Scalable normalizing
2022. Multi-scale speaker embedding-based graph attention networks flows with sublinear parameter complexity. Advances in Neural
for speaker diarisation, in: ICASSP 2022-2022 IEEE International Information Processing Systems 33, 14058–14067.
Conference on Acoustics, Speech and Signal Processing (ICASSP), [295] Lee, S.H., Kim, S.B., Lee, J.H., Song, E., Hwang, M.J., Lee, S.W.,
IEEE. pp. 8367–8371. 2022b. Hierspeech: Bridging the gap between text and speech by hi-
[278] Kwon, Y., Jung, J.w., Heo, H.S., Kim, Y.J., Lee, B.J., Chung, J.S., erarchical variational inference using self-supervised representations
2021. Adapting speaker embeddings for speaker diarisation. arXiv for speech synthesis. Advances in Neural Information Processing
preprint arXiv:2104.02879 . Systems 35, 16624–16636.
[279] Kye, S.M., Jung, Y., Lee, H.B., Hwang, S.J., Kim, H., 2020. Meta- [296] Lee, Y., Shin, J., Jung, K., 2021b. Bidirectional variational inference
learning for short utterance speaker recognition with imbalance length for non-autoregressive text-to-speech, in: International Conference
pairs. arXiv preprint arXiv:2004.02863 . on Learning Representations.
[280] Lai, C.J., Zhang, Y., Liu, A.H., Chang, S., Liao, Y., Chuang, [297] Lemaire, Q., Holzapfel, A., 2019. Temporal convolutional networks
Y., Qian, K., Khurana, S., Cox, D.D., Glass, J.R., 2021. PARP: for speech and music detection in radio broadcast, in: 20th Interna-
prune, adjust and re-prune for self-supervised speech recognition. tional Society for Music Information Retrieval Conference, ISMIR
CoRR abs/2106.05933. URL: https://arxiv.org/abs/2106.05933, 2019, 4-8 November 2019, International Society for Music Informa-
arXiv:2106.05933. tion Retrieval.
[281] Lakhotia, K., et al., 2021. On generative spoken language modeling [298] Lemercier, J.M., Richter, J., Welker, S., Gerkmann, T., 2022. Storm:
from raw audio. Transactions of the Association for Computational A diffusion-based stochastic regeneration model for speech enhance-
Linguistics 9. ment and dereverberation. arXiv preprint arXiv:2212.11851 .
[282] Lakomkin, E., Zamani, M.A., Weber, C., Magg, S., Wermter, S., [299] Leng, Y., Chen, Z., Guo, J., Liu, H., Chen, J., Tan, X., Mandic, D.,
2018. Emorl: continuous acoustic emotion classification using deep He, L., Li, X.Y., Qin, T., et al., 2022. Binauralgrad: A two-stage
reinforcement learning, in: 2018 IEEE International Conference on conditional diffusion probabilistic model for binaural audio synthesis.
Robotics and Automation (ICRA), IEEE. pp. 4445–4450. arXiv preprint arXiv:2205.14807 .
[283] Lam, M.W., Wang, J., Su, D., Yu, D., 2021. Sandglasset: A light [300] Leroy, D., Coucke, A., Lavril, T., Gisselbrecht, T., Dureau, J., 2019.
multi-granularity self-attentive network for time-domain speech sep- Federated learning for keyword spotting, in: ICASSP 2019-2019
aration, in: ICASSP 2021-2021 IEEE International Conference on IEEE international conference on acoustics, speech and signal pro-
Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 5759– cessing (ICASSP), IEEE. pp. 6341–6345.
5763. [301] Levkovitch, A., Nachmani, E., Wolf, L., 2022. Zero-shot voice

Mehrish et al.: Preprint submitted to Elsevier Page 62 of 72


A Review of Deep Learning Techniques for Speech Processing

conditioning for denoising diffusion tts models. arXiv preprint [320] Lim, T.Y., Yeh, R.A., Xu, Y., Do, M.N., Hasegawa-Johnson, M., 2018.
arXiv:2206.02246 . Time-frequency networks for audio super-resolution, in: 2018 IEEE
[302] Li, B., Chang, S.y., Sainath, T.N., Pang, R., He, Y., Strohman, T., International Conference on Acoustics, Speech and Signal Processing
Wu, Y., 2020a. Towards fast and accurate streaming end-to-end asr, (ICASSP), pp. 646–650. doi:10.1109/ICASSP.2018.8462049.
in: ICASSP 2020-2020 IEEE International Conference on Acoustics, [321] Lin, J., Niu, S., Wei, Z., Lan, X., Wijngaarden, A.J., Smith, M.C.,
Speech and Signal Processing (ICASSP), IEEE. pp. 6069–6073. Wang, K.C., 2019. Speech enhancement using forked generative
[303] Li, B., Gulati, A., Yu, J., Sainath, T.N., Chiu, C.C., Narayanan, A., adversarial networks with spectral subtraction. Proceedings of Inter-
Chang, S.Y., Pang, R., He, Y., Qin, J., et al., 2021a. A better and faster speech 2019 .
end-to-end model for streaming asr, in: ICASSP 2021-2021 IEEE [322] Lin, J., van Wijngaarden, A.J.d.L., Wang, K.C., Smith, M.C., 2021a.
International Conference on Acoustics, Speech and Signal Processing Speech enhancement using multi-stage self-attentive temporal con-
(ICASSP), IEEE. pp. 5634–5638. volutional networks. IEEE/ACM Transactions on Audio, Speech,
[304] Li, B., Sainath, T.N., Narayanan, A., Caroselli, J., Bacchiani, M., and Language Processing 29, 3440–3450. doi:10.1109/TASLP.2021.
Misra, A., Shafran, I., Sak, H., Pundak, G., Chin, K.K., et al., 2017. 3125143.
Acoustic modeling for google home., in: Interspeech, pp. 399–403. [323] Lin, J.h., Lin, Y.Y., Chien, C.M., Lee, H.y., 2021b. S2vc: a frame-
[305] Li, J., Liu, W., Lee, T., 2022a. Editnet: A lightweight network work for any-to-any voice conversion with self-supervised pretrained
for unsupervised domain adaptation in speaker verification. arXiv representations. arXiv preprint arXiv:2104.02901 .
preprint arXiv:2206.07548 . [324] Lin, Q., Hou, Y., Li, M., 2020. Self-attentive similarity measurement
[306] Li, J., Zhang, H., Zhang, X., Li, C., 2019a. Single channel speech strategies in speaker diarization., in: INTERSPEECH, pp. 284–288.
enhancement using temporal convolutional recurrent neural networks, [325] Lin, W.W., Mak, M.W., 2020. Wav2spk: A simple dnn architecture for
in: 2019 Asia-Pacific Signal and Information Processing Association learning speaker embeddings from waveforms., in: INTERSPEECH,
Annual Summit and Conference (APSIPA ASC), IEEE. pp. 896–900. pp. 3211–3215.
[307] Li, J., et al., 2022b. Recent advances in end-to-end automatic speech [326] Ling, S., Liu, Y., 2020. Decoar 2.0: Deep contextualized
recognition. APSIPA Transactions on Signal and Information Pro- acoustic representations with vector quantization. arXiv preprint
cessing 11. arXiv:2012.06659 .
[308] Li, K., Yang, R., Hu, X., 2022c. An efficient encoder-decoder archi- [327] Liu, A.H., Hsu, W.N., Auli, M., Baevski, A., 2023a. Towards end-
tecture with top-down attention for speech separation. arXiv preprint to-end unsupervised speech recognition, in: 2022 IEEE Spoken Lan-
arXiv:2209.15200 . guage Technology Workshop (SLT), IEEE. pp. 221–228.
[309] Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., 2019b. Neural speech [328] Liu, A.H., Lai, C.I.J., Hsu, W.N., Auli, M., Baevskiv, A., Glass, J.,
synthesis with transformer network, in: Proceedings of the AAAI 2022a. Simple and effective unsupervised speech synthesis. arXiv
conference on artificial intelligence, pp. 6706–6713. preprint arXiv:2204.02524 .
[310] Li, N., Liu, Y., Wu, Y., Liu, S., Zhao, S., Liu, M., 2020b. Robutrans: [329] Liu, A.T., Li, S.W., Lee, H.y., 2021a. Tera: Self-supervised learning
A robust transformer-based text-to-speech model, in: Proceedings of of transformer encoder representation for speech. IEEE/ACM Trans-
the AAAI Conference on Artificial Intelligence, pp. 8228–8235. actions on Audio, Speech, and Language Processing 29, 2351–2366.
[311] Li, Q., Gao, F., Guan, H., Ma, K., 2021b. Real-time monaural [330] Liu, A.T., Yang, S.w., Chi, P.H., Hsu, P.c., Lee, H.y., 2020. Mock-
speech enhancement with short-time discrete cosine transform. arXiv ingjay: Unsupervised speech representation learning with deep bidi-
preprint arXiv:2102.04629 . rectional transformer encoders, in: ICASSP 2020-2020 IEEE Inter-
[312] Li, Q., Qiu, D., Zhang, Y., Li, B., He, Y., Woodland, P.C., Cao, national Conference on Acoustics, Speech and Signal Processing
L., Strohman, T., 2021c. Confidence estimation for attention-based (ICASSP), IEEE. pp. 6419–6423.
sequence-to-sequence models for speech recognition, in: ICASSP [331] Liu, C., Zhang, F., Le, D., Kim, S., Saraf, Y., Zweig, G., 2021b.
2021-2021 IEEE International Conference on Acoustics, Speech and Improving rnn transducer based asr with auxiliary tasks, in: 2021
Signal Processing (ICASSP), IEEE. pp. 6388–6392. IEEE Spoken Language Technology Workshop (SLT), IEEE. pp.
[313] Li, R., Zhang, W., Chen, D., 2022d. The coral++ algorithm for 172–179.
unsupervised domain adaptation of speaker recognition, in: ICASSP [332] Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D.P., Wang,
2022-2022 IEEE International Conference on Acoustics, Speech and W., Plumbley, M.., 2023b. Audioldm: Text-to-audio generation with
Signal Processing (ICASSP), IEEE. pp. 7172–7176. latent diffusion models. ArXiv abs/2301.12503.
[314] Li, X.L., Liang, P., 2021. Prefix-tuning: Optimizing continuous [333] Liu, H., Choi, W., Liu, X., Kong, Q., Tian, Q., Wang, D., 2022b.
prompts for generation, in: Proceedings of the 59th Annual Meeting Neural vocoder is all you need for speech super-resolution. arXiv
of the Association for Computational Linguistics and the 11th Inter- preprint arXiv:2203.14941 .
national Joint Conference on Natural Language Processing (Volume [334] Liu, J., Li, C., Ren, Y., Chen, F., Zhao, Z., 2022c. Diffsinger: Singing
1: Long Papers), Association for Computational Linguistics, Online. voice synthesis via shallow diffusion mechanism, in: Proceedings of
pp. 4582–4597. URL: https://aclanthology.org/2021.acl-long.353, the AAAI Conference on Artificial Intelligence, pp. 11020–11028.
doi:10.18653/v1/2021.acl-long.353. [335] Liu, J., Pasupat, P., Cyphers, S., Glass, J., 2013. Asgard: A portable
[315] Li, Y., Mehrish, A., Zhao, S., Bhardwaj, R., Zadeh, A., Majumder, N., architecture for multilingual dialogue systems, in: 2013 IEEE Inter-
Mihalcea, R., Poria, S., 2023. Evaluating parameter-efficient transfer national Conference on Acoustics, Speech and Signal Processing,
learning approaches on sure benchmark for speech understanding. IEEE. pp. 8386–8390.
arXiv preprint arXiv:2303.03267 . [336] Liu, M., Wei, Y., 2022. An improvement to conformer-based model
[316] Li, Y.A., Han, C., Mesgarani, N., 2022e. Styletts: A style-based for high-accuracy speech feature extraction and learning. Entropy 24,
generative model for natural and diverse text-to-speech synthesis. 866.
arXiv preprint arXiv:2205.15439 . [337] Liu, R., Sisman, B., Gao, G., Li, H., 2021c. Expressive tts training
[317] Lim, D., Jang, W., Park, H., Kim, B., Yoon, J., et al., 2020. Jdi- with frame and style reconstruction loss. IEEE/ACM Transactions
t: Jointly trained duration informed transformer for text-to-speech on Audio, Speech, and Language Processing 29, 1806–1818. doi:10.
without explicit alignment. arXiv preprint arXiv:2005.07799 . 1109/TASLP.2021.3076369.
[318] Lim, D., Jung, S., Kim, E., 2022. Jets: Jointly training fast- [338] Liu, R., Sisman, B., Li, H., 2021d. Graphspeech: Syntax-aware
speech2 and hifi-gan for end to end text to speech. arXiv preprint graph attention network for neural speech synthesis, in: ICASSP
arXiv:2203.16852 . 2021-2021 IEEE International Conference on Acoustics, Speech and
[319] Lim, J., Oppenheim, A., 1978. All-pole modeling of degraded speech. Signal Processing (ICASSP), IEEE. pp. 6059–6063.
IEEE Transactions on Acoustics, Speech, and Signal Processing 26, [339] Liu, S., Cao, Y., Wang, D., Wu, X., Liu, X., Meng, H., 2021e. Any-to-
197–210. many voice conversion with location-relative sequence-to-sequence

Mehrish et al.: Preprint submitted to Elsevier Page 63 of 72


A Review of Deep Learning Techniques for Speech Processing

modeling. IEEE/ACM Transactions on Audio, Speech, and Language adversarial and collaborative games, in: International Conference on
Processing 29, 1717–1728. Learning Representations.
[340] Liu, S., Su, D., Yu, D., 2022d. Diffgan-tts: High-fidelity and effi- [361] Macho, D., Mauuary, L., Noé, B., Cheng, Y.M., Ealey, D., Jouvet, D.,
cient text-to-speech with denoising diffusion gans. arXiv preprint Kelleher, H., Pearce, D., Saadoun, F., 2002. Evaluation of a noise-
arXiv:2201.11972 . robust dsr front-end on aurora databases, in: Seventh International
[341] Liu, X., Van De Weijer, J., Bagdanov, A.D., 2019. Exploiting unla- Conference on Spoken Language Processing.
beled data in cnns by self-supervised learning to rank. IEEE transac- [362] Maimon, G., Adi, Y., 2022. Speaking style conversion with discrete
tions on pattern analysis and machine intelligence 41, 1862–1878. self-supervised units. arXiv preprint arXiv:2212.09730 .
[342] Liu, Y., Xu, Z., Wang, G., Chen, K., Li, B., Tan, X., Li, J., He, L., [363] Maiti, S., Mandel, M.I., 2020. Speaker independence of neural
Zhao, S., 2021f. Delightfultts: The microsoft speech synthesis system vocoders and their effect on parametric resynthesis speech enhance-
for blizzard challenge 2021. arXiv preprint arXiv:2110.12612 . ment, in: ICASSP 2020-2020 IEEE International Conference on
[343] Liu, Z., Tian, Q., Hu, C., Liu, X., Wu, M., Wang, Y., Zhao, H., Wang, Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 206–
Y., 2022e. Controllable and lossless non-autoregressive end-to-end 210.
text-to-speech. arXiv preprint arXiv:2207.06088 . [364] Majumdar, S., Acharya, S., Lavrukhin, V., Ginsburg, B., 2023. Dam-
[344] Lorenzo-Trueba, J., Drugman, T., Latorre, J., Merritt, T., Pu- age control during domain adaptation for transducer based automatic
trycz, B., Barra-Chicote, R., Moinet, A., Aggarwal, V., 2018. To- speech recognition, in: 2022 IEEE Spoken Language Technology
wards achieving robust universal neural vocoding. arXiv preprint Workshop (SLT), IEEE. pp. 130–135.
arXiv:1811.06292 . [365] Majumdar, S., Balam, J., Hrinchuk, O., Lavrukhin, V., Noroozi,
[345] Lu, X., Li, S., Fujimoto, M., 2020. Automatic speech recognition. V., Ginsburg, B., 2021. Citrinet: Closing the gap between non-
Speech-to-Speech Translation , 21–38. autoregressive and autoregressive end-to-end models for automatic
[346] Lu, X., Tsao, Y., Matsuda, S., Hori, C., 2013. Speech enhancement speech recognition. arXiv preprint arXiv:2104.01721 .
based on deep denoising autoencoder., in: Interspeech, pp. 436–440. [366] Malladi, S., Wettig, A., Yu, D., Chen, D., Arora, S., 2022. A kernel-
[347] Lu, Y.J., Tsao, Y., Watanabe, S., 2021. A study on speech enhance- based view of language model fine-tuning. ArXiv abs/2210.05643.
ment based on diffusion probabilistic model, in: 2021 Asia-Pacific [367] Mani, A., Palaskar, S., Meripo, N.V., Konam, S., Metze, F., 2020.
Signal and Information Processing Association Annual Summit and Asr error correction and domain adaptation using machine translation,
Conference (APSIPA ASC), pp. 659–666. in: ICASSP 2020-2020 IEEE International Conference on Acoustics,
[348] Lu, Y.J., Wang, Z.Q., Watanabe, S., Richard, A., Yu, C., Tsao, Y., Speech and Signal Processing (ICASSP), IEEE. pp. 6344–6348.
2022a. Conditional diffusion probabilistic model for speech enhance- [368] Manocha, P., Kumar, A., 2022. Speech quality assessment through
ment, in: ICASSP 2022 - 2022 IEEE International Conference on mos using non-matching references. arXiv preprint arXiv:2206.12285
Acoustics, Speech and Signal Processing (ICASSP), pp. 7402–7406. .
doi:10.1109/ICASSP43922.2022.9746901. [369] Manocha, P., Xu, B., Kumar, A., 2021. Noresqa: A framework for
[349] Lu, Y.J., Wang, Z.Q., Watanabe, S., Richard, A., Yu, C., Tsao, Y., speech quality assessment using non-matching references. Advances
2022b. Conditional diffusion probabilistic model for speech enhance- in Neural Information Processing Systems 34, 22363–22378.
ment, in: ICASSP 2022-2022 IEEE International Conference on [370] Mary, N.J.M.S., Umesh, S., Katta, S.V., 2021. S-vectors and tesa:
Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 7402– Speaker embeddings and a speaker authenticator based on transformer
7406. encoder. IEEE/ACM Transactions on Audio, Speech, and Language
[350] Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V.S., Bengio, Y., Processing 30, 404–413.
2019. Speech model pre-training for end-to-end spoken language [371] McLaren, M., Ferrer, L., Castan, D., Lawson, A., 2016. The speakers
understanding. arXiv preprint arXiv:1904.03670 . in the wild (sitw) speaker recognition database., in: Interspeech, pp.
[351] Luo, Y., Chen, Z., Yoshioka, T., 2020. Dual-path rnn: efficient long 818–822.
sequence modeling for time-domain single-channel speech separation, [372] Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., Ko-
in: ICASSP 2020-2020 IEEE International Conference on Acoustics, renevskaya, M., Sorokin, I., Timofeeva, T., Mitrofanov, A., An-
Speech and Signal Processing (ICASSP), IEEE. pp. 46–50. drusenko, A., Podluzhny, I., et al., 2020. Target-speaker voice activity
[352] Luo, Y., Mesgarani, N., 2018. Real-time single-channel dereverbera- detection: a novel approach for multi-speaker diarization in a dinner
tion and separation with time-domain audio separation network., in: party scenario. arXiv preprint arXiv:2005.07272 .
Interspeech, pp. 342–346. [373] Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J.,
[353] Luo, Y., Mesgarani, N., 2019. Conv-tasnet: Surpassing ideal time– Courville, A., Bengio, Y., 2016. Samplernn: An unconditional end-to-
frequency magnitude masking for speech separation. IEEE/ACM end neural audio generation model. arXiv preprint arXiv:1612.07837
transactions on audio, speech, and language processing 27, 1256– .
1266. [374] Mehta, S., Mercan, E., Bartlett, J., Weaver, D., Elmore, J.G., Shapiro,
[354] Luong, M., Tran, V.A., 2021. Flowvocoder: A small footprint neural L., 2018. Y-net: joint segmentation and classification for diagnosis of
vocoder based normalizing flow for speech synthesis. arXiv preprint breast biopsy images, in: Medical Image Computing and Computer
arXiv:2109.13675 . Assisted Intervention–MICCAI 2018: 21st International Conference,
[355] Luong, M.T., Pham, H., Manning, C.D., 2015. Effective ap- Granada, Spain, September 16-20, 2018, Proceedings, Part II 11,
proaches to attention-based neural machine translation. arXiv preprint Springer. pp. 893–901.
arXiv:1508.04025 . [375] Mehta, S., Szekely, E., Beskow, J., Henter, G.E., 2022. Neural hmms
[356] Lutati, S., Nachmani, E., Wolf, L., 2022. Sepit approaching a single are all you need (for high-quality attention-free tts), in: ICASSP 2022 -
channel speech separation bound. arXiv preprint arXiv:2205.11801 . 2022 IEEE International Conference on Acoustics, Speech and Signal
[357] Lutati, S., Nachmani, E., Wolf, L., 2023. Separate and diffuse: Using Processing (ICASSP), pp. 7457–7461. doi:10.1109/ICASSP43922.2022.
a pretrained diffusion model for improving source separation. arXiv 9746686.
preprint arXiv:2301.10752 . [376] Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., Xiao, J., 2020a. Flow-
[358] Lux, F., Vu, N.T., 2022. Language-agnostic meta-learning for low- tts: A non-autoregressive network for text to speech based on flow,
resource text-to-speech with articulatory features. arXiv preprint in: ICASSP 2020-2020 IEEE International Conference on Acoustics,
arXiv:2203.03191 . Speech and Signal Processing (ICASSP), IEEE. pp. 7209–7213.
[359] Ma, P., Mira, R., Petridis, S., Schuller, B.W., Pantic, M., 2021. [377] Miao, C., Shuang, L., Liu, Z., Minchuan, C., Ma, J., Wang, S., Xiao,
Lira: Learning visual speech representations from audio through J., 2021. Efficienttts: An efficient and high-quality text-to-speech ar-
self-supervision. arXiv preprint arXiv:2106.09171 . chitecture, in: International Conference on Machine Learning, PMLR.
[360] Ma, S., Mcduff, D., Song, Y., 2019. Neural tts stylization with pp. 7700–7709.

Mehrish et al.: Preprint submitted to Elsevier Page 64 of 72


A Review of Deep Learning Techniques for Speech Processing

[378] Miao, H., Cheng, G., Gao, C., Zhang, P., Yan, Y., 2020b. Transformer- [397] Ning, Y., He, S., Wu, Z., Xing, C., Zhang, L.J., 2019. A review of
based online ctc/attention end-to-end speech recognition architecture, deep learning based speech synthesis. Applied Sciences 9, 4050.
in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, [398] Niu, P., Chen, Z., Song, M., et al., 2019. A novel bi-directional
Speech and Signal Processing (ICASSP), pp. 6084–6088. doi:10. interrelated model for joint intent detection and slot filling. arXiv
1109/ICASSP40776.2020.9053165. preprint arXiv:1907.00390 .
[379] Michelsanti, D., Tan, Z.H., Zhang, S.X., Xu, Y., Yu, M., Yu, D., [399] Okamoto, T., Toda, T., Shiga, Y., Kawai, H., 2019a. Real-time
Jensen, J., 2021. An overview of deep-learning-based audio-visual neural text-to-speech with sequence-to-sequence acoustic model and
speech enhancement and separation. IEEE/ACM Transactions on waveglow or single gaussian wavernn vocoders., in: INTERSPEECH,
Audio, Speech, and Language Processing 29, 1368–1396. pp. 1308–1312.
[380] Mihalache, S., Burileanu, D., 2022. Using voice activity detection [400] Okamoto, T., Toda, T., Shiga, Y., Kawai, H., 2019b. Tacotron-based
and deep neural networks with hybrid speech feature extraction for acoustic model using phoneme alignment for practical neural text-
deceptive speech detection. Sensors 22, 1228. to-speech systems, in: 2019 IEEE Automatic Speech Recognition
[381] Milde, B., Biemann, C., 2018. Unspeech: Unsupervised speech and Understanding Workshop (ASRU), pp. 214–221. doi:10.1109/
context embeddings. arXiv preprint arXiv:1804.06775 . ASRU46091.2019.9003956.
[382] Millet, J., Caucheteux, C., Boubenec, Y., Gramfort, A., Dunbar, E., [401] Okamoto, T., Toda, T., Shiga, Y., Kawai, H., 2020. Transformer-based
Pallier, C., King, J.R., et al., 2022. Toward a realistic model of speech text-to-speech with weighted forced attention, in: ICASSP 2020 -
processing in the brain with self-supervised learning. Advances in 2020 IEEE International Conference on Acoustics, Speech and Signal
Neural Information Processing Systems 35, 33428–33443. Processing (ICASSP), pp. 6729–6733. doi:10.1109/ICASSP40776.2020.
[383] Mohamed, A., Okhonko, D., Zettlemoyer, L., 2019. Transformers 9053915.
with convolutional context for asr. arXiv preprint arXiv:1904.11660 . [402] Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves,
[384] Monteiro, J., Alam, M.J., Falk, T.H., 2019. Combining speaker A., et al., 2016. Conditional image generation with pixelcnn decoders.
recognition and metric learning for speaker-dependent representation Advances in neural information processing systems 29.
learning., in: INTERSPEECH, pp. 4015–4019. [403] Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O.,
[385] Montesinos, J.F., Kadandale, V.S., Haro, G., 2021. A cappella: Audio- Kavukcuoglu, K., Driessche, G., Lockhart, E., Cobo, L., Stimberg,
visual singing voice separation. arXiv preprint arXiv:2104.09946 F., et al., 2018a. Parallel wavenet: Fast high-fidelity speech synthe-
. sis, in: International conference on machine learning, PMLR. pp.
[386] Mustafa, A., Pia, N., Fuchs, G., 2021. Stylemelgan: An efficient high- 3918–3926.
fidelity adversarial vocoder with temporal adaptive normalization, [404] Oord, A.v.d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O.,
in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K., 2016.
Speech and Signal Processing (ICASSP), IEEE. pp. 6034–6038. Wavenet: A generative model for raw audio. arXiv preprint
[387] Nachmani, E., Adi, Y., Wolf, L., 2020. Voice separation with an arXiv:1609.03499 .
unknown number of multiple speakers, in: International Conference [405] Oord, A.v.d., Li, Y., Vinyals, O., 2018b. Representation learning
on Machine Learning, PMLR. pp. 7164–7175. with contrastive predictive coding. arXiv preprint arXiv:1807.03748
[388] Nakatani, T., 2019. Improving transformer-based end-to-end speech .
recognition with connectionist temporal classification and language [406] Ooster, J., Meyer, B.T., 2019. Improving deep models of speech
model integration, in: Proc. Interspeech. quality prediction through voice activity detection and entropy-based
[389] Nankaku, Y., Sumiya, K., Yoshimura, T., Takaki, S., Hashimoto, K., measures, in: ICASSP 2019-2019 IEEE International Conference
Oura, K., Tokuda, K., 2021. Neural sequence-to-sequence speech syn- on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp.
thesis using a hidden semi-markov model based structured attention 636–640.
mechanism. arXiv preprint arXiv:2108.13985 . [407] OpenAI, 2023. Gpt-4 technical report. ArXiv abs/2303.08774.
[390] Nassif, A.B., Shahin, I., Attili, I., Azzeh, M., Shaalan, K., 2019. [408] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L.,
Speech recognition using deep neural networks: A systematic review. Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schul-
IEEE access 7, 19143–19165. man, J., Hilton, J., Kelton, F., Miller, L.E., Simens, M., Askell, A.,
[391] von Neumann, T., Kinoshita, K., Boeddeker, C., Delcroix, M., Haeb- Welinder, P., Christiano, P.F., Leike, J., Lowe, R.J., 2022. Training
Umbach, R., 2021. Graph-pit: Generalized permutation invariant language models to follow instructions with human feedback. ArXiv
training for continuous separation of arbitrary numbers of speakers. abs/2203.02155.
arXiv preprint arXiv:2107.14446 . [409] Paliwal, K., Wójcicki, K., Shannon, B., 2011. The importance of
[392] Nguyen, H.B., Van Hai, D., Bui, T.D., Chau, H.N., Nguyen, Q.C., phase in speech enhancement. speech communication 53, 465–494.
2022a. Multi-channel speech enhancement using a minimum variance [410] Pamisetty, G., Sri Rama Murty, K., 2023. Prosody-tts: An end-to-end
distortionless response beamformer based on graph convolutional speech synthesis system with prosody control. Circuits, Systems, and
network. International Journal of Advanced Computer Science and Signal Processing 42, 361–384.
Applications 13. [411] Pan, J., Lei, T., Kim, K., Han, K.J., Watanabe, S., 2022. Sru++:
[393] Nguyen, V.A., Nguyen, A.H., Khong, A.W., 2022b. Tunet: A block- Pioneering fast recurrence with attention for speech recognition, in:
online bandwidth extension model based on transformers and self- ICASSP 2022 - 2022 IEEE International Conference on Acoustics,
supervised pretraining, in: ICASSP 2022-2022 IEEE International Speech and Signal Processing (ICASSP), pp. 7872–7876. doi:10.
Conference on Acoustics, Speech and Signal Processing (ICASSP), 1109/ICASSP43922.2022.9746187.
IEEE. pp. 161–165. [412] Panayotov, V., Chen, G., Povey, D., Khudanpur, S., 2015. Librispeech:
[394] Nguyen, V.N., Sadeghi, M., Ricci, E., Alameda-Pineda, X., 2021. an asr corpus based on public domain audio books, in: 2015 IEEE
Deep variational generative models for audio-visual speech separa- international conference on acoustics, speech and signal processing
tion, in: 2021 IEEE 31st International Workshop on Machine Learn- (ICASSP), IEEE. pp. 5206–5210.
ing for Signal Processing (MLSP), IEEE. pp. 1–6. [413] Pandey, A., Wang, D., 2019. Tcnn: Temporal convolutional neural
[395] Nguyen, X.P., Popuri, S., Wang, C., Tang, Y., Kulikov, I., Gong, H., network for real-time speech enhancement in the time domain, in:
2022c. Improving speech-to-speech translation through unlabeled ICASSP 2019-2019 IEEE International Conference on Acoustics,
text. arXiv preprint arXiv:2210.14514 . Speech and Signal Processing (ICASSP), IEEE. pp. 6875–6879.
[396] Nidadavolu, P.S., Villalba, J., Dehak, N., 2019. Cycle-gans for [414] Papastratis, I., 2021. Speech recognition: a review of the different
domain adaptation of acoustic features for speaker recognition, in: deep learning approaches. Accessed on 2, 2021.
ICASSP 2019-2019 IEEE International Conference on Acoustics, [415] Park, D.S., Zhang, Y., Jia, Y., Han, W., Chiu, C.C., Li, B., Wu, Y.,
Speech and Signal Processing (ICASSP), IEEE. pp. 6206–6210. Le, Q.V., 2020. Improved noisy student training for automatic speech

Mehrish et al.: Preprint submitted to Elsevier Page 65 of 72


A Review of Deep Learning Techniques for Speech Processing

recognition. arXiv preprint arXiv:2005.09629 . recognition., in: Interspeech, pp. 939–943.


[416] Park, S., Choo, K., Lee, J., Porov, A.V., Osipov, K., Sung, J.S., 2022. [437] Prenger, R., Valle, R., Catanzaro, B., 2019. Waveglow: A flow-based
Bunched lpcnet2: Efficient neural vocoders covering devices from generative network for speech synthesis, in: ICASSP 2019-2019
cloud to edge. arXiv preprint arXiv:2203.14416 . IEEE International Conference on Acoustics, Speech and Signal
[417] Park, T.J., Han, K.J., Kumar, M., Narayanan, S., 2019. Auto-tuning Processing (ICASSP), IEEE. pp. 3617–3621.
spectral clustering for speaker diarization using normalized maximum [438] Qian, K., Zhang, Y., Chang, S., Hasegawa-Johnson, M., Cox, D.,
eigengap. IEEE Signal Processing Letters 27, 381–385. 2020. Unsupervised speech decomposition via triple information bot-
[418] Pascual, S., Ravanelli, M., Serra, J., Bonafonte, A., Bengio, Y., 2019. tleneck, in: International Conference on Machine Learning, PMLR.
Learning problem-agnostic speech representations from multiple self- pp. 7836–7846.
supervised tasks. arXiv preprint arXiv:1904.03416 . [439] Qian, K., Zhang, Y., Gao, H., Ni, J., Lai, C.I., Cox, D., Hasegawa-
[419] Passricha, V., Aggarwal, R.K., 2019. A hybrid of deep cnn and bidi- Johnson, M., Chang, S., 2022. Contentvec: An improved self-
rectional lstm for automatic speech recognition. Journal of Intelligent supervised speech representation by disentangling speakers, in: Inter-
Systems 29, 1261–1274. national Conference on Machine Learning, PMLR. pp. 18003–18017.
[420] Paul, D., Mukherjee, S., Pantazis, Y., Stylianou, Y., 2021. A uni- [440] Qin, X., Bu, H., Li, M., 2020a. Hi-mia: A far-field text-dependent
versal multi-speaker multi-style text-to-speech via disentangled rep- speaker verification database and the baselines, in: ICASSP 2020-
resentation learning based on rényi divergence minimization., in: 2020 IEEE International Conference on Acoustics, Speech and Signal
Interspeech, pp. 3625–3629. Processing (ICASSP), IEEE. pp. 7609–7613.
[421] Paul, D., Pantazis, Y., Stylianou, Y., 2020. Speaker conditional [441] Qin, X., Li, M., Bu, H., Das, R.K., Rao, W., Narayanan, S.,
wavernn: Towards universal neural vocoder for unseen speaker and Li, H., 2020b. The ffsvc 2020 evaluation plan. arXiv preprint
recording conditions. arXiv preprint arXiv:2008.05289 . arXiv:2002.00387 .
[422] Pena, B., Huang, L., 2021. Wave-gan: a deep learning approach for [442] Qiu, Z., Fu, M., Yu, Y., Yin, L., Sun, F., Huang, H., 2022. Srtnet:
the prediction of nonlinear regular wave loads and run-up on a fixed Time domain speech enhancement via stochastic refinement. arXiv
cylinder. Coastal Engineering 167, 103902. preprint arXiv:2210.16805 .
[423] Peng, K., Ping, W., Song, Z., Zhao, K., 2020. Non-autoregressive neu- [443] Rabiner, L., Cheng, M., Rosenberg, A., McGonegal, C., 1976. A
ral text-to-speech, in: International conference on machine learning, comparative performance study of several pitch detection algorithms.
PMLR. pp. 7586–7598. IEEE Transactions on Acoustics, Speech, and Signal Processing 24,
[424] Peng, Z., Budhkar, A., Tuil, I., Levy, J., Sobhani, P., Cohen, R., Nas- 399–418.
sour, J., 2021. Shrinking bigfoot: Reducing wav2vec 2.0 footprint, [444] Rabiner, L.R., 1989. A tutorial on hidden markov models and selected
in: Proceedings of the Second Workshop on Simple and Efficient applications in speech recognition. Proceedings of the IEEE 77, 257–
Natural Language Processing, Association for Computational Lin- 286.
guistics, Virtual. pp. 134–141. URL: https://aclanthology.org/2021. [445] Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C.,
sustainlp-1.14, doi:10.18653/v1/2021.sustainlp-1.14. Sutskever, I., 2022. Robust speech recognition via large-scale weak
[425] Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., Gurevych, I., 2020. supervision. arXiv preprint arXiv:2212.04356 .
Adapterfusion: Non-destructive task composition for transfer learn- [446] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al., 2018.
ing. arXiv preprint arXiv:2005.00247 . Improving language understanding by generative pre-training .
[426] Pham, M., Li, Z., Whitehill, J., 2020. Toward better speaker em- [447] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.,
beddings: Automated collection of speech samples from unknown et al., 2019. Language models are unsupervised multitask learners.
distinct speakers, in: ICASSP 2020-2020 IEEE International Confer- OpenAI blog 1, 9.
ence on Acoustics, Speech and Signal Processing (ICASSP), IEEE. [448] Radzikowski, K., Nowak, R., Wang, L., Yoshie, O., 2019. Dual
pp. 7089–7093. supervised learning for non-native speech recognition. EURASIP
[427] Ping, W., Peng, K., Chen, J., 2018. Clarinet: Parallel wave generation Journal on Audio, Speech, and Music Processing 2019, 1–10.
in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281 . [449] Raffel, C., Luong, M.T., Liu, P.J., Weiss, R.J., Eck, D., 2017. Online
[428] Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., Narang, S., and linear-time attention by enforcing monotonic alignments, in: In-
Raiman, J., Miller, J., 2017. Deep voice 3: Scaling text-to-speech with ternational conference on machine learning, PMLR. pp. 2837–2846.
convolutional sequence learning. arXiv preprint arXiv:1710.07654 . [450] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena,
[429] Ping, W., Peng, K., Zhao, K., Song, Z., 2020. Waveflow: A compact M., Zhou, Y., Li, W., Liu, P.J., 2020. Exploring the limits of trans-
flow-based model for raw audio, in: International Conference on fer learning with a unified text-to-text transformer. The Journal of
Machine Learning, PMLR. pp. 7706–7716. Machine Learning Research 21, 5485–5551.
[430] Pino, J., Xu, Q., Ma, X., Dousti, M.J., Tang, Y., 2020. Self-training for [451] Rafiepour, M., Sartakhti, J.S., 2023. Ctran: Cnn-transformer-
end-to-end speech translation. Proc. Interspeech 2020 , 1476–1480. based network for natural language understanding. arXiv preprint
[431] Polyak, A., Adi, Y., Copet, J., Kharitonov, E., Lakhotia, K., Hsu, arXiv:2303.10606 .
W.N., Mohamed, A., Dupoux, E., 2021. Speech resynthesis from [452] Raitio, T., Rasipuram, R., Castellani, D., 2020. Controllable neural
discrete disentangled self-supervised representations. arXiv preprint text-to-speech synthesis using intuitive prosodic features. arXiv
arXiv:2104.00355 . preprint arXiv:2009.06775 .
[432] Polyak, A., Wolf, L., Taigman, Y., 2019. Tts skins: Speaker conver- [453] Rajapakshe, T., Latif, S., Rana, R., Khalifa, S., Schuller, B.W., 2020.
sion via asr. arXiv preprint arXiv:1904.08983 . Deep reinforcement learning with pre-training for time-efficient train-
[433] Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M., 2021a. ing of automatic speech recognition. arXiv preprint arXiv:2005.11172
Grad-tts: A diffusion probabilistic model for text-to-speech, in: Inter- .
national Conference on Machine Learning, PMLR. pp. 8599–8608. [454] Rajapakshe, T., Rana, R., Khalifa, S., Liu, J., Schuller, B., 2022. A
[434] Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M., Wei, novel policy for pre-trained deep reinforcement learning for speech
J., 2021b. Diffusion-based voice conversion with fast maximum emotion recognition, in: Australasian Computer Science Week 2022,
likelihood sampling scheme. arXiv preprint arXiv:2109.13821 . pp. 96–105.
[435] Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., [455] Rakotonirina, N.C., 2021. Self-attention for audio super-resolution,
Na, X., Wang, Y., Khudanpur, S., 2016. Purely sequence-trained in: 2021 IEEE 31st International Workshop on Machine Learning for
neural networks for asr based on lattice-free mmi., in: Interspeech, Signal Processing (MLSP), IEEE. pp. 1–6.
pp. 2751–2755. [456] Ravanelli, M., Bengio, Y., 2018. Speaker recognition from raw
[436] Prabhavalkar, R., Rao, K., Sainath, T.N., Li, B., Johnson, L., Jaitly, waveform with sincnet, in: 2018 IEEE Spoken Language Technology
N., 2017. A comparison of sequence-to-sequence models for speech Workshop (SLT), IEEE. pp. 1021–1028.

Mehrish et al.: Preprint submitted to Elsevier Page 66 of 72


A Review of Deep Learning Techniques for Speech Processing

[457] Ravanelli, M., Parcollet, T., Bengio, Y., 2019. The pytorch-kaldi supervised model for speech representation learning. arXiv preprint
speech recognition toolkit, in: ICASSP 2019-2019 IEEE International arXiv:2103.08393 .
Conference on Acoustics, Speech and Signal Processing (ICASSP), [477] Sadjadi, S.O., Pelecanos, J., Zhu, W., 2014. Nearest neighbor discrim-
IEEE. pp. 6465–6469. inant analysis for robust speaker recognition, in: Fifteenth Annual
[458] Ravanelli, M., Zhong, J., Pascual, S., Swietojanski, P., Monteiro, J., Conference of the International Speech Communication Association.
Trmal, J., Bengio, Y., 2020. Multi-task self-supervised learning for [478] Saito, Y., Takamichi, S., Saruwatari, H., 2021. Perceptual-similarity-
robust speech recognition, in: ICASSP 2020-2020 IEEE International aware deep speaker representation learning for multi-speaker gen-
Conference on Acoustics, Speech and Signal Processing (ICASSP), erative modeling. IEEE/ACM Transactions on Audio, Speech, and
IEEE. pp. 6989–6993. Language Processing 29, 1033–1048.
[459] Reddy, C.K., Gopal, V., Cutler, R., Beyrami, E., Cheng, R., Dubey, [479] Salehinejad, H., Sankar, S., Barfett, J., Colak, E., Valaee, S., 2017.
H., Matusevych, S., Aichner, R., Aazami, A., Braun, S., et al., 2020. Recent advances in recurrent neural networks. arXiv preprint
The interspeech 2020 deep noise suppression challenge: Datasets, arXiv:1801.01078 .
subjective testing framework, and challenge results. arXiv preprint [480] Salesky, E., Sperber, M., Black, A.W., 2019. Exploring phoneme-
arXiv:2005.13981 . level speech representations for end-to-end speech translation. arXiv
[460] Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y., 2020. preprint arXiv:1906.01199 .
Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv [481] Sathyendra, K.M., Muniyappa, T., Chang, F.J., Liu, J., Su, J., Strimel,
preprint arXiv:2006.04558 . G.P., Mouchtaris, A., Kunzmann, S., 2022. Contextual adapters for
[461] Ren, Y., Liu, J., Zhao, Z., 2021. Portaspeech: Portable and high- personalized speech recognition in neural transducers, in: ICASSP
quality generative text-to-speech. Advances in Neural Information 2022-2022 IEEE International Conference on Acoustics, Speech and
Processing Systems 34, 13963–13974. Signal Processing (ICASSP), IEEE. pp. 8537–8541.
[462] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y., 2019. [482] Scalart, P., et al., 1996. Speech enhancement based on a priori
Fastspeech: Fast, robust and controllable text to speech. Advances in signal to noise estimation, in: 1996 IEEE International Conference
neural information processing systems 32. on Acoustics, Speech, and Signal Processing Conference Proceedings,
[463] Reynolds, D.A., 2003. Channel robust speaker verification via feature IEEE. pp. 629–632.
mapping, in: 2003 IEEE International Conference on Acoustics, [483] Scarton, C., Forcada, M.L., Espla-Gomis, M., Specia, L., 2019. Es-
Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., timating post-editing effort: a study on human judgements, task-
IEEE. pp. II–53. based and reference-based metrics of mt quality. arXiv preprint
[464] Rho, D., Park, J., Ko, J.H., 2022. Nas-vad: Neural architecture search arXiv:1910.06204 .
for voice activity detection. arXiv preprint arXiv:2201.09032 . [484] Scheibler, R., Ji, Y., Chung, S.W., Byun, J., Choe, S., Choi, M.S.,
[465] Richey, C., Barrios, M.A., Armstrong, Z., Bartels, C., Franco, H., 2022. Diffusion-based generative speech source separation. arXiv
Graciarena, M., Lawson, A., Nandwana, M.K., Stauffer, A., van Hout, preprint arXiv:2210.17327 .
J., et al., 2018. Voices obscured in complex environmental settings [485] Schneider, S., Baevski, A., Collobert, R., Auli, M., 2019. wav2vec:
(voices) corpus. arXiv preprint arXiv:1804.05053 . Unsupervised pre-training for speech recognition. arXiv preprint
[466] Richter, J., Carbajal, G., Gerkmann, T., 2020. Speech enhancement arXiv:1904.05862 .
with stochastic temporal convolutional networks., in: Interspeech, pp. [486] Schroff, F., Kalenichenko, D., Philbin, J., 2015. Facenet: A unified
4516–4520. embedding for face recognition and clustering, in: Proceedings of
[467] Riviere, M., Joulin, A., Mazaré, P.E., Dupoux, E., 2020. Unsu- the IEEE conference on computer vision and pattern recognition, pp.
pervised pretraining transfers well across languages, in: ICASSP 815–823.
2020-2020 IEEE International Conference on Acoustics, Speech and [487] Schuster, M., Paliwal, K.K., 1997. Bidirectional recurrent neural
Signal Processing (ICASSP), IEEE. pp. 7414–7418. networks. IEEE transactions on Signal Processing 45, 2673–2681.
[468] Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P., 2001. Per- [488] Seo, D., Oh, H.S., Jung, Y., 2021. Wav2kws: Transfer learning
ceptual evaluation of speech quality (pesq)-a new method for speech from speech representations for keyword spotting. IEEE Access 9,
quality assessment of telephone networks and codecs, in: 2001 IEEE 80682–80691.
international conference on acoustics, speech, and signal processing. [489] Serrà, J., Pascual, S., Pons, J., Araz, R.O., Scaini, D., 2022. Univer-
Proceedings (Cat. No. 01CH37221), IEEE. pp. 749–752. sal speech enhancement with score-based diffusion. arXiv preprint
[469] Rostami, A.M., Karimi, A., Akhaee, M.A., 2022. Keyword spotting arXiv:2206.03065 .
in continuous speech using convolutional neural network. Speech [490] Serrà, J., Pons, J., Pascual, S., 2021. Sesqa: semi-supervised learning
Communication 142, 15–21. for speech quality assessment, in: ICASSP 2021-2021 IEEE Inter-
[470] Rousseau, A., Deléglise, P., Esteve, Y., 2012. Ted-lium: an automatic national Conference on Acoustics, Speech and Signal Processing
speech recognition dedicated corpus., in: LREC, pp. 125–129. (ICASSP), IEEE. pp. 381–385.
[471] Routray, S., Mao, Q., 2022. Phase sensitive masking-based single [491] Sertolli, B., Ren, Z., Schuller, B.W., Cummins, N., 2021. Repre-
channel speech enhancement using conditional generative adversarial sentation transfer learning from deep end-to-end speech recognition
network. Computer Speech & Language 71, 101270. networks for the classification of health states from speech. Computer
[472] Rouvier, M., Dufour, R., Bousquet, P.M., 2021. Review of different Speech & Language 68, 101204.
robust x-vector extractors for speaker verification, in: 2020 28th [492] Shen, J., Jia, Y., Chrzanowski, M., Zhang, Y., Elias, I., Zen, H., Wu,
European Signal Processing Conference (EUSIPCO), pp. 1–5. doi:10. Y., 2020. Non-attentive tacotron: Robust and controllable neural tts
23919/Eusipco47968.2020.9287426. synthesis including unsupervised duration modeling. arXiv preprint
[473] Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., arXiv:2010.04301 .
Liberman, M., 2018. First dihard challenge evaluation plan. 2018, [493] Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen,
tech. Rep. . Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., et al., 2018. Natural tts
[474] Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, synthesis by conditioning wavenet on mel spectrogram predictions,
S., Liberman, M., 2019. The second dihard diarization challenge: in: 2018 IEEE international conference on acoustics, speech and
Dataset, task, and baselines. arXiv preprint arXiv:1906.07839 . signal processing (ICASSP), IEEE. pp. 4779–4783.
[475] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Lau- [494] Shi, Y., Wang, Y., Wu, C., Fuegen, C., Zhang, F., Le, D., Yeh, C.F.,
renzo, S., 2020. Streaming keyword spotting on mobile devices. Seltzer, M.L., 2020. Weak-attention suppression for transformer
arXiv preprint arXiv:2005.06720 . based speech recognition. arXiv preprint arXiv:2005.09137 .
[476] Sadhu, S., He, D., Huang, C.W., Mallidi, S.H., Wu, M., Rastrow, [495] Shih, K.J., Valle, R., Badlani, R., Lancucki, A., Ping, W., Catanzaro,
A., Stolcke, A., Droppo, J., Maas, R., 2021. Wav2vec-c: A self- B., 2021. Rad-tts: Parallel flow-based tts with robust alignment learn-

Mehrish et al.: Preprint submitted to Elsevier Page 67 of 72


A Review of Deep Learning Techniques for Speech Processing

ing and diverse synthesis, in: ICML Workshop on Invertible Neural ation by conditional recurrent adversarial network. arXiv preprint
Networks, Normalizing Flows, and Explicit Likelihood Models. arXiv:1804.04786 .
[496] Shim, H.J., Heo, J., Park, J.H., Lee, G.H., Yu, H.J., 2022. Graph [516] Soni, M.H., Patil, H.A., 2016. Novel deep autoencoder features for
attentive feature aggregation for text-independent speaker verification, non-intrusive speech quality assessment, in: 2016 24th European
in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Signal Processing Conference (EUSIPCO), IEEE. pp. 2315–2319.
Speech and Signal Processing (ICASSP), pp. 7972–7976. doi:10. [517] Sorin, A., Shechtman, S., Hoory, R., 2020. Principal style compo-
1109/ICASSP43922.2022.9746257. nents: Expressive style control and cross-speaker transfer in neural
[497] Sicherman, A., Adi, Y., 2023. Analysing discrete self supervised tts., in: INTERSPEECH, pp. 3411–3415.
speech representation for spoken language modeling. arXiv preprint [518] Sperber, M., Paulik, M., 2020. Speech translation and the end-
arXiv:2301.00591 . to-end promise: Taking stock of where we are. arXiv preprint
[498] Simić, N., Suzić, S., Nosek, T., Vujović, M., Perić, Z., Savić, M., arXiv:2004.06358 .
Delić, V., 2022. Speaker recognition using constrained convolutional [519] Stoller, D., Ewert, S., Dixon, S., 2018. Wave-u-net: A multi-scale
neural networks in emotional speech. Entropy 24, 414. neural network for end-to-end audio source separation. arXiv preprint
[499] Simply, R.M., Dafna, E., Zigel, Y., 2019. Diagnosis of obstructive arXiv:1806.03185 .
sleep apnea using speech signals from awake subjects. IEEE Journal [520] Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J., 2021.
of Selected Topics in Signal Processing 14, 251–260. Attention is all you need in speech separation, in: ICASSP 2021-2021
[500] Singh, G., Sharma, S., Kumar, V., Kaur, M., Baz, M., Masud, M., IEEE International Conference on Acoustics, Speech and Signal
2021. Spoken language identification using deep learning. Computa- Processing (ICASSP), IEEE. pp. 21–25.
tional Intelligence and Neuroscience 2021. [521] Sukhadia, V.N., Umesh, S., 2023. Domain adaptation of low-resource
[501] Singh, P., Ganapathy, S., 2021. Self-supervised metric learning with target-domain models using well-trained asr conformer models, in:
graph clustering for speaker diarization, in: 2021 IEEE Automatic 2022 IEEE Spoken Language Technology Workshop (SLT), IEEE.
Speech Recognition and Understanding Workshop (ASRU), pp. 90– pp. 295–301.
97. doi:10.1109/ASRU51503.2021.9688271. [522] Sun, A., Wang, J., Cheng, N., Peng, H., Zeng, Z., Kong, L., Xiao,
[502] Singh, P., Kaul, A., Ganapathy, S., 2023. Supervised hierarchical J., 2021. Graphpb: Graphical representations of prosody boundary
clustering using graph neural networks for speaker diarization. arXiv in speech synthesis, in: 2021 IEEE Spoken Language Technology
preprint arXiv:2302.12716 . Workshop (SLT), IEEE. pp. 438–445.
[503] Singh, S., Wang, R., Hou, F., 2022. Improved meta learning for [523] Sun, A., Wang, J., Cheng, N., Peng, H., Zeng, Z., Xiao, J., 2020.
low resource speech recognition, in: ICASSP 2022-2022 IEEE In- Graphtts: Graph-to-sequence modelling in neural text-to-speech, in:
ternational Conference on Acoustics, Speech and Signal Processing ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
(ICASSP), IEEE. pp. 4798–4802. Speech and Signal Processing (ICASSP), pp. 6719–6723. doi:10.
[504] Siuzdak, H., Dura, P., van Rijn, P., Jacoby, N., 2022. Wavthruvec: La- 1109/ICASSP40776.2020.9053355.
tent speech representation as intermediate features for neural speech [524] Sung, T.W., Liu, J.Y., Lee, H.y., Lee, L.s., 2019. Towards end-to-
synthesis. arXiv preprint arXiv:2203.16930 . end speech-to-text translation with two-pass decoding, in: ICASSP
[505] Skerry-Ryan, R., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., 2019-2019 IEEE International Conference on Acoustics, Speech and
Shor, J., Weiss, R., Clark, R., Saurous, R.A., 2018. Towards end-to- Signal Processing (ICASSP), IEEE. pp. 7175–7179.
end prosody transfer for expressive speech synthesis with tacotron, [525] Suno-AI, 2023. Bark. URL: https://github.com/suno-ai/bark.
in: international conference on machine learning, PMLR. pp. 4693– [526] Synnaeve, G., Xu, Q., Kahn, J., Likhomanenko, T., Grave, E., Pratap,
4702. V., Sriram, A., Liptchinsky, V., Collobert, R., 2020. End-to-end asr:
[506] Smith, N., Gales, M., 2001. Speech recognition using svms. Advances from supervised to semi-supervised learning with modern architec-
in neural information processing systems 14. tures, in: ICML 2020 Workshop on Self-supervision in Audio and
[507] Snell, J., Swersky, K., Zemel, R., 2017. Prototypical networks for few- Speech.
shot learning. Advances in neural information processing systems [527] Tae, J., Kim, H., Kim, T., 2021. Editts: Score-based editing for
30. controllable text-to-speech. CoRR abs/2110.02584. URL: https:
[508] Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S., 2017. //arxiv.org/abs/2110.02584, arXiv:2110.02584.
Deep neural network embeddings for text-independent speaker verifi- [528] Tan, K., Wang, D., 2019. Learning complex spectral mapping with
cation., in: Interspeech, pp. 999–1003. gated convolutional recurrent networks for monaural speech enhance-
[509] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S., ment. IEEE/ACM Transactions on Audio, Speech, and Language
2018. X-vectors: Robust dnn embeddings for speaker recognition, in: Processing 28, 380–390.
2018 IEEE international conference on acoustics, speech and signal [529] Tan, L., Karnjanadecha, M., 2003. Pitch detection algorithm: auto-
processing (ICASSP), IEEE. pp. 5329–5333. correlation method and amdf, in: Proceedings of the 3rd international
[510] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S., symposium on communications and information technology, pp. 551–
2015. Deep unsupervised learning using nonequilibrium thermody- 556.
namics, in: International Conference on Machine Learning, PMLR. [530] Tanaka, K., Kameoka, H., Kaneko, T., Hojo, N., 2019. Atts2s-
pp. 2256–2265. vc: Sequence-to-sequence voice conversion with attention and con-
[511] Solomonoff, A., Campbell, W.M., Boardman, I., 2005. Advances in text preservation mechanisms, in: ICASSP 2019 - 2019 IEEE In-
channel compensation for svm speaker recognition, in: Proceed- ternational Conference on Acoustics, Speech and Signal Processing
ings.(ICASSP’05). IEEE International Conference on Acoustics, (ICASSP), pp. 6805–6809. doi:10.1109/ICASSP.2019.8683282.
Speech, and Signal Processing, 2005., IEEE. pp. I–629. [531] Tang, C., Luo, C., Zhao, Z., Xie, W., Zeng, W., 2021. Joint time-
[512] Solomonoff, A., Quillen, C., Campbell, W.M., 2004. Channel com- frequency and time domain learning for speech enhancement, in:
pensation for svm speaker recognition., in: Odyssey, pp. 219–226. Proceedings of the Twenty-Ninth International Conference on Inter-
[513] Son Chung, J., Senior, A., Vinyals, O., Zisserman, A., 2017. Lip national Joint Conferences on Artificial Intelligence, pp. 3816–3822.
reading sentences in the wild, in: Proceedings of the IEEE conference [532] Tang, Y., Ding, G., Huang, J., He, X., Zhou, B., 2019. Deep speaker
on computer vision and pattern recognition, pp. 6447–6456. embedding learning with multi-level pooling for text-independent
[514] Sondhi, M., Schroeter, J., 1987. A hybrid time-frequency domain speaker verification, in: ICASSP 2019-2019 IEEE International Con-
articulatory speech synthesizer. IEEE Transactions on Acoustics, ference on Acoustics, Speech and Signal Processing (ICASSP), IEEE.
Speech, and Signal Processing 35, 955–967. doi:10.1109/TASSP.1987. pp. 6116–6120.
1165240. [533] Tao, F., Busso, C., 2019. End-to-end audiovisual speech activity
[515] Song, Y., Zhu, J., Li, D., Wang, X., Qi, H., 2018. Talking face gener- detection with bimodal recurrent neural models. Speech Communi-

Mehrish et al.: Preprint submitted to Elsevier Page 68 of 72


A Review of Deep Learning Techniques for Speech Processing

cation 113, 25–35. [553] Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-
[534] Tian, X., Chng, E.S., Li, H., 2019. A vocoder-free wavenet voice Dominguez, J., 2014a. Deep neural networks for small footprint
conversion with non-parallel data. arXiv preprint arXiv:1902.03705 . text-dependent speaker verification, in: 2014 IEEE international con-
[535] Tits, N., Wang, F., El Haddad, K., Pagel, V., Dutoit, T., 2019. Visual- ference on acoustics, speech and signal processing (ICASSP), IEEE.
ization and interpretation of latent spaces for controlling expressive pp. 4052–4056.
speech synthesis through audio analysis. Proc. Interspeech 2019 , [554] Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-
4475–4479. Dominguez, J., 2014b. Deep neural networks for small footprint
[536] Tjandra, A., Sakti, S., Nakamura, S., 2018. Sequence-to-sequence asr text-dependent speaker verification, in: 2014 IEEE International
optimization via reinforcement learning, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. doi:10.1109/ICASSP.2014.6854363.
IEEE. pp. 5829–5833. [555] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
[537] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need.
Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, Advances in neural information processing systems 30.
A., Joulin, A., Grave, E., Lample, G., 2023. Llama: Open and efficient [556] Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P.,
foundation language models. ArXiv abs/2302.13971. Bengio, Y., et al., 2017. Graph attention networks. stat 1050, 10–
[538] Tranter, S.E., Reynolds, D.A., 2006. An overview of automatic 48550.
speaker diarization systems. IEEE Transactions on audio, speech, [557] Veličković, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm,
and language processing 14, 1557–1565. R.D., 2018. Deep graph infomax. arXiv preprint arXiv:1809.10341 .
[539] Tsunoo, E., Kashiwagi, Y., Kumakura, T., Watanabe, S., 2019. Trans- [558] Vincent, E., Virtanen, T., Gannot, S., 2018. Audio source separation
former asr with contextual block processing, in: 2019 IEEE Auto- and speech enhancement. John Wiley & Sons.
matic Speech Recognition and Understanding Workshop (ASRU), [559] Vuong, T., Xia, Y., Stern, R.M., 2021. A modulation-domain loss
IEEE. pp. 427–433. for neural-network-based real-time speech enhancement, in: ICASSP
[540] Tu, T., Chen, Y.J., Yeh, C.c., Lee, H.Y., 2019. End-to-end text-to- 2021-2021 IEEE International Conference on Acoustics, Speech and
speech for low-resource languages by cross-lingual transfer learning. Signal Processing (ICASSP), IEEE. pp. 6643–6647.
arXiv preprint arXiv:1904.06508 . [560] Vygon, R., Mikhaylovskiy, N., 2021. Learning efficient representa-
[541] Tüske, Z., Audhkhasi, K., Saon, G., 2019. Advancing sequence-to- tions for keyword spotting with triplet loss, in: Speech and Computer:
sequence based speech recognition., in: Interspeech, pp. 3780–3784. 23rd International Conference, SPECOM 2021, St. Petersburg, Rus-
[542] Tzinis, E., Adi, Y., Ithapu, V.K., Xu, B., Kumar, A., 2022a. Continual sia, September 27–30, 2021, Proceedings 23, Springer. pp. 773–785.
self-training with bootstrapped remixing for speech enhancement, [561] Wan, L., Wang, Q., Papir, A., Moreno, I.L., 2018. Generalized
in: ICASSP 2022-2022 IEEE International Conference on Acoustics, end-to-end loss for speaker verification, in: 2018 IEEE International
Speech and Signal Processing (ICASSP), IEEE. pp. 6947–6951. Conference on Acoustics, Speech and Signal Processing (ICASSP),
[543] Tzinis, E., Adi, Y., Ithapu, V.K., Xu, B., Smaragdis, P., Kumar, A., IEEE. pp. 4879–4883.
2022b. Remixit: Continual self-training of speech enhancement [562] Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., Chen,
models via bootstrapped remixing. IEEE Journal of Selected Topics Z., Liu, Y., Wang, H., Li, J., et al., 2023a. Neural codec language
in Signal Processing 16, 1329–1341. models are zero-shot text to speech synthesizers. arXiv preprint
[544] Tzirakis, P., Kumar, A., Donley, J., 2021. Multi-channel speech arXiv:2301.02111 .
enhancement using graph neural networks, in: ICASSP 2021-2021 [563] Wang, C., Tang, Y., Ma, X., Wu, A., Popuri, S., Okhonko, D., Pino, J.,
IEEE International Conference on Acoustics, Speech and Signal 2020a. fairseq s2t: Fast speech-to-text modeling with fairseq. arXiv
Processing (ICASSP), IEEE. pp. 3415–3419. preprint arXiv:2010.05171 .
[545] Um, S.Y., Oh, S., Byun, K., Jang, I., Ahn, C., Kang, H.G., 2020. [564] Wang, C., Wu, A., Pino, J., 2020b. Covost 2 and massively multi-
Emotional speech synthesis with rich and granularized control, in: lingual speech-to-text translation. arXiv preprint arXiv:2007.10310
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, .
Speech and Signal Processing (ICASSP), pp. 7254–7258. doi:10. [565] Wang, C., Wu, Y., Qian, Y., Kumatani, K., Liu, S., Wei, F., Zeng,
1109/ICASSP40776.2020.9053732. M., Huang, X., 2021a. Unispeech: Unified speech representation
[546] Vainer, J., Dušek, O., 2020. Speedyspeech: Efficient neural speech learning with labeled and unlabeled data, in: International Conference
synthesis. arXiv preprint arXiv:2008.03802 . on Machine Learning, PMLR. pp. 10937–10947.
[547] Valin, J.M., Isik, U., Smaragdis, P., Krishnaswamy, A., 2022. Neural [566] Wang, F., Tax, D.M., 2016. Survey on the attention based rnn
speech synthesis on a shoestring: Improving the efficiency of lpcnet, model and its applications in computer vision. arXiv preprint
in: ICASSP 2022-2022 IEEE International Conference on Acoustics, arXiv:1601.06823 .
Speech and Signal Processing (ICASSP), IEEE. pp. 8437–8441. [567] Wang, G., 2019. Deep text-to-speech system with seq2seq model.
[548] Valin, J.M., Skoglund, J., 2019. Lpcnet: Improving neural speech arXiv preprint arXiv:1903.07398 .
synthesis through linear prediction, in: ICASSP 2019-2019 IEEE [568] Wang, H., Wang, D., 2020. Time-frequency loss for cnn based
International Conference on Acoustics, Speech and Signal Processing speech super-resolution, in: ICASSP 2020 - 2020 IEEE International
(ICASSP), IEEE. pp. 5891–5895. Conference on Acoustics, Speech and Signal Processing (ICASSP),
[549] Valle, R., Li, J., Prenger, R., Catanzaro, B., 2020a. Mellotron: Multi- pp. 861–865. doi:10.1109/ICASSP40776.2020.9053712.
speaker expressive voice synthesis by conditioning on rhythm, pitch [569] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z.,
and global style tokens, in: ICASSP 2020 - 2020 IEEE International Liu, W., 2018a. Cosface: Large margin cosine loss for deep face
Conference on Acoustics, Speech and Signal Processing (ICASSP), recognition, in: Proceedings of the IEEE conference on computer
pp. 6189–6193. doi:10.1109/ICASSP40776.2020.9054556. vision and pattern recognition, pp. 5265–5274.
[550] Valle, R., Shih, K., Prenger, R., Catanzaro, B., 2020b. Flowtron: [570] Wang, J., He, Y., Zhao, C., Shao, Q., Tu, W.W., Ko, T., Lee, H.y., Xie,
an autoregressive flow-based generative network for text-to-speech L., 2021b. Auto-kws 2021 challenge: Task, datasets, and baselines.
synthesis. arXiv preprint arXiv:2005.05957 . arXiv preprint arXiv:2104.00513 .
[551] Van Den Oord, A., Vinyals, O., et al., 2017. Neural discrete represen- [571] Wang, J., Xiao, X., Wu, J., Ramamurthy, R., Rudzicz, F., Brudno, M.,
tation learning. Advances in neural information processing systems 2020c. Speaker diarization with session-level speaker embedding
30. refinement using graph neural networks, in: ICASSP 2020 - 2020
[552] Vanzo, A., Croce, D., Bastianelli, E., Basili, R., Nardi, D., 2016. Ro- IEEE International Conference on Acoustics, Speech and Signal
bust spoken language understanding for house service robots. Polibits Processing (ICASSP), pp. 7109–7113. doi:10.1109/ICASSP40776.2020.
, 11–16. 9054176.

Mehrish et al.: Preprint submitted to Elsevier Page 69 of 72


A Review of Deep Learning Techniques for Speech Processing

[572] Wang, Q., Downey, C., Wan, L., Mansfield, P.A., Moreno, I.L., 2018b. [590] Weiss, R.J., Skerry-Ryan, R., Battenberg, E., Mariooryad, S.,
Speaker diarization with lstm, in: 2018 IEEE International conference Kingma, D.P., 2021. Wave-tacotron: Spectrogram-free end-to-end
on acoustics, speech and signal processing (ICASSP), IEEE. pp. 5239– text-to-speech synthesis, in: ICASSP 2021-2021 IEEE International
5243. Conference on Acoustics, Speech and Signal Processing (ICASSP),
[573] Wang, Q., Guo, P., Sun, S., Xie, L., Hansen, J.H., 2019a. Adver- IEEE. pp. 5679–5683.
sarial regularization for end-to-end robust speaker verification., in: [591] Weng, C., Cui, J., Wang, G., Wang, J., Yu, C., Su, D., Yu, D., 2018.
Interspeech, pp. 4010–4014. Improving attention based sequence-to-sequence models for end-to-
[574] Wang, Q., Rao, W., Sun, S., Xie, L., Chng, E.S., Li, H., 2018c. end english conversational speech recognition., in: Interspeech, pp.
Unsupervised domain adaptation via domain adversarial training 761–765.
for speaker recognition, in: 2018 IEEE International Conference on [592] Westhausen, N.L., Meyer, B.T., 2020. Dual-signal transforma-
Acoustics, Speech and Signal Processing (ICASSP), pp. 4889–4893. tion lstm network for real-time noise suppression. arXiv preprint
doi:10.1109/ICASSP.2018.8461423. arXiv:2005.07551 .
[575] Wang, T., Deng, J., Geng, M., Ye, Z., Hu, S., Wang, Y., Cui, M., [593] Winata, G.I., Cahyawijaya, S., Lin, Z., Liu, Z., Fung, P., 2020.
Jin, Z., Liu, X., Meng, H., 2022a. Conformer based elderly speech Lightweight and efficient end-to-end speech recognition using low-
recognition system for alzheimer’s disease detection. arXiv preprint rank transformer, in: ICASSP 2020 - 2020 IEEE International Con-
arXiv:2206.13232 . ference on Acoustics, Speech and Signal Processing (ICASSP), pp.
[576] Wang, T., Pan, Z., Ge, M., Yang, Z., Li, H., 2023b. Time-domain 6144–6148. doi:10.1109/ICASSP40776.2020.9053878.
speech separation networks with graph encoding auxiliary. IEEE [594] Wu, D.Y., Lee, H.y., 2020a. One-shot voice conversion by vector
Signal Processing Letters 30, 110–114. quantization, in: ICASSP 2020-2020 IEEE International Conference
[577] Wang, W., Lin, Q., Cai, D., Li, M., 2022b. Similarity measure- on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp.
ment of segment-level speaker embeddings in speaker diarization. 7734–7738.
IEEE/ACM Transactions on Audio, Speech, and Language Process- [595] Wu, D.Y., Lee, H.y., 2020b. One-shot voice conversion by vector
ing 30, 2645–2658. quantization, in: ICASSP 2020 - 2020 IEEE International Conference
[578] Wang, X., Li, L., Wang, D., 2019b. Vae-based domain adaptation for on Acoustics, Speech and Signal Processing (ICASSP), pp. 7734–
speaker verification, in: 2019 Asia-Pacific Signal and Information 7738. doi:10.1109/ICASSP40776.2020.9053854.
Processing Association Annual Summit and Conference (APSIPA [596] Wu, J., Hua, Y., Yang, S., Qin, H., Qin, H., 2019. Speech enhance-
ASC), IEEE. pp. 535–539. ment using generative adversarial network by distilling knowledge
[579] Wang, X., Ming, H., He, L., Soong, F.K., 2020d. s-transformer: from statistical method. Applied Sciences 9, 3396.
Segment-transformer for robust neural speech synthesis. arXiv [597] Wu, S., Shi, Z., 2021. Itotts and itowave: Linear stochastic differ-
preprint arXiv:2011.08480 . ential equation is all you need for audio generation. arXiv preprint
[580] Wang, Y., Ju, Z., Tan, X., He, L., Wu, Z., Bian, J., Zhao, S., 2023c. arXiv:2105.07583 .
Audit: Audio editing by following instructions with latent diffusion [598] Wu, X., 2022. Deep sparse conformer for speech recognition. arXiv
models. arXiv:2304.00830. preprint arXiv:2209.00260 .
[581] Wang, Y., Shen, Y., Jin, H., 2018d. A bi-model based rnn seman- [599] Wu, Y., Tan, X., Li, B., He, L., Zhao, S., Song, R., Qin, T., Liu, T.Y.,
tic frame parsing model for intent detection and slot filling. arXiv 2022. Adaspeech 4: Adaptive text to speech in zero-shot scenarios.
preprint arXiv:1812.10235 . arXiv preprint arXiv:2204.00436 .
[582] Wang, Y., Shi, Y., Zhang, F., Wu, C., Chan, J., Yeh, C.F., Xiao, A., [600] Xia, W., Huang, J., Hansen, J.H., 2019. Cross-lingual text-
2021c. Transformer in action: A comparative study of transformer- independent speaker verification using unsupervised adversarial dis-
based acoustic models for large scale speech recognition applications, criminative domain adaptation, in: ICASSP 2019-2019 IEEE Inter-
in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, national Conference on Acoustics, Speech and Signal Processing
Speech and Signal Processing (ICASSP), pp. 6778–6782. doi:10. (ICASSP), IEEE. pp. 5816–5820.
1109/ICASSP39728.2021.9414087. [601] Xiao, X., Kanda, N., Chen, Z., Zhou, T., Yoshioka, T., Chen, S.,
[583] Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Zhao, Y., Liu, G., Wu, Y., Wu, J., et al., 2021. Microsoft speaker
Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al., 2017. Tacotron: To- diarization system for the voxceleb speaker recognition challenge
wards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 2020, in: ICASSP 2021-2021 IEEE International Conference on
. Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 5824–
[584] Wang, Y., Stanton, D., Zhang, Y., Ryan, R.S., Battenberg, E., Shor, 5828.
J., Xiao, Y., Jia, Y., Ren, F., Saurous, R.A., 2018e. Style tokens: Un- [602] Xin, D., Saito, Y., Takamichi, S., Koriyama, T., Saruwatari, H., 2020.
supervised style modeling, control and transfer in end-to-end speech Cross-lingual text-to-speech synthesis via domain adaptation and
synthesis, in: International Conference on Machine Learning, PMLR. perceptual similarity regression in speaker space., in: Interspeech,
pp. 5180–5189. pp. 2947–2951.
[585] Wang, Y., Zhang, S., Lee, J., 2019c. Bridging commonsense rea- [603] Xu, C., Hu, B., Li, Y., Zhang, Y., Ju, Q., Xiao, T., Zhu, J., et al.,
soning and probabilistic planning via a probabilistic action language. 2021a. Stacked acoustic-and-textual encoding: Integrating the pre-
Theory and Practice of Logic Programming 19, 1090–1106. trained models into speech translation encoders. arXiv preprint
[586] Wang, Z., Wohlwend, J., Lei, T., 2019d. Structured pruning of large arXiv:2105.05752 .
language models. CoRR abs/1910.04732. URL: http://arxiv.org/ [604] Xu, J., Tan, X., Ren, Y., Qin, T., Li, J., Zhao, S., Liu, T.Y., 2020.
abs/1910.04732, arXiv:1910.04732. Lrspeech: Extremely low-resource speech synthesis and recognition,
[587] Wang, Z.Q., Le Roux, J., Hershey, J.R., 2018f. Alternative objective in: Proceedings of the 26th ACM SIGKDD International Conference
functions for deep clustering, in: 2018 IEEE International Conference on Knowledge Discovery & Data Mining, pp. 2802–2812.
on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. [605] Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Conneau,
686–690. A., Collobert, R., Synnaeve, G., Auli, M., 2021b. Self-training and
[588] Wang, Z.Q., Wang, P., Wang, D., 2020e. Complex spectral mapping pre-training are complementary for speech recognition, in: ICASSP
for single-and multi-channel speech enhancement and robust asr. 2021-2021 IEEE International Conference on Acoustics, Speech and
IEEE/ACM transactions on audio, speech, and language processing Signal Processing (ICASSP), IEEE. pp. 3030–3034.
28, 1778–1787. [606] Xu, Y., Du, J., Dai, L.R., Lee, C.H., 2014. A regression approach
[589] Warden, P., 2018. Speech commands: A dataset for limited- to speech enhancement based on deep neural networks. IEEE/ACM
vocabulary speech recognition. arXiv preprint arXiv:1804.03209 Transactions on Audio, Speech, and Language Processing 23, 7–19.
. [607] Xue, J., Deng, Y., Han, Y., Li, Y., Sun, J., Liang, J., 2022. Ecapa-tdnn

Mehrish et al.: Preprint submitted to Elsevier Page 70 of 72


A Review of Deep Learning Techniques for Speech Processing

for multi-speaker text-to-speech synthesis, in: 2022 13th International [626] You, J., Kim, D., Nam, G., Hwang, G., Chae, G., 2021. Gan
Symposium on Chinese Spoken Language Processing (ISCSLP), vocoder: Multi-resolution discriminator is all you need. arXiv
IEEE. pp. 230–234. preprint arXiv:2103.05236 .
[608] Yamamoto, R., Song, E., Kim, J.M., 2020. Parallel wavegan: A fast [627] Yu, C., Lu, H., Hu, N., Yu, M., Weng, C., Xu, K., Liu, P., Tuo, D.,
waveform generation model based on generative adversarial networks Kang, S., Lei, G., et al., 2019. Durian: Duration informed attention
with multi-resolution spectrogram, in: ICASSP 2020-2020 IEEE network for multimodal synthesis. arXiv preprint arXiv:1909.01700
International Conference on Acoustics, Speech and Signal Processing .
(ICASSP), IEEE. pp. 6199–6203. [628] Yu, D., Deng, L., 2016. Automatic speech recognition. volume 1.
[609] Yan, Y., Tan, X., Li, B., Qin, T., Zhao, S., Shen, Y., Liu, T.Y., 2021. Springer.
Adaspeech 2: Adaptive text to speech with untranscribed data, in: [629] Yu, F., Koltun, V., 2015. Multi-scale context aggregation by dilated
ICASSP 2021-2021 IEEE International Conference on Acoustics, convolutions. CoRR abs/1511.07122.
Speech and Signal Processing (ICASSP), IEEE. pp. 6613–6617. [630] Yu, Y., Park, D., Kim, H.K., 2022. Auxiliary loss of transformer with
[610] Yang, D., Liu, S., Yu, J., Wang, H., Weng, C., Zou, Y., 2022a. Nore- residual connection for end-to-end speaker diarization, in: ICASSP
speech: Knowledge distillation based conditional diffusion model for 2022-2022 IEEE International Conference on Acoustics, Speech and
noise-robust expressive tts. arXiv preprint arXiv:2211.02448 . Signal Processing (ICASSP), IEEE. pp. 8377–8381.
[611] Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y., Yu, D., [631] Yue, F., Deng, Y., He, L., Ko, T., Zhang, Y., 2022. Exploring machine
2022b. Diffsound: Discrete diffusion model for text-to-sound genera- speech chain for domain adaptation, in: ICASSP 2022-2022 IEEE
tion. arXiv preprint arXiv:2207.09983 . International Conference on Acoustics, Speech and Signal Processing
[612] Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., Xie, L., 2021a. (ICASSP), IEEE. pp. 6757–6761.
Multi-band melgan: Faster waveform generation for high-quality text- [632] Yun, S., Jeong, M., Kim, R., Kang, J., Kim, H.J., 2019. Graph
to-speech, in: 2021 IEEE Spoken Language Technology Workshop transformer networks. Advances in neural information processing
(SLT), IEEE. pp. 492–498. systems 32.
[613] Yang, G.P., Tuan, C.I., Lee, H.Y., Lee, L.s., 2019a. Improved speech [633] Zeghidour, N., Grangier, D., 2021. Wavesplit: End-to-end speech
separation with time-and-frequency cross-domain joint embedding separation by speaker clustering. IEEE/ACM Transactions on Audio,
and clustering. arXiv preprint arXiv:1904.07845 . Speech, and Language Processing 29, 2840–2849.
[614] Yang, J., Lee, J., Kim, Y., Cho, H., Kim, I., 2020. Vocgan: A high- [634] Zeinali, H., Wang, S., Silnova, A., Matějka, P., Plchot, O., 2019. But
fidelity real-time vocoder with a hierarchically-nested adversarial system description to voxceleb speaker recognition challenge 2019.
network. arXiv preprint arXiv:2007.15256 . arXiv preprint arXiv:1910.12592 .
[615] Yang, S., Chi, P., Chuang, Y., Lai, C.J., Lakhotia, K., Lin, Y.Y., Liu, [635] Zeremdini, J., Ben Messaoud, M.A., Bouzid, A., 2015. A comparison
A.T., Shi, J., Chang, X., Lin, G., Huang, T., Tseng, W., Lee, K., of several computational auditory scene analysis (casa) techniques
Liu, D., Huang, Z., Dong, S., Li, S., Watanabe, S., Mohamed, A., for monaural speech segregation. Brain informatics 2, 155–166.
Lee, H., 2021b. SUPERB: speech processing universal performance [636] Zeyer, A., Merboldt, A., Michel, W., Schlüter, R., Ney, H., 2021.
benchmark. CoRR abs/2105.01051. URL: https://arxiv.org/abs/ Librispeech transducer model with internal language model prior
2105.01051, arXiv:2105.01051. correction. arXiv preprint arXiv:2104.03006 .
[616] Yang, S., Liu, M., 2022. Data augmentation for speaker verifica- [637] Zhang, A., Wang, Q., Zhu, Z., Paisley, J., Wang, C., 2019a. Fully
tion, in: Proceedings of the 2022 6th International Conference on supervised speaker diarization, in: ICASSP 2019-2019 IEEE Inter-
Electronic Information Technology and Computer Engineering, pp. national Conference on Acoustics, Speech and Signal Processing
1247–1251. (ICASSP), IEEE. pp. 6301–6305.
[617] Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., Long, [638] Zhang, B., Haddow, B., Sennrich, R., 2022a. Revisiting end-to-end
K., Shan, S., Chen, X., 2019b. Lrw-1000: A naturally-distributed speech-to-text translation from scratch, in: International Conference
large-scale benchmark for lip reading in the wild, in: 2019 14th IEEE on Machine Learning, PMLR. pp. 26193–26205.
International Conference on Automatic Face and Gesture Recognition [639] Zhang, B., Titov, I., Haddow, B., Sennrich, R., 2020a. Adaptive
(FG 2019), pp. 1–8. doi:10.1109/FG.2019.8756582. feature selection for end-to-end speech translation. arXiv preprint
[618] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, arXiv:2010.08518 .
Q.V., 2019c. Xlnet: Generalized autoregressive pretraining for lan- [640] Zhang, C., Koishida, K., 2017. End-to-end text-independent speaker
guage understanding. Advances in neural information processing verification with triplet loss on short utterances., in: Interspeech, pp.
systems 32. 1487–1491.
[619] Yao, Z., Dong, Z., Zheng, Z., Gholami, A., Yu, J., Tan, E., Wang, L., [641] Zhang, C., Li, Y., Du, N., Fan, W., Yu, P.S., 2018a. Joint slot filling
Huang, Q., Wang, Y., Mahoney, M.W., Keutzer, K., 2020. HAWQV3: and intent detection via capsule neural networks. arXiv preprint
dyadic neural network quantization. CoRR abs/2011.10680. URL: arXiv:1812.09471 .
https://arxiv.org/abs/2011.10680, arXiv:2011.10680. [642] Zhang, C., Ren, Y., Tan, X., Liu, J., Zhang, K., Qin, T., Zhao, S.,
[620] Yasuda, Y., Wang, X., Takaki, S., Yamagishi, J., 2019. Investiga- Liu, T.Y., 2021a. Denoispeech: Denoising text to speech with frame-
tion of enhanced tacotron text-to-speech synthesis systems with self- level noise modeling, in: ICASSP 2021-2021 IEEE International
attention for pitch accent language, in: ICASSP 2019 - 2019 IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP),
International Conference on Acoustics, Speech and Signal Processing IEEE. pp. 7063–7067.
(ICASSP), pp. 6905–6909. doi:10.1109/ICASSP.2019.8682353. [643] Zhang, C., Shi, J., Weng, C., Yu, M., Yu, D., 2022b. Towards end-to-
[621] Ye, F., Yang, J., 2021. A deep neural network model for speaker end speaker diarization with generalized neural speaker clustering,
identification. Applied Sciences 11, 3603. in: ICASSP 2022-2022 IEEE International Conference on Acoustics,
[622] Ye, R., Wang, M., Li, L., 2021. End-to-end speech translation via Speech and Signal Processing (ICASSP), IEEE. pp. 8372–8376.
cross-modal progressive training. arXiv preprint arXiv:2104.10380 . [644] Zhang, H., Wang, L., Lee, K.A., Liu, M., Dang, J., Chen, H., 2021b.
[623] Yen, H., Germain, F.G., Wichern, G., Roux, J.L., 2022. Cold diffusion Meta-learning for cross-channel speaker verification, in: ICASSP
for speech enhancement. arXiv preprint arXiv:2211.02527 . 2021-2021 IEEE International Conference on Acoustics, Speech and
[624] Yoneyama, R., Yamamoto, R., Tachibana, K., 2022. Nonparallel high- Signal Processing (ICASSP), IEEE. pp. 5839–5843.
quality audio super resolution with domain adaptation and resampling [645] Zhang, H., Wang, L., Lee, K.A., Liu, M., Dang, J., Meng, H.,
cyclegans. arXiv preprint arXiv:2210.15887 . 2023a. Meta-generalization for domain-invariant speaker verifica-
[625] Yoon, J.W., Woo, B.J., Kim, N.S., 2022. Hubert-ee: Early exiting hu- tion. IEEE/ACM Transactions on Audio, Speech, and Language
bert for efficient speech recognition. arXiv preprint arXiv:2204.06328 Processing 31, 1024–1036.
. [646] Zhang, J.X., Ling, Z.H., Dai, L.R., 2018b. Forward attention in

Mehrish et al.: Preprint submitted to Elsevier Page 71 of 72


A Review of Deep Learning Techniques for Speech Processing

sequence-to-sequence acoustic modeling for speech synthesis, in: arXiv:2302.11824 .


2018 IEEE International conference on acoustics, speech and signal [664] Zhao, S., Nguyen, T.H., Ma, B., 2021a. Monaural speech enhance-
processing (ICASSP), IEEE. pp. 4789–4793. ment with complex convolutional block attention module and joint
[647] Zhang, J.X., Ling, Z.H., Dai, L.R., 2019b. Non-parallel sequence-to- time frequency losses, in: ICASSP 2021-2021 IEEE International
sequence voice conversion with disentangled linguistic and speaker Conference on Acoustics, Speech and Signal Processing (ICASSP),
representations. IEEE/ACM Transactions on Audio, Speech, and IEEE. pp. 6648–6652.
Language Processing 28, 540–552. [665] Zhao, S., Wang, H., Nguyen, T.H., Ma, B., 2021b. Towards natural
[648] Zhang, J.X., Ling, Z.H., Liu, L.J., Jiang, Y., Dai, L.R., 2019c. and controllable cross-lingual voice conversion based on neural tts
Sequence-to-sequence acoustic modeling for voice conversion. model and phonetic posteriorgram, in: ICASSP 2021-2021 IEEE
IEEE/ACM Transactions on Audio, Speech, and Language Process- International Conference on Acoustics, Speech and Signal Processing
ing 27, 631–644. (ICASSP), IEEE. pp. 5969–5973.
[649] Zhang, J.X., Liu, L.J., Chen, Y.N., Hu, Y.J., Jiang, Y., Ling, Z.H., [666] Zhao, W., Yang, Z., 2023. An emotion speech synthesis method
Dai, L.R., 2020b. Voice conversion by cascading automatic speech based on vits. Applied Sciences 13, 2225.
recognition and text-to-speech synthesis with prosody transfer. arXiv [667] Zhao, X., Yang, S., Shan, S., Chen, X., 2020b. Mutual information
preprint arXiv:2009.01475 . maximization for effective lip reading, in: 2020 15th IEEE Interna-
[650] Zhang, L., Ren, Y., Deng, L., Zhao, Z., 2022c. Hifidenoise: tional Conference on Automatic Face and Gesture Recognition (FG
High-fidelity denoising text to speech with adversarial networks, 2020), IEEE. pp. 420–427.
in: ICASSP 2022-2022 IEEE International Conference on Acoustics, [668] Zheng, C., Peng, X., Zhang, Y., Srinivasan, S., Lu, Y., 2021a. In-
Speech and Signal Processing (ICASSP), IEEE. pp. 7232–7236. teractive speech and noise modeling for speech enhancement, in:
[651] Zhang, Q., Lu, H., Sak, H., Tripathi, A., McDermott, E., Koo, S., Ku- Proceedings of the AAAI Conference on Artificial Intelligence, pp.
mar, S., 2020c. Transformer transducer: A streamable speech recog- 14549–14557.
nition model with transformer encoders and rnn-t loss, in: ICASSP [669] Zheng, R., Chen, J., Ma, M., Huang, L., 2021b. Fused acoustic and
2020-2020 IEEE International Conference on Acoustics, Speech and text encoding for multimodal bilingual pretraining and speech trans-
Signal Processing (ICASSP), IEEE. pp. 7829–7833. lation, in: International Conference on Machine Learning, PMLR.
[652] Zhang, S., Tong, H., Xu, J., Maciejewski, R., 2019d. Graph convo- pp. 12736–12746.
lutional networks: a comprehensive review. Computational Social [670] Zheng, Y., Li, X., Xie, F., Lu, L., 2020. Improving end-to-end speech
Networks 6, 1–23. synthesis with local recurrent neural network enhanced transformer,
[653] Zhang, X., Cheng, F., Wang, S., 2019e. Spatio-temporal fusion based in: ICASSP 2020-2020 IEEE International Conference on Acoustics,
convolutional sequence learning for lip reading, in: Proceedings of Speech and Signal Processing (ICASSP), IEEE. pp. 6734–6738.
the IEEE/CVF International Conference on Computer Vision, pp. [671] Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X., 2019a. Talking face
713–722. generation by adversarially disentangled audio-visual representation,
[654] Zhang, X., Wang, J., Cheng, N., Xiao, J., 2022d. Tdass: Target in: Proceedings of the AAAI conference on artificial intelligence, pp.
domain adaptation speech synthesis framework for multi-speaker 9299–9306.
low-resource tts, in: 2022 International Joint Conference on Neural [672] Zhou, P., Fan, R., Chen, W., Jia, J., 2019b. Improving generalization
Networks (IJCNN), pp. 1–7. doi:10.1109/IJCNN55064.2022.9892596. of transformer for speech recognition with parallel schedule sampling
[655] Zhang, Y., Che, H., Wang, X., 2021c. Non-parallel sequence-to- and relative positional embedding. arXiv preprint arXiv:1911.00203
sequence voice conversion for arbitrary speakers, in: 2021 12th In- .
ternational Symposium on Chinese Spoken Language Processing [673] Zhu, D., Chen, N., 2022. Multi-source domain adaptation and fusion
(ISCSLP), pp. 1–5. doi:10.1109/ISCSLP49672.2021.9362095. for speaker verification. IEEE/ACM Transactions on Audio, Speech,
[656] Zhang, Y., Han, W., Qin, J., Wang, Y., Bapna, A., Chen, Z., Chen, and Language Processing 30, 2103–2116.
N., Li, B., Axelrod, V., Wang, G., et al., 2023b. Google usm: Scaling [674] Zhu, H., Huang, H., Li, Y., Zheng, A., He, R., 2018. Arbitrary talking
automatic speech recognition beyond 100 languages. arXiv preprint face generation via attentional audio-visual coherence learning. arXiv
arXiv:2303.01037 . preprint arXiv:1812.06589 .
[657] Zhang, Y., Park, D.S., Han, W., Qin, J., Gulati, A., Shor, J., Jansen, [675] Zhu, H., Lee, K.A., Li, H., 2021. Serialized multi-layer multi-
A., Xu, Y., Huang, Y., Wang, S., Zhou, Z., Li, B., Ma, M., Chan, W., head attention for neural speaker embedding. arXiv preprint
Yu, J., Wang, Y., Cao, L., Sim, K.C., Ramabhadran, B., Sainath, T.N., arXiv:2107.06493 .
Beaufays, F., Chen, Z., Le, Q.V., Chiu, C.C., Pang, R., Wu, Y., 2022e. [676] Zhu, X., Yang, S., Yang, G., Xie, L., 2019. Controlling emotion
Bigssl: Exploring the frontier of large-scale semi-supervised learning strength with relative attribute for end-to-end speech synthesis, in:
for automatic speech recognition. IEEE Journal of Selected Topics in 2019 IEEE Automatic Speech Recognition and Understanding Work-
Signal Processing 16, 1519–1532. doi:10.1109/JSTSP.2022.3182537. shop (ASRU), pp. 192–199. doi:10.1109/ASRU46091.2019.9003829.
[658] Zhang, Y., Qin, J., Park, D.S., Han, W., Chiu, C.C., Pang, R., Le,
Q.V., Wu, Y., 2020d. Pushing the limits of semi-supervised learning
for automatic speech recognition. arXiv preprint arXiv:2010.10504 .
[659] Zhang, Z., Zhou, L., Wang, C., Chen, S., Wu, Y., Liu, S., Chen, Z.,
Liu, Y., Wang, H., Li, J., et al., 2023c. Speak foreign languages
with your own voice: Cross-lingual neural codec language modeling.
arXiv preprint arXiv:2303.03926 .
[660] Zhao, C., Wang, M., Dong, Q., Ye, R., Li, L., 2020a. Neurst: Neural
speech translation toolkit. arXiv preprint arXiv:2012.10018 .
[661] Zhao, G., Sonsaat, S., Silpachai, A., Lucic, I., Chukharev-Hudilainen,
E., Levis, J., Gutierrez-Osuna, R., 2018. L2-arctic: A non-native
english speech corpus., in: Interspeech, pp. 2783–2787.
[662] Zhao, H., Tan, H., Mei, H., 2022. Tiny-attention adapter: Contexts
are more important than the number of parameters. arXiv preprint
arXiv:2211.01979 .
[663] Zhao, S., Ma, B., 2023. Mossformer: Pushing the performance limit
of monaural speech separation using gated single-head transformer
with convolution-augmented joint self-attentions. arXiv preprint

Mehrish et al.: Preprint submitted to Elsevier Page 72 of 72


• The review highlights the state-of-the-art deep learning techniques employed in speech
processing.
• This review offers a comprehensive overview, encompassing both bottom-up and top-
down approaches.
• We trace the evolution from traditional techniques to modern models like CNNs, RNNs,
transformers, Conformers etc.
CRediT authorship contributon statement

Ambuj Mehrish: Conceptualizatonn Writnn - Orininal Draft avonil Majumder: Writnn -


Orininal Draft Rishabh Bhardwaj: Reviewn Editnnt Rada Mihalcea: Reviewn Editnnt Soujanya
Poria: Reviewn Editnnt
Declaratio if ioterettt

☒ The authors declare that they have no known competng fnancial interests or personal relatonships
that could have appeared to infuence the work reported in this paper.

☐ The authors declare the following fnancial interests/personal relatonships which may be considered
as potental competng interests:

You might also like