A Review of Deep Learning Techniques For Speech Processing
A Review of Deep Learning Techniques For Speech Processing
Vall-E
Whisper
• HuBERT
• Speechstew
• Wav2Vec 2.0
• FastSpeech2
• Conformer
Performance
• ContextNet
• LSTM
• GRU
HMM + GMM
Time
The field of speech processing has undergone a transformative shift with the advent of deep learning. The
use of multiple processing layers has enabled the creation of models capable of extracting intricate features
from speech data. This development has paved the way for unparalleled advancements in speech recognition,
text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance
of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues
for research and innovation in the field of speech processing, with far-reaching implications for a range of
industries and applications. This review paper provides a comprehensive overview of the key deep learning
models and their applications in speech-processing tasks. We begin by tracing the evolution of speech
processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep
learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize
the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore,
we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and
describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we
discuss the challenges and future directions of deep learning in speech processing, including the need for more
parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing.
By examining the field’s evolution, comparing and contrasting different approaches, and highlighting future
directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field.
2 Mehrish et al.
Contents
Abstract 1
Contents 2
1 Introduction 3
2 Background 5
2.1 Speech Signals 5
2.2 Speech Features 5
2.3 Traditional models for speech processing 7
3 Deep Learning Architectures and Their Applications in Speech Processing Tasks 8
3.1 Recurrent Neural Networks (RNNs) 9
3.2 Convolutional Neural Networks 11
3.3 Transformers 16
3.4 Conformer 20
3.5 Sequence to Sequence Models 22
3.6 Reinforcement Learning 24
3.7 Graph Neural Network 26
3.8 Diffusion Probabilistic Model 27
4 Speech Representation Learning 29
4.1 Supervised Learning 29
4.2 Unsupervised learning 32
4.3 Semi-supervised Learning 32
4.4 Self-supervised representation learning (SSRL) 34
5 Speech Processing Tasks 39
5.1 Automatic speech recognition (ASR) & conversational multi-speaker AST 40
5.2 Neural Speech Synthesis 45
5.3 Speaker recognition 54
5.4 Speaker Diarization 56
5.5 Speech-to-speech translation 61
5.6 Speech enhancement 62
5.7 Audio Super Resolution 64
5.8 Voice Activity Detection (VAD) 65
5.9 Speech Quality Assessment 65
5.10 Speech Separation 66
5.11 Spoken Language Understanding 68
5.12 Audio/visual multimodal speech processing 71
6 Advanced Transfer Learning Techniques for Speech Processing 74
6.1 Domain Adaptation 74
6.2 Meta Learning 75
6.3 Parameter-Efficient Transfer Learning 76
7 Conclusion and Future Research Directions 79
References 82
A Review of Deep Learning Techniques for Speech Processing 3
1 Introduction
Humans employ language as a means to effectively convey their emotions and sentiments. Language
encompasses a collection of words forming a vocabulary, accompanied by grammar, which dictates
the appropriate usage of these words. It manifests in various forms, including written text, sign
language, and spoken communication. Speech, specifically, entails the utilization of phonetic
combinations of consonant and vowel sounds to articulate words from the vocabulary. Phonetics,
in turn, pertains to the production and perception of sounds by individuals. Through speech,
individuals are able to express themselves and convey meaning in their chosen language.
Speech processing is a field dedicated to the study and application of methods for analyzing
and manipulating speech signals. It encompasses a range of tasks, including automatic speech
recognition (ASR) [390, 628], speaker recognition (SR) [31], and speech synthesis or text-to-speech
[396]. In recent years, speech processing has garnered increasing significance due to its diverse
applications in areas such as telecommunications, healthcare, and entertainment. Notably, statistical
modeling techniques, particularly Hidden Markov Models (HMMs), have played a pivotal role in
advancing the field [149, 442]. These models have paved the way for significant advancements and
breakthroughs in speech processing research and development.
Over the past few years, the field of speech processing has been transformed by introducing
powerful tools, including deep learning. Figure 1 illustrates the evolution of speech processing
models over the years, the rapid development of deep learning architecture for speech processing
reflects the growing complexity and diversity of the field. This technology has revolutionized
the analysis and processing of speech signals using deep neural networks (DNNs), convolutional
neural networks (CNNs), and recurrent neural networks (RNNs). These architectures have proven
highly effective in various speech-processing applications, such as speech recognition, speaker
recognition, and speech synthesis. This study comprehensively overviews the most critical and
emerging deep-learning techniques and their potential applications in various speech-processing
tasks.
Deep learning has revolutionized speech processing by its ability to automatically learn mean-
ingful features from raw speech signals, eliminating the need for manual feature engineering. This
breakthrough has led to significant advancements in speech processing performance, particularly
in challenging scenarios involving noise, as well as diverse accents and dialects. By leveraging the
power of deep neural networks, speech processing systems can now adapt and generalize more
effectively, resulting in improved accuracy and robustness in various applications. The inherent
capability of deep learning to extract intricate patterns and representations from speech data has
opened up new possibilities for tackling real-world speech processing challenges.
Deep learning architectures have emerged as powerful tools in speech processing, offering
remarkable improvements in various tasks. Pioneering studies, such as [185], have demonstrated
the substantial gains achieved by deep neural networks (DNNs) in speech recognition accuracy
compared to traditional HMM-based systems. Complementing this, research in [3] showcased the
effectiveness of convolutional neural networks (CNNs) for speech recognition. Moreover, recurrent
neural networks (RNNs) have proven their efficacy in both speech recognition and synthesis,
as highlighted in [161]. Recent advancements in deep learning have further enhanced speech
processing systems, with attention mechanisms [85] and transformers [554] playing significant
roles. Attention mechanisms enable the model to focus on salient sections of the input signal, while
transformers facilitate modeling long-range dependencies within the signal. These developments
have led to substantial improvements in the performance and versatility of speech processing
systems, unlocking new possibilities for applications in diverse domains.
4 Mehrish et al.
Although deep learning has made remarkable progress in speech processing, it still faces certain
challenges that need to be addressed. These challenges include the requirement for substantial
amounts of labeled data, the interpretability of the models, and their robustness to different
environmental conditions. To provide a comprehensive understanding of the advancements in
this domain, this paper presents an extensive overview of deep learning architectures employed
in speech-processing applications. Speech processing encompasses the analysis, synthesis, and
recognition of speech signals, and the integration of deep learning techniques has led to significant
advancements in these areas. By examining the current state-of-the-art approaches, this paper
aims to shed light on the potential of deep learning for tackling the existing challenges and further
advancing speech processing research.
The paper provides a comprehensive exploration of deep-learning architectures in the field of
speech processing. It begins by establishing the background, encompassing the definition of speech
signals, speech features, and traditional non-neural models. Subsequently, the focus shifts towards
an in-depth examination of various deep-learning architectures specifically tailored for speech
processing, including RNNs, CNNs, Transformers, GNNs, and diffusion models. Recognizing the
significance of representation learning techniques in this domain, the survey paper dedicates a
dedicated section to their exploration.
Moving forward, the paper delves into an extensive range of speech processing tasks where deep
learning has demonstrated substantial advancements. These tasks encompass critical areas such
as speech recognition, speech synthesis, speaker recognition, speech-to-speech translation, and
speech synthesis. By thoroughly analyzing the fundamentals, model architectures, and specific
tasks within the field, the paper then progresses to discuss advanced transfer learning techniques,
including domain adaptation, meta-learning, and parameter-efficient transfer learning.
Finally, in the conclusion, the paper reflects on the current state of the field and identifies potential
future directions. By considering emerging trends and novel approaches, the paper aims to shed
light on the evolving landscape of deep learning in speech processing and provide insights into
promising avenues for further research and development.
Why this paper?. Deep learning has become a powerful tool in speech processing because it
automatically learns high-level representations of speech signals from raw audio data. As a result,
significant advancements have been made in various speech-processing tasks, including speech
recognition, speaker identification, speech synthesis, and more. These tasks are essential in various
applications, such as human-computer interaction, speech-based search, and assistive technology
for people with speech impairments. For example, virtual assistants like Siri and Alexa use speech
recognition technology, while audiobooks and in-car navigation systems rely on text-to-speech
systems.
Given the wide range of applications and the rapidly evolving nature of deep learning, a compre-
hensive review paper that surveys the current state-of-the-art techniques and their applications in
speech processing is necessary. Such a paper can help researchers and practitioners stay up-to-
date with the latest developments and trends and provide insights into potential areas for future
research. However, to the best of our knowledge, no current work covers a broad spectrum of
speech-processing tasks.
A review paper on deep learning for speech processing can also be a valuable resource for
beginners interested in learning about the field. It can provide an overview of the fundamental
concepts and techniques used in deep learning for speech processing and help them gain a deeper
understanding of the field. While some survey papers focus on specific speech-processing tasks
such as speech recognition, a broad survey would cover a wide range of other tasks such as speaker
A Review of Deep Learning Techniques for Speech Processing 5
recognition speech synthesis, and more. A broad survey would highlight the commonalities and
differences between these tasks and provide a comprehensive view of the advancements made in
the field.
2 Background
Before moving on to deep neural architectures, we discuss basic terms used in speech processing,
low-level representations of speech signals, and traditional models used in the field.
A/D conversion, pre-emphasis filtering, framing, windowing, Fourier transform, Mel filter
bank application, logarithmic operation, discrete cosine transform (DCT), and liftering. By
following these steps, MFCCs enable the extraction of informative audio features while
avoiding redundancy and preserving the relevant characteristics of the sound signal.
Other types of speech features include formant frequencies, pitch contour, cepstral coefficients,
wavelet coefficients, and spectral envelope. These features can be used for various speech-processing
tasks, including speech recognition, speaker identification, emotion recognition, and speech syn-
thesis.
In the field of speech processing, frequency-based representations such as Mel spectrogram and
MFCC are widely used since they are more robust to noise as compared to temporal variations
of the sound [7]. Time-domain features can be useful when the task warrants this information
(such as pauses, emotions, phoneme duration, and speech segments). It is noteworthy that the
time-domain and frequency-domain features tend to capture different sets of information and thus
can be used in conjunction to solve a task [512, 529, 568].
valuable information from vast amounts of speech data. In this section, we delve into the applications
of deep learning architectures in speech processing tasks, exploring their potential, advancements,
and the impact they have had on the field. By examining the key components and techniques
employed in these architectures, we aim to provide insights into the current state-of-the-art in
deep learning for speech processing and shed light on the exciting prospects it holds for future
advancements in the field.
𝑦𝑡 .
Bidirectional RNNs. For numerous tasks in speech processing, it is more effective to process
the whole utterance at once. For instance, in speech recognition, one-shot input transcription can
be more robust than transcribing based on the partial (i.e. previous) context information [161]. The
vanilla RNN has a limitation in such cases as they are unidirectional in nature, that is, output 𝑦𝑡 is
obtained from {𝑥𝑘 }𝑘=1
𝑡 , and thus, agnostic of what comes after time 𝑡. Bidirectional RNNs (BRNNs)
were proposed to overcome such shortcomings of RNNs [485]. BRRNs encode both future and past
(input) context in separate hidden layers. The outputs of the two RNNs are then combined at each
time step, typically by concatenating them together, to create a new, richer representation that
includes both past and future context.
→
− →
−
ℎ𝑡 = H (𝑊ℎℎ→ ℎ 𝑡 −1 + 𝑊 −
− →𝑥 + 𝑏→
𝑥ℎ 𝑡
−)
ℎ
(3)
←− ←−
ℎ𝑡 = H (𝑊ℎℎ←− ℎ 𝑡 +1 + 𝑊←−𝑥 + 𝑏←
𝑥ℎ 𝑡
−)
ℎ
(4)
→
− ←
−
→ ℎ 𝑡 + 𝑊←
𝑦𝑡 = 𝑊ℎ𝑦
− − ℎ + 𝑏𝑦
ℎ𝑦 𝑡
(5)
10 Mehrish et al.
→− ←
−
where high dimensional hidden states ℎ 𝑡 −1 and ℎ 𝑡 +1 are hidden states modeling the forward
context from 1, 2, . . . , 𝑡 − 1 and backward context from 𝑇 ,𝑇 − 1, . . . , 𝑡 + 1, respectively.
Long Short-Term Memory. Vanilla RNNs are observed to face another limitation, that is,
vanishing gradients that do not allow them to learn from long-range context information. To
overcome this, a variant of RNN, named as LSTM, was specifically designed to address the vanishing
gradient problem and enable the network to selectively retain (or forget) information over longer
periods of time [187]. This attribute is achieved by maintaining separate purpose-built memory cells
in the network: the long-term memory cell 𝑐𝑡 and the short-term memory cell ℎ𝑡 . In Equation (2),
LSTM redefines the operator H in terms of forget gate 𝑓𝑡 , input gate 𝑖𝑡 , and output gate 𝑜𝑡 ,
𝑖𝑡 = 𝜎 (𝑊𝑥𝑖 𝑥𝑡 + 𝑊ℎ𝑖 ℎ𝑡 −1 + 𝑊𝑐𝑖 𝑐𝑡 −1 + 𝑏𝑖 ), (6)
𝑓𝑡 = 𝜎 (𝑊𝑥 𝑓 𝑥𝑡 + 𝑊ℎ𝑓 ℎ𝑡 −1 + 𝑊𝑐 𝑓 𝑐𝑡 −1 + 𝑏 𝑓 ), (7)
𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡 −1 + 𝑖𝑡 ⊙ tanh (𝑊𝑥𝑐 𝑥𝑡 + 𝑊ℎ𝑐 ℎ𝑡 −1 + 𝑏𝑐 ), (8)
𝑜𝑡 = 𝜎 (𝑊𝑥𝑜 𝑥𝑡 + 𝑊ℎ𝑜 ℎ𝑡 −1 + 𝑊𝑐𝑜 𝑐𝑡 + 𝑏𝑜 ), (9)
ℎ𝑡 = 𝑜𝑡 ⊙ tanh (𝑐𝑡 ), (10)
where 𝜎 (𝑥) = 1/(1 + 𝑒 −𝑥 ) is a logistic sigmoid activation function. 𝑐𝑡 is a fusion of the information
from the previous state of the long-term memory 𝑐𝑡 −1 , the previous state of short-term memory
ℎ𝑡 −1 , and current input 𝑥𝑡 . 𝑊 and 𝑏 are weight matrices and biases. ⊙ is the element-wise vector
multiplication or Hadamard operator. Bidirectional LSTMs (BLSTMs) can capture longer contexts
in both forward and backward directions [158].
also solves the problem of having to specify the position of a character in the output, allowing for
more efficient training of the neural network without post-processing the output. Finally, the CTC
decoder can transform the neural network output into the final text without post-processing.
3.1.2 Application
The utilization of RNNs in popular products such as Google’s voice search and Apple’s Siri to
process user input and predict the output has been well-documented [177, 304]. RNNs are frequently
utilized in speech recognition tasks, such as the prediction of phonetic segments from audio signals
[412]. They excel in use cases where context plays a vital role in outcome prediction and are distinct
from CNNs as they utilize feedback loops to process a data sequence that informs the final output
[412].
In recent times, there have been advancements in the architecture of RNNs, which have been
primarily focused on developing end-to-end (E2E) models [302, 409] for ASR. These E2E models
have replaced conventional hybrid models and have displayed substantial enhancements in speech
recognition [302, 303]. However, a significant challenge faced by E2E RNN models is the synchro-
nization of the input speech sequence with the output label sequence [158]. To tackle this issue, a
loss function called CTC [159] is utilized for training RNN models, allowing for the repetition of
labels to construct paths of the same length as the input speech sequence. An alternative method is
to employ an Attention-based Encoder-Decoder (AED) model based on RNN architecture, which
utilizes an attention mechanism to align the input speech sequence with the output label sequence.
However, AED models tend to perform poorly on lengthy utterances.
The development of Bimodal Recurrent Neural Networks (BRNN) has led to significant ad-
vancements in the field of Audiovisual Speech Activity Detection (AV-SAD) [531]. BRNNs have
demonstrated immense potential in improving the performance of speech recognition systems,
particularly in noisy environments, by combining information from various sources. By integrating
separate RNNs for each modality, BRNNs can capture temporal dependencies within and across
modalities. This leads to successful outcomes in speech-based systems, where integrating audio and
visual modalities is crucial for accurate speech recognition. Compared to conventional audio-only
systems, BRNN-based AV-SAD systems display superior performance, particularly in challenging
acoustic conditions where audio-only systems might struggle.
To enhance the performance of continuous speech recognition, LSTM networks have been
utilized in hybrid architectures alongside CNNs [417]. The CNNs extract local features from speech
frames that are then processed by LSTMs over time [417]. LSTMs have also been employed for
speech synthesis, where they have been shown to enhance the quality of statistical parametric
speech synthesis [417].
Aside from their ASR and speech synthesis applications, LSTM networks have been utilized for
speech post-filtering. To improve the quality of synthesized speech, researchers have proposed deep
learning-based post-filters, with LSTMs demonstrating superior performance over other post-filter
types [99]. Bidirectional LSTM (Bi-LSTM) is another variant of RNN that has been widely used
for speech synthesis [136]. Several RNN-based analysis/synthesis models such as WaveNet [402],
SampleRNN [373], and Tacotron have been developed. These neural vocoder models can generate
high-quality synthesized speech from acoustic features without requiring intermediate vocoding
steps.
input space. A pooling layer converts convolution layer activations to low resolution by taking the
maximum filter activation within a specified window and shifting across the activation map. CNNs
are variants of fully connected neural networks widely used for processing data with grid-like
topology. For example, time-series data (1D grid) with samples at regular intervals or images (2D
grid) with pixels constitute a grid-like structure.
As discussed in Section 2, the speech spectrogram retains more information than hand-crafted
features, including speaker characteristics such as vocal tract length differences across speakers,
distinct speaking styles causing formant to undershoot or overshoot, etc. Also, explicitly expressed
these characteristics in the frequency domain. The spectrogram representation shows very strong
correlations in time and frequency. Due to these characteristics of the spectrogram, it is a suitable
input for a CNN processing pipeline that requires preserving locality in both frequency and time
axis. For speech signals, modeling local correlations with CNNs will be beneficial. The CNNs can
also effectively extract the structural features from the spectrogram and reduce the complexity of
the model through weight sharing. This section will discuss the architecture of 1D and 2D CNNs
used in various speech-processing tasks.
3.2.1 CNN Model Variants
2D CNN. Since spectrograms are two-dimensional visual representations, one can leverage
CNN architectures widely used for visual data processing (images and videos) by performing
convolutions in two dimensions. The mathematical equation for a 2D convolutional layer can be
represented as:
𝐿 ∑︁
∑︁ 𝑀
(𝑘 ) (𝑙 )
𝑦𝑖,𝑗 =𝜎 𝑥𝑖+𝑙 𝑤 (𝑘 ) + 𝑏 (𝑘 )
−1,𝑗+𝑚−1 𝑙,𝑚
(14)
𝑙=1 𝑚=1
(𝑙 ) (𝑘 )
Here, 𝑥𝑖,𝑗 is the pixel value of the 𝑙 𝑡ℎ input channel at the spatial location (𝑖, 𝑗), 𝑤𝑙,𝑚 is the weight
of the 𝑚𝑡ℎ filter at the 𝑙 𝑡ℎ channel producing the 𝑘 𝑡ℎ feature map, and 𝑏 (𝑘 ) is the bias term for the
𝑘 𝑡ℎ feature map.
(𝑘 )
The output feature map 𝑦𝑖,𝑗 is obtained by convolving the input image with the filters and then
applying an activation function 𝜎 to introduce non-linearity. The convolution operation involves
sliding the filter window over the input image, computing the dot product between the filter and
the input pixels at each location, and producing a single output pixel.
However, there are some drawbacks to using a 2D CNN for speech processing. One of the main
issues is that 2D convolutions are computationally expensive, especially for large inputs. This is
because 2D convolutions involve many multiplications and additions, and the computational cost
grows quickly with the input size.
To address this issue, a 1D CNN can be designed to operate directly on the speech signal
without needing a spectrogram. 1D convolutions are much less computationally expensive than
2D convolutions because they only operate on one dimension of the input. This reduces the
multiplications and additions required, making the network faster and more efficient. In addition,
1D feature maps require less memory during processing, which is especially important for real-time
applications. A neural network’s memory requirements are proportional to its feature maps’ size.
By using 1D convolutions, the size of the feature maps can be significantly reduced, which can
improve the efficiency of the network and reduce the memory requirements.
1D CNN. 1D CNN is essentially a special case of 2D CNN where the height of the filter is
equal to the height the spectogram. Thus, the filter only slides along the temporal dimension and
the height of the resultant feature maps is one. As such, 1D convolutions are computationally less
A Review of Deep Learning Techniques for Speech Processing 13
expensive and memory efficient [261], as compared to 2D CNNs. Several studies [6, 245, 262] have
shown that 1D CNNs are preferable to their 2D counterparts in certain applications. For example,
Alsabhan [12] found that the performance of predicting emotions with a 2D CNN model was lower
compared to a 1D CNN model.
1D convolution is useful in speech processing for several reasons:
• Since, speech signals are sequences of amplitudes sampled over time, 1D convolution can
be applied along temporal dimension to capture temporal variations in the signal.
• Robustness to distortion and noise: Since, 1D convolution allows local feature extraction,
the resultant features are often resilient to global distortions of the signal. For instance,
a speaker might be interrupted in the middle of an utterance. Local features would still
produce robust representations for those relevant spans, which is key to ASR, among
many speech processing task. On the other hand, speech signals are often contaminated
with noise, making extracting meaningful information difficult. 1D convolution followed
by pooling layers can mitigate the impact of noise [180], improving speech recognition
systems’ accuracy.
The basic building block of a 1D CNN is the convolutional layer, which applies a set of filters to
the input data. A convolutional layer employs a collection of adjustable parameters called filters
to carry out convolution operations on the input data, resulting in a set of feature maps as the
output, which represent the activation of each filter at each position in the input data. The size of
the feature maps depends on the size of the input data, the size of the filters, and the number of
filters used. The activation function used in a 1D CNN is typically a non-linear function, such as
the rectified linear unit (ReLU) function.
Given an input sequence 𝑥 of length 𝑁 , a set of 𝐾 filters 𝑊𝑘 of length 𝑀, and a bias term 𝑏𝑘 , the
output feature map 𝑦𝑘 of the 𝑘 𝑡ℎ filter is given by
𝑀
∑︁−1
𝑦𝑘 [𝑛] = ReLU(𝑏𝑘 + 𝑊𝑘 [𝑚] ∗ 𝑥 [𝑛 − 𝑚]) (15)
𝑚=0
where 𝑛 ranges from 𝑀 −1 to 𝑁 −1, and ∗ denotes the convolution operation. After the convolutional
layer, the output tensor is typically passed through a pooling layer, reducing the feature maps’ size
by down-sampling. The most commonly used pooling operation is the max-pooling, which keeps
the maximum value from a sliding window across each feature map.
CNNs often replace previously popular methods like HMMs and GMM-UBM in various cases.
Moreover, CNNs possess the ability to acquire features that remain robust despite variations in
speech signals resulting from diverse speakers, accents, and background noise. This is made possible
due to three key properties of CNNs: locality, weight sharing, and pooling. The locality property
enhances resilience against non-white noise by enabling the computation of effective features from
cleaner portions of the spectrum. Consequently, only a smaller subset of features is affected by
the noise, allowing higher network layers a better opportunity to handle the noise by combining
higher-level features computed for each frequency band. This improvement over standard fully
connected neural networks, which process all input features in the lower layers, highlights the
significance of locality. As a result, locality reduces the number of network weights that must be
learned.
3.2.2 Application
CNNs have proven to be versatile tools for a range of speech-processing tasks. They have been
successfully applied to speech recognition [4, 390], including in hybrid NN-HMM models for speech
recognition, and can be used for multi-class classification of words [5]. In addition, CNNs have
14 Mehrish et al.
been proposed for speaker recognition in an emotional speech, with a constrained CNN model
presented in [496].
CNNs, both 1D and 2D, have emerged as the core building block for various speech processing
models, including acoustic models [162, 273, 483] in ASR systems. For instance, in 2021, researchers
from Facebook AI proposed wav2vec2.0 [483], a hybrid ASR system based on CNNs for learning
representations of raw speech signals that were then fed into a transformer-based language model.
The system achieved state-of-the-art results on several benchmark datasets.
Similarly, Google’s VGGVox [92] used a CNN with VGG architecture to learn speaker embeddings
from Mel spectrograms, achieving state-of-the-art results in speaker recognition. CNNs have
also been widely used in developing state-of-the-art speech enhancement and text-to-speech
architectures. For instance, the architecture proposed in [311, 541] for Deep Noise Suppression
(DNS) [457] challenge and Google’s Tacotron2 [491] are examples of models that use CNNs as
their core building blocks. In addition to traditional tasks like ASR and speaker identification,
CNNs have also been applied to non-traditional speech processing tasks like emotion recognition
[230], Parkinson’s disease detection [224], language identification [498] and sleep apnea detection
[497]. In all these tasks, CNN extracted features from speech signals and fed them into the task
classification model.
Output
dilation=4
hidden
dilation=2
hidden
dilation=1
Input
Fig. 2. TCNNs leverage causal and dilated convolutions to model temporal dependencies in sequential data.
Causal convolutions ensure that future information is not used during training, while dilated convolutions
increase the receptive field without increasing computational complexity. This makes TCNNs an effective
and efficient solution for a wide range of tasks, including speech recognition, action recognition, and music
analysis.
Consider a 1-D sequence 𝑥 ∈ R𝑛 and a filter: 𝑓 : {0, ..., 𝑘 − 1} → R, the dilated convolution
operation 𝐹𝑑 on an element 𝑦 of the sequence is defined as
𝑘
∑︁−1
𝐹𝑑 (𝑦) = (𝑥 ∗𝑑 𝑓 ) (𝑠) = 𝑓 (𝑖).𝑥 𝑦−𝑑.𝑖 , (16)
𝑖=0
where 𝑘 is filter size, 𝑑 is dilation factor, and 𝑦 − 𝑑.𝑖 is the span along the past. The dilation
step introduces a fixed step between every two adjacent filter taps. When 𝑑 = 1, a dilated
convolution acts as a normal convolution. Whereas, for larger dilation, the filter acts on
a wide but non-contiguous range of inputs. Therefore, dilation effectively expands the
receptive field of the convolutional networks.
3.2.5 Application
Recent studies have shown that the TCNN architecture not only outperforms traditional recurrent
networks like LSTMs and GRUs in terms of accuracy but also possesses a set of advantageous
properties, including:
• Parallelism is a key advantage of TCNN over RNNs. In RNNs, time-step predictions depend
on their predecessors’ completion, which limits parallel computation. In contrast, TCNNs
apply the same filter to each span in the input, allowing parallel application thereof. This
feature enables more efficient processing of long input sequences compared to RNNs that
process sequentially.
• The receptive field size can be modified in various ways to enhance the performance of
TCNNs. For example, incorporating additional dilated convolutional layers, employing
larger dilation factors, or augmenting the filter size are all effective methods. Consequently,
TCNNs offer superior management of the model’s memory size and are highly adaptable to
diverse domains.
• When dealing with lengthy input sequences, LSTM and GRU models tend to consume a
significant amount of memory to retain the intermediate outcomes for their numerous
cell gates. On the other hand, TCNNs utilize shared filters throughout a layer, and the
16 Mehrish et al.
back-propagation route depends solely on the depth of the network. This makes TCNNs
a more memory-efficient alternative to LSTMs and GRUs, especially in scenarios where
memory constraints are a concern.
TCNNs can perform real-time speech enhancement in the time domain [411]. They have much
fewer trainable parameters than earlier models, making them more efficient. TCNs have also been
used for speech and music detection in radio broadcasts [212, 297]. They have been used for single
channel speech enhancement [322, 464] and are trained as filter banks to extract features from
waveform to improve the performance of ASR [307].
3.3 Transformers
While recurrence in RNNs (Section 3.1) is a boon for neural networks to model sequential data, it is
also a bane as the recurrence in time to update the hidden state intrinsically precludes parallelization.
Additionally, although dedicated gated RNNs such as LSTM and GRU have helped to mitigate
the vanishing gradient problem to some extent, it can still be a challenge to maintain long-term
dependencies in RNNs.
Proposed by Vaswani et al. [554], Transformer solved a critical shortcoming of RNNs by allowing
parallelization within the training sample, that is, facilitating the processing of the entire input
sequence at once. Since then, the primary idea of using only the attention mechanism to construct
an encoder and decoder has served as the basic recipe for many state-of-the-art architectures across
the domains of machine learning. In this survey, we use transformer to denote architectures
that are inspired by Transformer [46, 109, 167, 444, 445]. This section overviews the transformer’s
fundamental design proposed by Vaswani et al. [554] and its adaptations for different speech-related
applications.
3.3.1 Basic Architecture
Transformer architecture [554] comprises an attention-based encoder and decoder, with each
module consisting of a stack of identical blocks. Each block in the encoder and decoder consists
of two sub-layers: a multi-head attention (MHA) mechanism and a position-wise fully connected
feedforward network as described in Figure 3. The MHA mechanism in the encoder allows each
input element to attend to every other element in the sequence, enabling the model to capture
long-range dependencies in the input sequence. The decoder typically uses a combination of MHA
and encoder-decoder attention to attend to both the input sequence and the previously generated
output elements. The feedforward network in each block of the Transformer provides non-linear
transformations to the output of the attention mechanism. Next, we discuss operations involved in
transformer layers, that is, multi-head attention and position-wise feedforward network:
Attention(Q,K,V)
Linear
MatMul MultiHeadAttn(Q,K,V)
Concatenate
Softmax
head M
Mask (opt) Scaled Dot Product Attention head 2
head 1
Scale
Q K V
Q K V
QK𝑇
Attention(Q, K, V) = softmax √ V (17)
𝑑𝑘
Here multiple queries, keys, and value vectors, are packed together in matrix form respectively
denoted by Q ∈ R𝑁 ×𝑑𝑘 , K ∈ R𝑀 ×𝑑𝑘 , and V ∈ R𝑀 ×𝑑 𝑣 . N and M represent the lengths of queries and
keys (or values). Scaling of dot product attention becomes critical to tackling the issue of small
gradients with the increase in 𝑑𝑘 [554].
Instead of performing single attention in each transformer block, multiple attentions in lower-
dimensional space have been observed to work better [554]. This observation gave rise to Multi-
Head Attention: For ℎ heads and dimension of tokens in the model 𝑑𝑚 , the 𝑑𝑚 -dimensional
query, key, and values are projected ℎ times to 𝑑𝑘 , 𝑑𝑘 , and 𝑑 𝑣 dimensions using learnable linear
projections3 . Each head performs attention operation as per Equation (17). The ℎ 𝑑 𝑣 -dimensional
are concatenated and projected back to 𝑑𝑚 using another projection matrix:
Position-wise FFN. The position-wise FNN consists of two dense layers. It is referred to
position-wise since the same two dense layers are used for each positioned item in the sequence
and are equivalent to applying two 1 × 1 convolution layers.
3 Projection weights are neither shared across heads nor query, key, and values.
18 Mehrish et al.
advantage in comprehending speech, as they analyse the entire sentence simultaneously, whereas
RNNs process input words one by one.
Transformers have been successfully applied in end-to-end speech processing, including auto-
matic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS) [309]. In 2018,
the Speech-Transformer was introduced as a no-recurrence sequence-to-sequence model for speech
recognition. To reduce the dimension difference between input and output sequences, the model’s
architecture was modified by adding convolutional neural network (CNN) layers before feeding
the features to the transformer. In a later study [388], the authors proposed a method to improve
the performance of end-to-end speech recognition models based on transformers. They integrated
the connectionist temporal classification (CTC) with the transformer-based model to achieve better
accuracy and used language models to incorporate additional context and mitigate recognition
errors.
In addition to speech recognition, the transformer model has shown promising results in TTS
applications. The transformer based TTS model generates mel-spectrograms, followed by a WaveNet
vocoder to output the final audio results [309]. Several neural network-based TTS models, such as
Tacotron 2, DeepVoice 3, and transformer TTS, have outperformed traditional concatenative and
statistical parametric approaches in terms of speech quality [309, 426, 491].
One of the strengths of Transformer-based architectures for neural speech synthesis is their
high efficiency while considering the global context [162, 492]. The Transformer TTS model has
shown advantages in training and inference efficiency over RNN-based models such as Tacotron 2
[491]. The efficiency of the Transformer TTS network can speed up the training about 4.25 times
[309]. Moreover, Multi-Speech, a multi-speaker TTS model based on the Transformer [309], has
demonstrated the effectiveness of synthesizing a more robust and better quality multi-speaker
voice than naive Transformer-based TTS.
In contrast to the strengths of Transformer-based architectures in neural speech synthesis, large
language models based on Transformers such as BERT [109], GPT [444], XLNet [618], and T5
[448] have limitations when it comes to speech processing. One of the issues is that these models
require discrete tokens as input, necessitating using a tokenizer or a speech recognition system,
introducing errors and noise. Furthermore, pre-training on large-scale text corpora can lead to
domain mismatch problems when processing speech data. To address these limitations, dedicated
frameworks have been developed for learning speech representations using transformers, including
wav2vec [483], data2vec [24], Whisper [443], VALL-E [562], Unispeech [565], SpeechT5 [16] etc.
We discuss some of them as follows.
• Speech representation learning frameworks, such as wav2vec, have enabled significant ad-
vancements in speech processing tasks. One recent framework, w2v-BERT [585], combines
contrastive learning and MLM to achieve self-supervised speech pre-training on discrete
tokens. Fine-tuning wav2vec models with limited labeled data has also been demonstrated
to achieve state-of-the-art results in speech recognition tasks [25]. Moreover, XLS-R [20],
another model based on wav2vec 2.0, has shown state-of-the-art results in various tasks,
domains, data regimes, and languages, by leveraging multilingual data augmentation and
contrastive learning techniques on a large scale. These models learn universal speech rep-
resentations that can be transferred across languages and domains, thus representing a
significant advancement in speech representation learning.
• Transformers have been increasingly popular in the development of frameworks for learn-
ing representations from multi-modal data, such as speech, images, and text. Among these
frameworks, Data2vec [24] is a self-supervised training approach that aims to learn joint rep-
resentations to capture cross-modal correlations and transfer knowledge across modalities.
20 Mehrish et al.
3.4 Conformer
The Transformer architecture, which utilizes a self-attention mechanism, has successfully replaced
recurrent operations in previous architectures. Over the past few years, various Transformer
variants have been proposed [162]. Architectures combining Transformers and CNNs have re-
cently shown promising results on speech-processing tasks [582]. To efficiently model both local
and global dependencies of an audio sequence, several attempts have been made to combine
CNNs and Transformers. One such architecture proposed by the authors is the Conformer [162], a
convolution-augmented transformer for speech recognition. Conformer outperforms RNNs, pre-
vious Transformers, and CNN-based models, achieving state-of-the-art performance in speech
A Review of Deep Learning Techniques for Speech Processing 21
8B
34M 118M 317M 317M
85M 317M 1B 2B
110M 1B 317M
Fig. 4. Timeline highlighting notable large Transformer models developed for speech processing, along with
their corresponding parameter sizes.
recognition. The Conformer model consists of several building blocks, including convolutional
layers, self-attention layers, and feedforward layers. The architecture of the Conformer model can
be summarized as follows:
• Input Layer: The Conformer model inputs a sequence of audio features, such as MFCCs or
Mel spectrograms.
• Convolutional Layers: Local features are extracted from the audio signal by processing the
input sequence through convolutional layers.
• Self-Attention Layers: The Conformer model incorporates self-attention layers following
the convolutional layers. Self-attention is a mechanism that enables the model to focus
on various sections of the input sequence while making predictions. This is especially
advantageous for speech recognition because it facilitates capturing long-term dependencies
in the audio signal.
• Feedforward Layers: After the self-attention layers, the Conformer model applies a sequence
of feedforward layers intended to process the output of the self-attention layers further and
ready it for the ultimate prediction.
• Output Layer: Finally, the output from the feedforward layers undergoes a softmax activation
function to generate the final prediction, typically representing a sequence of character
labels or phonemes.
The conformer model has emerged as a promising neural network architecture for various
speech-related research tasks, including but not limited to speech recognition, speaker recognition,
and language identification. In a recent study by Gulati et al. [162], the conformer model was
demonstrated to outperform previous state-of-the-art models, particularly in speech recognition
significantly. This highlights the potential of the conformer model as a key tool for advancing
speech-related research.
22 Mehrish et al.
3.4.1 Application
The Conformer model stands out among other speech recognition models due to its ability to
efficiently model both local and global dependencies of an audio sequence. This is crucial for speech
recognition, language translation, and audio classification [1, 2, 162]. The model achieves this
through self-attention and convolution modules, combining the strengths of CNNs and Trans-
formers. While CNNs capture local information in audio sequences, the self-attention mechanism
captures global dependencies [2]. The Conformer model has achieved remarkable performance in
speech recognition tasks, setting benchmarks on datasets such as LibriSpeech and AISHELL-1.
Despite these successes, speech synthesis and recognition challenges persist, including difficulties
generating natural-sounding speech in non-English languages and real-time speech generation.
To address these limitations, Wang et al. [658] proposed a novel approach that combines noisy
student training with SpecAugment and large Conformer models pre-trained on the Libri-Light
dataset using the wav2vec 2.0 pre-training method. This approach achieved state-of-the-art word
error rates on the LibriSpeech dataset. Recently, Wang et al. [575] developed Conformer-LHUC,
an extension of the Conformer model that employs learning hidden unit contribution (LHUC) for
speaker adaptation. Conformer-LHUC has demonstrated exceptional performance in elderly speech
recognition and shows promise for the clinical diagnosis and treatment of Alzheimer’s disease.
Several enhancements have been made to the Conformer-based model to address high word
error rates without a language model, as documented in [336]. Wu [598] proposed a deep sparse
Conformer to improve its long-sequence representation capabilities. Furthermore, Burchi and
Timofte [49] have recently enhanced the noise robustness of the Efficient Conformer architecture
by processing both audio and visual modalities. In addition, models based on Conformer, such as
Transducers [252], have been adopted for real-time speech recognition [412] due to their ability to
process audio data much more quickly than conventional recurrent neural network (RNN) models.
ASR: CE,CTC
ST: CE
TTs: L1, L2, BCE
Decoder-PostNet
Encoder-Main Decoder-Main
Source Attention
Bi-directional +
RNN / Self Attention Uni-directional
RNN/ Self Attention
Encoder-PreNet Decoder-PreNet
Source Target
Sequence Sequence
Fig. 5. Unified formulation for Sequence-to-Sequence architecture in speech applications [244]. 𝑋 and 𝑌 are
source and target sequences respectively.
• Encoder
𝑋 0 = Encoder-PreNet(𝑋 ),
(22)
𝑋𝑒 = Encoder-Main(𝑋 0 )
where 𝑋 is the sequence of speech features (e.g. Mel spectrogram) for AST and ST and
phoneme or character sequence for TTS.
• Decoder
𝑌0 [1 : 𝑡 − 1] = Decoder-PreNet(𝑌 [1 : 𝑡 − 1]),
𝑌𝑑 [𝑡] = Decoder-Main(𝑋𝑒 , 𝑌0 [1 : 𝑡 − 1]), (23)
𝑌𝑝𝑜𝑠𝑡 [1 : 𝑡] = Decoder-PostNet(𝑌𝑑 [1 : 𝑡]),
During the training stage, input to the decoder is ground truth target sequence 𝑌 [1 :
𝑡 − 1]. The Decoder-Main module is utilized to produce a subsequent target frame. This
is accomplished by utilizing the encoded sequence 𝑋𝑒 and the prefix of the target prefix
𝑌0 [1 : 𝑡 − 1]. The decoder is mostly unidirectional for sequence generation and often uses
an attention mechanism [28] to produce the output.
Seq2seq models have been widely used in speech processing, initially based on RNNs. However,
RNNs face the challenge of processing long sequences, which can lead to the loss of the initial
context by the end of the sequence [244]. To overcome this limitation, the transformer architecture
has emerged, leveraging self-attention mechanisms to handle sequential data. The transformer has
shown remarkable performance in tasks such as ASR, ST, and speech synthesis. As a result, the use
of RNN-based seq2seq models has declined in favour of the transformer-based approach.
24 Mehrish et al.
3.5.1 Application
Seq2seq models have been used for speech processing tasks such as voice conversion [210, 528],
speech synthesis [210, 398, 399, 567, 583], and speech recognition. The field of ASR has seen sig-
nificant progress, with several advanced techniques emerging as popular options. These include
the CTC approach, which has been further developed and improved upon through recent advance-
ments [160], as well as attention-based approaches that have also gained traction [85]. The growing
interest in these techniques has increased the use of seq2seq models in the speech community.
• Attention-based Approaches: The attention mechanism is a crucial component of sequence-
to-sequence models, allowing them to effectively weigh input acoustic features during
decoding [28, 355]. Attention-based Seq2seq models utilize previously generated output
tokens and the complete input sequence to factorize the joint probability of the target
sequence into individual time steps. The attention mechanism is conditioned on the current
decoder states and runs over the encoder output representations to incorporate information
from the input sequence into the decoder output. Incorporating attention mechanisms in
Seq2Seq models has resulted in an impressive performance in various speech processing
tasks, such as speech recognition [389, 434, 539, 591], text-to-speech [400, 491, 620], and
voice conversion [210, 528]. These models have demonstrated competitiveness with tra-
ditional state-of-the-art approaches. Additionally, attention-based Seq2Seq models have
been used for confidence estimation tasks in speech recognition, where confidence scores
generated by a speech recognizer can assess transcription quality [312]. Furthermore, these
models have been explored for few-shot learning, which has the potential to simplify the
training and deployment of speech recognition systems [183].
• Connectionist Temporal Classification: While attention-based methods create a soft align-
ment between input and target sequences, approaches that utilize CTC loss aim to maximize
log conditional likelihood by considering all possible monotonic alignments between them.
These CTC-based Seq2Seq models have delivered competitive results across various ASR
benchmarks [162, 182, 365, 524] and have been extended to other speech-processing tasks
such as voice conversion [339, 648, 655], speech synthesis [648] etc. Recent studies have
concentrated on enhancing the performance of Seq2Seq models by combining CTC with
attention-based mechanisms, resulting in promising outcomes. This combination remains a
subject of active investigation in the speech-processing domain.
the task at hand. For instance, in ASR tasks, the environment can be composed of speech features,
the action can be the choices of phonemes, and the reward could be the correctness of those
phonemes given the input. Audio signals are one-dimensional time-series signals that undergo
pre-processing and feature extraction procedures. Pre-processing steps include noise suppression,
silence removal, and channel equalization, improving audio signal quality and creating robust and
efficient audio-based systems. Previous research has demonstrated that pre-processing improves
the performance of deep learning-based audio systems [288].
Feature extraction is typically performed after pre-processing to convert the audio signal into
meaningful and informative features while reducing their number. MFCCs and spectrograms are
popular feature extraction choices in speech-based systems [288]. These features are then given to
the DRL agent to perform various tasks depending on the application. For instance, consider the
scenario where a human speaks to a DRL-trained machine, where the machine must act based on
features derived from audio signals.
• Value-based DRL: Given the state of the environment (𝑠), a value function 𝑄 : 𝑆 × 𝐴 → R is
learned to estimate overall future reward 𝑄 (𝑠, 𝑎) should an action 𝑎 be taken. This value
function is parameterized with deep networks like CNN, Transformers, etc.
• Policy-based DRL: As opposed to value-based RL, policy-based RL methods learns a policy
function 𝜋 : 𝑆 → 𝐴 that chooses the best possible action (𝑎) based on reward.
• Model-based DRL: Unlike the previous two approaches, model-based RL learns the dynamics
of the environment in terms of the state transition probabilities, i.e., a function 𝑀 : 𝑆 × 𝐴 ×
𝑆 → R. Given such a model, policy, or value functions are optimized.
3.6.2 Application
In speech-related research, deep reinforcement learning can be used for several purposes, including:
Speech recognition and Emotion modeling. Deep reinforcement learning (DRL) can be
used to train speech recognition systems [88, 89, 231, 451, 534] to transcribe speech accurately. In
this case, the system receives an audio input and outputs a text sequence corresponding to the
spoken words. The environmental states might be learned from the input audio features. The actions
might be the generated phonemes. The reward could be the similarity between the generated and
gold phonemes, quantified in edit distance. Several works have also achieved promising results for
non-native speech recognition [446]
DRL pre-training has shown promise in reducing training time and enhancing performance
in various Human-Computer Interaction (HCI) applications, including speech recognition [451].
Recently, researchers have suggested using a reinforcement learning algorithm to develop a Speech
Enhancement (SE) system that effectively improves ASR systems. However, ASR systems are often
complicated and composed of non-differentiable units, such as acoustic and language models.
Therefore, the ASR system’s recognition outcomes should be employed to establish the objective
function for optimizing the SE model. Other than ASR, SE, some studies have also focused on SER
using DRL algorithms [243, 282, 452]
Speaker identification. Similarly, for speaker identification tasks, the actions can be the
speaker’s choices, and a binary reward can be the correctness of choice.
Speech synthesis and coding. Likewise, the states can be the input text, the actions can
be the generated audio, and the reward could be the similarity between the gold and generated
mel-spectrogram.
26 Mehrish et al.
h1 h1
h2 h2
Graph G h0 MLP
h0 Graph Predcition
GNNl
h3 h3 MLP
h4 Concat(hiL,hjL) Edge Prediction
Embed. h4
Edge Features {e0ij}
Layer l: {hli}, {elij} Layer l+1: {hl+1i}, {el+1ij} MLP
Embed. {hLi} Node Prediction
Node Features {h0i}
Input Layer L x GNN Layer Prediction Layer
Fig. 6. A standard experimental pipeline for GCNs, which embeds the graph node and embeds the graph
node edge features, performs several GNN layers to compute convolutional features, and finally predicts a
task-specific MLP layer.
Deep reinforcement learning has several advantages over traditional machine learning techniques.
It can learn from raw data without needing hand-engineered features, making it more flexible
and adaptable. It can also learn from feedback, making it more robust and able to handle noisy
environments.
However, deep reinforcement learning also has some challenges that must be addressed. It
requires a lot of data to train and can be computationally expensive. It also requires careful selection
of the reward function to ensure that the system learns the desired behavior.
Graph Representation of Speech Data. The first step in using GNNs for speech processing
is representing the speech data as a graph. One way to do this is to represent the speech signal as a
sequence of frames, each representing a short audio signal segment. We can then represent each
frame as a node in the graph, with edges connecting adjacent frames.
Graph Convolutional Layers. Once the speech data is represented as a graph, we can use
graph convolutional layers to learn representations of the graph nodes. Graph convolutional layers
are similar to traditional ones, but instead of operating on a grid-like structure, they operate on
graphs. These layers learn to aggregate information from neighboring nodes to update the features
of each node.
Graph Attention Layers. Graph attention layers can be combined with graph convolutional
layers to give more importance to certain nodes in the graph. Graph attention layers learn to assign
weights to neighbor nodes based on their features, which can help capture important patterns in
A Review of Deep Learning Techniques for Speech Processing 27
speech data. Several works have used graph attention layers for neural speech synthesis [338] or
speaker verification [227] and diarization [277].
Recurrent Layers. Recurrent layers can be used in GNNs for speech processing to capture
temporal dependencies between adjacent frames in the audio signal. Recurrent layers allow the
network to maintain an internal state that carries information from previous time steps, which can
be useful for modeling the dynamics of speech signals.
Output Layers. The output layer of a GNN for speech processing can be a classification layer
that predicts a label for the speech data (e.g., phoneme or word) or a regression layer that predicts
a continuous value (e.g., pitch or loudness). The output layer can be a traditional fully connected
layer or a graph pooling layer that aggregates information from all the nodes in the graph.
3.7.2 Application
The advantages of using GNNs for speech processing tasks include their ability to represent
the dependencies and interrelationships between various entities, which is suitable for speech
processing tasks such as speaker diarization [499, 500, 571], speaker verification [228, 494], speech
synthesis [338, 520, 521], or speech separation [558, 576], which require the analysis of complex
data representations. GNNs retain a state representing information from their neighborhood with
arbitrary depth, unlike standard neural networks. GNNs can be used to model the relationship
between phonemes and words. GNNs can learn to recognize words in spoken language by treating
the phoneme sequence as a graph. GNNs can also be used to model the relationship between
different acoustic features, such as pitch, duration, and amplitude, in speech signals, improving
speech recognition accuracy.
GNNs have shown promising results in multichannel speech enhancement, where they are used
for extracting clean speech from noisy mixtures captured by multiple microphones [542]. The
authors of a recent study [391] propose a novel approach to multichannel speech enhancement by
combining Graph Convolutional Networks (GCNs) with spatial filtering techniques such as the
Minimum Variance Distortionless Response (MVDR) beamformer. The algorithm aims to extract
speech and noise from noisy signals by computing the Power Spectral Density (PSD) matrices of
the noise and the speech signal of interest and then obtaining optimal weights for the beam former
using a frequency-time mask. The proposed method combines the MVDR beam former with a
super-Gaussian joint maximum a posteriori (SGJMAP) based SE gain function and a GCN-based
separation network. The SGJMAP-based SE gain function is used to enhance the speech signals,
while the GCN-based separation network is used to separate the speech from the noise further.
diffusion process
reverse process
Fig. 7. The Diffusion Probabilistic Model is a generative model that progressively transforms a noise distribu-
tion into the target data distribution through a series of diffusion steps, where the noise level decreases as
the process continues. The model is trained by maximizing the likelihood of the data distribution and can be
used for tasks such as speech synthesis, enhancement, and denoising.
At every time step 𝑡, 𝑞(𝑥𝑡 |𝑥𝑡 −1 ) := N (𝑥𝑡 ; 1 − 𝛽𝑡 𝑥𝑡 −1, 𝛽𝑡 I) where {𝛽𝑡 ∈ (0, 1)}𝑇𝑡=1 . As the forward
√︁
process progresses, the data sample 𝑥 0 losses its distinguishable features, and as 𝑇 → ∞, 𝑥𝑇
approaches a standard Gaussian distribution.
Reverse diffusion process. The reverse diffusion process is defined by a Markov chain from
𝑥𝑇 ∼ N (0, I) to 𝑥 0 and parameterized by 𝜃 :
𝑇
Ö
𝑝𝜃 (𝑥 0, ..., 𝑥𝑇 −1 |𝑥𝑇 ) = 𝑝𝜃 (𝑥𝑡 −1 |𝑥𝑡 ) (25)
𝑡 =1
where 𝑥𝑇 ∼ N (0, 𝐼 ) and the transition probability 𝑝𝜃 (𝑥𝑡 −1 |𝑥𝑡 ) is learnt through noise-estimation.
This process eliminates the Gaussian noise added in the forward diffusion process.
3.8.1 Application
Diffusion models have emerged as a leading approach for generating high-quality speech in recent
years [67, 204, 218, 269, 431, 432]. These non-autoregressive models transform white noise signals
into structured waveforms via a Markov chain with a fixed number of steps. One such model,
FastDiff, has achieved impressive results in high-quality speech synthesis [204]. By leveraging a
stack of time-aware diffusion processes, FastDiff can generate high-quality speech samples 58 times
faster than real-time on a V100 GPU, making it practical for speech synthesis deployment for the
first time. It also outperforms other competing methods in end-to-end text-to-speech synthesis.
Another powerful diffusion probabilistic model proposed for audio synthesis is DiffWave [269]. It is
non-autoregressive and generates high-fidelity audio for different waveform generation tasks, such
as neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional
generation. DiffWave delivers speech quality on par with the strong WaveNet vocoder [402] while
synthesizing audio much faster.
Diffusion models have shown great promise in speech processing, particularly in speech en-
hancement [347, 348, 440, 487]. Recent advances in diffusion probabilistic models have led to the
development of a new speech enhancement algorithm that incorporates the characteristics of the
noisy speech signal into the diffusion and reverses processes [349]. This new algorithm is a gener-
alized form of the probabilistic diffusion model, known as the conditional diffusion probabilistic
model. During its reverse process, it can adapt to non-Gaussian real noises in the estimated speech
signal. In addition, Qiu et al. [440] propose SRTNet, a novel method for speech enhancement that
A Review of Deep Learning Techniques for Speech Processing 29
uses the diffusion model as a module for stochastic refinement. The proposed method comprises a
joint network of deterministic and stochastic modules, forming the “enhance-and-refine” paradigm.
The paper also includes a theoretical demonstration of the proposed method’s feasibility and
presents experimental results to support its effectiveness.
Fully Connected
Hidden Layers
segment-level
-vectors
Statistics Pooling
frame-level
Fig. 9. 𝑥-vector model architecture. 𝑥 1 ,𝑥 2 ,....,𝑥𝑇 are the spectral features such as Mel spectrograms of the
speech utterance.
Table 1. The table summarizes various loss functions used in training the speaker recognition models including
their formulation [91].
Self-Supervised
Fig. 10. Overview of difference between probabilistic latent variable models and self-supervised learning. In
latent variable models learn the functions 𝑓 (.) and 𝑔(.) learn the parameters of distribution 𝑝 and 𝑞. The
latent variable 𝑧 is used for representing learning.
Probabilistic latent variable models provide a powerful way to learn a representation that captures
the underlying relationships between observed and unobserved variables, without requiring explicit
supervision or labels. These models involve unobserved latent variables that must be inferred from
the observed data, typically using probabilistic inference techniques such as Markov Chain Monte
Carlo (MCMC) methods. In the context of representation learning, Variational autoencoders (VAE)
are commonly used with latent variable models for various speech processing tasks, leveraging the
power of probabilistic modeling to capture complex patterns in speech data.
the deviation between the model’s prediction 𝑓𝜃 (𝑥) and the ground truth label 𝑦. The expected loss
can be mathematically expressed as:
where 𝑝𝜃 (𝑦 𝑗 |𝑥𝑖 ) is the predicted probability of the 𝑗-th label for the unlabelled data point 𝑥𝑖 . Finally
the overall objective function for semi-supervised learning can be expressed as L = L𝑠𝑢𝑝 +𝛼 L𝑢𝑛𝑠𝑢𝑝 ,
𝛼 is a hyperparameter that controls the weight of the unsupervised loss term. The goal is to find
the optimal parameters 𝜃 that minimize this objective function. Semi-supervised learning involves
learning a model from both labelled and unlabelled data by minimizing a combination of supervised
and unsupervised loss terms. By leveraging the additional unlabelled data, semi-supervised learning
can improve the generalization and performance of the model in downstream tasks.
Semi-supervised learning techniques are increasingly being employed to enhance the perfor-
mance of DNNs across a range of downstream tasks in speech processing, including ASR, TTS, etc.
The primary objective of such approaches is to leverage large unlabelled datasets to augment the
performance of supervised tasks that rely on labelled datasets. The recent advancements in speech
recognition have led to a growing interest in the integration of semi-supervised learning methods
to improve the performance of ASR and TTS systems [34, 89, 229, 605, 657, 658]. This approach is
particularly beneficial in scenarios where labelled data is scarce or expensive to acquire. In fact,
for many languages around the globe, labelled data for training ASR models are often inadequate,
making it challenging to achieve optimal results. Thus, using a semi-supervised learning model
trained on abundant resource data can offer a viable solution that can be readily extended to
low-resource languages.
Semi-supervised learning has emerged as a valuable tool for addressing the challenges of insuf-
ficient annotations and poor generalization [165]. Research in various domains, including image
quality assessment [341], has demonstrated that leveraging both labelled and unlabelled data
through semi-supervised learning can lead to improved performance and generalization. In the
domain of speech quality assessment, several studies [488] have exploited the generalization
capabilities of semi-supervised learning to enhance performance.
Moreover, semi-supervised learning has gained significant attention in other areas of speech
processing, such as end-to-end speech translation [428]. By leveraging large amounts of unlabelled
data, semi-supervised learning approaches have demonstrated promising results in improving
34 Mehrish et al.
Masked Input
the performance and robustness of speech translation models. This highlights the potential of
semi-supervised learning to address the limitations of traditional supervised learning approaches
in a variety of speech processing tasks.
detailed architecture for generative models with three different variants is shown in Figure 11. The
earliest self-supervised method, predicting masked inputs using surrounding data, originated from
the text field in 2013 with word2vec. The continuous bag of words (CBOW) concept of word2vec
predicts a central word based on its neighbors, resembling ELMo and BERT’s masked language
modeling (MLM). These non-autoregressive generative approaches differ in their use of advanced
structures, such as bidirectional LSTM (for ELMo) and transformer (for BERT), with recent models
producing contextual embeddings. In the context of the speech, Mockingjay [330] applied masking
to all feature dimensions in the speech domain, whereas TERA [329] applied to mask only to a
particular subset of feature dimensions. The summary of generative self-supervised approaches
along with the data used for training the models are outlined in Table 2. We further discuss different
generative approaches as highlighted in Figure 11 as follows:
Table 2. Summary of generative self-supervised approaches and proposed models for speech processing with
associated metrics and training Data. ASR: Automatic Speech Recognition, PR: Phoneme Recognition. PC:
Phoneme Classification, SR: Speaker Recognition, LS: LibriSpeech.
Pre-Training Dataset
Model Reference Task (Metric)
Dataset (hours) Training Test
PC LS (360h) LS (360h) LS (test-clean)
Mockingjay [330]
SR LS (360h) LS (100h) LS (100h)
PASE [416] ASR LS (50 hr) DIRHA DIRHA
DIRHA DIRHA
PASE+ [456] ASR LS (50 hr)
CHiME-5 CHiME-5
LS (100h, 360h, 460 h, 960h) LS (100h, 360h, 460 h, 960h) LS (test-clean)
DeCoAR [326] ASR
WSJ si284 WSJ si284 LS (test-other)
Discriminator
Bilinear Bilinear
has explored similar pretext tasks for speech representation learning that help models
develop contextualized representations capturing information from the entire input, like
the DeCoAR model [326]. This approach assists the model in comprehending input data
better, leading to more precise and informative representations.
4.4.2 Contrastive Models
The technique involves training a model to differentiate between similar and dissimilar pairs of data
samples, which helps the model acquire valuable representations that can be utilized for various
tasks, as shown on Figure 12. The fundamental principle of contrastive learning is to generate
positive and negative pairs of training samples based on the comprehension of the data. The model
must learn a function that assigns high similarity scores to two positive samples and low similarity
scores to two negative samples. Therefore, generating appropriate samples is crucial for ensuring
that the model comprehends the fundamental features and structures of the data. Table 3 outlines
popular contrastive self-supervised models used for different speech-processing tasks. We discuss
Wav2Vec 2.0 since it has achieved state-of-the-art results in different downstream tasks.
• Wav2Vec 2.0 [26] is a framework for self-supervised learning of speech representations that
is one of the current state-of-the-art models for ASR [26]. The training of the model occurs
in two stages. Initially, the model operates in a self-supervised mode during the first phase,
where it uses unlabelled data and aims to achieve the best speech representation possible.
The second phase is fine-tuning a particular dataset for a specific purpose. Wav2Vec 2.0
takes advantage of self-supervised training and uses convolutional layers to extract features
from raw audio.
In the speech field, researchers have explored different approaches to avoid overfitting, including
augmentation techniques like Speech SimCLR [220] and the use of positive and negative pairs
through methods like Contrastive Predictive Coding (CPC) (Ooster and Meyer [404]), Wav2vec
(v1, v2.0) (Schneider et al. [483]), VQ-wav2vec (Baevski et al. [25]), and Discrete BERT [23]." "In the
A Review of Deep Learning Techniques for Speech Processing 37
Table 3. Summary of contrastive self-supervised approaches and proposed models for speech processing with
associated metrics and training Data. ASR: Automatic Speech Recognition, PR: Phoneme Recognition. PC:
Phoneme Classification, SR: Speaker Recognition, LS: LibriSpeech, LL: LibriLight, WSJ: Wall Street Journal.
Pre-Training Dataset
Model Reference Task
Dataset (hours) Training Test
PC LS (100h) LS (100h) LS (100h)
CPC [403]
SR LS (100h) LS (100h) LS (100h)
LS (100h, 360h)
Modified CPC [465] PC CV-Dataset CV-Dataset
Zerospeech2017(45h)
WSJ (80h) WSJ (80h)
LS (960h) LS (960h) WSJ (test92, test93)
TIMIT (5h) TIMIT (5h) LS (test-clean, test-other)
ASR
Bidirectional CPC [247] SSA (1h) SSA (1h) TED3 (dev, test)
TED3 (440h) TED3 (440h) SwithBoard (eval2000)
SwithBoard (310h) SwithBoard (310h)
Audio Set (2500h) Audio Set (2500h)
OpenSLR
ASR-Multi AVSpeech (3100h) AVSpeech (3100h)
ALFFA
CV-Dataset (430h CV-Dataset (430h)
LS 80/860h
ASR WSJ (si284) WSJ (eval92)
wav2vec [483] LS 960h + WSJ (si284)
PR TIMIT TIMIT TIMIT
LS (960h) LS (test-clean)
ASR LS (960h)
wav2vec 2.0 [26] LL (60000h) LS (test-other)
LS (960h)
PR TIMIT TIMIT
LL (60000h)
ASR LS (960h) WSJ (si284) WSJ (eval92)
vq-wav2vec 2.0 [25]
PR LS (960h) TIMIT TIMIT
wav2vec-C [474] ASR Alexa-10k Alexa-eval Alexa-eval
LS (test)
LS (test-other)
w2v-BERT [96] ASR LL (60000h) LS (960h)
LS (dev)
LS (dev-other)
LS (960h)
ASR WJS (si284) WJS (si284) WJS (si284)
Speech SimCLR [220]
TED2
LS (960h)
PR WJS (si284) TIMIT TIMIT
TED2
LL (60000h)
UnSpeech [381] ASR-Mult GigaSpeech (10000h) SUPERB SUPERB
VP (24000h)
graph field, researchers have developed approaches like Deep Graph Infomax (DGI) (Velickovic
et al., 2019 [556]) to learn representations that maximize the mutual information between local
patches and global structures while minimizing mutual information between patches of corrupted
graphs and the original graph’s global representation.
input input
sequence sequence
(a) (b)
Fig. 13. Predictive Self-supervised learning: (a) Discrete BERT (b) HuBERT.
these models. In the following section we breifly discuss three popular predictive SSRL approaches
used widely in various downstream tasks.
• The direct application of BERT-type training to speech input presents challenges due to the
unsegmented and unstructured nature of speech. To overcome this obstacle, a pioneering
model known as Discrete BERT [23] has been developed. This model converts continuous
speech input into a sequence of discrete codes, facilitating code representation learning.
The discrete units are obtained from a pre-trained vq-wav2vec model [25], and they serve as
both inputs and targets within a standard BERT model. The architecture of Discrete BERT,
illustrated in Figure 13 (a), incorporates a softmax normalized output layer. During training,
categorical cross-entropy loss is employed, with a masked perspective of the original speech
input utilized for predicting code representations. Remarkably, the Discrete BERT model
has exhibited impressive efficacy in self-supervised speech representation learning. Even
with a mere 10-minute fine-tuning set, it achieved a Word Error Rate (WER) of 25% on
the standard test-other subset. This approach effectively tackles the challenge of directly
applying BERT-type training to continuous speech input and holds substantial potential for
significantly enhancing speech recognition accuracy
• The HuBERT [193] and TERA [329] models are two self-supervised approaches for speech
representation learning. HuBERT uses an offline clustering step to align target labels with a
BERT-like prediction loss, with the prediction loss applied only over the masked regions
as outlined in Figure 13 (b). This encourages the model to learn a combined acoustic and
language model over the continuous inputs. On the other hand, TERA is a self-supervised
speech pre-training method that reconstructs acoustic frames from their altered counterparts
using a stochastic policy to alter along various dimensions, including time, frequency, and
tasks. These alterations help extract feature-based speech representations that can be fine-
tuned as part of downstream models.
Microsoft has introduced UniSpeech-SAT [72] and WavLM [71] models, which follow the HuBERT
framework. These models have been designed to enhance speaker representation and improve vari-
ous downstream tasks. The key focus of these models is data augmentation during the pre-training
stage, resulting in superior performance. WavLM model has exhibited outstanding effectiveness in
diverse downstream tasks, such as automatic speech recognition, phoneme recognition, speaker
A Review of Deep Learning Techniques for Speech Processing 39
Table 4. Summary of predictive self-supervised approaches and proposed models for speech processing with
associated metrics and training Data. ASR: Automatic Speech Recognition, PR: Phoneme Recognition. PC:
Phoneme Classification, SR: Speaker Recognition, LL: LibriLight, LS: LibriSpeech.
Pre-Training Dataset
Model Reference Task (Metric)
Dataset (hours) Training Test
LS (test)
LS (test-other)
ASR LL (60000h) LS (960h)
BEST-RQ [78] LS (dev)
LS (dev-other)
LL (60000h)
ASR-Multi GigaSpeech (10000h) SUPERB SUPERB
VP (24000h)
data2vec [24] ASR LS (960h) LS (10m, 1h, 100h, 960h) LS (960h)
LS (test)
Discrete BERT [23] ASR LS (960h) LS (100h)
LS (test-other)
LS (960h) LS (test)
HuBERT [625] ASR LS (960h)
LL (60000h) LS (test-other)
WavLM [71] ASR LL (60000h) SUPERB SUPERB
identification, and emotion recognition. It is worth highlighting that this model currently holds the
top position on the SUPERB leaderboard [615], which evaluates speech representations’ perfor-
mance in terms of reusability.
Self-supervised learning has emerged as a widely adopted and effective technique for speech
processing tasks due to its ability to train models with large amounts of unlabeled data. A compre-
hensive overview of self-supervised approaches, evaluation metrics, and training data is provided
in Table 4 for speech recognition, speaker recognition, and speech enhancement. Researchers
and practitioners can use this resource to select appropriate self-supervised methods and datasets
to enhance their speech-processing systems. As self-supervised learning techniques continue to
advance and refine, we can expect significant progress and advancements in speech processing.
Table 5. Comparative analysis of speech processing datasets: This table summarizes the essential features of
different speech-processing datasets, including their typical applications in various speech-processing tasks.
ASR: Automatic Speech Recognition, PR: Phoneme Recognition. PC: Phoneme Classification, SR: Speaker
Recognition, SV: Speaker Verification, SER: Speech Emotion Recognition, IC: Intent Classification, TTS:
Text-to-Speech, VC: Voice Conversion, ST: Speech Translation, SS: Speech Separation
Table 6. Comprehensive Evaluation Metrics for Speech Processing Tasks. This table provides a comprehensive
overview of the evaluation metrics used to assess the performance of speech-based systems across various
tasks such as ASR, speaker verification, and TTS. The table highlights the specific metrics employed for each
task, along with the score range and commonly used datasets.
in ASR systems [26, 443]. This paper provides an overview of the key components involved in ASR
and highlights the role of deep learning techniques in enhancing the technology’s accuracy.
Most speech recognition systems that use deep learning aim to simplify the processing pipeline
by training a single model to directly map speech signals to their corresponding text transcriptions.
Unlike traditional ASR systems that require multiple components to extract and model features,
such as HMMs and GMMs, end-to-end models do not rely on hand-designed components [19,
305]. Instead, end-to-end ASR systems use DNNs to learn acoustic and linguistic representations
directly from the input speech signals [305]. One popular type of end-to-end model is the encoder-
decoder model with attention. This model uses an encoder network to map input audio signals to
hidden representations, and a decoder network to generate text transcriptions from the hidden
representations. During the decoding process, the attention mechanism enables the decoder to
selectively focus on different parts of the input signal [305].
End-to-end ASR models can be trained using various techniques such as CTC [245], which is
used to train models without explicit alignment between the input and output sequences, and
RNNs, which are commonly used to model temporal dependencies in sequential data such as
speech signals. Transfer learning-based approaches can also improve end-to-end ASR performance
by leveraging pre-trained models or features [106, 327, 489]. While end-to-end ASR models have
shown promising results in various applications, there is still room for improvement to achieve
human-level performance [106, 137, 236, 237, 327, 625]. Nonetheless, deep learning-based end-to-
end ASR architecture offers a promising and efficient approach to speech recognition that can
simplify the processing pipeline and improve recognition accuracy.
5.1.2 Dataset
The development and evaluation of ASR systems are heavily dependent on the availability of large
datasets. As a result, ASR is an active area of research, with numerous datasets used for this purpose.
In this context, several popular datasets have gained prominence for use in ASR systems.
42 Mehrish et al.
• Common Voice: Mozilla’s Common Voice project [17] is dedicated to producing an accessible,
unrestricted collection of human speech for the purpose of training speech recognition
systems. This ever-expanding dataset features contributions from more than 9, 000 speakers
spanning 60 different languages.
• LibriSpeech: LibriSpeech [410] is a corpus of approximately 1,000 hours of read English
speech created from audiobooks in the public domain. It is widely used for speech recognition
research and is notable for its high audio quality and clean transcription.
• VoxCeleb: VoxCeleb [92] is a large-scale dataset containing over 1 million short audio clips
of celebrities speaking, which can be used for speech recognition and recognition research.
It includes a diverse range of speakers from different backgrounds and professions.
• TIMIT: The TIMIT corpus [153] is a widely used speech dataset consisting of recordings
consisting of 630 speakers representing eight major dialects of American English, each read-
ing ten phonetically rich sentences. It has been used as a benchmark for speech recognition
research since its creation in 1986.
• CHiME-5: The CHiME-5 dataset [33] is a collection of recordings made in a domestic
environment to simulate a real-world speech recognition scenario. It includes 6.5 hours of
audio from multiple microphone arrays and is designed to test the performance of ASR
systems in noisy and reverberant environments.
Other notable datasets include Google’s Speech Commands Dataset [589], the Wall Street Journal
dataset4 , and TED-LIUM [468].
5.1.3 Models
The use of RNN-based architecture in speech recognition has many advantages over traditional
acoustic models. One of the most significant benefits is their ability to capture long-term temporal
dependencies [244] in speech data, enabling them to model the dynamic nature of speech signals.
Additionally, RNNs can effectively process variable-length audio sequences, which is essential
in speech recognition tasks where the duration of spoken words and phrases can vary widely.
RNN-based models can efficiently identify and segment phonemes, detect and transcribe spoken
words, and can be trained end-to-end, eliminating the need for intermediate steps. These features
make RNN-based models particularly useful in real-time applications, such as speech recognition
in mobile devices or smart homes [117, 178], where low latency and high accuracy are crucial.
In the past, RNNs were the go-to model for ASR. However, their limited ability to handle long-
range dependencies prompted the adoption of the Transformer architecture. For example, in 2019,
Google’s Speech-to-Text API transitioned to a Transformer-based architecture that surpassed
the previous RNN-based model, especially in noisy environments and for longer sentences, as
reported in [651]. Additionally, Facebook AI Research introduced wav2vec 2.0, a self-supervised
learning approach that leverages a Transformer-based architecture to perform unsupervised speech
recognition. wav2vec 2.0 has significantly outperformed the previous RNN-based model and
achieved state-of-the-art results on several benchmark datasets.
Transformer for the ASR task is first proposed in [116], where authors include CNN layers
before submitting preprocessed speech features to the input. By incorporating more CNN layers, it
becomes feasible to diminish the gap between the sizes of the input and output sequences, given
that the number of frames in audio exceeds the number of tokens in text. This results in a favorable
impact on the training process. The change in the original architecture is minimal, and the model
achieves a competitive word error rate (WER) of 10.9% on the Wall Street Journal (WSK) speech
recognition dataset (Table 7). Despite its numerous advantages, Transformers in its pristine state
4 https://www.ldc.upenn.edu/
A Review of Deep Learning Techniques for Speech Processing 43
Table 7. Table summarizing the performance of different ASR models in terms of WER% on five different
datasets (LibriSpeech test, LibriSpeech clean, TIMIT, Common Voice, WSJ eval92, and GigaSpeech) also
highlighting the use of extra data during training. ZS stands for Zero-Shot Performance.
Extra Extra
Model Architecture WER% ↓ WER% ↓ Model Architecture WER% ↓
Training Data Training Data
LibriSpeech test clean others TIMIT
Conformer + Wav2vec 2.0 [658] Conformer + wav2vec2.0 Y 1.4 2.6 wav2vec 2.0 [26] Transformer + CNN Y 8.3
w2v-BERT XXL [96] CNN+Transformer Y 1.4 2.5 vq-wav2vec [25] Transformer + CNN Y 11.6
SpeechStew (1B)[58] Conformer Y 1.7 3.3 LSTM + Monophone Reg [455] LSTM N 14.5
SpeechStew (100M) [58] Conformer N 2.0 4.0 Common Voice
ContextNet + SpecAugment [413] LSTM+CNN Y 1.7 3.4 SpeechStew (1B) [58] Conformer N 10.8
Conformer (L) [162] Conformer N 1.9 4.1 Whisper [443] N 9.5
ContextNet [169] Conformer + wav2vec2.0 N 1.9 3.4 WSJ eval92
Squeezeformer [255] Conformer N 2.47 5.97 SpeechStew (100M) [58] Conformer N 1.3
LSTM Transducer [636] LSTM N 2.23 5.6 tdnn+chain [433] TDNN N 2.32
Transformer Transducer [331] Transformer N 2.0 4.2 GigaSpeech
Whisper [443] N 2.7 (ZS) 5.6 (ZS) Conformer/Transformer-AED [61] Conformer N 10.80
has several issues when applied to ASR. RNN, with its overall training speed (i.e., convergence)
and better WER because of effective joint training and decoding methods, is still the best option.
The authors in [116] propose the Speech Transformer, which has the advantage of faster iteration
time, but slower convergence compared to RNN-based ASR. However, integrating the Speech
Transformer with the naive language model (LM) is challenging. To address this issue, various
improvements in the Speech Transformer architecture have been proposed in recent years. For
example, [245] suggests incorporating the Connectionist Temporal Classification (CTC) loss into
the Speech Transformer. CTC is a popular technique used in speech recognition to align input and
output sequences of varying lengths and one-to-many or many-to-one mappings. It introduces
a blank symbol representing gaps between output symbols and computes the loss function by
summing probabilities across all possible paths. The loss function encourages the model to assign
high probabilities to correct output symbols and low probabilities to incorrect output symbols
and the blank symbol, allowing the model to predict sequences of varying lengths. The CTC loss
is commonly used with RNNs such as LSTM and GRU, which are well-suited for sequential data.
CTC loss is a powerful tool for training neural networks to perform sequence-to-sequence tasks
where the input and output sequences have varying lengths and mappings between them are not
one-to-one.
Various other improvements have also been proposed to enhance the performance of Speech
Transformer architecture and integrate it with the naive language model, as the use of the trans-
former directly for ASR has not been effective in exploiting the correlation among the speech
frames. The sequence order of speech, which the recurrent processing of input features can repre-
sent, is an important distinction. The degradation in performance for long sentences is reported
using absolute positional embedding (AED) [85]. The problems associated with long sequences
can become more acute for transformer [672]. To address this issue, a transition was made from
absolute positional encoding to relative positional embeddings [672]. Whereas authors in [537]
replace positional embeddings with pooling layers. In a considerably different approach, the authors
in [383] propose a novel way of combining positional embedding with speech features by replacing
positional encoding with trainable convolution layers. This update further improves the stability
of optimization for large-scale learning of transformer networks. The above works confirmed the
superiority of their techniques against sinusoidal positional encoding.
In 2016, Baidu introduced a hybrid ASR model called Deep Speech 2 [13] that uses both RNNs
and Transformers. The model also uses CNNs to extract features from the audio signal, followed
44 Mehrish et al.
Table 8. Comparison of performance between wav2vec2.0 Large and Whisper on different datasets. The zero-
shot Whisper model consistently outperforms wav2vec2.0 Large on several datasets, indicating significant
performance differences.
accuracy when transcribing the LibriSpeech dataset. These two cutting-edge frameworks, Wav2Vec
2.0 and whisper, currently represent the state-of-the-art in ASR. The whisper model is trained on
an extensive supervised dataset, including over 680,000 hours of audio data collected from the web,
which has made it more resilient to various accents, background noise, and technical jargon. The
whisper model is also capable of transcribing and translating audio in multiple languages, making
it a versatile tool. OpenAI has released inference models and code, laying the groundwork for the
development of practical applications based on the whisper model.
In contrast to its predecessor, Wav2Vec 2.0 is a self-supervised learning framework that trains
models on unlabeled audio data before fine-tuning them on specific datasets. It uses a contrastive
predictive coding (CPC) loss function to learn speech representations directly from raw audio data,
requiring less labeled data. The model’s performance has been impressive, achieving state-of-the-art
results on several ASR benchmarks. These advances in unsupervised pre-training techniques and
the development of novel ASR frameworks like Whisper and Wav2Vec 2.0 have greatly improved
the field of speech recognition, paving the way for new real-world applications. In summary, the
Table 8 highlights the varying effectiveness of wav2vec2.0 large and whisper models across different
datasets.
Table 9. Exploring the Landscape of TTS and Vocoder Architectures: Autoregressive and Non-Autoregressive
Models.
5.2.2 Datasets
The field of neural speech synthesis is rapidly advancing and relies heavily on high-quality datasets
for effective training and evaluation of models. One of the most frequently utilized datasets in
this field is the LJ Speech [217], which features about 24 hours of recorded speech from a single
female speaker reading passages from the public domain LJ Speech Corpus. This dataset is free
and has corresponding transcripts, making it an excellent choice for text-to-speech synthesis tasks.
Moreover, it has been used as a benchmark for numerous neural speech synthesis models, including
Tacotron [583], WaveNet [402], and DeepVoice [18, 156].
Apart from the LJ Speech dataset, several other datasets are widely used in neural speech synthesis
research. The CMU Arctic [267] and L2 Arctic [661] datasets contain recordings of English speakers
with diverse accents reading passages designed to capture various phonetic and prosodic aspects of
speech. The LibriSpeech [410], VoxCeleb [92], TIMIT Acoustic-Phonetic Continuous Speech Corpus
[153], and Common Voice Dataset [17] are other valuable datasets that offer ample opportunities
for training and evaluating text-to-speech synthesis models.
5.2.3 Models
Neural network-based text-to-speech (TTS) systems have been proposed using neural networks
as the basis for speech synthesis, particularly with the emergence of deep learning. In Statistical
Parametric Speech Synthesis (SPSS), early neural models replaced HMMs for acoustic modeling.
The first modern neural TTS model, WaveNet [402], generated waveforms directly from linguistic
features. Other models, such as DeepVoice 1/2 [18, 156], used neural network-based models to follow
the three components of statistical parametric synthesis. End-to-end models, including Tacotron
1 & 2 [491, 583], Deep Voice 3, and FastSpeech 1 & 2 [458, 460], simplified text analysis modules
and utilized mel-spectrograms to simplify acoustic features with character/phoneme sequences
as input. Fully end-to-end TTS systems, such as ClariNet [425], FastSpeech 2 [458], and EATS
[114], are capable of directly generating waveforms from text inputs. Compared to concatenative
synthesis 7 and statistical parametric synthesis, neural network-based speech synthesis offers
several advantages including superior voice quality, naturalness, intelligibility, and reduced reliance
on human preprocessing and feature development. Therefore, end-to-end TTS systems represent a
promising direction for advancing the field of speech synthesis.
7 https://en.wikipedia.org/wiki/Concatenative_synthesis
A Review of Deep Learning Techniques for Speech Processing 47
Transformer models have become increasingly popular for generating mel-spectrograms in TTS
systems [309, 458]. These models are preferred over RNN structures in end-to-end TTS systems
because they improve training and inference efficiency [309, 460]. In a study conducted by Li
et al. [309], a multi-head attention mechanism replaced both RNN structures and the vanilla
attention mechanism in Tacotron 2 [491]. This approach addressed the long-distance dependency
problem and improved pluralization. Phoneme sequences were used as input to generate the mel-
spectrogram, and speech samples were synthesized using WaveNet as a vocoder. Results showed
that the transformer-based TTS approach was 4.25 times faster than Tacotron 2 and achieved
similar MOS (Mean Opinion Score) performance.
Aside from the work mentioned above, there are other studies that are based on the Tacotron
architecture. For example, Skerry-Ryan et al. [503] and Wang et al. [584] proposed Tacotron-based
models for prosody control. These models use a separate encoder to compute style information from
reference audio that is not provided in the text. Another noteworthy work is the Global-style-Token
(GST) [584] which improves on style embeddings by adding an attention layer to capture a wider
range of acoustic styles.
The FastSpeech [460] algorithm aims to improve the inference speed of TTS systems. To achieve
this, it utilizes a feedforward network based on 1D convolution and the self-attention mechanism in
transformers to generate Mel-spectrograms in parallel. Additionally, it solves the issue of sequence
length mismatch between the Mel-spectrogram sequence and its corresponding phoneme sequence
by employing a length regulator based on a duration predictor. The FastSpeech model was evaluated
on the LJSpeech dataset and demonstrated significantly faster Mel-spectrogram generation than
the autoregressive transformer model while maintaining comparable performance. FastPitch builds
on FastSpeech by conditioning the TTS model on fundamental frequency or pitch contour, which
improves convergence and eliminates the need for knowledge distillation of Mel-spectrogram
targets in FastSpeech.
FastSpeech 2 [458] represents a transformer-based Text-to-Speech (TTS) system that addresses
the limitations of its predecessor, FastSpeech, while effectively handling the challenging one-to-
many mapping problem in TTS. It introduces the utilization of a broader range of speech information,
including energy, pitch, and more accurate duration, as conditional inputs. Furthermore, FastSpeech
2 trains the system directly on a ground-truth target, enhancing the quality of the synthesized speech.
Additionally, a simplified variant called FastSpeech 2s has been proposed in [61], eliminating the
requirement for intermediate Mel-spectrograms and enabling the direct generation of speech from
text during inference. Experimental evaluations conducted on the LJSpeech dataset demonstrated
that both FastSpeech 2 and FastSpeech 2s offer a streamlined training pipeline, resulting in fast,
robust, and controllable speech synthesis compared to FastSpeech.
Furthermore, in addition to the transformer-based TTS systems like FastSpeech 2 and FastSpeech
2s, researchers have also been exploring the potential of Variational Autoencoder (VAE) based TTS
models [163, 196, 251, 296]. These models can learn a latent representation of speech signals from
textual input and may be able to produce high-quality speech with less training data and greater
control over the generated speech characteristics. For example, authors in [251] used a conditional
variational autoencoder (CVAE) to model the acoustic features of speech and an adversarial loss to
improve the naturalness of the generated speech. This approach involved conditioning the CVAE
on the linguistic features of the input text and using an adversarial loss to match the distribution of
the generated speech to that of natural speech. Results from this method have shown promise in
generating speech that exhibits natural prosody and intonation.
WaveGrad [67] and DiffWave [269] have emerged as significant contributions in the field,
employing diffusion models to generate raw waveforms with exceptional performance. In contrast,
GradTTS [431] and DiffTTS [218] utilize diffusion models to generate mel features rather than raw
48 Mehrish et al.
Linguistic Acoustic
Text Features Acoustic Features Waveform
Text Waveform
analysis Model Generation
Fig. 14. Neural Text-to-speech (TTS) pipeline: a diagram showing the main modules of a typical TTS system.
The system takes text input and processes it through various stages to generate speech output. The text
analysis module tokenizes the input text and generates linguistic features such as phonemes and prosody. The
acoustic model module then converts these linguistic features into acoustic features, such as mel spectrograms,
using a neural network. Finally, the waveform generation module synthesizes the speech waveform from the
acoustic features using another neural network.
waveforms. Addressing the intricate challenge of one-shot many-to-many voice conversion, DiffVC
[432] introduces a novel solver based on stochastic differential equations. Expanding the scope of
sound generation to include singing voice synthesis, DiffSinger [334] introduces a shallow diffusion
mechanism. Additionally, Diffsound [611] proposes a sound generation framework that incorporates
text conditioning and employs a discrete diffusion model, effectively resolving concerns related to
unidirectional bias and accumulated errors.
EdiTTS [525] introduces a diffusion-based audio model that is specifically tailored for the text-
to-speech task. Its innovative approach involves the utilization of the denoising reversal process to
incorporate desired edits through coarse perturbations in the prior space. Similarly, Guided-TTS
[249] and Guided-TTS2 [257] stand as early text-to-speech models that have effectively harnessed
diffusion models for sound generation. Furthermore, Levkovitch et al. [301] have made notable
contributions by combining a voice diffusion model with a spectrogram domain conditioning
technique. This combined approach facilitates text-to-speech synthesis, even with previously
unseen voices during the training phase, thereby enhancing the model’s versatility and capabilities.
InferGrad [74] enhances the diffusion-based text-to-speech model by incorporating the inference
process during training, particularly when a limited number of inference steps are available. This
improvement results in faster and higher-quality sampling. SpecGrad [264] introduces adaptations
to the time-varying spectral envelope of diffusion noise based on conditioning log-mel spectrograms,
drawing inspiration from signal processing techniques. ItoTTS [597] presents a unified framework
that combines text-to-speech and vocoder models, utilizing linear SDE (Stochastic Differential
Equation) as its fundamental principle. ProDiff [206] proposes a progressive and efficient diffusion
model specifically designed for generating high-quality text-to-speech synthesis. Unlike traditional
diffusion models that require a large number of iterations, ProDiff parameterizes the model by pre-
dicting clean data and incorporates a teacher-synthesized mel-spectrogram as a target to minimize
data discrepancies and improve the sharpness of predictions. Finally, Binaural Grad [299] explores
the application of diffusion models in binaural audio synthesis, aiming to generate binaural audio
from monaural audio sources. It accomplishes this through a two-stage diffusion-based framework.
5.2.4 Alignment
Improving the alignment of text and speech in TTS architecture has been the focus of recent
research [22, 29, 35, 64, 225, 250, 316, 375, 377, 431, 459, 490, 493, 646]. Traditional TTS models
require external aligners to provide attention alignments of phoneme-to-frame sequences, which
can be complex and inefficient. Although autoregressive TTS models use an attention mechanism
to learn these alignments online, these alignments tend to be brittle and often fail to generalize to
long utterances and out-of-domain text, resulting in missing or repeating words.
A Review of Deep Learning Techniques for Speech Processing 49
uLM
Speech
Discrete Generation
Resynthesis
Quantizer Decoder
(u2S)
Encoder
(S2u)
Fig. 15. The architecture of the Generative Spoken Language Model GSLM introduced by Meta in [281].
GSLM model operates through a three-part architecture. Firstly, the encoder takes the speech waveform and
transforms it into distinct units represented as S2u. Secondly, the decoder reverses this mapping by converting
the units back to the original waveform, represented as u2S. Finally, the language model is unit-based and
captures the distribution of unit sequences, which can be viewed as a form of pseudo-text.
In their study [121], the authors presented a novel text encoder network that includes an
additional objective function to explicitly align text and speech encodings. The text encoder
architecture is straightforward, consisting of an embedding layer, followed by two bidirectional
LSTM layers that maintain the input’s resolution. The study utilized the same subword segmentation
for the input text as for the ASR output targets. While RNN models with soft attention mechanisms
have been proven to be highly effective in various tasks, including speech synthesis, their use in
online settings results in quadratic time complexity due to the pass over the entire input sequence
for generating each element in the output sequence. In [447], the authors proposed an end-to-end
differentiable method for learning monotonic alignments, enabling the computation of attention in
linear time. Several enhancements, such as those proposed in [79], have been proposed in recent
years to improve alignment in TTS models. Additionally, in [21], the authors introduced a generic
alignment learning framework that can be easily extended to various neural TTS models.
The use of normalizing flow has been introduced to address output diversity issues in parallel
TTS architectures. This technique is utilized to model the duration of speech, as evidenced by
studies conducted in [250, 377, 493]. One such flow-based generative model is Glow-TTS [250],
developed specifically for parallel TTS without the need for an external aligner. The model employs
the generic Glow architecture previously used in computer vision and vocoder models to produce
mel-spectrograms from text inputs, which are then converted to speech audio. Glow-TTS has
demonstrated superior synthesis speed over the autoregressive model, Tacotron 2, while maintaining
comparable speech quality.
Recently, a new TTS model called EfficientTTS [377] has been introduced. This model outperforms
previous models such as Tacotron 2 and Glow-TTS in terms of speech quality, training efficiency,
and synthesis speed. The EfficientTTS model uses a multi-head attention mechanism to align input
text and speech encodings, enabling it to generate high-quality speech with fewer parameters
and faster synthesis speed. Overall, the introduction of normalizing flow and the development of
models such as Glow-TTS and EfficientTTS have significantly improved the quality and efficiency
of TTS systems.
50 Mehrish et al.
To resolve output diversity issues in parallel TTS architectures, normalizing flow has been intro-
duced to model the duration of speech [250, 377, 493]. Glow-TTS [250] is a flow-based generative
model for parallel TTS that does not require any external aligner12345. It is built on the generic
Glow model that is previously used in computer vision and vocoder models3. Glow-TTS is designed
to produce mel-spectrograms from text input, which can then be converted to speech audio4. It has
been shown to achieve an order-of-magnitude speed-up over the autoregressive model, Tacotron 2,
at synthesis with comparable speech quality. EfficientTTS is a recent study that proposed a new
TTS model, which significantly outperformed models such as Tacotron 2 [491] and Glow-TTS [250]
in terms of speech quality, training efficiency, and synthesis speed. The EfficientTTS [377] model
uses a multi-head attention mechanism to align the input text and speech encodings, enabling it to
generate high-quality speech with fewer parameters and faster synthesis speed.
5.2.5 Speech Resynthesis
Speech resynthesis is the process of generating speech from a given input signal. The input signal
can be in various forms, such as a digital recording, text, or other types of data. The aim of speech
resynthesis is to create an output that closely resembles the original signal in terms of sound
quality, prosody, and other acoustic characteristics. Speech resynthesis is an important research
area with various applications, including speech enhancement [194, 363, 526], and voice conversion
[362]. Recent advancements in speech resynthesis have revolutionized the field by incorporating
self-supervised discrete representations to generate disentangled representations of speech content,
prosodic information, and speaker identity. These techniques enable the generation of speech
in a controlled and precise manner, as seen in [281, 429, 437, 495]. The objective is to generate
high-quality speech that maintains or degrades acoustic cues, such as phonotactics, syllabic rhythm,
or intonation, from natural speech recordings.
Speech resynthesis is a vital research area with various applications, including speech enhance-
ment and voice conversion, and recent advancements have revolutionized the field by incorporating
self-supervised discrete representations. These techniques enable the generation of high-quality
speech that maintains or degrades acoustic cues from natural speech recordings, and they have
been used in the GSLM [281] architecture for acoustic modeling, speech recognition, and synthesis,
as outlined in Figure 15. It comprises a discrete speech encoder, a generative language model, and
a speech decoder, all trained without supervision. GSLM is the only prior work addressing the
generative aspect of speech pre-training, which builds a text-free language model using discovered
units.
spectral shaping to adapt the diffusion noise. This adaptation, achieved through time-varying
filtering, improves sound quality, particularly in high-frequency bands. Other examples
of diffusion-based vocoders include InferGrad [74], SpecGrad [264], and Priorgrad [293].
InfraGrad incorporates the inference process into training to reduce inference iterations
while maintaining high quality. SpecGrad adapts the diffusion noise distribution to a given
acoustic feature and uses adaptive noise spectral shaping to generate high-fidelity speech
waveforms.
• Flow-based models: Parallel WaveNet, WaveGlow, etc. [258, 294, 354, 427, 435] are based
on normalizing flows and are capable of generating high-fidelity speech in real-time. While
flow-based vocoders generally perform worse than autoregressive vocoders with regard to
modeling the density of speech signals, recent research [354] has proposed new techniques
to improve their performance.
Universal neural vocoding is a challenging task that has achieved limited success to date. However,
recent advances in speech synthesis have shown a promising trend toward improving zero-shot
performance by scaling up model sizes. Despite its potential, this approach has yet to be extensively
explored. Nonetheless, several approaches have been proposed to address the challenges of universal
vocoding. For example, WaveRNN has been utilized in previous studies to achieve universal vocoding
(Lorenzo-Trueba et al. [344]; Paul et al. [419]). Another approach Jiao et al. [221] developed involves
constructing a universal vocoder using a flow-based model. Additionally, the GAN vocoder has
emerged as a promising candidate for this task, as suggested by You et al. [626].
5.2.8 Controllable Speech Synthesis
Controllable Speech Synthesis [122, 276, 460, 543, 547, 584, 676] is a rapidly evolving research area
that focuses on generating natural-sounding speech with the ability to control various aspects of
speech, including pitch, speed, and emotion. Controllable Speech Synthesis is positioned in the
emerging field of affective computing at the intersection of three disciplines: expressive speech
analysis [533], natural language processing, and machine learning. This field aims to develop
systems capable of recognizing, interpreting, and generating human-like emotional responses in
interactions between humans and machines.
Expressive speech analysis is a critical component of this field. It provides mathematical tools
to analyse speech signals and extract various acoustic features, including pitch, loudness, and
duration, that convey emotions in speech. Natural language processing is also crucial to this
field, as it helps to process the text input and extract the meaning and sentiment of the words.
Finally, machine learning techniques are used to model and control the expressive features of the
synthesized speech, enabling the systems to produce more expressive and controllable speech
[11, 205, 274, 295, 337, 408, 515, 548, 666].
In the last few years, notable advancements have been achieved in this field [164, 248, 450],
and several approaches have been proposed to enhance the quality of synthesized speech. For
example, some studies propose using deep learning techniques to synthesize expressive speech
and conditional generation models to control the prosodic features of speech [248, 450]. Others
propose using motion matching-based algorithms to synthesize gestures from speech [164].
5.2.9 Disentangling and Transferring
The importance of disentangled representations for neural speech synthesis cannot be overstated,
as it has been widely recognized in the literature that this approach can greatly improve the inter-
pretability and expressiveness of speech synthesis models [195, 360, 436]. Disentangling multiple
styles or prosody information during training is crucial to enhance the quality of expressive speech
synthesis and control. Various disentangling techniques have been developed using adversarial
A Review of Deep Learning Techniques for Speech Processing 53
and collaborative games, the VAE framework, bottleneck reconstructions, and frame-level noise
modeling combined with adversarial training.
For instance, Ma et al. [360] have employed adversarial and collaborative games to enhance the
disentanglement of content and style, resulting in improved controllability. Hsu et al. [195] have
utilized the VAE framework with adversarial training to separate speaker information from noise.
Qian et al. [436] have introduced speech flow, which can disentangle rhythm, pitch, content, and
timbre through three bottleneck reconstructions. In another work based on, adversarial training,
Zhang et al. [642] have proposed a method that disentangles noise from the speaker by modeling
the noise at the frame level.
Developing high-quality speech synthesis models that can handle noisy data and generate
accurate representations of speech is a challenging task. To tackle this issue, Zhang et al. [650]
propose a novel approach involving multi-length adversarial training. This method allows for
modeling different noise conditions and improves the accuracy of pitch prediction by incorporating
discriminators on the mel-spectrogram. By replacing the traditional pitch predictor model with this
approach, the authors demonstrate significant improvements in the fidelity of synthesized speech.
5.2.10 Robustness
Using neural TTS models can present issues with robustness, leading to low-quality audio sam-
ples for unseen or atypical text. In response, Li et al. [310] proposed RobuTrans [310], a robust
transformer that converts input text to linguistic features before feeding it to the encoder. This
model also includes modifications to the attention mechanism and position embedding, resulting in
improved MOS scores compared to other TTS models. Another approach to enhancing robustness
is the s-Transformer, introduced by Wang et al. [579], which models speech at the segment level,
allowing it to capture long-term dependencies and use segment-level encoder-decoder attention.
This technique performs similarly to the standard transformer, exhibiting robustness for extra-long
sentences. Lastly, Zheng et al. [670] proposed an approach that combines a local recurrent neural
network with the transformer to capture sequential and local information in sequences. Evaluation
of a 20-hour Mandarin speech corpus demonstrated that this model outperforms the transformer
alone in performance.
In their recent paper [610], the authors proposed a novel method for extracting dynamic prosody
information from audio recordings, even in noisy environments. Their approach employs proba-
bilistic denoising diffusion models and knowledge distillation to learn speaking style features from
a teacher model, resulting in a highly accurate reproduction of prosody and timber. This model
shows great potential in applications such as speech synthesis and recognition, where noise-robust
prosody information is crucial. Other noteworthy advances in the development of robust TTS
systems include the work by [493], which focuses on a robust speech-text alignment module, as
well as the use of normalizing flows for diverse speech synthesis.
5.2.11 Low-Resource Neural Speech Synthesis
High-quality paired text and speech data are crucial for building high-quality Text-to-Speech
(TTS) systems [147]. Unfortunately, most languages are not supported by popular commercialized
speech services due to the lack of sufficient training data [604]. To overcome this challenge,
researchers have developed TTS systems under low data resource scenarios using various techniques
[127, 147, 538, 604].
Several techniques have been proposed by researchers to enhance the efficiency of low-resource/Zero-
shot TTS systems. One of these is the use of semi-supervised speech synthesis methods that utilize
unpaired training data to improve data efficiency, as suggested in a study by Liu et al. [328]. Another
method involves cascading pre-trained models for ASR, MT, and TTS to increase data size from
54 Mehrish et al.
unlabelled speech, as proposed by Nguyen et al. [394]. In addition, researchers have employed
crowdsourced acoustic data collection to develop TTS systems for low-resource languages, as
shown in a study by Butryna et al. [50]. Huang et al. [205] introduced a zero-shot style transfer
approach for out-of-domain speech synthesis that generates speech samples exhibiting a new and
distinctive style, such as speaker identity, emotion, and prosody.
5.3.2 Dataset
The VoxCeleb dataset (VoxCeleb 1 & 2) is widely used in speaker recognition research, as mentioned
in [92]. This dataset consists of speech data collected from publicly available media, employing a
fully automated pipeline that incorporates computer vision techniques. The pipeline retrieves videos
from YouTube and applies active speaker verification using a two-stream synchronization CNN.
Speaker identity is further confirmed through CNN-based facial recognition. Another commonly
employed dataset is TIMIT, which comprises recordings of phonetically balanced English sentences
spoken by a diverse set of speakers. TIMIT is commonly used for evaluating speech recognition
and speaker identification systems, as referenced in [153].
A Review of Deep Learning Techniques for Speech Processing 55
Other noteworthy datasets in the field include the SITW database [371], which provides hand-
annotated speech samples for benchmarking text-independent speaker recognition technology,
and the RSR2015 database [286], which contains speech recordings acquired in a typical office
environment using multiple mobile devices. Additionally, the RedDots project [291] and VOICES
corpus [463] offer unique collections of offline voice recordings in furnished rooms with background
noise, while the CN-CELEB database [135] focuses on a specific person of interest extracted from
bilibili.com using an automated pipeline followed by human verification.
The BookTubeSpeech dataset [424] was also collected using an automated pipeline from Book-
Tube videos, and the Hi-MIA database [438] was designed specifically for far-field scenarios using
multiple microphone arrays. The FFSVC20 challenge [439] and DIHARD challenge [471] are speaker
verification and diarization research initiatives focusing on far-field and robustness challenges,
respectively. Finally, the LibriSpeech dataset [410], originally intended for speech recognition, is
also useful for speaker recognition tasks due to its included speaker identity labels.
5.3.3 Models
Speaker identification (SI) and verification (SV) are crucial research topics in the field of speech
technology due to their significant importance in various applications such as security [125],
forensics [270], biometric authentication [170], and speaker diarization [601]. Speaker recognition
has become more popular with technological advancements, including the Internet of Things (IoT),
smart devices, voice assistants, smart homes, and humanoids. Therefore, a significant quantity
of research has been conducted in this field, and many methods have been developed, making
the state-of-the-art in this field quite mature and versatile. However, it has become increasingly
challenging to provide an overview of the various methods due to the high number of studies in
the field.
A neural network approach for speaker verification was first attempted by Variani et al. [553] in
2014, utilizing four fully connected layers for speaker classification. Their approach has successfully
verified speakers with short-duration utterances by obtaining the 𝑑-vector by averaging the output
of the last hidden layer across frames. Although various attempts have been made to directly learn
speaker representation from raw waveforms by other researchers (Jung et al. [226], Ravanelli and
Bengio [454]), other well-designed neural networks like CNNs and RNNs have been proposed for
speaker verification tasks by Ye and Yang [621]. Nevertheless, the field still requires more powerful
deep neural networks for superior extraction of speaker features.
Speaker verification has seen notable advancements with the advent of more powerful deep neural
networks. One such model is the 𝑥-vector-based system proposed by Snyder et al. [507], which has
gained widespread popularity due to its remarkable performance. Since its introduction, the 𝑥-vector
system has undergone significant architectural enhancements and optimized training procedures
[103]. The widely-used ResNet [176] architecture has been incorporated into the system to improve
its performance further. Adding residual connections between frame-level layers has been found
to improve the embeddings [152, 634]. This technique has also aided in faster convergence of the
back-propagation algorithm and mitigated the vanishing gradient problem [176]. Tang et al. [530]
proposed further improvements to the 𝑥-vector system. They introduced a hybrid structure based
on TDNN and LSTM to generate complementary speaker information at different levels. They
also suggested a multi-level pooling strategy to collect the speaker information from global and
local perspectives. These advancements have significantly improved speaker verification systems’
performance and paved the way for further developments in the field.
Desplanques et al. [108] propose a state-of-the-art architecture for speaker verification utilizing
a Time Delay Neural Network (TDNN) called ECAPA-TDNN. The paper presents a range of
enhancements to the existing 𝑥-vector architecture that leverages recent developments in face
56 Mehrish et al.
verification and computer vision. Specifically, the authors suggest three major improvements.
Firstly, they propose restructuring the initial frame layers into 1-dimensional Res2Net modules
with impactful skip connections, which can better capture the relationships between different time
frames. Secondly, they introduce Squeeze-and-Excitation blocks to the TDNN layers, which help
highlight the most informative channels and improve feature discrimination. Lastly, the paper
proposes channel attention propagation and aggregation to efficiently propagate attention weights
through multiple TDNN layers, further enhancing the model’s ability to discriminate between
speakers.
Additionally, the paper presents a new approach that utilizes ECAPA-TDNN from the speaker
recognition domain as the backbone network for a multiscale channel adaptive module. The
proposed method achieves promising results, demonstrating the effectiveness of the proposed
architecture in speaker verification. Overall, ECAPA-TDNN offers a comprehensive solution to
speaker verification by introducing several novel contributions that improve the existing 𝑥-vector
architecture, which has been state-of-the-art in speaker verification for several years. The proposed
approach also achieves promising results, suggesting that the proposed architecture can effectively
tackle the challenges of speaker verification.
The attention mechanism is a powerful method for obtaining a more discriminative utterance-
level feature by explicitly selecting frame-level representations that better represent speaker char-
acteristics. Recently, the Transformer model with a self-attention mechanism has become effective
in various application fields, including speaker verification. The Transformer architecture has been
extensively explored for speaker verification. TESA [370] is an architecture based on the Trans-
former’s encoder, proposed as a replacement for conventional PLDA-based speaker verification to
capture speaker characteristics better. TESA outperforms PLDA on the same dataset by utilizing
the next sentence prediction task of BERT [109]. Zhu et al. [675] proposed a method to create fixed-
dimensional speaker verification representation using a serialized multi-layer multi-head attention
mechanism. Unlike other studies that redesign the inner structure of the attention module, their
approach strictly follows the original Transformer, providing simple but effective modifications.
• Acoustic Features Extraction: In the analysis of multi-speaker speech data, one critical
component is the extraction of acoustic features [14, 536]. This process involves extracting
features such as pitch, energy, and MFCCs from the audio signal. These acoustic features
play a crucial role in identifying different speakers by analyzing their unique characteristics.
• Segmentation: Segmentation is a crucial component in the analysis of multi-speaker audio
data, where the audio signal is divided into smaller segments based on the silence periods
between speakers [14, 536]. This process helps in reducing the complexity of the problem
and makes it easier to identify different speakers in smaller segments
• Speaker Embedding Extraction: This process involves obtaining a low-dimensional repre-
sentation of each speaker’s voice, which is commonly referred to as speaker embedding.
A Review of Deep Learning Techniques for Speech Processing 57
Speaker
utterance with multiple Speech Vs Non-Speech Covariance
overlapping
Speakers
Speaker
VAD smbedding Scoring Speaker
extraction Change
Speaker ID Segmented
and
Clustering Resegmentation
Clustered
utterances
Fig. 16. Speaker diarization system diagram showcasing the process of identifying and differentiating multiple
speakers in an audio recording using various techniques such as VAD, segmentation, clustering and re-
segmentation.
This is achieved by passing the acoustic features extracted from the speech signal through
a deep neural network, such as a CNN or RNN[506].
• Clustering: In this component, the extracted speaker embeddings are clustered based on
similarity, and each cluster represents a different speaker [14, 536]. This process commonly
uses unsupervised clustering algorithms, such as k-means clustering.
• Speaker Classification: In this component, the speaker embeddings are classified into different
speaker identities using a supervised classification algorithm, such as SVM or MLP [14, 536].
• Re-segmentation: This component is responsible for refining the initial segmentation by
adjusting the segment boundaries based on the classification results. It helps in improving the
accuracy of speaker diarization by reducing the errors made during the initial segmentation.
Various studies focus on traditional speaker diarization systems [14, 536]. This paper will review
the recent efforts toward deep learning-based speaker diarizations techniques.
5.4.2 Dataset
• NIST SRE 2000 (Disk-8) or CALLHOME dataset: The NIST SRE 2000 (Disk-8) corpus, also
referred to as the CALLHOME dataset, is a frequently utilized resource for speaker diariza-
tion in contemporary research papers. Originally released in 2000, this dataset comprises
conversational telephone speech (CTS) collected from diverse speakers representing a wide
range of ages, genders, and dialects. It includes 500 sessions of multilingual telephonic
speech, each containing two to seven speakers, with two primary speakers in each con-
versation. The dataset covers various topics, including personal and familial relationships,
work, education, and leisure activities. The audio recordings were obtained using a single
microphone and had a sampling rate of 8 kHz, with 16-bit linear quantization.
• Directions into Heterogeneous Audio Research (DIHARD) Challenge and dataset: The DIHARD
Challenge, organized by the National Institute of Standards and Technology (NIST), aims
to enhance the accuracy of speech recognition and diarization in challenging acoustic
environments, such as crowded spaces, distant microphones, and reverberant rooms. The
challenge comprises tasks requiring advanced machine-learning techniques, including
speaker diarization, recognition, and speech activity detection. The DIHARD dataset used
in the challenge comprises over 50 hours of speech from more than 500 speakers, gathered
from diverse sources like meetings, broadcast news, and telephone conversations. These
recordings feature various acoustic challenges, such as overlapping speech, background
noise, and distant or reverberant speech, captured through different microphone setups. To
aid in the evaluation process, the dataset has been divided into separate development and
58 Mehrish et al.
evaluation sets. The assessment metrics used to gauge performance include diarization error
rate (DER), as well as accuracy in speaker verification, identification, and speech activity
detection.
• Augmented Multi-party Interaction (AMI) database: The AMI database is a collection of
audio and video recordings that capture real-world multi-party conversations in office
environments. The database was developed as part of the AMI project, which aimed to
develop technology for automatically analyzing multi-party meetings. The database contains
over 100 hours of audio and video recordings of meetings involving four to seven participants,
totaling 112 meetings. The meetings were held in multiple offices and were designed to
reflect the kinds of discussions that take place in typical business meetings. The audio
recordings were captured using close-talk microphones placed on each participant and
additional microphones placed in the room to capture ambient sound. The video recordings
were captured using multiple cameras placed around the room. In addition to the audio and
video recordings, the database also includes annotations that provide additional information
about the meetings, including speaker identities, speech transcriptions, and information
about the meeting structure (e.g., turn-taking patterns). The AMI database has been used
extensively in research on automatic speech recognition, speaker diarization, and other
related speech and language processing topics.
• VoxSRC Challenge and VoxConverse corpus: The VoxCeleb Speaker Recognition Challenge
(VoxSRC) is an annual competition designed to assess the capabilities of speaker recognition
systems in identifying speakers from speech recorded in real-world environments. The
challenge provides participants with a dataset of audio and visual recordings of interviews,
news shows, and talk shows featuring famous individuals. The VoxSRC encompasses several
tracks, including speaker diarization, and comprises a development set (20.3 hours, 216
recordings) and a test set (53.5 hours, 310 recordings). Recordings in the dataset may feature
between one and 21 speakers, with a diverse range of ambient noises, such as background
music and laughter. To facilitate the speaker diarization track of the VoxSRC-21 and VoxSRC-
22 competitions, VoxConverse, an audio-visual diarization dataset containing multi-speaker
clips of human speech sourced from YouTube videos, is available, and additional details are
provided on the project website 8 .
• LibriCSS: The LibriCSS corpus is a valuable resource for researchers studying speech sepa-
ration, recognition, and speaker diarization. The corpus comprises 10 hours of multichannel
recordings captured using a 7-channel microphone array in a real meeting room. The audio
was played from the LibriSpeech corpus, and each of the ten sessions was subdivided into
six 10-minute mini-sessions. Each mini-session contained audio from eight speakers and
was designed to have different overlap ratios ranging from 0% to 40%. To make research
easier, the corpus includes baseline systems for speech separation and Automatic Speech
Recognition (ASR) and a baseline system that integrates speech separation, speaker diariza-
tion, and ASR. These baseline systems have already been developed and made available to
researchers.
• Rich Transcription Evaluation Series: The Rich Transcription Evaluation Series dataset is a
collection of speech data used for speaker diarization evaluation. The Rich Transcription
Fall 2003 Evaluation (RT-03F) was the first evaluation in the series focused on "Who Said
What" tasks. The dataset has been used in subsequent evaluations, including the Second
DIHARD Diarization Challenge, which used the Jaccard index to compute the JER (Jaccard
Error Rate) for each pair of segmentations. The dataset is essential for data-driven spoken
8 https://www.robots.ox.ac.uk/ vgg/data/voxconverse/
A Review of Deep Learning Techniques for Speech Processing 59
language processing methods and calculates speaker diarization accuracy at the utterance
level. The dataset includes rules, evaluation methods, and baseline systems to promote
reproducible research in the field. The dataset has been used in various speaker diarization
systems and their subtasks in the context of broadcast news and CTS data
• CHiME-5/6 challenge and dataset The CHiME-5/6 challenge is a speech processing challenge
focusing on distant multi-microphone conversational speech diarization and recognition
in everyday home environments. The challenge provides a dataset of recordings from
everyday home environments, including dinner recordings originally collected for and
exposed during the CHiME-5 challenge. The dataset is designed to be representative of
natural conversational speech. The challenge features two audio input conditions: single-
channel and multichannel. Participants are provided with baseline systems for speech
enhancement, speech activity detection (SAD), and diarization, as well as results obtained
with these systems for all tracks. The challenge aims to improve the robustness of diarization
systems to variations in recording equipment, noise conditions, and conversational domains.
• AMI dataset: The AMI database is a comprehensive collection of 100 hours of recordings
sourced from 171 meeting sessions held across various locations. It features two distinct
audio sources – one recorded using lapel microphones for individual speakers and the
other using omnidirectional microphone arrays placed on the table. It is an ideal dataset
for evaluating speaker diarization systems integrated with the ASR module. AMI’s value
proposition is further enhanced by providing forced alignment data, which captures the
timings at the word and phoneme levels and speaker labeling. Finally, it’s worth noting that
each meeting session involves a small group of three to five speakers.
5.4.3 Models
Speaker diarization has been a subject of research in the field of audio processing, with the goal
of separating speakers in an audio recording. In recent years, deep learning has emerged as a
powerful technique for speaker diarization, leading to significant advancements in this field. In this
article, we will explore some of the recent developments in deep learning architecture for speaker
diarization, focusing on different modules of speaker diarization as outlined in Figure 16. Through
this discussion, we will highlight major advancements in each module.
• Segmentation and clustering: Speaker diarization systems typically use a range of techniques
for segmenting speech, such as identifying speaker change, uniform speaker segmenta-
tion, ASR-based word segmentation, and supervised speaker turn detection. However, each
approach has its own benefits and drawbacks. Uniform speaker segmentation involves
dividing speech into segments of equal length, which can be difficult to optimize to capture
speaker turn boundaries and include enough speaker information. ASR-based word seg-
mentation identifies word boundaries using automatic speech recognition, but the resulting
segments may be too brief to provide adequate speaker information. Supervised speaker
turn detection, on the other hand, involves a specialized model that can accurately identify
speaker turn timestamps. While this method can achieve high accuracy, it requires labeled
data for training. These techniques have been widely discussed in previous research, and
choosing the appropriate one depends on the specific requirements of the application.
– The authors in [98] propose real-time speaker diarization system that combines incre-
mental clustering and local diarization applied to a rolling window of speech data and
is designed to handle overlapping speech segments. The proposed pipeline is designed
to utilize end-to-end overlap-aware segmentation to detect and separate overlapping
speakers.
60 Mehrish et al.
– In another related work, authors in [643] introduce a novel speaker diarization system
with a generalized neural speaker clustering module as the backbone.
– In a recent study conducted by Park et al. [415], a new framework for spectral clustering
is proposed that allows for automatic parameter tuning of the clustering algorithm
in the context of speaker diarization. The proposed technique utilizes normalized
maximum eigengap (NME) values to determine the number of clusters and threshold
parameters for each row in an affinity matrix during spectral clustering. The authors
demonstrated that their method outperformed existing state-of-the-art methods on
two different datasets for speaker diarization.
– Bayesian HMM clustering of x-vector sequences (VBx) diarization approach, which
clusters x-vectors using a Bayesian hidden Markov model (BHMM) [285], combined
with a ResNet101 (He et al. [176]) 𝑥-vector extractor achieves superior results on
CALLHOME [111], AMI [53] and DIHARD II [472] datasets
• Speaker Embedding Extraction and Classification:
– Attentive Aggregation for Speaker Diarization [278]: This approach uses an attention
mechanism to aggregate embeddings from multiple frames and generate speaker
embeddings. The speaker embeddings are then used for clustering to identify speaker
segments.
– End-to-End Speaker Diarization with Self-Attention [145]: This method uses a self-
attention mechanism to capture the correlations between the input frames and gen-
erates embeddings for each frame. The embeddings are then used for clustering to
identify speaker segments.
– Wang et al. [577] present an innovative method for measuring similarity between
speaker embeddings in speaker diarization using neural networks. The approach incor-
porates past and future contexts and uses a segmental pooling strategy. Furthermore,
the speaker embedding network and similarity measurement model are jointly trained.
The paper extends this framework to target-speaker voice activity detection (TS-VAD)
[372]. The proposed method effectively learns the similarity between speaker embed-
dings by considering both past and future contexts.
– Time-Depth Separable Convolutions for Speaker Diarization [266]: This approach uses
time-depth separable convolutions to generate embeddings for each frame, which are
then used for clustering to identify speaker segments. The method is computationally
efficient and achieves state-of-the-art performance on several benchmark datasets.
• Re-segmentation:
– Numerous studies in this field centre around developing a re-segmentation strategy
for diarization systems that can effectively handle both voice activity and overlapped
speech detection. This approach can also be a post-processing step to identify and
assign overlapped speech regions accurately. Notable examples of such works include
those by Bullock et al. [47] and Bredin and Laurent [45].
• End-to-End Neural Diarization: In addition to the above work, end-to-end speaker diarization
systems have gained the attention of the research community due to their ability to handle
speaker overlaps and their optimization to minimize diarization errors directly. In one such
work, the authors propose end-to-end neural speaker diarization that does not rely on
clustering and instead uses a self-attention-based neural network to directly output the
joint speech activities of all speakers for each segment [145]. Following the trend, several
other works propose enhanced architectures based on self-attention [324, 630]
A Review of Deep Learning Techniques for Speech Processing 61
conventional cascade system, the feasibility of the end-to-end direct speech-to-speech translation
was demonstrated [219].
In a recent publication from 2020, researchers presented a study on an end-to-end speech
translation system. This system incorporates pre-trained models such as Wav2Vec 2.0 and mBART,
along with coupling modules between the encoder and decoder. The study also introduces an
efficient fine-tuning technique, which selectively trains only 20% of the total parameters [622]. The
system developed by the UPC Machine Translation group actively participated in the IWSLT 2021
offline speech translation task, which aimed to develop a system capable of translating English
audio recordings from TED talks into German text.
E2E ST is often improved by pretraining the encoder and/or decoder with transcripts from
speech recognition or text translation tasks [110, 563, 603, 639]. Consequently, it has become
the standard approach used in various toolkits [214, 563, 660, 669]. However, transcripts are not
always available, and the significance of pretraining for E2E ST is rarely studied. Zhang et al. [638]
explored the effectiveness of E2E ST trained solely on speech-translation pairs and proposed an
algorithm for training from scratch. The proposed system outperforms previous studies in four
benchmarks covering 23 languages without pretraining. The paper also discusses neural acoustic
feature modeling, which extracts acoustic features directly from raw speech signals to simplify
inductive biases and enhance speech description.
5.6.2 Dataset
One popular dataset for speech enhancement tasks is AISHELL-4, which comprises authentic
Mandarin speech recordings captured during conferences using an 8-channel circular microphone
array. In accordance with [144], AISHELL-4 is composed of 211 meeting sessions, each featuring 4
to 8 speakers, for a total of 120 hours of content. This dataset is of great value for research into
multi-speaker processing owing to its realistic acoustics and various speech qualities, including
speaker diarization and speech recognition
Another popular dataset used for speech enhancement is the dataset from Deep Noise Suppression
(DNS) challenge [457], a large-scale dataset of noisy speech signals and their corresponding clean
speech signals. The DNS dataset contains over 10, 000 hours of noisy speech signals and over
1, 000 hours of clean speech signals, making it useful for training deep learning models for speech
enhancement. The Voice Bank Corpus (VCTK) is another dataset containing speech recordings
from 109 speakers, each recording approximately 400 sentences. The dataset contains clean and
noisy speech recordings, making it useful for training speech enhancement models. These datasets
provide realistic acoustics, rich natural speech characteristics, and large-scale noisy and clean
speech signals, making them useful for training deep learning models.
A Review of Deep Learning Techniques for Speech Processing 63
Table 10. Performance of different speech enhancement algorithms on the Deep Noise Suppression (DNS)
Challenge dataset. The table showcases improvements in PESQ-WB, PESQ-NB, SI-SDR-WB, and SI-SDR-NB
metrics, and identifies the top-performing methods in each category.
5.6.3 Models
Several Classical algorithms have been reported in the literature for speech enhancement, including
spectral subtraction [41], Wiener and Kalman filtering [319, 480], MMSE estimation [128], comb
filtering [222], subspace methods [171]. Phase spectrum compensation [407]. However, classical
algorithms such as spectral subtraction and Wiener filtering approach the problem in the spectral
domain and are restricted to stationary or quasi-stationary noise.
Neural network-based approaches inspired from other areas such as computer vision [10, 146, 188]
and generative adversarial networks [142, 321, 469, 596] or developed for general audio processing
tasks [157, 588] have outperformed the classical approaches. Various neural network models
based on different architectures, including fully connected neural networks [606], deep denoising
autoencoder [346], CNN [143], LSTM [77], and Transformer [263] have effectively handled diverse
noisy conditions.
Diffusion-based models have also shown promising results for speech enhancement [298, 349,
623] and have led to the development of novel speech enhancement algorithms called Conditional
Diffusion Probabilistic Model (CDiffuSE) that incorporates characteristics of the observed noisy
speech signal into the diffusion and reverse processing [349]. CDiffuSE is a generalized formulation
of the diffusion probabilistic model that can adapt to non-Gaussian real noises in the estimated
speech signal. Another diffusion-based model for speech enhancement is StoRM [298], which
stands for Stochastic Regeneration Model. It uses a predictive model to remove vocalizing and
breathing artifacts while producing high-quality samples using a diffusion process, even in adverse
conditions. StoRM has shown great ability at bridging the performance gap between predictive
and generative approaches for speech enhancement. Furthermore, authors in [623] propose cold
diffusion process is an advanced iterative version of the diffusion process to recover clean speech
from noisy speech. According to the authors, it can be utilized to restore high-quality samples from
arbitrary degradations. Table 10 summarizing the performance of different speech enhancement
algorithms on the Deep Noise Suppression (DNS) Challenge dataset using different metrics.
64 Mehrish et al.
5.7.2 Datasets
This section provides an overview of the diverse datasets utilized in Audio Super Resolution
literature. One of the most frequently used datasets is the MUSDB18, specifically designed for
music source separation and enhancement. This dataset encompasses more than 150 songs with
distinct tracks for individual instruments. Another prominent dataset is UrbanSound8K, which
comprises over, 8000 environmental sound files collected from 10 different categories, making it
ideal for evaluating Audio Super Resolution algorithms in noisy environments. Furthermore, the
VoiceBank dataset is another essential resource for evaluating Audio Super Resolution systems,
comprising over 10,000 speech recordings from five distinct speakers. This dataset offers a rich
source of information for assessing speech processing systems, including Audio Super Resolution.
Another dataset, LibriSpeech, features more than 1000 hours of spoken words from several books
and speakers, making it valuable for evaluating Audio Super Resolution algorithms to enhance the
quality of spoken words. Finally, the TED-LIUM dataset, which includes over 140 hours of speech
recordings from various speakers giving TED talks, provides a real-world setting for evaluating
Audio Super Resolution algorithms for speech enhancement. By using these datasets, researchers
can evaluate Audio Super Resolution systems for a wide range of audio signals and improve the
generalizability of these algorithms for real-world scenarios.
5.7.3 Models
Audio super-resolution has been extensively explored using deep learning architectures [8, 40,
168, 253, 290, 320, 333, 392, 453, 624]. One notable paper by Rakotonirina [453] proposes a novel
network architecture that integrates convolution and self-attention mechanisms for audio super-
resolution. Specifically, they use Attention-based Feature-Wise Linear Modulation (AFiLM) [453]
to modulate the activations of the convolutional model. In another recent work by Yoneyama et al.
[624], the super-resolution task is decomposed into domain adaptation and resampling processes
to handle acoustic mismatch in unpaired low- and high-resolution signals. To address this, they
jointly optimize the two processes within the CycleGAN framework.
Moreover, the Time-Frequency Network (TFNet) [320] proposed a deep network that achieves
promising results by modeling the task as a regression problem in either time or frequency domain.
To further enhance audio super-resolution, the paper proposes a time-frequency network that
combines time and frequency domain information. Finally, recent advancements in diffusion models
have introduced new approaches to neural audio upsampling. Specifically, Lee and Han [290], and
Han and Lee [168] propose NU-Wave 1 and 2 diffusion probabilistic models, respectively, which
can produce high-quality waveforms with a sampling rate of 48kHz from coarse 16kHz or 24kHz
inputs. These models are a promising direction for improving audio super-resolution.
A Review of Deep Learning Techniques for Speech Processing 65
5.9.2 Datasets
The speech quality assessment algorithms are evaluated using several datasets, each with unique
characteristics. The TIMIT Acoustic-Phonetic Continuous Speech Corpus [153] has clean speech
recordings and artificially generated degraded versions for speech synthesis and quality assessment
research. The NOIZEUS dataset [203] is designed for evaluating noise reduction and speech quality
assessment algorithms, with clean speech and artificially degraded versions containing various
types of noise and distortion. The ETSI Aurora databases [361] are used for evaluating speech
enhancement techniques and quality assessment algorithms, containing speech recordings with
different types of distortions like acoustic echo and background noise. Furthermore, for training
and validation, the clean speech recordings from the DNS Challenge [457] can be used along with
the noise dataset such as FSDK50 [138] for additive noise degradation.
5.9.3 Models
Current objective methods such as Perceptual Evaluation of Speech Quality (PESQ) [466] and
Perceptual Objective Listening Quality Assessment (POLQA) [36] for evaluating the quality of
speech mostly rely on the availability of the corresponding clean reference. These methods fail
in real-world scenarios where the ground truth clean reference is unavailable. In recent years,
several attempts to automatically estimate the MOS using neural networks for performing quality
assessment and predicting ratings or scores have attracted much attention [55, 57, 118, 119, 404,
514]. These approaches outperform traditional approaches without the need for a clean reference.
However, they lack robustness and generalization capabilities, limiting their use in real-world
applications. The authors in [404] explore Deep machine listening for Estimating Speech Quality
(DESQ) for predicting the perceived speech quality based on phoneme posterior probabilities
obtained using a deep neural network.
In recent years, there have been several quality assessment frameworks developed to estimate
speech quality, such as NORESQA [369] based on non-matching reference (NMR). NORESQA takes
inspiration from the human ability to assess speech quality even when the content is non-matching.
Additionally, NORESQA introduces two new metrics - NORESQA-score, which is based on SI-SDR
for speech, and NORESQA-MOS, which evaluates the Mean Opinion Score (MOS) of a speech
recording using non-matching references. A recent extension to NORESQA, known as NORESQA-
MOS, has been proposed in [368]. The primary difference between these frameworks is that while
NORESQA estimates speech quality using non-matching references through NORESQA-score and
NORESQA-MOS, NORESQA-MOS is specifically designed to assess the MOS of a given speech
recording using NMRs.
improvement in performance with the advent of deep neural networks, which can learn complex
relationships between input features and output sources.
5.10.2 Datasets
The WSJ0-2mix dataset comprises mixtures of two Wall Street Journal corpus (WSJ) speakers. It
consists of a training set of 30,000 mixtures and a test set of 5000 mixtures, and it has been widely
used to evaluate speech separation algorithms. CHiME-4 is a dataset that contains recordings of
multiple speakers in real-world environments, such as a living room, a kitchen, and a café and is
designed to test algorithms in challenging acoustic environments. TIMIT-2mix is a dataset based
on the TIMIT corpus, consisting of mixtures of two speakers, and includes a training set of 462
mixtures and a test set of 400 mixtures. The dataset provides a more controlled environment than
CHiME-4 to test speech separation algorithms. LibriMix is derived from the LibriSpeech corpus
and includes mixtures of up to four speakers, with a training set of 100,000 mixtures and a test set
of 1,000 mixtures, providing a more realistic and challenging environment than WSJ0-2mix. Lastly,
the MUSDB18 dataset contains mixtures of music tracks separated into individual stems, including
vocals, drums, bass, and other instruments. It consists of a training set of 100 songs and a test set of
50 songs. Despite not being specifically designed for that purpose, it has been used as a benchmark
for evaluating speech separation algorithms.
5.10.3 Models
Deep Clustering++ [181], first proposed in 2015, employs deep neural networks to extract features
from the input signal and cluster similar feature vectors in a latent space to separate different
speakers. The model’s performance is improved using spectral masking and a permutation invariant
training method. The advantage of this model is its ability to handle multiple speakers, but it also
has a high computational cost. Chimera++ [587] is another effective model that combines deep
clustering with mask-inference networks in a multi-objective training scheme. The model is trained
using a multitask learning approach, optimizing speech enhancement and speaker identification.
Chimera++ can perform speech enhancement and speaker identification but has a relatively long
training time.
TasNet v2 [352] employs a convolutional neural network (CNN) to process the input signal
and generate a time-frequency mask for each source. The model is trained using an invariant
permutation training (PIT) method [265], which enables it to separate multiple sources accurately.
TasNet v2 achieves state-of-the-art performance in various speech separation tasks with high
separation accuracy, but its disadvantage is its relatively high computational cost. The variant of
TasNet based on CNNs is proposed in [353]. The model is called Conv-TasNet and can generate a
time-frequency mask for each source to obtain the separated source’s signal. Compared to previous
models, Conv-TasNet has faster processing time but lower accuracy.
In recent research, encoder-decoder architectures have been explored for effectively separating
source signals. One promising approach is the Hybrid Tasnet architecture [613], which utilizes an
encoder to extract features from the input signal and a decoder to generate the independent sources.
This hybrid architecture captures both short-term and long-term dependencies in the input signal,
leading to improved separation performance. However, it should be noted that this model’s higher
computational cost should be considered when selecting an appropriate separation method.
Dual-path RNN [351] uses RNN architecture to perform speech separation. The model uses a
dual-path structure [351] to capture low-frequency and high-frequency information in the input
signal. Dual-path RNN achieves impressive performance in various speech separation tasks. The
advantage of this model is its ability to capture low-frequency and high-frequency information, but
its disadvantage is its high computational cost. Gated DualPathRNN [387] is a variant of Dual-path
68 Mehrish et al.
Table 11. Table comparing the performance of different speech separation methods using SI-SDRi metrics on
various speech separation benchmarks.
Model Architecture WSJ0-2mix WSJ0-3mix WSJ0-5mix Libri2Mix Libri5Mix Libri10Mix Libri20Mix WHAM
Separate And Diffuse [357] Diffusion 23.9 20.9 - 21.5 14.2 9 5.2 -
MossFormer (L) [663] Transformer 22.8 21.2 - - - - - -
MossFormer (M) [663] Transformer 22.5 20.8 - - - - - 17.3
SepFormer [518] Transformer 22.3 19.5 - - - - - -
Sandglasset [283] Transformer + LSTM 21.0 19.5 - - - - - -
Hungarian PIT [120] RNN - - 13.22 - 12.72 7.78 4.26 -
TDANet (L) [308] Transformer + CNN - - - 17.4 - - - 15.2
TDANet [308] Transformer + CNN - - - 16.9 - - - 14.8
Sepit [356] CNN 22.4 20.1 - - 13.7 8.2 - -
Gated DualPathRNN [387] CNN + LSTM 20.12 16.85 10.56 - - - - -
Dual-path RNN [351] LSTM 18.8 - - - - - - -
Conv-Tasnet [353] CNN 15.3 - - - - - - -
RNN that employs gated recurrent units (GRUs) to improve the model’s performance. The model
uses a gating mechanism to control the flow of information in the recurrent network, allowing it to
capture long-term dependencies in the input signal. Gated DualPathRNN achieves state-of-the-art
performance in various speech separation tasks. The advantage of this model is its ability to capture
long-term dependencies, but its disadvantage is its higher computational cost than other models.
Wavesplit [633] employs a Wave-U-Net [517] architecture to perform speech separation. The
model uses a fully convolutional neural network to extract features from the input signal and
generate a time-frequency mask for each source. Wavesplit achieves impressive performance in
various speech separation tasks. The advantage of this model is its high separation accuracy and
relatively fast processing time, but its disadvantage is its relatively high memory usage.
Numerous studies have investigated the application of Transformer architecture in the context
of speech separation. One such study is SepFormer [518], which has yielded encouraging outcomes
on the WSJ0-2mix and WSJ0-3mix datasets, as evidenced by the data presented in Table 11. Addi-
tionally, MossFormer [663] is another cutting-edge architecture that has successfully pushed the
boundaries of monaural speech separation across multiple speech separation benchmarks. It is
worth noting that although both models employ attention mechanisms, MossFormer integrates a
blend of convolutional modules to further amplify its performance.
Diffusion models have been proven to be highly effective in various machine learning tasks related
to computer vision, as well as speech-processing tasks. The recent development of DiffSep [482] for
speech separation, which is based on score-matching of a stochastic differential equation, has shown
competitive performance on the VoiceBank-DEMAND dataset. Additionally, Separate And Diffuse
[357], another diffusion-based model that utilizes a pretrained diffusion model, currently represents
the state-of-the-art performance in various speech separation benchmarks (refer to Table 11). These
advancements demonstrate the significant potential of diffusion models in advancing the field of
machine learning and speech processing.
understanding. Typically, SLU tasks involve identifying the domain or topic of a spoken utterance,
determining the speaker’s intent or goal in making the utterance, and filling in any relevant slots
or variables associated with that intent. For example, consider the spoken utterance, "What is the
weather like in San Francisco today?" An SLU system would need to identify the domain (weather),
the intent (obtaining current weather information), and the specific slot to be filled (location-San
Francisco) to generate an appropriate response. By improving SLU capabilities, we can enable more
effective communication between humans and machines, making interactions more natural and
efficient.
Data-driven methods are frequently utilized to achieve these tasks, employing large datasets to
train models capable of accurately recognizing and interpreting spoken language. Among these
methods, machine learning techniques, such as deep neural networks, are widely employed, given
their exceptional ability to handle complex and ambiguous speech data. The SLU task may be
subdivided into the following categories for greater clarity.
• Keyword Spotting: Keyword Spotting (KS) is a technique used in speech processing to
identify specific words or phrases within spoken language. It involves analysing audio
recordings and detecting instances of pre-defined keywords or phrases. This technique is
commonly used in applications such as voice assistants, where the system needs to recognize
specific commands or questions from the user.
• Intent Classification: Intent Classification (IC) is a spoken language understanding task
that involves identifying the intent behind a spoken sentence. It is usually implemented
as a pipeline process, with a speech recognition module followed by text processing that
classifies the intents. However, end-to-end intent classification using speech has numerous
advantages compared to the conventional pipeline approach using AST followed by NLP
modules.
• Slot Filling: Slot Filling (SF) is a widely used technique in Speech Language Understanding
(SLU) that enables the extraction of important information, such as names, dates, and
locations, from a user’s speech. The process involves identifying the specific pieces of
information that are relevant to the user’s request and placing them into pre-defined slots.
For instance, if a user asks for the weather in a particular city, the system will identify the
city name and fill it into the appropriate slot, thereby providing an accurate and relevant
response.
5.11.2 Dataset
• Keyword Spotting Datasets:
– Coucke et al. [100]: This dataset is a speech command recognition dataset that consists
of 105,000 spoken commands in English, with each command being one of 35 keywords.
The dataset is designed to be highly varied and challenging, with a diverse set of
speakers and background noise conditions.
– Leroy et al. [300]: This dataset is a federated learning-based keyword spotting dataset,
it is composed of data from multiple sources that are trained together without sharing
the raw data. The dataset consists of audio recordings from multiple devices and
environments, with the goal of improving the robustness of KS across different devices
and settings
– Auto-KWS [570]: This dataset is automatically generated using TTS approach. The
dataset consists of 1000 keywords spoken by 100 different synthetic voices, with
variations in accent, gender, and age.
– Speech Commands [589]: This data is a large-scale dataset for KS task that consists
of over 100, 000 spoken commands in English, with each command belonging to 35
70 Mehrish et al.
different keywords. The dataset is specifically designed to be highly varied and chal-
lenging, with a diverse set of speakers and background noises. It is commonly used as
a benchmark dataset for KS research.
• Intent Classification and Slot Filling
– ATIS [179]: The Airline Travel Information System (ATIS) dataset is a collection of
spoken queries and responses related to airline travel, such as flight reservations, flight
status, and airport information. The dataset is annotated with both intent labels (e.g.
“flight booking”, “flight status inquiry") and slot labels (e.g. depart city, arrival city,
date). The ATIS dataset has been used extensively as a benchmark for natural language
understanding models.
– SNIPS [101]: SNIPS is a dataset of voice commands designed for building a natural
language understanding system. It consists of thousands of examples of spoken requests,
each annotated with the intent of the request (e.g. “play music”, “set an alarm”, etc.).
The dataset is widely used for training IC and SF models.
– Fluent Speech Commands [350]: It is a dataset of voice commands for controlling smart
home devices, such as lights, thermostats, and locks. The dataset consists of over 1,5000
spoken commands, each labeled with the intended devices and action (e.g. “turn on the
living room lights”, “set the thermostat to 72 degrees”). The dataset is designed to have
variations in speaker accent, background noise, and device placement.
– MIT-Restaurant and MIT-Movie [335]: These are two datasets created by researchers at
MIT for training natural language understanding models from restaurant and movie
information requests. The dataset contains spoken and text-based queries, each labeled
with the intent of the request (e.g. “find a nearby Italian restaurant”,” get informa-
tion about the movie Inception”) and relevant slot information (e.g. restaurant type,
movie name, etc). The datasets are widely used for benchmarking natural language
understanding models.
5.11.3 Models
• Keyword Spotting: The state-of-the-art techniques for keyword spotting in speech involve
deep learning models, such as CNNs [467] and transformers [37]. Wav2Keyword is one of
the popular model based on Wav2Vec2.0 architecture [486] and have achieved SOTA results
on Speech Commands data V1 and V21. Another model that achieves SOTA classification
accuracy on the Google Speech commands dataset is Keyword Transformer (KWT) [486].
KWT uses a transformer model and achieves 98.6% and 97.7% accuracy on the 12 and
35-word tasks, respectively. KWT also has low latency and can be used on mobile devices.
• The DIET architecture, as introduced in [48], is a transformer-based multitask model that
addresses intent classification and entity recognition simultaneously. DIET allows for the
seamless integration of various pre-trained embeddings such as BERT, GloVe, and ConveRT.
Results from experiments show that DIET outperforms fine-tuned BERT and has the added
benefit of being six times faster to train.
• Chang et al. [59] investigated the effectiveness of prompt tuning on the GSLM architecture
and showcased its competitiveness on various SLU tasks, such as KS, IC, and SF. Impressively,
this approach achieves comparable results with fewer trainable parameters than full fine-
tuning. Despite being a popular and effective technique in numerous NLP tasks, prompt
tuning has not received much attention in the speech community. Additionally, other
researchers have pursued a different path by utilizing pre-trained wav2vec2.0 and different
adapters [315] to attain state-of-the-art outcomes.
A Review of Deep Learning Techniques for Speech Processing 71
Table 12. Comprehensive performance analysis of various models for Keyword Spotting (KS) and Slot Filling
(SF) tasks, evaluated on two benchmark datasets: Google Speech Commands for KS and ATIS for SF.
Keyword Spotting on Google Speech Commands (Accuracy % ↑) Slot Filling on ATIS (F1 ↑)
Model Reference Google Speech Commands V1 12 Google Speech Commands V2 12 Google Speech Commands V2 35 Model Reference ATIS
Despite the remarkable progress made in the field of SLU, accurately comprehending human
speech in real-life situations continues to pose significant challenges. These challenges are amplified
by the presence of diverse accents, dialects, and linguistic variations. In a notable study, Vanzo et al.
[551] emphasize the significance of SLU in facilitating effective human-robot interaction, particularly
within the context of house service robots. The authors delve into the specific obstacles encountered
in this domain, which encompass handling noisy and unstructured speech, accommodating various
accents and speech variations, and deciphering complex commands involving multiple actions.
To overcome these obstacles, ongoing research endeavors are dedicated to developing innovative
solutions that enhance the precision and efficacy of SLU systems. By addressing these challenges,
the aim is to enable more robust and accurate speech comprehension in diverse real-life scenarios.
Recent studies, including the comprehensive analysis of the performance of different models
and techniques for Keyword Spotting (KS) and Slot Filling (SF) tasks on Google Speech Commands
and ATIS benchmark datasets (Table 12), have furnished valuable insights into the strengths and
limitations of such approaches in SLU. Capitalizing on these findings and leveraging the latest
advances in deep learning and speech recognition could help us continue to expand the frontiers of
spoken language understanding and drive further innovation in this domain.
advancements in deep learning technology have enabled the development of neural network-
based lip-reading models to accomplish this task with high accuracy. These models take
silent facial videos as input and produce the corresponding speech audio or characters as
output. The potential applications of automatic lip-reading models are vast and diverse,
including enabling videoconferencing in noisy environments, using surveillance videos
as long-range listening devices, and facilitating conversations in noisy social settings.
Developing these models could significantly improve our daily lives.
• Audiovisual speech separation: Recent years have witnessed a growing interest in audiovi-
sual speech separation, driven by the remarkable human capacity to selectively focus on a
specific sound source amidst background noise, commonly known as the "cocktail party
effect." This phenomenon poses a significant challenge in computer speech recognition,
prompting the development of automatic speech separation techniques aimed at isolating
individual speech sources from complex audio signals. In a noteworthy study by Ephrat et al.
(2018) Ephrat et al. [130], the authors proposed that audiovisual speech separation surpasses
audio-only approaches by leveraging visual cues from a speaker’s face to resolve ambiguity
in speech signals. By integrating visual information, the model’s ability to disentangle over-
lapping speech signals is enhanced. The implications of automatic speech separation extend
across diverse applications, including assistive technologies for individuals with hearing
impairments and head-mounted devices designed to facilitate effective communication in
noisy meeting scenarios.
• Talking face generation: Generating a realistic talking face of a target character, synchronized
with a given speech and ensuring smooth transitions between facial images, is the objective
of talking face generation. This task has garnered substantial interest and poses a significant
challenge due to the dynamic nature of facial movements, which depend on both visual
information (input face image) and acoustic information (input speech audio) to achieve
accurate lip-speech synchronization. Despite its challenges, talking face generation holds
immense potential for various applications, including teleconferencing, creating virtual
characters with specific facial expressions, and enhancing speech comprehension. In recent
years, significant advancements have been made in the field of talking face generation, as
evidenced by notable studies [65, 133, 134, 513, 671].
5.12.2 Datasets
Several datasets are widely used for audiovisual multimodal research, including VoxCeleb, TCD-
TIMID [173] , etc. We briefly discuss some of them in the following section.
• TCD-TIMID [173]: This is an extensive and diverse audiovisual dataset that encompasses
both audio and video recordings of 600 distinct sentences spoken by 60 participants. The
dataset features a wide range of speakers with different genders, accents, and backgrounds,
making it highly suitable for talker-independent speech recognition research. The audio
recordings are of exceptional quality, captured using high-fidelity microphones with a
sampling rate of 48kHz. Meanwhile, the video footage is of 720p resolution and includes
depth information for every frame
• LipReading in the Wild (LRW) [93]: The LRW is a comprehensive audiovisual dataset that
encompasses 500 distinct words spoken by more than 1000 speakers. This dataset has been
segmented into distinct training, evaluation, and test sets to facilitate efficient research.
Additionally, the LRW-1000 dataset [617] represents a subset of LRW, featuring a 1000-
word vocabulary. Researchers can benefit from pre-trained weights included with this
dataset, simplifying the evaluation process. Overall, these datasets are highly regarded in
A Review of Deep Learning Techniques for Speech Processing 73
the scientific community for their size and versatility in supporting research related to
speech recognition and natural language processing
• LRS2 and LRS3 10 : The LRS2 and LRS3 datasets are additional examples of audiovisual
speech recognition datasets that have been gathered from videos captured in real-world
settings. Each of these datasets has its own distinct train/test split and includes cropped face
tracks as well as corresponding audio clips sourced from British television. Both datasets
are considered to be of significant value to researchers in the field of speech recognition,
particularly those focused on audiovisual analysis.
• GRID [97]: This dataset comprises high-fidelity audio and video recordings of more than
1000 sentences spoken by 34 distinct speakers, including 18 males and 16 females. The
sentences were gathered using the prompt "put red at G9 now" and are widely employed in
research related to audio-visual speech separation and talking face synthesis. The dataset is
considered to be of exceptional quality and is highly sought after in the scientific community.
5.12.3 Models
In recent years, there has been a remarkable surge in the development of algorithms tailored
for multimodal tasks. Specifically, significant attention has been devoted to the advancement of
neural networks for Text-to-Speech (TTS) applications [251, 458–460]. The integration of visual
and auditory modalities through multimodal processing has played a pivotal role in enhancing
various tasks relevant to our daily lives. Lip-reading, for instance, has witnessed notable progress in
recent years, whether accompanied by audio or not. Son et al. have made a significant contribution
to this field with their hybrid model [511]. Combining convolutional neural networks (CNN),
long short-term memory (LSTM) networks, and an attention mechanism, their model captures
correlations between lip videos and audio, enabling accurate character generation. Additionally,
the authors introduce a new dataset called LRS, which facilitates the development of lip-reading
models.
Another noteworthy model, LiRA [359], focuses on self-supervised learning for lip-reading. It
leverages lip image sequences and audio waveforms to derive high-level representations during the
pre-training stage, achieving word-level and sentence-level lip-reading capabilities. In the realm
of capturing human emotions expressed through acoustic signals, Ephrat et al. [129] propose an
innovative model that frames the task as an acoustic regression problem instead of a visual-to-
text modeling approach. Their work emphasizes the advantages of this perspective. Furthermore,
Vid2Speech [131], a CNN-based model, takes facial image sequences as input and generates cor-
responding speech audio waveforms. It employs a two-tower CNN model that processes facial
grayscale images while calculating optical flow between frames. Additionally, other models such as
those based on mutual information maximization [667] and spatiotemporal fusion [653] have been
proposed for the lip-reading task, further expanding the methodologies explored in this domain.
In an early attempt to develop algorithms for audiovisual speech separation, the authors of
[130] proposed a CNN-based architecture that encodes facial images and speech spectrograms to
compute a complex mask for speech separation. Additionally, they introduced the AVspeech dataset
in this work. AV-CVAE [393] utilizes a conditional VAE to detect the lip movements of the speaker
and predict separated speech. In a deviation from speech signals, [385] focuses on audiovisual
singing separation and employs a two-stream CNN architecture, Y-Net [374], to process audio and
video separately. This work introduces a large dataset of solo singing videos for audiovisual singing
separation. The VisualSpeech [151] architecture takes a face image sequence and mixed audio of
lip movement as input and predicts a complex mask. It also proposes a cross-modal embedding
10 https://www.robots.ox.ac.uk/ṽgg/data/lip_reading/lrs2.html
74 Mehrish et al.
space to facilitate the correlation of audio and visual modalities. Finally, FaceFilter [94] uses still
images as visual information, and other methods for the audiovisual speech separation task are
proposed in [10, 146, 379].
The rise of Deepfake videos on the internet has led to a surge in demand for creating realistic
talking faces for various applications, such as video production, marketing, and entertainment.
Previously, the conventional approach involved manipulating 3D meshes to create specific faces,
which was time-consuming and limited to certain identities. However, recent advancements in deep
generative models have made significant progress. For example, DAVS [671] introduced an end-to-
end trainable deep neural network capable of learning a joint audiovisual representation, which uses
adversarial training to disentangle the latent space. Another architecture proposed by ATVGnet
[65] consists of an audio transformation network (AT-net) and a visual generation network (VG-net)
for processing acoustic and visual information, respectively. This method introduced a regression-
based discriminator, a dynamically adjustable pixel-wise loss, and an attention mechanism. In
[674], a novel framework for talking face generation was presented, which discovers audiovisual
coherence through an asymmetrical mutual information estimator. Furthermore, the authors in
[133] proposed an end-to-end approach based on generative adversarial networks that use noisy
speech for talking face generation. In addition, alternative methods based on conditional recurrent
adversarial networks and speech-driven talking face generation were introduced in [134, 513].
6.1.2 Models
Various techniques have been proposed to adapt a deep learning model for speech processing
tasks. An example of a technique is reconstruction-based domain adaptation, which leverages an
additional reconstruction task to generate a communal representation for all the domains. The
Deep Reconstruction Classification Network (DRCN) [154] is an illustration of such an approach,
as it endeavors to address both tasks concurrently: (i) classification of the source data and (ii)
reconstruction of the input data. Another technique used in domain adaptation is the domain-
adversarial neural network architecture, which aims to learn domain-invariant features using a
gradient reversal layer [51, 574, 654].
Different domain adaptation techniques are successfully applied to different speech processing
tasks, such as speaker recognition [44, 200, 313, 395] and verification [75, 76, 306, 645, 673], where
the goal is to verify the identity of a speaker using their voice. One approach for domain adaptation
in speaker verification is to use adversarial domain training to learn speaker-independent features
insensitive to variations in the recording environment [75].
A Review of Deep Learning Techniques for Speech Processing 75
Domain adaptation has also been applied to speech recognition [213, 367, 519, 631] to improve
speech recognition accuracy in a target domain. One recent approach for domain adaptation in
ASR is prompt-tuning [112], which involves fine-tuning the ASR system on a small amount of data
from the new domain. Another approach is to use adapter modules for transducer-based speech
recognition systems [364, 479], which can balance the recognition accuracy of general speech and
improve recognition on adaptation domains. The Machine Speech Chain integrates both end-to-end
(E2E) ASR and neural text-to-speech (TTS) into one circle [631]. This integration can be used for
domain adaptation by fine-tuning the E2E ASR on a small amount of data from the new domain
and then using the TTS to generate synthetic speech in the new domain for further training.
In addition to domain adaptation techniques used in speech recognition, there has been growing
interest in adapting text-to-speech (TTS) models to specific speakers or domains. This research
direction is critical, especially in low-resource settings where collecting sufficient training data can
be challenging. Several recent works have proposed different approaches for speaker and domain
adaptation in TTS, such as AdaSpeech [66, 599, 609].
6.2.2 Models
In low-resource ASR, meta-learning is used to quickly adapt unseen target languages by formulating
ASR for different languages as different tasks and meta-learning the initialization parameters from
many pretraining languages [192, 501]. The proposed approach, MetaASR [192], significantly
outperforms the state-of-the-art multitask pretraining approach on all target languages with
different combinations of pretraining languages. In speaker verification, meta-learning is used to
improve the meta-learning training for SV by introducing two methods to improve the backbone
embedding network [73]. The proposed methods can obtain consistent improvements over the
existing meta-learning training framework [279].
Meta-learning has proven to be a promising approach in various speech-related tasks, including
low-resource ASR and speaker verification. In addition to these tasks, meta-learning has also been
applied to few-shot speaker adaptive TTS and language-agnostic TTS, demonstrating its potential to
improve performance across different speech technologies. Meta-TTS [208] is an example of a meta-
learning model used for a few-shot speaker adaptive TTS. It can synthesize high-speaker-similarity
speech from a few enrolment samples with fewer adaptation steps. Similarly, a language-agnostic
meta-learning approach is proposed in [358] for low-resource TTS.
76 Mehrish et al.
Add and
LayerNorm Attention
LoRA
Prefix
Adapter Attention LoRA
Tuning
Feed Forward FF Up
FF FF
Down UP
Nonlinear LoRA LoRA LoRA
Add and
LayerNorm
FF Down Hidden States
Multi-Head Hidden States
Attention
Bottleneck Multi-Head Attention Multi-Head Attention
Adapter
Fig. 17. Transformer architecture and Adapter, Prefix Tuning, and LoRA.
Residual
Convolution Adapter
Fig. 18. The architecture of 1D convolution layer-based lightweight adapter. 𝑘 is the kernel size of 1D
convolution. ∗ denotes depth-wise convolution.
Adapter Tuning. Adapters are a type of neural module that can be retrofitted onto a pre-trained
language model, with significantly fewer parameters than the original model. One such type is the
A Review of Deep Learning Techniques for Speech Processing 77
bottleneck or standard adapter (Houlsby et al., 2019; Pfeiffer et al., 2020) [189, 423]. The adapter
takes an input vector ℎ ∈ R𝑑 and down-projects it to a lower-dimensional space with dimensionality
𝑚 (where 𝑚 < 𝑑), applies a non-linear function 𝑔(·), and then up-projects the result back to the
original 𝑑-dimensional space. Finally, the output is obtained by adding a residual connection.
𝒉 ← 𝒉 + 𝑔(𝒉𝑾down )𝑾up (30)
where matrices 𝑾down and 𝑾up are used as down and up projection matrices, respectively, with
𝑾 down having dimensions R𝑑 ×𝑚 and 𝑾up having dimensions R𝑚×𝑑 . Previous studies have empir-
ically shown that a two-layer feedforward neural network with a bottleneck is effective. In this
work, we follow the experimental settings outlined in [423] for the adapter, which is inserted after
the feedforward layer of every transformer module, as depicted in Figure 17.
Prefix tuning. Recent studies have suggested modifying the attention module of the Transformer
model to improve its performance in natural language processing tasks. This approach involves
adding learnable vectors to the pre-trained multi-head attention keys and values at every layer, as
depicted in Figure 17. Specifically, two sets of learnable prefix vectors, 𝑷𝑲 and 𝑷𝑽 , are concatenated
with the original key and value matrices 𝑲 and 𝑽 , while the query matrix 𝑸 remains unchanged.
The resulting matrices are then used for multi-head attention, where each head of the attention
mechanism is computed as follows:
head𝑖 = Attn(𝑸𝑾𝑄(𝑖 ) , [𝑷𝐾(𝑖 ) , 𝑲𝑾𝑄(𝑖 ) ], [𝑷𝑉(𝑖 ) , 𝑽𝑾𝑄(𝑖 ) ]) (31)
where Attn(·) is scaled dot-product attention given by:
𝑸 𝑲𝑇
Attn(𝑸, 𝑲, 𝑽 ) = softmax( √ )𝑽 (32)
𝑑𝑘
The attention heads in each layer are modified by prefix tuning, with only the prefix vectors 𝑷𝐾 and
𝑷𝑉 being updated during training. This approach provides greater control over the transmission of
acoustic information between layers and effectively activates the pre-trained model’s knowledge.
LoRA. LoRA is a novel approach proposed by Hu et al. (2021) [198], which aims to approximate
weight updates in the Transformer by injecting trainable low-rank matrices into its layers. In
this method, a pre-trained weight matrix 𝑊 ∈ R𝑑 ×𝑘 is updated by a low-rank decomposition
𝑾 + Δ𝑾 = 𝑾 + 𝑾 down𝑾 up, where 𝑾 down ∈ R𝑑 ×𝑟 , 𝑾 up ∈ R𝑟 ×𝑘 are tunable parameters and 𝑟
represents the rank of the decomposition matrices, with 𝑟 < 𝑑. Specifically, for a given input 𝒙 to
the linear projection in the multi-headed attention layer, LoRA modifies the projection output 𝒉 as
follows:
𝒉 ← 𝒉 + 𝑠 · 𝒙𝑾down𝑾up (33)
In this work, LoRA is integrated into four locations of the multi-head attention layer, as illustrated
in Figure 17. Thanks to its lightweight nature, the pre-trained model can accommodate many
small modules for different tasks, allowing for efficient task switching by replacing the modules.
Additionally, LoRA incurs no inference latency and achieves a convergence rate that is comparable
to that of training the original model, unlike fully fine-tuned models [198].
Convolutional Adapter. CNNs have become increasingly popular in the field of speech processing
due to their ability to learn task-specific information and combine channel-wise information within
local receptive fields. To further improve the efficiency of CNNs for speech processing tasks, Li
et al. (2023) [315] proposed a lightweight adapter, called the ConvAdapter, which uses three 1D
convolutional layers, layer normalization, and a squeeze-and-excite module (Zhang et al., 2017) [201],
as shown in Figure 18. By utilizing depth-wise convolution, which requires fewer parameters and
78 Mehrish et al.
is more computationally efficient, the authors were able to achieve better performance while using
fewer resources. In this approach, the ConvAdapter is added to the same location as the Bottleneck
Adapter (Figure 17).
Table 13. The study evaluated various parameter-efficient training methods on pre-trained Word2Vec 2.0,
including full fine-tuning, on the SURE benchmark. The fraction of trainable parameters were represented by
percentages, with the number of KS task’s trainable parameters given. Results are reported using weighted-f1
as the metric (w-f1) on MELD, with the best performance in bold and the second best underlined. To avoid
data imbalance, the researchers opted for using weighted-f1 as the metric. The study cites Li et al. (2023)
[315] as a reference.
Fine Tuning 315,703,947 96.53 42.93 99.00 92.36 0.2295 0.135 0.0903 99.08
Adapter 25,467,915 (8.08%) 94.07 41.58 98.87 96.32 0.2290 0.214 0.2425 99.19
Prefix Tuning 1,739,787 (0.55%) 90.00 44.21 99.73 98.49 0.2255 0.166 0.1022 98.86
LoRA 3,804,171 (1.20%) 90.00 47.05 99.00 97.61 0.2428 0.149 0.1014 98.28
ConvAdapter 2,952,539 (0.94%) 91.87 46.30 99.60 97.61 0.2456 0.2062 0.2958 98.99
Table 14. Results on SURE benchmark for full fine-tuning and other parameter-efficient training methods on
pre-trained Wav2Vec 2.0 for IC and PR tasks on FS: Fluent Speech [350] and LS: LibriSpeech [410] datasets,
respectively.
IC PR SF
Method FS LS SNIPS
#Parameters #Parameters #Parameters
ACC% ↑ PER ↓ F1 % ↑ CER ↓
Fine-Tuning 315707288 99.60 311304394 0.0577 311375119 93.89 0.1411
Adapter 25471256 (8.06%) 99.39 25278538 (8.01%) 0.1571 25349263 (8.14%) 92.60 0.1666
Prefix Tuning 1743128 (0.55%) 93.43 1550410 (0.49%) 0.1598 1621135 (0.50%) 62.32 0.6041
LoRA 3807512 (1.20%) 99.68 3614794 (1.16%) 0.1053 3685519 (1.18%) 90.61 0.2016
ConvAdapter 3672344 (1.16%) 95.60 3479626 (1.11%) 0.1532 3550351 (1.14%) 59.27 0.6405
Table 15. Results on the SURE benchmark for the TTS task. MCD and WER are the metrics used to compare
fine-tuning and other parameter-efficient approaches.
LTS L2ARCTIC
Method Parameters (%)
MCD ↓ WER ↓ MCD ↓ WER ↓
Fine-tuning 35802977 6.2038 0.2655 6.71469 0.2141
Adapter 659200 6.1634 0.3143 6.544 0.2504
Prefix 153600 6.2523 0.3334 7.4264 0.3244
LoRA 81920 6.8319 0.3786 7.0698 0.3291
Convadapter 108800 6.9202 0.3365 6.9712 0.3227
A Review of Deep Learning Techniques for Speech Processing 79
Table 13, Table 14, and Table 15 present the results of various speech processing tasks in the
SURE benchmark. The findings demonstrate that the adapter-based methods perform comparably
well in fine-tuning. However, there is no significant advantage of any particular adapter type over
others for these benchmark tasks and datasets.
(1) Large Speech Models: In addition to the advancements made with wav2vec2.0, further
progress in the field of ASR and TTS models involves the development of larger and more
comprehensive models, along with the utilization of larger datasets. By leveraging these
resources, it becomes possible to create TTS models that exhibit enhanced naturalness and
human-like prosody. One promising approach to achieve this is through the application of
adversarial training, where a discriminator is employed to distinguish between machine-
generated speech and reference speech. This adversarial framework facilitates the generation
of TTS models that closely resemble human speech, providing a significant step forward in
achieving more realistic and high-quality synthesized speech. By exploring these avenues,
researchers aim to push the boundaries of speech synthesis technology, ultimately enhancing
the overall performance and realism of TTS systems.
(2) Multilingual Models: Self-supervised learning has emerged as a transformative approach
in the field of speech recognition, particularly for low-resource languages characterized
by scarce or unavailable labeled datasets. The recent development of the XLS-R model, a
state-of-the-art self-supervised speech recognition model, represents a significant milestone
in this domain. With a remarkable scale of over 2 billion parameters, the XLS-R model has
been trained on a diverse dataset spanning 128 languages, surpassing its predecessor in
terms of language coverage. The notable advantage of scaling up larger multilingual models
like XLS-R lies in the substantial performance improvements they offer. As a result, these
models are poised to outperform single-language models and hold immense promise for
the future of speech recognition. By harnessing the power of self-supervised learning and
leveraging multilingual datasets, the XLS-R model showcases the potential for addressing the
challenges posed by low-resource languages and advancing the field of speech recognition
to new heights.
(3) Multimodal Speech Models: Traditional speech and text models have typically operated
within a single modality, focusing solely on either speech or text inputs and outputs. How-
ever, as the scale of generative models continues to grow exponentially, the integration
of multiple modalities becomes a natural progression. This trend is evident in the latest
developments, such as the unveiling of groundbreaking language models like GPT-4 [405]
and Kosmos-I [207], which demonstrate the ability to process both images and text jointly.
These pioneering multimodal models pave the way for the emergence of large-scale ar-
chitectures that can seamlessly handle speech and other modalities in a unified manner.
The convergence of multiple modalities within a single model opens up new avenues for
comprehensive understanding and generation of multimodal content, and it is highly antic-
ipated that we will witness the rapid development of large multimodal models tailored for
speech and beyond in the near future.
(4) In-Context Learning: Utilizing mixed-modality models opens up possibilities for the devel-
opment of in-context learning approaches for a wide range of speech-related tasks. This
paradigm allows the tasks to be explicitly defined within the input, along with accompa-
nying examples. Remarkable progress has already been demonstrated in large language
models (LLMs), including notable works such as InstructGPT [406], FLAN-T5 [90], and
LLaMA [535]. These models showcase the efficacy of in-context learning, where the in-
tegration of context-driven information empowers the models to excel in various speech
tasks. By leveraging mixed-modality models and incorporating contextual cues, researchers
are advancing the boundaries of speech processing capabilities, paving the way for more
versatile and context-aware speech systems.
(5) Controllable Speech Generation:An intriguing application stemming from the aforementioned
concept is controllable text-to-speech (TTS), which allows for fine-grained control over
A Review of Deep Learning Techniques for Speech Processing 81
various attributes of the synthesized speech. Attributes such as tone, accent, age, gender,
and more can be precisely controlled through in-context text guidance. This controllability
in TTS opens up exciting possibilities for personalization and customization, enabling users
to tailor the synthesized speech to their specific requirements. By leveraging advanced
models and techniques, researchers are making significant strides in developing controllable
TTS systems that provide users with a powerful and flexible speech synthesis experience.
(6) Parameter-efficient Learning: With the increasing scale of LLMs and speech models, it
becomes imperative to adapt these models with minimal parameter updates. This necessi-
tates the development of specialized adapters that can efficiently update these emerging
mixed-modality large models. Additionally, model compression techniques have proven
to be practical solutions in addressing the challenges posed by these large models. Recent
research [280, 422, 593] has demonstrated the effectiveness of model compression, highlight-
ing the sparsity that exists within these models, particularly for specific tasks. By employing
model compression techniques, researchers can reduce the computational requirements
and memory footprint of these models while preserving their performance, making them
more practical and accessible for real-world applications.
(7) Explainability: Explainability remains elusive to these large networks as they grow. Re-
searchers are steadfast in explaining these networks’ functioning and learning dynamics.
Recently, much work has been done to learn the fine-tuning and in-context learning dy-
namics of these large models for text under the neural-tangent-kernel (NTK) asymptotic
framework [366]. Such exploration is yet to be done in the speech domain. More yet, ex-
plainability could be built-in as inductive bias in architecture. To this end, brain-inspired
architectures [382] are being developed, which may shed more light on this aspect of large
models.
(8) Neuroscience-inspired Architectures:In recent years, there has been significant research
exploring the parallels between speech-processing architectures and the intricate workings
of the human brain [382]. These studies have unveiled compelling evidence of a strong
correlation between the layers of speech models and the functional hierarchy observed in
the human brain. This intriguing finding has served as a catalyst for the development of
neuroscience-inspired speech models that demonstrate comparable performance to state-
of-the-art (SOTA) models [382]. By drawing inspiration from the underlying principles
of neural processing in the human brain, these innovative speech models aim to enhance
our understanding of speech perception and production while pushing the boundaries of
performance in the field of speech processing.
(9) Text-to-Audio Models for Text-to-Speech: Lately, transformer and diffusion-based text-to-
audio (TTA) model development is turning into an exciting area of research. Until recently,
most of these models [155, 272, 332, 580, 611] overlooked speech in favour of general audio.
In the future, however, the models will likely strive to be equally performant in both audio
and speech. To that end, current TTS methods will likely be an integral part of those models.
Recently, Suno-AI [523] have aimed at striking a good balance between general audio and
speech, although their implementation is not public, nor have they provided any detailed
paper.
Acknowledgement
This research is supported by the Ministry of Education, Singapore, under its AcRF Tier-2 grant
(Project no. T2MOE2008, and Grantor reference no. MOE-T2EP20220-0017), and A*STAR under its
RIE 2020 AME programmatic grant (project reference no. RGAST2003. Any opinions, findings and
82 Mehrish et al.
conclusions or recommendations expressed in this material are those of the author(s) and do not
reflect the views of the Ministry of Education, Singapore.
References
[1] 2022. Conformer-1. AssemblyAI (2022). https://www.assemblyai.com/blog/conformer-1/
[2] 2022. Speech Recognition With Conformer. Nvidia (2022). https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_
recognition_with_conformer.html
[3] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. Convolutional
neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing 22, 10
(2014), 1533–1545.
[4] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. Convolutional
Neural Networks for Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 10
(2014), 1533–1545. https://doi.org/10.1109/TASLP.2014.2339736
[5] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn. 2012. Applying Convolutional Neural
Networks concepts to hybrid NN-HMM model for speech recognition. In 2012 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). 4277–4280. https://doi.org/10.1109/ICASSP.2012.6288864
[6] Osama Abdeljaber, Onur Avci, Serkan Kiranyaz, Moncef Gabbouj, and Daniel J Inman. 2017. Real-time vibration-based
structural damage detection using one-dimensional convolutional neural networks. Journal of Sound and Vibration
388 (2017), 154–170.
[7] Zrar Kh. Abdul and Abdulbasit K. Al-Talabani. 2022. Mel Frequency Cepstral Coefficient and its Applications: A
Review. IEEE Access 10 (2022), 122136–122158. https://doi.org/10.1109/ACCESS.2022.3223444
[8] Sherif Abdulatif, Ruizhe Cao, and Bin Yang. 2022. CMGAN: Conformer-Based Metric-GAN for Monaural Speech
Enhancement. arXiv preprint arXiv:2209.11112 (2022).
[9] Sivanand Achanta, Albert Antony, Ladan Golipour, Jiangchuan Li, Tuomo Raitio, Ramya Rasipuram, Francesco
Rossi, Jennifer Shi, Jaimin Upadhyay, David Winarsky, et al. 2021. On-device neural speech synthesis. In 2021 IEEE
Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 1155–1161.
[10] Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The conversation: Deep audio-visual speech
enhancement. arXiv preprint arXiv:1804.04121 (2018).
[11] Vatsal Aggarwal, Marius Cotescu, Nishant Prateek, Jaime Lorenzo-Trueba, and Roberto Barra-Chicote. 2020. Using
Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech. In ICASSP 2020 - 2020
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6179–6183. https://doi.org/10.1109/
ICASSP40776.2020.9053678
[12] Waleed Alsabhan. 2023. Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensem-
bling Techniques 1D Convolution Neural Network and Attention. Sensors 23, 3 (2023), 1386.
[13] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared
Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in
english and mandarin. In International conference on machine learning. PMLR, 173–182.
[14] Xavier Anguera, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Friedland, and Oriol Vinyals. 2012.
Speaker diarization: A review of recent research. IEEE Transactions on audio, speech, and language processing 20, 2
(2012), 356–370.
[15] Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, Ondřej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir Durrani, Marcello
Federico, Christian Federmann, Jiatao Gu, et al. 2020. Findings of the IWSLT 2020 evaluation campaign. In Proceedings
of the 17th International Conference on Spoken Language Translation. 1–34.
[16] Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, et al.
2021. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint
arXiv:2110.07205 (2021).
[17] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay
Saunders, Francis M Tyers, and Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. arXiv
preprint arXiv:1912.06670 (2019).
[18] Sercan Ö Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li,
John Miller, Andrew Ng, Jonathan Raiman, et al. 2017. Deep voice: Real-time neural text-to-speech. In International
conference on machine learning. PMLR, 195–204.
[19] Kartik Audhkhasi, George Saon, Zoltán Tüske, Brian Kingsbury, and Michael Picheny. 2019. Forget a Bit to Learn
Better: Soft Forgetting for CTC-Based Automatic Speech Recognition.. In Interspeech. 2618–2622.
[20] Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick
von Platen, Yatharth Saraf, Juan Pino, et al. 2021. XLS-R: Self-supervised cross-lingual speech representation learning
A Review of Deep Learning Techniques for Speech Processing 83
[43] Herve A Bourlard and Nelson Morgan. 1994. Connectionist speech recognition: a hybrid approach. Vol. 247. Springer
Science & Business Media.
[44] Pierre-Michel Bousquet and Mickael Rouvier. 2019. On robustness of unsupervised domain adaptation for speaker
recognition. In Interspeech.
[45] Hervé Bredin and Antoine Laurent. 2021. End-to-end speaker segmentation for overlap-aware resegmentation. arXiv
preprint arXiv:2104.04045 (2021).
[46] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural
information processing systems 33 (2020), 1877–1901.
[47] Latané Bullock, Hervé Bredin, and Leibny Paola Garcia-Perera. 2020. Overlap-aware diarization: Resegmentation
using neural end-to-end overlapped speech detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 7114–7118.
[48] Tanja Bunk, Daksh Varshneya, Vladimir Vlasov, and Alan Nichol. 2020. Diet: Lightweight language understanding
for dialogue systems. arXiv preprint arXiv:2004.09936 (2020).
[49] Maxime Burchi and Radu Timofte. 2023. Audio-Visual Efficient Conformer for Robust Speech Recognition. In
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2258–2267.
[50] Alena Butryna, Shan-Hui Cathy Chu, Isin Demirsahin, Alexander Gutkin, Linne Ha, Fei He, Martin Jansche, Cibu
Johny, Anna Katanova, Oddur Kjartansson, et al. 2020. Google crowdsourced speech corpora and related open-source
resources for low-resource languages and dialects: an overview. arXiv preprint arXiv:2010.06778 (2020).
[51] Anoop C S, Prathosh A P, and A G Ramakrishnan. 2021. Unsupervised Domain Adaptation Schemes for Building
ASR in Low-Resource Languages. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
342–349. https://doi.org/10.1109/ASRU51503.2021.9688269
[52] William Campbell, Joseph Campbell, Douglas Reynolds, Douglas Jones, and Timothy Leek. 2003. Phonetic speaker
recognition with support vector machines. Advances in neural information processing systems 16 (2003).
[53] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis
Karaiskos, Wessel Kraaij, Melissa Kronenthal, et al. 2006. The AMI meeting corpus: A pre-announcement. In Machine
Learning for Multimodal Interaction: Second International Workshop, MLMI 2005, Edinburgh, UK, July 11-13, 2005,
Revised Selected Papers 2. Springer, 28–39.
[54] Paolo Castiglioni. 2005. Levinson-durbin algorithm. Encyclopedia of Biostatistics 4 (2005).
[55] Andrew A Catellier and Stephen D Voran. 2020. Wawenets: A no-reference convolutional waveform-based approach
to estimating narrowband and wideband speech quality. In ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 331–335.
[56] Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2021. MuST-C: A
multilingual corpus for end-to-end speech translation. Computer Speech & Language 66 (2021), 101155.
[57] Benjamin Cauchi, Kai Siedenburg, Joao F Santos, Tiago H Falk, Simon Doclo, and Stefan Goetze. 2019. Non-intrusive
speech quality prediction using modulation energies and lstm-network. IEEE/ACM Transactions on Audio, Speech,
and Language Processing 27, 7 (2019), 1151–1163.
[58] William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, and Mohammad Norouzi. 2021. Speechstew: Simply mix
all available speech recognition data to train one large neural network. arXiv preprint arXiv:2104.02133 (2021).
[59] Kai-Wei Chang et al. 2022. An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech
Processing Tasks. arXiv preprint arXiv:2203.16773 (2022).
[60] Sneha Chaudhari, Varun Mithal, Gungor Polatkan, and Rohan Ramanath. 2021. An attentive survey of attention
models. ACM Transactions on Intelligent Systems and Technology (TIST) 12, 5 (2021), 1–32.
[61] Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan
Trmal, Junbo Zhang, et al. 2021. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed
audio. arXiv preprint arXiv:2106.06909 (2021).
[62] Junkun Chen, Mingbo Ma, Renjie Zheng, and Liang Huang. 2020. Mam: Masked acoustic modeling for end-to-end
speech-to-text translation. arXiv preprint arXiv:2010.11445 (2020).
[63] Junkun Chen, Mingbo Ma, Renjie Zheng, and Liang Huang. 2021. SpecRec: An Alternative Solution for Improving
End-to-End Speech-to-Text Translation via Spectrogram Reconstruction.. In Interspeech. 2232–2236.
[64] Jiawei Chen, Xu Tan, Yichong Leng, Jin Xu, Guihua Wen, Tao Qin, and Tie-Yan Liu. 2021. Speech-t: Transducer for
text to speech and beyond. Advances in Neural Information Processing Systems 34 (2021), 6621–6633.
[65] Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation
with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
7832–7841.
[66] Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2021. Adaspeech: Adaptive
text to speech for custom voice. arXiv preprint arXiv:2103.00993 (2021).
A Review of Deep Learning Techniques for Speech Processing 85
[67] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. 2020. WaveGrad: Estimating
gradients for waveform generation. arXiv preprint arXiv:2009.00713 (2020).
[68] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. 2020. WaveGrad: Estimating
Gradients for Waveform Generation. In International Conference on Learning Representations.
[69] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, Najim Dehak, and William Chan. 2021.
Wavegrad 2: Iterative refinement for text-to-speech synthesis. arXiv preprint arXiv:2106.09660 (2021).
[70] Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bert for joint intent classification and slot filling. arXiv preprint
arXiv:1902.10909 (2019).
[71] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya
Yoshioka, Xiong Xiao, et al. 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.
IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518.
[72] Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu
Li, et al. 2022. Unispeech-sat: Universal speech representation learning with speaker aware pre-training. In ICASSP
2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6152–6156.
[73] Yafeng Chen, Wu Guo, and Bin Gu. 2021. Improved meta-learning training for speaker verification. arXiv preprint
arXiv:2103.15421 (2021).
[74] Zehua Chen, Xu Tan, Ke Wang, Shifeng Pan, Danilo Mandic, Lei He, and Sheng Zhao. 2022. Infergrad: Improving
Diffusion Models for Vocoder by Considering Inference in Training. In ICASSP 2022-2022 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8432–8436.
[75] Zhengyang Chen, Shuai Wang, and Yanmin Qian. 2020. Adversarial Domain Adaptation for Speaker Verification
Using Partially Shared Network.. In Interspeech. 3017–3021.
[76] Zhengyang Chen, Shuai Wang, and Yanmin Qian. 2021. Self-supervised learning based domain adaptation for robust
speaker verification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 5834–5838.
[77] Zhuo Chen, Shinji Watanabe, Hakan Erdogan, and John R Hershey. 2015. Speech enhancement and recognition using
multi-task learning of long short-term memory recurrent neural networks. In Sixteenth Annual Conference of the
International Speech Communication Association.
[78] Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. 2022. Self-supervised learning with random-
projection quantizer for speech recognition. In International Conference on Machine Learning. PMLR, 3915–3924.
[79] Chung-Cheng Chiu and Colin Raffel. 2017. Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382 (2017).
[80] Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based
encoder-decoder networks. IEEE Transactions on Multimedia 17, 11 (2015), 1875–1886.
[81] Won Ik Cho, Donghyun Kwak, J. Yoon, and Nam Soo Kim. 2020. Speech to Text Adaptation: Towards an Efficient
Cross-Modal Distillation. In Interspeech.
[82] Hyeong-Seok Choi, Juheon Lee, Wansoo Kim, Jie Lee, Hoon Heo, and Kyogu Lee. 2021. Neural analysis and synthesis:
Reconstructing speech from self-supervised representations. Advances in Neural Information Processing Systems 34
(2021), 16251–16265.
[83] Hyeong-Seok Choi, Jinhyeok Yang, Juheon Lee, and Hyeongju Kim. 2022. NANSY++: Unified Voice Synthesis with
Neural Analysis and Synthesis. arXiv preprint arXiv:2211.09407 (2022).
[84] Jan Chorowski, Ron J Weiss, Samy Bengio, and Aäron Van Den Oord. 2019. Unsupervised speech representation
learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing 27, 12 (2019),
2041–2053.
[85] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based
models for speech recognition. Advances in neural information processing systems 28 (2015).
[86] Ju-chieh Chou, Cheng-chieh Yeh, and Hung-yi Lee. 2019. One-shot voice conversion by separating speaker and
content representations with instance normalization. arXiv preprint arXiv:1904.05742 (2019).
[87] Anurag Chowdhury, Austin Cozzo, and Arun Ross. 2022. Domain Adaptation for Speaker Recognition in Singing and
Spoken Voice. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 7192–7196.
[88] Hoon Chung, Hyeong-Bae Jeon, and Jeon Gue Park. 2020. Semi-supervised training for sequence-to-sequence speech
recognition using reinforcement learning. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE,
1–6.
[89] Hoon Chung, Hyeong-Bae Jeon, and Jeon Gue Park. 2020. Semi-supervised Training for Sequence-to-Sequence
Speech Recognition Using Reinforcement Learning. In 2020 International Joint Conference on Neural Networks (IJCNN).
1–6. https://doi.org/10.1109/IJCNN48605.2020.9207023
[90] Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani,
Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha
86 Mehrish et al.
Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M.
Dai, Hongkun Yu, Slav Petrov, Ed Huai hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le,
and Jason Wei. 2022. Scaling Instruction-Finetuned Language Models. ArXiv abs/2210.11416 (2022).
[91] Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan
Jung, Bong-Jin Lee, and Icksang Han. 2020. In defence of metric learning for speaker recognition. arXiv preprint
arXiv:2003.11982 (2020).
[92] J. S. Chung, A. Nagrani, and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.
[93] Joon Son Chung and Andrew Zisserman. 2017. Lip reading in the wild. In Computer Vision–ACCV 2016: 13th Asian
Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13. Springer,
87–103.
[94] Soo-Whan Chung, Soyeon Choe, Joon Son Chung, and Hong-Goo Kang. 2020. Facefilter: Audio-visual speech
separation using still images. arXiv preprint arXiv:2005.07074 (2020).
[95] Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass. 2019. An unsupervised autoregressive model for speech
representation learning. arXiv preprint arXiv:1904.03240 (2019).
[96] Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. 2021. W2v-bert:
Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE
Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 244–250.
[97] Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception
and automatic speech recognition. The Journal of the Acoustical Society of America 120, 5 (2006), 2421–2424.
[98] Juan M Coria, Hervé Bredin, Sahar Ghannay, and Sophie Rosset. 2021. Overlap-aware low-latency online speaker
diarization based on end-to-end local segmentation. In 2021 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU). IEEE, 1139–1146.
[99] Marvin Coto-Jiménez. 2019. Improving post-filtering of artificial speech using pre-trained LSTM neural networks.
Biomimetics 4, 2 (2019), 39.
[100] Alice Coucke, Mohammed Chlieh, Thibault Gisselbrecht, David Leroy, Mathieu Poumeyrol, and Thibaut Lavril. 2019.
Efficient keyword spotting using dilated convolutions and gating. In ICASSP 2019-2019 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6351–6355.
[101] Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro,
Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al. 2018. Snips voice platform: an embedded spoken
language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190 (2018).
[102] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional
networks. In International conference on machine learning. PMLR, 933–941.
[103] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep
face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4690–4699.
[104] Keqi Deng, Songjun Cao, Yike Zhang, and Long Ma. 2021. Improving Hybrid CTC/Attention End-to-End Speech Recog-
nition with Pretrained Acoustic and Language Models. In 2021 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU). 76–82. https://doi.org/10.1109/ASRU51503.2021.9688009
[105] Keqi Deng, Songjun Cao, Yike Zhang, Long Ma, Gaofeng Cheng, Ji Xu, and Pengyuan Zhang. 2022. Improving
CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models. In ICASSP 2022 -
2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8517–8521. https://doi.org/10.
1109/ICASSP43922.2022.9747887
[106] Keqi Deng, Songjun Cao, Yike Zhang, Long Ma, Gaofeng Cheng, Ji Xu, and Pengyuan Zhang. 2022. Improving
CTC-based speech recognition via knowledge transferring from pre-trained language models. In ICASSP 2022-2022
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8517–8521.
[107] Pavel Denisov and Ngoc Thang Vu. 2020. Pretrained Semantic Speech Embeddings for End-to-End Spoken Language
Understanding via Cross-Modal Teacher-Student Learning. In Interspeech.
[108] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. Ecapa-tdnn: Emphasized channel attention,
propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143 (2020).
[109] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[110] Mattia A Di Gangi, Matteo Negri, and Marco Turchi. 2019. Adapting transformer to end-to-end spoken language
translation. In Proceedings of INTERSPEECH 2019. International Speech Communication Association (ISCA), 1133–
1137.
[111] Mireia Diez, Lukáš Burget, Federico Landini, Shuai Wang, and Honza Černockỳ. 2020. Optimizing Bayesian HMM
based x-vector clustering for the second DIHARD speech diarization challenge. In ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6519–6523.
A Review of Deep Learning Techniques for Speech Processing 87
[112] Saket Dingliwal, Ashish Shenoy, Sravan Bodapati, Ankur Gandhe, Ravi Teja Gadde, and Katrin Kirchhoff. 2021.
Prompt-tuning in ASR systems for efficient domain-adaptation. arXiv preprint arXiv:2110.06502 (2021).
[113] Chris Donahue, Julian McAuley, and Miller Puckette. 2018. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208
(2018).
[114] Jeff Donahue, Sander Dieleman, Mikolaj Binkowski, Erich Elsen, and Karen Simonyan. 2020. End-to-end Adversarial
Text-to-Speech. In International Conference on Learning Representations.
[115] Jeff Donahue, Sander Dieleman, Mikołaj Bińkowski, Erich Elsen, and Karen Simonyan. 2020. End-to-end adversarial
text-to-speech. arXiv preprint arXiv:2006.03575 (2020).
[116] Linhao Dong, Shuang Xu, and Bo Xu. 2018. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model
for Speech Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506
[117] Peiyan Dong, Siyue Wang, Wei Niu, Chengming Zhang, Sheng Lin, Zhengang Li, Yifan Gong, Bin Ren, Xue Lin, and
Dingwen Tao. 2020. Rtmobile: Beyond real-time mobile acceleration of rnns for speech recognition. In 2020 57th
ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
[118] Xuan Dong and Donald S Williamson. 2020. An attention enhanced multi-task model for objective speech assessment
in real-world environments. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 911–915.
[119] Xuan Dong and Donald S Williamson. 2020. A pyramid recurrent network for predicting crowdsourced speech-quality
ratings of real-world signals. arXiv preprint arXiv:2007.15797 (2020).
[120] Shaked Dovrat, Eliya Nachmani, and Lior Wolf. 2021. Many-speakers single channel speech separation with optimal
permutation training. arXiv preprint arXiv:2104.08955 (2021).
[121] Jennifer Drexler and James Glass. 2019. Explicit alignment of text and speech encodings for attention-based end-
to-end speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE,
913–919.
[122] Chenpeng Du and Kai Yu. 2022. Phone-Level Prosody Modelling With GMM-Based MDN for Diverse and Controllable
Speech Synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 190–201. https:
//doi.org/10.1109/TASLP.2021.3133205
[123] Jun Du, Qing Wang, Tian Gao, Yong Xu, Li-Rong Dai, and Chin-Hui Lee. 2014. Robust speech recognition with speech
enhanced deep neural networks. In Fifteenth annual conference of the international speech communication association.
[124] Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres,
and Xavier Giro-i Nieto. 2021. How2sign: a large-scale multimodal dataset for continuous american sign language. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2735–2744.
[125] Jide S Edu, Jose M Such, and Guillermo Suarez-Tangil. 2020. Smart home personal assistants: a security and privacy
review. ACM Computing Surveys (CSUR) 53, 6 (2020), 1–36.
[126] Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron J Weiss, and Yonghui Wu. 2021. Parallel tacotron:
Non-autoregressive and controllable tts. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 5709–5713.
[127] Ashraf Elneima and Mikołaj Bińkowski. 2022. Adversarial Text-to-Speech for low-resource languages. In Proceedings
of the The Seventh Arabic Natural Language Processing Workshop (WANLP). 76–84.
[128] Yariv Ephraim. 1992. A Bayesian estimation approach for speech enhancement using hidden Markov models. IEEE
Transactions on Signal Processing 40, 4 (1992), 725–735.
[129] Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2017. Improved speech reconstruction from silent video. In Proceedings
of the IEEE International Conference on Computer Vision Workshops. 455–462.
[130] Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and
Michael Rubinstein. 2018. Looking to listen at the cocktail party: A speaker-independent audio-visual model for
speech separation. arXiv preprint arXiv:1804.03619 (2018).
[131] Ariel Ephrat and Shmuel Peleg. 2017. Vid2speech: speech reconstruction from silent video. In 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5095–5099.
[132] Linus Ericsson, Henry Gouk, Chen Change Loy, and Timothy M Hospedales. 2022. Self-supervised representation
learning: Introduction, advances, and challenges. IEEE Signal Processing Magazine 39, 3 (2022), 42–62.
[133] Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Duan. 2020. End-to-end generation of talking
faces from noisy speech. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing
(ICASSP). IEEE, 1948–1952.
[134] Sefik Emre Eskimez, You Zhang, and Zhiyao Duan. 2021. Speech driven talking face generation from a single image
and an emotion condition. IEEE Transactions on Multimedia 24 (2021), 3480–3490.
[135] Yue Fan, JW Kang, LT Li, KC Li, HL Chen, ST Cheng, PY Zhang, ZY Zhou, YQ Cai, and Dong Wang. 2020. Cn-celeb:
a challenging chinese speaker recognition dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics,
88 Mehrish et al.
[160] Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In
International conference on machine learning. PMLR, 1764–1772.
[161] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural
networks. In 2013 IEEE international conference on acoustics, speech and signal processing. Ieee, 6645–6649.
[162] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong
Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv
preprint arXiv:2005.08100 (2020).
[163] Haohan Guo, Fenglong Xie, Frank K Soong, Xixin Wu, and Helen Meng. 2022. A Multi-Stage Multi-Codebook
VQ-VAE Approach to High-Performance Neural TTS. arXiv preprint arXiv:2209.10887 (2022).
[164] Ikhsanul Habibie, Mohamed Elgharib, Kripasindhu Sarkar, Ahsan Abdullah, Simbarashe Nyatsanga, Michael Neff,
and Christian Theobalt. 2022. A motion matching-based framework for controllable gesture synthesis from speech.
In ACM SIGGRAPH 2022 Conference Proceedings. 1–9.
[165] Mohamed Farouk Abdel Hady and Friedhelm Schwenker. 2013. Semi-supervised learning. Handbook on Neural
Information Processing (2013), 215–239.
[166] Chi Han, Mingxuan Wang, Heng Ji, and Lei Li. 2021. Learning shared semantic space for speech-to-text translation.
arXiv preprint arXiv:2105.03095 (2021).
[167] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu,
Yixing Xu, et al. 2022. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence
45, 1 (2022), 87–110.
[168] Seungu Han and Junhyeok Lee. 2022. NU-Wave 2: A general neural audio upsampling model for various sampling
rates. arXiv preprint arXiv:2206.08545 (2022).
[169] Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, and
Yonghui Wu. 2020. Contextnet: Improving convolutional neural networks for automatic speech recognition with
global context. arXiv preprint arXiv:2005.03191 (2020).
[170] Rafizah Mohd Hanifa, Khalid Isa, and Shamsul Mohamad. 2021. A review on speaker recognition: Technology and
challenges. Computers & Electrical Engineering 90 (2021), 107005.
[171] Peter SK Hansen. 1997. Signal subspace methods for speech enhancement. Ph. D. Dissertation. Citeseer.
[172] Xiang Hao, Xiangdong Su, Radu Horaud, and Xiaofei Li. 2021. Fullsubnet: A full-band and sub-band fusion model for
real-time single-channel speech enhancement. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 6633–6637.
[173] Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on
Multimedia 17, 5 (2015), 603–615.
[174] Andrew O Hatch, Sachin Kajarekar, and Andreas Stolcke. 2006. Within-class covariance normalization for SVM-based
speaker recognition. In Ninth international conference on spoken language processing.
[175] Simon Haykin and Zhe Chen. 2005. The cocktail party problem. Neural computation 17, 9 (2005), 1875–1902.
[176] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[177] Yanzhang He, Rohit Prabhavalkar, Kanishka Rao, Wei Li, Anton Bakhtin, and Ian McGraw. 2017. Streaming small-
footprint keyword spotting using sequence-to-sequence models. In 2017 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU). IEEE, 474–481.
[178] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli
Kannan, Yonghui Wu, Ruoming Pang, et al. 2019. Streaming end-to-end speech recognition for mobile devices. In
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6381–6385.
[179] Charles T Hemphill, John J Godfrey, and George R Doddington. 1990. The ATIS spoken language systems pilot corpus.
In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.
[180] Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking Neural Network Robustness to Common Corruptions and
Perturbations. In International Conference on Learning Representations. https://openreview.net/forum?id=HJz6tiCqYm
[181] John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. 2016. Deep clustering: Discriminative
embeddings for segmentation and separation. In 2016 IEEE international conference on acoustics, speech and signal
processing (ICASSP). IEEE, 31–35.
[182] Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, and Shinji Watanabe. 2022. BERT
Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model. arXiv
preprint arXiv:2210.16663 (2022).
[183] Bertrand Higy and Peter Bell. 2018. Few-shot learning with attention-based sequence-to-sequence models. arXiv
preprint arXiv:1811.03519 (2018).
[184] Ivan Himawan, Fernando Villavicencio, Sridha Sridharan, and Clinton Fookes. 2019. Deep domain adaptation for
anti-spoofing in speaker verification systems. Computer Speech & Language 58 (2019), 377–402.
90 Mehrish et al.
[185] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent
Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech
recognition: The shared views of four research groups. IEEE Signal processing magazine 29, 6 (2012), 82–97.
[186] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems 33 (2020), 6840–6851.
[187] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[188] Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Yu Tsao, Hsiu-Wen Chang, and Hsin-Min Wang. 2018. Audio-visual
speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in
Computational Intelligence 2, 2 (2018), 117–128.
[189] Neil Houlsby, , et al. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine
Learning. 2790–2799.
[190] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo,
Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th
International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri
and Ruslan Salakhutdinov (Eds.). PMLR, 2790–2799. https://proceedings.mlr.press/v97/houlsby19a.html
[191] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang. 2017. Voice conversion from unaligned
corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849
(2017).
[192] Jui-Yang Hsu, Yuan-Jui Chen, and Hung-yi Lee. 2020. Meta learning for end-to-end low-resource speech recognition. In
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7844–7848.
[193] Wei-Ning Hsu et al. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden
units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
[194] Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, and Yossi Adi. 2022. ReVISE: Self-Supervised Speech Resynthesis
with Visual Input for Universal and Generalized Speech Enhancement. arXiv preprint arXiv:2212.11377 (2022).
[195] Wei-Ning Hsu, Yu Zhang, Ron J Weiss, Yu-An Chung, Yuxuan Wang, Yonghui Wu, and James Glass. 2019. Disen-
tangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5901–5905.
[196] Wei-Ning Hsu, Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen,
Jonathan Shen, et al. 2018. Hierarchical Generative Modeling for Controllable Speech Synthesis. In International
Conference on Learning Representations.
[197] Yen-Chang Hsu, Ting Hua, Sung-En Chang, Qiang Lou, Yilin Shen, and Hongxia Jin. 2022. Language model
compression with weighted low-rank factorization. ArXiv abs/2207.00112 (2022).
[198] Edward J Hu et al. 2021. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on
Learning Representations.
[199] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
https://openreview.net/forum?id=nZeVKeeFYf9
[200] Hang-Rui Hu, Yan Song, Ying Liu, Li-Rong Dai, Ian McLoughlin, and Lin Liu. 2022. Domain Robust Deep Embedding
Learning for Speaker Recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 7182–7186.
[201] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. 2017. Squeeze-and-Excitation Networks. https:
//doi.org/10.48550/ARXIV.1709.01507
[202] Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie.
2020. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint
arXiv:2008.00264 (2020).
[203] Yi Hu and Philipos C Loizou. 2007. Evaluation of objective quality measures for speech enhancement. IEEE
Transactions on audio, speech, and language processing 16, 1 (2007), 229–238.
[204] Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. 2022. Fastdiff: A fast conditional
diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934 (2022).
[205] Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. 2022. Generspeech: Towards style transfer for
generalizable out-of-domain text-to-speech. Advances in Neural Information Processing Systems 35 (2022), 10970–
10983.
[206] Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. 2022. Prodiff: Progressive fast diffusion
model for high-quality text-to-speech. In Proceedings of the 30th ACM International Conference on Multimedia.
2595–2605.
[207] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan
Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and
A Review of Deep Learning Techniques for Speech Processing 91
Furu Wei. 2023. Language Is Not All You Need: Aligning Perception with Language Models. ArXiv abs/2302.14045
(2023).
[208] Sung-Feng Huang, Chyi-Jiunn Lin, Da-Rong Liu, Yi-Chen Chen, and Hung-yi Lee. 2022. Meta-TTS: Meta-learning
for few-shot speaker adaptive text-to-speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30
(2022), 1558–1571.
[209] Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, and Tomoki Toda. 2021. On prosody modeling for
ASR+ TTS based voice conversion. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
IEEE, 642–649.
[210] Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, and Tomoki Toda. 2019. Voice transformer
network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining. arXiv preprint
arXiv:1912.06813 (2019).
[211] Zhiying Huang, Hao Li, and Ming Lei. 2020. Devicetts: A small-footprint, fast, stable network for on-device text-to-
speech. arXiv preprint arXiv:2010.15311 (2020).
[212] Yun-Ning Hung, Chih-Wei Wu, Iroro Orife, Aaron Hipple, William Wolcott, and Alexander Lerch. 2022. A large TV
dataset for speech and music activity detection. EURASIP Journal on Audio, Speech, and Music Processing 2022, 1
(2022), 21.
[213] Dongseong Hwang, Ananya Misra, Zhouyuan Huo, Nikhil Siddhartha, Shefali Garg, David Qiu, Khe Chai Sim,
Trevor Strohman, Françoise Beaufays, and Yanzhang He. 2022. Large-scale asr domain adaptation using self-and
semi-supervised learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 6627–6631.
[214] Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki Karita, Nelson Enrique Yalta Soplin, Tomoki Hayashi, and
Shinji Watanabe. 2020. ESPnet-ST: All-in-one speech translation toolkit. arXiv preprint arXiv:2004.10234 (2020).
[215] Sathish Indurthi, Houjeung Han, Nikhil Kumar Lakumarapu, Beomseok Lee, Insoo Chung, Sangha Kim, and Chanwoo
Kim. 2020. End-end speech-to-text translation with modality agnostic meta-learning. In ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7904–7908.
[216] Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin, Karim Helwani, and Arvindh Krishnaswamy. 2020.
Poconet: Better speech enhancement with frequency-positional embeddings, semi-supervised conversational data,
and biased loss. arXiv preprint arXiv:2008.04470 (2020).
[217] Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/.
[218] Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, and Nam Soo Kim. 2021. Diff-tts: A denoising
diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409 (2021).
[219] Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, and Roi Pomerantz. 2022. Translatotron 2: High-quality direct
speech-to-speech translation with voice preservation. In International Conference on Machine Learning. PMLR,
10120–10134.
[220] Dongwei Jiang, Wubo Li, Miao Cao, Wei Zou, and Xiangang Li. 2020. Speech simclr: Combining contrastive and
reconstruction objective for self-supervised speech representation learning. arXiv preprint arXiv:2010.13991 (2020).
[221] Yunlong Jiao, Adam Gabryś, Georgi Tinchev, Bartosz Putrycz, Daniel Korzekwa, and Viacheslav Klimkov. 2021.
Universal neural vocoding with parallel wavenet. In ICASSP 2021-2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 6044–6048.
[222] Wen Jin, Xin Liu, Michael S Scordilis, and Lu Han. 2009. Speech enhancement using harmonic emphasis and adaptive
comb filtering. IEEE transactions on audio, speech, and language processing 18, 2 (2009), 356–368.
[223] Yong Rae Jo, Young Ki Moon, Won Ik Cho, and Geun Sik Jo. 2021. Self-attentive vad: Context-aware detection of voice
from noise. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 6808–6812.
[224] Anubhav Johri, Ashish Tripathi, et al. 2019. Parkinson disease detection using deep neural networks. In 2019 twelfth
international conference on contemporary computing (IC3). IEEE, 1–4.
[225] Yooncheol Ju, Ilhwan Kim, Hongsun Yang, Ji-Hoon Kim, Byeongyeol Kim, Soumi Maiti, and Shinji Watanabe. 2022.
TriniTTS: Pitch-controllable End-to-end TTS without External Aligner. In Proc. Interspeech. 16–20.
[226] Jee-weon Jung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, and Ha-Jin Yu. 2019. Rawnet: Advanced end-to-end deep
neural network using raw waveforms for text-independent speaker verification. arXiv preprint arXiv:1904.08104
(2019).
[227] Jee-weon Jung, Hee-Soo Heo, Ha-Jin Yu, and Joon Son Chung. 2021. Graph attention networks for speaker verification.
In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6149–
6153.
[228] Jee-weon Jung, Hee-Soo Heo, Ha-Jin Yu, and Joon Son Chung. 2021. Graph Attention Networks for Speaker
Verification. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
6149–6153. https://doi.org/10.1109/ICASSP39728.2021.9414057
92 Mehrish et al.
[229] Jacob Kahn, Ann Lee, and Awni Hannun. 2020. Self-training for end-to-end speech recognition. In ICASSP 2020-2020
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7084–7088.
[230] Samuel Kakuba, Alwin Poulose, and Dong Seog Han. 2022. Deep Learning-Based Speech Emotion Recognition Using
Multi-Level Fusion of Concurrent Features. IEEE Access 10 (2022), 125538–125551.
[231] Taku Kala and Takahiro Shinozaki. 2018. Reinforcement learning of speech recognition system based on policy
gradient and hypothesis selection. In 2018 ieee international conference on acoustics, speech and signal processing
(icassp). IEEE, 5759–5763.
[232] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg,
Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. In International
Conference on Machine Learning. PMLR, 2410–2419.
[233] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016.
Neural machine translation in linear time. arXiv preprint arXiv:1610.10099 (2016).
[234] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling
sentences. arXiv preprint arXiv:1404.2188 (2014).
[235] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. 2019. ACVAE-VC: Non-parallel voice
conversion with auxiliary classifier variational autoencoder. IEEE/ACM Transactions on Audio, Speech, and Language
Processing 27, 9 (2019), 1432–1443.
[236] Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, Jinyu Li, and
Takuya Yoshioka. 2022. Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings. arXiv preprint
arXiv:2203.16685 (2022).
[237] Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, and Takuya Yoshioka. 2022.
Transcribe-to-diarize: Neural speaker diarization for unlimited number of speakers using end-to-end speaker-
attributed asr. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 8082–8086.
[238] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo. 2019. Cyclegan-vc2: Improved cyclegan-
based non-parallel voice conversion. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 6820–6824.
[239] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo. 2020. Cyclegan-vc3: Examining and
improving cyclegan-vcs for mel-spectrogram conversion. arXiv preprint arXiv:2010.11672 (2020).
[240] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo. 2021. Maskcyclegan-vc: Learning non-
parallel voice conversion with filling in frames. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 5919–5923.
[241] Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka, and Shogo Seki. 2022. iSTFTNet: Fast and lightweight mel-
spectrogram vocoder incorporating inverse short-time Fourier transform. In ICASSP 2022-2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6207–6211.
[242] Jiawen Kang, Ruiqi Liu, Lantian Li, Yunqi Cai, Dong Wang, and Thomas Fang Zheng. 2020. Domain-invariant speaker
vector projection by model-agnostic meta-learning. arXiv preprint arXiv:2005.11900 (2020).
[243] Ioannis Kansizoglou, Loukas Bampis, and Antonios Gasteratos. 2019. An active learning paradigm for online
audio-visual emotion recognition. IEEE Transactions on Affective Computing 13, 2 (2019), 756–768.
[244] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson
Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al. 2019. A comparative study on transformer vs rnn in
speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 449–456.
[245] Shigeki Karita, Nelson Yalta, Shinji Watanabe, Marc Delcroix, Atsunori Ogawa, and Tomohiro Nakatani. 2019.
Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and
Language Model Integration. In INTERSPEECH.
[246] Kazuya Kawakami. 2008. Supervised sequence labelling with recurrent neural networks. Ph. D. Dissertation. Technical
University of Munich.
[247] Kazuya Kawakami, Luyu Wang, Chris Dyer, Phil Blunsom, and Aaron van den Oord. 2020. Learning robust and
multilingual speech representations. arXiv preprint arXiv:2001.11128 (2020).
[248] Tom Kenter, Vincent Wan, Chun-An Chan, Rob Clark, and Jakub Vit. 2019. CHiVE: Varying prosody in speech
synthesis with a linguistically driven dynamic hierarchical conditional variational network. In International Conference
on Machine Learning. PMLR, 3331–3340.
[249] Heeseung Kim, Sungwon Kim, and Sungroh Yoon. 2022. Guided-tts: A diffusion model for text-to-speech via classifier
guidance. In International Conference on Machine Learning. PMLR, 11119–11133.
[250] Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-tts: A generative flow for text-to-speech
via monotonic alignment search. Advances in Neural Information Processing Systems 33 (2020), 8067–8077.
A Review of Deep Learning Techniques for Speech Processing 93
[251] Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for
end-to-end text-to-speech. In International Conference on Machine Learning. PMLR, 5530–5540.
[252] Juntae Kim and Jeehye Lee. 2021. Generalizing RNN-transducer to out-domain audio via sparse self-attention layers.
arXiv preprint arXiv:2108.10752 (2021).
[253] Jaechang Kim, Yunjoo Lee, Seunghoon Hong, and Jungseul Ok. 2022. Learning continuous representation of audio
for arbitrary scale super resolution. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 3703–3707.
[254] Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, and Seong-Whan Lee. 2021. Fre-gan: Adversarial frequency-consistent
audio synthesis. arXiv preprint arXiv:2106.02297 (2021).
[255] Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya Mangalam, Jitendra Malik, Michael W Mahoney,
and Kurt Keutzer. 2022. Squeezeformer: An efficient transformer for automatic speech recognition. arXiv preprint
arXiv:2206.00888 (2022).
[256] Seongbin Kim, Gyuwan Kim, Seongjin Shin, and Sangmin Lee. 2021. Two-Stage Textual Knowledge Distillation
for End-to-End Spoken Language Understanding. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). 7463–7467. https://doi.org/10.1109/ICASSP39728.2021.9414619
[257] Sungwon Kim, Heeseung Kim, and Sungroh Yoon. 2022. Guided-TTS 2: A Diffusion Model for High-quality Adaptive
Text-to-Speech with Untranscribed Data. arXiv preprint arXiv:2205.15370 (2022).
[258] Sungwon Kim, Sang-gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. 2018. FloWaveNet: A generative flow
for raw audio. arXiv preprint arXiv:1811.02155 (2018).
[259] Tomi Kinnunen, Evgeny Karpov, and Pasi Franti. 2005. Real-time speaker identification and verification. IEEE
Transactions on Audio, Speech, and Language Processing 14, 1 (2005), 277–288.
[260] Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, and Tomohiro Nakatani. 2020. Improving noise robust automatic
speech recognition with single-channel time-domain enhancement network. In ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7009–7013.
[261] Serkan Kiranyaz, Onur Avci, Osama Abdeljaber, Turker Ince, Moncef Gabbouj, and Daniel J. Inman. 2021. 1D
convolutional neural networks and applications: A survey. Mechanical Systems and Signal Processing 151 (2021),
107398. https://doi.org/10.1016/j.ymssp.2020.107398
[262] Serkan Kiranyaz, Turker Ince, Ridha Hamila, and Moncef Gabbouj. 2015. Convolutional neural networks for patient-
specific ECG classification. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and
Biology Society (EMBC). IEEE, 2608–2611.
[263] Yuma Koizumi, Kohei Yatabe, Marc Delcroix, Yoshiki Masuyama, and Daiki Takeuchi. 2020. Speech enhancement
using self-adaptation and multi-head self-attention. In ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 181–185.
[264] Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, and Michiel Bacchiani. 2022. SpecGrad: Diffusion Probabilistic
Model based Neural Vocoder with Adaptive Noise Spectral Shaping. arXiv preprint arXiv:2203.16749 (2022).
[265] Morten Kolbæk, Dong Yu, Zheng-Hua Tan, and Jesper Jensen. 2017. Multitalker speech separation with utterance-level
permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and
Language Processing 25, 10 (2017), 1901–1913.
[266] Nithin Rao Koluguri, Taejin Park, and Boris Ginsburg. 2022. TitaNet: Neural Model for speaker representation with
1D Depth-wise separable convolutions and global context. In ICASSP 2022-2022 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8102–8106.
[267] John Kominek and Alan W Black. 2004. The CMU Arctic speech databases. In Fifth ISCA workshop on speech synthesis.
[268] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and
high fidelity speech synthesis. Advances in Neural Information Processing Systems 33 (2020), 17022–17033.
[269] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2020. Diffwave: A versatile diffusion model
for audio synthesis. arXiv preprint arXiv:2009.09761 (2020).
[270] Sergey Koval and Sergey Krynov. 2020. Practice of usage of spectral analysis for forensic speaker identification. In
RLA2C 1998-Speaker Recognition and its Commercial and Forensic Applications. 136–140.
[271] Yuichiro Koyama, Tyler Vuong, Stefan Uhlich, and Bhiksha Raj. 2020. Exploring the best loss function for DNN-based
low-latency speech enhancement with temporal convolutional networks. arXiv preprint arXiv:2005.11611 (2020).
[272] Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D’efossez, Jade Copet, Devi Parikh, Yaniv
Taigman, and Yossi Adi. 2022. AudioGen: Textually Guided Audio Generation. ArXiv abs/2209.15352 (2022).
[273] Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary,
Jason Li, and Yang Zhang. 2020. Quartznet: Deep automatic speech recognition with 1d time-channel separable
convolutions. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 6124–6128.
94 Mehrish et al.
[274] Ajinkya Kulkarni, Vincent Colotte, and Denis Jouvet. 2020. Transfer learning of the expressivity using FLOW metric
learning in multispeaker text-to-speech synthesis. In INTERSPEECH 2020.
[275] Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de
Brébisson, Yoshua Bengio, and Aaron C Courville. 2019. Melgan: Generative adversarial networks for conditional
waveform synthesis. Advances in neural information processing systems 32 (2019).
[276] Ohsung Kwon, Inseon Jang, ChungHyun Ahn, and Hong-Goo Kang. 2019. An Effective Style Token Weight Control
Technique for End-to-End Emotional Speech Synthesis. IEEE Signal Processing Letters 26, 9 (2019), 1383–1387.
https://doi.org/10.1109/LSP.2019.2931673
[277] Youngki Kwon, Hee-Soo Heo, Jee-weon Jung, You Jin Kim, Bong-Jin Lee, and Joon Son Chung. 2022. Multi-scale
speaker embedding-based graph attention networks for speaker diarisation. In ICASSP 2022-2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8367–8371.
[278] Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You Jin Kim, Bong-Jin Lee, and Joon Son Chung. 2021. Adapting
speaker embeddings for speaker diarisation. arXiv preprint arXiv:2104.02879 (2021).
[279] Seong Min Kye, Youngmoon Jung, Hae Beom Lee, Sung Ju Hwang, and Hoirin Kim. 2020. Meta-learning for short
utterance speaker recognition with imbalance length pairs. arXiv preprint arXiv:2004.02863 (2020).
[280] Cheng-I Jeff Lai, Yang Zhang, Alexander H. Liu, Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer
Khurana, David D. Cox, and James R. Glass. 2021. PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech
Recognition. CoRR abs/2106.05933 (2021). arXiv:2106.05933 https://arxiv.org/abs/2106.05933
[281] Kushal Lakhotia et al. 2021. On generative spoken language modeling from raw audio. Transactions of the Association
for Computational Linguistics 9 (2021).
[282] Egor Lakomkin, Mohammad Ali Zamani, Cornelius Weber, Sven Magg, and Stefan Wermter. 2018. Emorl: continuous
acoustic emotion classification using deep reinforcement learning. In 2018 IEEE International Conference on Robotics
and Automation (ICRA). IEEE, 4445–4450.
[283] Max WY Lam, Jun Wang, Dan Su, and Dong Yu. 2021. Sandglasset: A light multi-granularity self-attentive network
for time-domain speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 5759–5763.
[284] Adrian Łańcucki. 2021. Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021-2021 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6588–6592.
[285] Federico Landini, Ján Profant, Mireia Diez, and Lukáš Burget. 2022. Bayesian hmm clustering of x-vector sequences
(vbx) in speaker diarization: theory, implementation and analysis on standard tasks. Computer Speech & Language 71
(2022), 101254.
[286] Anthony Larcher, Kong Aik Lee, Bin Ma, and Haizhou Li. 2012. The RSR2015: Database for text-dependent speaker
verification using multiple pass-phrases. In Annual Conference of the International Speech Communication Association
(Interspeech).
[287] Anthony Larcher, Ambuj Mehrish, Marie Tahon, Sylvain Meignier, Jean Carrive, David Doukhan, Olivier Galibert,
and Nicholas Evans. 2021. Speaker embeddings for diarization of broadcast data in the allies challenge. In ICASSP
2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5799–5803.
[288] Siddique Latif, Junaid Qadir, Adnan Qayyum, Muhammad Usama, and Shahzad Younis. 2020. Speech technology for
healthcare: Opportunities, challenges, and state of the art. IEEE Reviews in Biomedical Engineering 14 (2020), 342–356.
[289] Hung-yi Lee, Abdelrahman Mohamed, Shinji Watanabe, Tara Sainath, Karen Livescu, Shang-Wen Li, Shu-wen Yang,
and Katrin Kirchhoff. 2022. Self-supervised Representation Learning for Speech Processing. In Proceedings of the
2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies: Tutorial Abstracts. 8–13.
[290] Junhyeok Lee and Seungu Han. 2021. Nu-wave: A diffusion probabilistic model for neural audio upsampling. arXiv
preprint arXiv:2104.02321 (2021).
[291] Kong Aik Lee, Anthony Larcher, Guangsen Wang, Patrick Kenny, Niko Brümmer, David Van Leeuwen, Hagai
Aronowitz, Marcel Kockmann, Carlos Vaquero, Bin Ma, et al. 2015. The RedDots data collection for speaker
recognition. In Interspeech 2015.
[292] Kong Aik Lee, Qiongqiong Wang, and Takafumi Koshinaka. 2019. The CORAL+ algorithm for unsupervised domain
adaptation of PLDA. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 5821–5825.
[293] Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, and
Tie-Yan Liu. 2021. PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive
Prior. In International Conference on Learning Representations.
[294] Sang-gil Lee, Sungwon Kim, and Sungroh Yoon. 2020. Nanoflow: Scalable normalizing flows with sublinear parameter
complexity. Advances in Neural Information Processing Systems 33 (2020), 14058–14067.
A Review of Deep Learning Techniques for Speech Processing 95
[295] Sang-Hoon Lee, Seung-Bin Kim, Ji-Hyun Lee, Eunwoo Song, Min-Jae Hwang, and Seong-Whan Lee. 2022. Hier-
Speech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised
Representations for Speech Synthesis. Advances in Neural Information Processing Systems 35 (2022), 16624–16636.
[296] Yoonhyung Lee, Joongbo Shin, and Kyomin Jung. 2021. Bidirectional variational inference for non-autoregressive
text-to-speech. In International Conference on Learning Representations.
[297] Quentin Lemaire and Andre Holzapfel. 2019. Temporal convolutional networks for speech and music detection in
radio broadcast. In 20th International Society for Music Information Retrieval Conference, ISMIR 2019, 4-8 November
2019. International Society for Music Information Retrieval.
[298] Jean-Marie Lemercier, Julius Richter, Simon Welker, and Timo Gerkmann. 2022. StoRM: A Diffusion-based Stochastic
Regeneration Model for Speech Enhancement and Dereverberation. arXiv preprint arXiv:2212.11851 (2022).
[299] Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiang-Yang
Li, Tao Qin, et al. 2022. BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio
Synthesis. arXiv preprint arXiv:2205.14807 (2022).
[300] David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gisselbrecht, and Joseph Dureau. 2019. Federated learning
for keyword spotting. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing
(ICASSP). IEEE, 6341–6345.
[301] Alon Levkovitch, Eliya Nachmani, and Lior Wolf. 2022. Zero-Shot Voice Conditioning for Denoising Diffusion TTS
Models. arXiv preprint arXiv:2206.02246 (2022).
[302] Bo Li, Shuo-yiin Chang, Tara N Sainath, Ruoming Pang, Yanzhang He, Trevor Strohman, and Yonghui Wu. 2020.
Towards fast and accurate streaming end-to-end ASR. In ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 6069–6073.
[303] Bo Li, Anmol Gulati, Jiahui Yu, Tara N Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming
Pang, Yanzhang He, James Qin, et al. 2021. A better and faster end-to-end model for streaming asr. In ICASSP
2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5634–5638.
[304] Bo Li, Tara N Sainath, Arun Narayanan, Joe Caroselli, Michiel Bacchiani, Ananya Misra, Izhak Shafran, Hasim Sak,
Golan Pundak, Kean K Chin, et al. 2017. Acoustic Modeling for Google Home.. In Interspeech. 399–403.
[305] Jinyu Li et al. 2022. Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and
Information Processing 11, 1 (2022).
[306] Jingyu Li, Wei Liu, and Tan Lee. 2022. EDITnet: A Lightweight Network for Unsupervised Domain Adaptation in
Speaker Verification. arXiv preprint arXiv:2206.07548 (2022).
[307] Jingdong Li, Hui Zhang, Xueliang Zhang, and Changliang Li. 2019. Single channel speech enhancement using
temporal convolutional recurrent neural networks. In 2019 Asia-Pacific Signal and Information Processing Association
Annual Summit and Conference (APSIPA ASC). IEEE, 896–900.
[308] Kai Li, Runxuan Yang, and Xiaolin Hu. 2022. An efficient encoder-decoder architecture with top-down attention for
speech separation. arXiv preprint arXiv:2209.15200 (2022).
[309] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural speech synthesis with transformer
network. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 6706–6713.
[310] Naihan Li, Yanqing Liu, Yu Wu, Shujie Liu, Sheng Zhao, and Ming Liu. 2020. Robutrans: A robust transformer-based
text-to-speech model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8228–8235.
[311] Qinglong Li, Fei Gao, Haixin Guan, and Kaichi Ma. 2021. Real-time monaural speech enhancement with short-time
discrete cosine transform. arXiv preprint arXiv:2102.04629 (2021).
[312] Qiujia Li, David Qiu, Yu Zhang, Bo Li, Yanzhang He, Philip C Woodland, Liangliang Cao, and Trevor Strohman. 2021.
Confidence estimation for attention-based sequence-to-sequence models for speech recognition. In ICASSP 2021-2021
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6388–6392.
[313] Rongjin Li, Weibin Zhang, and Dongpeng Chen. 2022. The coral++ algorithm for unsupervised domain adaptation of
speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 7172–7176.
[314] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of
the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4582–4597.
https://doi.org/10.18653/v1/2021.acl-long.353
[315] Yingting Li, Ambuj Mehrish, Shuai Zhao, Rishabh Bhardwaj, Amir Zadeh, Navonil Majumder, Rada Mihalcea, and
Soujanya Poria. 2023. Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech
Understanding. arXiv preprint arXiv:2303.03267 (2023).
[316] Yinghao Aaron Li, Cong Han, and Nima Mesgarani. 2022. StyleTTS: A Style-Based Generative Model for Natural and
Diverse Text-to-Speech Synthesis. arXiv preprint arXiv:2205.15439 (2022).
96 Mehrish et al.
[317] Dan Lim, Won Jang, Heayoung Park, Bongwan Kim, Jaesam Yoon, et al. 2020. Jdi-t: Jointly trained duration informed
transformer for text-to-speech without explicit alignment. arXiv preprint arXiv:2005.07799 (2020).
[318] Dan Lim, Sunghee Jung, and Eesung Kim. 2022. JETS: Jointly training FastSpeech2 and HiFi-GAN for end to end text
to speech. arXiv preprint arXiv:2203.16852 (2022).
[319] Jae Lim and Alan Oppenheim. 1978. All-pole modeling of degraded speech. IEEE Transactions on Acoustics, Speech,
and Signal Processing 26, 3 (1978), 197–210.
[320] Teck Yian Lim, Raymond A. Yeh, Yijia Xu, Minh N. Do, and Mark Hasegawa-Johnson. 2018. Time-Frequency Networks
for Audio Super-Resolution. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
646–650. https://doi.org/10.1109/ICASSP.2018.8462049
[321] Ju Lin, Sufeng Niu, Zice Wei, Xiang Lan, Adriaan J Wijngaarden, Melissa C Smith, and Kuang-Ching Wang. 2019.
Speech enhancement using forked generative adversarial networks with spectral subtraction. Proceedings of Interspeech
2019 (2019).
[322] Ju Lin, Adriaan J. de Lind van Wijngaarden, Kuang-Ching Wang, and Melissa C. Smith. 2021. Speech Enhancement
Using Multi-Stage Self-Attentive Temporal Convolutional Networks. IEEE/ACM Transactions on Audio, Speech, and
Language Processing 29 (2021), 3440–3450. https://doi.org/10.1109/TASLP.2021.3125143
[323] Jheng-hao Lin, Yist Y Lin, Chung-Ming Chien, and Hung-yi Lee. 2021. S2vc: a framework for any-to-any voice
conversion with self-supervised pretrained representations. arXiv preprint arXiv:2104.02901 (2021).
[324] Qingjian Lin, Yu Hou, and Ming Li. 2020. Self-Attentive Similarity Measurement Strategies in Speaker Diarization..
In INTERSPEECH. 284–288.
[325] Wei-Wei Lin and Man-Wai Mak. 2020. Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings
from Waveforms.. In INTERSPEECH. 3211–3215.
[326] Shaoshi Ling and Yuzong Liu. 2020. Decoar 2.0: Deep contextualized acoustic representations with vector quantization.
arXiv preprint arXiv:2012.06659 (2020).
[327] Alexander H Liu, Wei-Ning Hsu, Michael Auli, and Alexei Baevski. 2023. Towards end-to-end unsupervised speech
recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 221–228.
[328] Alexander H Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevskiv, and James Glass. 2022. Simple and
effective unsupervised speech synthesis. arXiv preprint arXiv:2204.02524 (2022).
[329] Andy T Liu, Shang-Wen Li, and Hung-yi Lee. 2021. Tera: Self-supervised learning of transformer encoder representa-
tion for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2351–2366.
[330] Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. 2020. Mockingjay: Unsupervised speech rep-
resentation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6419–6423.
[331] Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, and Geoffrey Zweig. 2021. Improving RNN transducer
based ASR with auxiliary tasks. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 172–179.
[332] Haohe Liu, Zehua Chen, Yiitan Yuan, Xinhao Mei, Xubo Liu, Danilo P. Mandic, Wenwu Wang, and MarkD . Plumbley.
2023. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. ArXiv abs/2301.12503 (2023).
[333] Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, and DeLiang Wang. 2022. Neural vocoder is all you
need for speech super-resolution. arXiv preprint arXiv:2203.14941 (2022).
[334] Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2022. Diffsinger: Singing voice synthesis via shallow
diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11020–11028.
[335] Jingjing Liu, Panupong Pasupat, Scott Cyphers, and Jim Glass. 2013. Asgard: A portable architecture for multilingual
dialogue systems. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 8386–8390.
[336] Mengzhuo Liu and Yangjie Wei. 2022. An Improvement to Conformer-Based Model for High-Accuracy Speech
Feature Extraction and Learning. Entropy 24, 7 (2022), 866.
[337] Rui Liu, Berrak Sisman, Guanglai Gao, and Haizhou Li. 2021. Expressive TTS Training With Frame and Style
Reconstruction Loss. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 1806–1818.
https://doi.org/10.1109/TASLP.2021.3076369
[338] Rui Liu, Berrak Sisman, and Haizhou Li. 2021. Graphspeech: Syntax-aware graph attention network for neural speech
synthesis. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 6059–6063.
[339] Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, and Helen Meng. 2021. Any-to-many voice
conversion with location-relative sequence-to-sequence modeling. IEEE/ACM Transactions on Audio, Speech, and
Language Processing 29 (2021), 1717–1728.
[340] Songxiang Liu, Dan Su, and Dong Yu. 2022. Diffgan-tts: High-fidelity and efficient text-to-speech with denoising
diffusion gans. arXiv preprint arXiv:2201.11972 (2022).
[341] Xialei Liu, Joost Van De Weijer, and Andrew D Bagdanov. 2019. Exploiting unlabeled data in cnns by self-supervised
learning to rank. IEEE transactions on pattern analysis and machine intelligence 41, 8 (2019), 1862–1878.
A Review of Deep Learning Techniques for Speech Processing 97
[342] Yanqing Liu, Zhihang Xu, Gang Wang, Kuan Chen, Bohan Li, Xu Tan, Jinzhu Li, Lei He, and Sheng Zhao. 2021.
Delightfultts: The microsoft speech synthesis system for blizzard challenge 2021. arXiv preprint arXiv:2110.12612
(2021).
[343] Zhengxi Liu, Qiao Tian, Chenxu Hu, Xudong Liu, Menglin Wu, Yuping Wang, Hang Zhao, and Yuxuan Wang. 2022.
Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech. arXiv preprint arXiv:2207.06088 (2022).
[344] Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote,
Alexis Moinet, and Vatsal Aggarwal. 2018. Towards achieving robust universal neural vocoding. arXiv preprint
arXiv:1811.06292 (2018).
[345] Xugang Lu, Sheng Li, and Masakiyo Fujimoto. 2020. Automatic speech recognition. Speech-to-Speech Translation
(2020), 21–38.
[346] Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori. 2013. Speech enhancement based on deep denoising
autoencoder.. In Interspeech, Vol. 2013. 436–440.
[347] Yen-Ju Lu, Yu Tsao, and Shinji Watanabe. 2021. A Study on Speech Enhancement Based on Diffusion Probabilistic
Model. In 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA
ASC). 659–666.
[348] Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, and Yu Tsao. 2022. Conditional Diffusion
Probabilistic Model for Speech Enhancement. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). 7402–7406. https://doi.org/10.1109/ICASSP43922.2022.9746901
[349] Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, and Yu Tsao. 2022. Conditional diffusion
probabilistic model for speech enhancement. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 7402–7406.
[350] Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua Bengio. 2019. Speech model
pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670 (2019).
[351] Yi Luo, Zhuo Chen, and Takuya Yoshioka. 2020. Dual-path rnn: efficient long sequence modeling for time-domain
single-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 46–50.
[352] Yi Luo and Nima Mesgarani. 2018. Real-time single-channel dereverberation and separation with time-domain audio
separation network.. In Interspeech. 342–346.
[353] Yi Luo and Nima Mesgarani. 2019. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech
separation. IEEE/ACM transactions on audio, speech, and language processing 27, 8 (2019), 1256–1266.
[354] Manh Luong and Viet Anh Tran. 2021. FlowVocoder: A small Footprint Neural Vocoder based Normalizing flow for
Speech Synthesis. arXiv preprint arXiv:2109.13675 (2021).
[355] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural
machine translation. arXiv preprint arXiv:1508.04025 (2015).
[356] Shahar Lutati, Eliya Nachmani, and Lior Wolf. 2022. Sepit approaching a single channel speech separation bound.
arXiv preprint arXiv:2205.11801 (2022).
[357] Shahar Lutati, Eliya Nachmani, and Lior Wolf. 2023. Separate And Diffuse: Using a Pretrained Diffusion Model for
Improving Source Separation. arXiv preprint arXiv:2301.10752 (2023).
[358] Florian Lux and Ngoc Thang Vu. 2022. Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with
Articulatory Features. arXiv preprint arXiv:2203.03191 (2022).
[359] Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Björn W Schuller, and Maja Pantic. 2021. Lira: Learning visual speech
representations from audio through self-supervision. arXiv preprint arXiv:2106.09171 (2021).
[360] Shuang Ma, Daniel Mcduff, and Yale Song. 2019. Neural TTS stylization with adversarial and collaborative games. In
International Conference on Learning Representations.
[361] Duncan Macho, Laurent Mauuary, Bernhard Noé, Yan Ming Cheng, Doug Ealey, Denis Jouvet, Holly Kelleher, David
Pearce, and Fabien Saadoun. 2002. Evaluation of a noise-robust DSR front-end on Aurora databases. In Seventh
International Conference on Spoken Language Processing.
[362] Gallil Maimon and Yossi Adi. 2022. Speaking Style Conversion With Discrete Self-Supervised Units. arXiv preprint
arXiv:2212.09730 (2022).
[363] Soumi Maiti and Michael I Mandel. 2020. Speaker independence of neural vocoders and their effect on parametric
resynthesis speech enhancement. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 206–210.
[364] Somshubra Majumdar, Shantanu Acharya, Vitaly Lavrukhin, and Boris Ginsburg. 2023. Damage Control During
Domain Adaptation for Transducer Based Automatic Speech Recognition. In 2022 IEEE Spoken Language Technology
Workshop (SLT). IEEE, 130–135.
[365] Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk, Vitaly Lavrukhin, Vahid Noroozi, and Boris Ginsburg.
2021. Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic
98 Mehrish et al.
[387] Eliya Nachmani, Yossi Adi, and Lior Wolf. 2020. Voice separation with an unknown number of multiple speakers. In
International Conference on Machine Learning. PMLR, 7164–7175.
[388] Tomohiro Nakatani. 2019. Improving transformer-based end-to-end speech recognition with connectionist temporal
classification and language model integration. In Proc. Interspeech, Vol. 2019.
[389] Yoshihiko Nankaku, Kenta Sumiya, Takenori Yoshimura, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, and Keiichi
Tokuda. 2021. Neural sequence-to-sequence speech synthesis using a hidden semi-Markov model based structured
attention mechanism. arXiv preprint arXiv:2108.13985 (2021).
[390] Ali Bou Nassif, Ismail Shahin, Imtinan Attili, Mohammad Azzeh, and Khaled Shaalan. 2019. Speech recognition using
deep neural networks: A systematic review. IEEE access 7 (2019), 19143–19165.
[391] Huu Binh Nguyen, Duong Van Hai, Tien Dat Bui, Hoang Ngoc Chau, and Quoc Cuong Nguyen. 2022. Multi-Channel
Speech Enhancement using a Minimum Variance Distortionless Response Beamformer based on Graph Convolutional
Network. International Journal of Advanced Computer Science and Applications 13, 10 (2022).
[392] Viet-Anh Nguyen, Anh HT Nguyen, and Andy WH Khong. 2022. Tunet: A block-online bandwidth extension model
based on transformers and self-supervised pretraining. In ICASSP 2022-2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 161–165.
[393] Viet-Nhat Nguyen, Mostafa Sadeghi, Elisa Ricci, and Xavier Alameda-Pineda. 2021. Deep variational generative
models for audio-visual speech separation. In 2021 IEEE 31st International Workshop on Machine Learning for Signal
Processing (MLSP). IEEE, 1–6.
[394] Xuan-Phi Nguyen, Sravya Popuri, Changhan Wang, Yun Tang, Ilia Kulikov, and Hongyu Gong. 2022. Improving
Speech-to-Speech Translation Through Unlabeled Text. arXiv preprint arXiv:2210.14514 (2022).
[395] Phani Sankar Nidadavolu, Jesús Villalba, and Najim Dehak. 2019. Cycle-gans for domain adaptation of acoustic
features for speaker recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 6206–6210.
[396] Yishuang Ning, Sheng He, Zhiyong Wu, Chunxiao Xing, and Liang-Jie Zhang. 2019. A review of deep learning based
speech synthesis. Applied Sciences 9, 19 (2019), 4050.
[397] Peiqing Niu, Zhongfu Chen, Meina Song, et al. 2019. A novel bi-directional interrelated model for joint intent
detection and slot filling. arXiv preprint arXiv:1907.00390 (2019).
[398] Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, and Hisashi Kawai. 2019. Real-Time Neural Text-to-Speech with
Sequence-to-Sequence Acoustic Model and WaveGlow or Single Gaussian WaveRNN Vocoders.. In INTERSPEECH.
1308–1312.
[399] Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, and Hisashi Kawai. 2019. Tacotron-Based Acoustic Model Using
Phoneme Alignment for Practical Neural Text-to-Speech Systems. In 2019 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU). 214–221. https://doi.org/10.1109/ASRU46091.2019.9003956
[400] Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, and Hisashi Kawai. 2020. Transformer-Based Text-to-Speech
with Weighted Forced Attention. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). 6729–6733. https://doi.org/10.1109/ICASSP40776.2020.9053915
[401] Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche,
Edward Lockhart, Luis Cobo, Florian Stimberg, et al. 2018. Parallel wavenet: Fast high-fidelity speech synthesis. In
International conference on machine learning. PMLR, 3918–3926.
[402] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbren-
ner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint
arXiv:1609.03499 (2016).
[403] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding.
arXiv preprint arXiv:1807.03748 (2018).
[404] Jasper Ooster and Bernd T Meyer. 2019. Improving deep models of speech quality prediction through voice activity
detection and entropy-based measures. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 636–640.
[405] OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
[406] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens,
Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. 2022. Training language models
to follow instructions with human feedback. ArXiv abs/2203.02155 (2022).
[407] Kuldip Paliwal, Kamil Wójcicki, and Benjamin Shannon. 2011. The importance of phase in speech enhancement.
speech communication 53, 4 (2011), 465–494.
[408] Giridhar Pamisetty and K Sri Rama Murty. 2023. Prosody-TTS: An end-to-end speech synthesis system with prosody
control. Circuits, Systems, and Signal Processing 42, 1 (2023), 361–384.
100 Mehrish et al.
[409] Jing Pan, Tao Lei, Kwangyoun Kim, Kyu J. Han, and Shinji Watanabe. 2022. SRU++: Pioneering Fast Recurrence with
Attention for Speech Recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). 7872–7876. https://doi.org/10.1109/ICASSP43922.2022.9746187
[410] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on
public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).
IEEE, 5206–5210.
[411] Ashutosh Pandey and DeLiang Wang. 2019. TCNN: Temporal convolutional neural network for real-time speech
enhancement in the time domain. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 6875–6879.
[412] Ilias Papastratis. 2021. Speech Recognition: a review of the different deep learning approaches. Accessed on 2 (2021),
2021.
[413] Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. 2020. Improved
noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629 (2020).
[414] Sangjun Park, Kihyun Choo, Joohyung Lee, Anton V Porov, Konstantin Osipov, and June Sig Sung. 2022. Bunched
LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge. arXiv preprint arXiv:2203.14416 (2022).
[415] Tae Jin Park, Kyu J Han, Manoj Kumar, and Shrikanth Narayanan. 2019. Auto-tuning spectral clustering for speaker
diarization using normalized maximum eigengap. IEEE Signal Processing Letters 27 (2019), 381–385.
[416] Santiago Pascual, Mirco Ravanelli, Joan Serra, Antonio Bonafonte, and Yoshua Bengio. 2019. Learning problem-
agnostic speech representations from multiple self-supervised tasks. arXiv preprint arXiv:1904.03416 (2019).
[417] Vishal Passricha and Rajesh Kumar Aggarwal. 2019. A hybrid of deep CNN and bidirectional LSTM for automatic
speech recognition. Journal of Intelligent Systems 29, 1 (2019), 1261–1274.
[418] Dipjyoti Paul, Sankar Mukherjee, Yannis Pantazis, and Yannis Stylianou. 2021. A Universal Multi-Speaker Multi-Style
Text-to-Speech via Disentangled Representation Learning Based on Rényi Divergence Minimization.. In Interspeech.
3625–3629.
[419] Dipjyoti Paul, Yannis Pantazis, and Yannis Stylianou. 2020. Speaker conditional WaveRNN: Towards universal neural
vocoder for unseen speaker and recording conditions. arXiv preprint arXiv:2008.05289 (2020).
[420] Blanca Pena and Luofeng Huang. 2021. Wave-GAN: a deep learning approach for the prediction of nonlinear regular
wave loads and run-up on a fixed cylinder. Coastal Engineering 167 (2021), 103902.
[421] Kainan Peng, Wei Ping, Zhao Song, and Kexin Zhao. 2020. Non-autoregressive neural text-to-speech. In International
conference on machine learning. PMLR, 7586–7598.
[422] Zilun Peng, Akshay Budhkar, Ilana Tuil, Jason Levy, Parinaz Sobhani, Raphael Cohen, and Jumana Nassour. 2021.
Shrinking Bigfoot: Reducing wav2vec 2.0 footprint. In Proceedings of the Second Workshop on Simple and Efficient
Natural Language Processing. Association for Computational Linguistics, Virtual, 134–141. https://doi.org/10.18653/
v1/2021.sustainlp-1.14
[423] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. 2020. AdapterFusion:
Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247 (2020).
[424] Minh Pham, Zeqian Li, and Jacob Whitehill. 2020. Toward better speaker embeddings: Automated collection of
speech samples from unknown distinct speakers. In ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 7089–7093.
[425] Wei Ping, Kainan Peng, and Jitong Chen. 2018. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv
preprint arXiv:1807.07281 (2018).
[426] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and
John Miller. 2017. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint
arXiv:1710.07654 (2017).
[427] Wei Ping, Kainan Peng, Kexin Zhao, and Zhao Song. 2020. Waveflow: A compact flow-based model for raw audio. In
International Conference on Machine Learning. PMLR, 7706–7716.
[428] Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, and Yun Tang. 2020. Self-Training for End-to-End
Speech Translation. Proc. Interspeech 2020 (2020), 1476–1480.
[429] Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed,
and Emmanuel Dupoux. 2021. Speech resynthesis from discrete disentangled self-supervised representations. arXiv
preprint arXiv:2104.00355 (2021).
[430] Adam Polyak, Lior Wolf, and Yaniv Taigman. 2019. TTS skins: Speaker conversion via ASR. arXiv preprint
arXiv:1904.08983 (2019).
[431] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. 2021. Grad-tts: A diffusion
probabilistic model for text-to-speech. In International Conference on Machine Learning. PMLR, 8599–8608.
[432] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, and Jiansheng Wei. 2021. Diffusion-
based voice conversion with fast maximum likelihood sampling scheme. arXiv preprint arXiv:2109.13821 (2021).
A Review of Deep Learning Techniques for Speech Processing 101
[433] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang,
and Sanjeev Khudanpur. 2016. Purely sequence-trained neural networks for ASR based on lattice-free MMI.. In
Interspeech. 2751–2755.
[434] Rohit Prabhavalkar, Kanishka Rao, Tara N Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly. 2017. A Comparison of
sequence-to-sequence models for speech recognition.. In Interspeech. 939–943.
[435] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2019. Waveglow: A flow-based generative network for speech
synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 3617–3621.
[436] Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, and David Cox. 2020. Unsupervised speech
decomposition via triple information bottleneck. In International Conference on Machine Learning. PMLR, 7836–7846.
[437] Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, and Shiyu Chang.
2022. Contentvec: An improved self-supervised speech representation by disentangling speakers. In International
Conference on Machine Learning. PMLR, 18003–18017.
[438] Xiaoyi Qin, Hui Bu, and Ming Li. 2020. Hi-mia: A far-field text-dependent speaker verification database and the
baselines. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
7609–7613.
[439] Xiaoyi Qin, Ming Li, Hui Bu, Rohan Kumar Das, Wei Rao, Shrikanth Narayanan, and Haizhou Li. 2020. The ffsvc
2020 evaluation plan. arXiv preprint arXiv:2002.00387 (2020).
[440] Zhibin Qiu, Mengfan Fu, Yinfeng Yu, LiLi Yin, Fuchun Sun, and Hao Huang. 2022. SRTNet: Time Domain Speech
Enhancement Via Stochastic Refinement. arXiv preprint arXiv:2210.16805 (2022).
[441] Lawrence Rabiner, Md Cheng, A Rosenberg, and C McGonegal. 1976. A comparative performance study of several
pitch detection algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing 24, 5 (1976), 399–418.
[442] Lawrence R Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc.
IEEE 77, 2 (1989), 257–286.
[443] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech
recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022).
[444] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by
generative pre-training. (2018).
[445] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are
unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[446] Kacper Radzikowski, Robert Nowak, Le Wang, and Osamu Yoshie. 2019. Dual supervised learning for non-native
speech recognition. EURASIP Journal on Audio, Speech, and Music Processing 2019 (2019), 1–10.
[447] Colin Raffel, Minh-Thang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck. 2017. Online and linear-time attention
by enforcing monotonic alignments. In International conference on machine learning. PMLR, 2837–2846.
[448] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of
Machine Learning Research 21, 1 (2020), 5485–5551.
[449] Mehrdad Rafiepour and Javad Salimi Sartakhti. 2023. CTRAN: CNN-Transformer-based Network for Natural Language
Understanding. arXiv preprint arXiv:2303.10606 (2023).
[450] Tuomo Raitio, Ramya Rasipuram, and Dan Castellani. 2020. Controllable neural text-to-speech synthesis using
intuitive prosodic features. arXiv preprint arXiv:2009.06775 (2020).
[451] Thejan Rajapakshe, Siddique Latif, Rajib Rana, Sara Khalifa, and Björn W Schuller. 2020. Deep reinforcement learning
with pre-training for time-efficient training of automatic speech recognition. arXiv preprint arXiv:2005.11172 (2020).
[452] Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Jiajun Liu, and Bjorn Schuller. 2022. A novel policy for pre-trained deep
reinforcement learning for speech emotion recognition. In Australasian Computer Science Week 2022. 96–105.
[453] Nathanaël Carraz Rakotonirina. 2021. Self-attention for audio super-resolution. In 2021 IEEE 31st International
Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 1–6.
[454] Mirco Ravanelli and Yoshua Bengio. 2018. Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken
Language Technology Workshop (SLT). IEEE, 1021–1028.
[455] Mirco Ravanelli, Titouan Parcollet, and Yoshua Bengio. 2019. The pytorch-kaldi speech recognition toolkit. In ICASSP
2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6465–6469.
[456] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, and Yoshua Bengio.
2020. Multi-task self-supervised learning for robust speech recognition. In ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6989–6993.
[457] Chandan KA Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy
Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, et al. 2020. The interspeech 2020 deep noise
suppression challenge: Datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981
102 Mehrish et al.
(2020).
[458] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech 2: Fast and
high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020).
[459] Yi Ren, Jinglin Liu, and Zhou Zhao. 2021. Portaspeech: Portable and high-quality generative text-to-speech. Advances
in Neural Information Processing Systems 34 (2021), 13963–13974.
[460] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and
controllable text to speech. Advances in neural information processing systems 32 (2019).
[461] Douglas A Reynolds. 2003. Channel robust speaker verification via feature mapping. In 2003 IEEE International
Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., Vol. 2. IEEE, II–53.
[462] Daniel Rho, Jinhyeok Park, and Jong Hwan Ko. 2022. Nas-vad: Neural architecture search for voice activity detection.
arXiv preprint arXiv:2201.09032 (2022).
[463] Colleen Richey, Maria A Barrios, Zeb Armstrong, Chris Bartels, Horacio Franco, Martin Graciarena, Aaron Lawson,
Mahesh Kumar Nandwana, Allen Stauffer, Julien van Hout, et al. 2018. Voices obscured in complex environmental
settings (voices) corpus. arXiv preprint arXiv:1804.05053 (2018).
[464] Julius Richter, Guillaume Carbajal, and Timo Gerkmann. 2020. Speech Enhancement with Stochastic Temporal
Convolutional Networks.. In Interspeech. 4516–4520.
[465] Morgane Riviere, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. 2020. Unsupervised pretraining
transfers well across languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 7414–7418.
[466] Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech
quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE
international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), Vol. 2. IEEE,
749–752.
[467] Amir Mohammad Rostami, Ali Karimi, and Mohammad Ali Akhaee. 2022. Keyword spotting in continuous speech
using convolutional neural network. Speech Communication 142 (2022), 15–21.
[468] Anthony Rousseau, Paul Deléglise, and Yannick Esteve. 2012. TED-LIUM: an Automatic Speech Recognition dedicated
corpus.. In LREC. 125–129.
[469] Sidheswar Routray and Qirong Mao. 2022. Phase sensitive masking-based single channel speech enhancement using
conditional generative adversarial network. Computer Speech & Language 71 (2022), 101270.
[470] Mickael Rouvier, Richard Dufour, and Pierre-Michel Bousquet. 2021. Review of different robust x-vector extractors
for speaker verification. In 2020 28th European Signal Processing Conference (EUSIPCO). 1–5. https://doi.org/10.23919/
Eusipco47968.2020.9287426
[471] Neville Ryant, Kenneth Church, Christopher Cieri, Alejandrina Cristia, Jun Du, Sriram Ganapathy, and Mark Liberman.
2018. First DIHARD challenge evaluation plan. 2018, tech. Rep. (2018).
[472] Neville Ryant, Kenneth Church, Christopher Cieri, Alejandrina Cristia, Jun Du, Sriram Ganapathy, and Mark Liberman.
2019. The second dihard diarization challenge: Dataset, task, and baselines. arXiv preprint arXiv:1906.07839 (2019).
[473] Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya, Mirkó Visontai, and Stella Laurenzo. 2020. Streaming
keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020).
[474] Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo,
and Roland Maas. 2021. Wav2vec-c: A self-supervised model for speech representation learning. arXiv preprint
arXiv:2103.08393 (2021).
[475] Seyed Omid Sadjadi, Jason Pelecanos, and Weizhong Zhu. 2014. Nearest neighbor discriminant analysis for robust
speaker recognition. In Fifteenth Annual Conference of the International Speech Communication Association.
[476] Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2021. Perceptual-similarity-aware deep speaker repre-
sentation learning for multi-speaker generative modeling. IEEE/ACM Transactions on Audio, Speech, and Language
Processing 29 (2021), 1033–1048.
[477] Hojjat Salehinejad, Sharan Sankar, Joseph Barfett, Errol Colak, and Shahrokh Valaee. 2017. Recent advances in
recurrent neural networks. arXiv preprint arXiv:1801.01078 (2017).
[478] Elizabeth Salesky, Matthias Sperber, and Alan W Black. 2019. Exploring phoneme-level speech representations for
end-to-end speech translation. arXiv preprint arXiv:1906.01199 (2019).
[479] Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Feng-Ju Chang, Jing Liu, Jinru Su, Grant P Strimel, Athanasios
Mouchtaris, and Siegfried Kunzmann. 2022. Contextual adapters for personalized speech recognition in neural
transducers. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 8537–8541.
[480] Pascal Scalart et al. 1996. Speech enhancement based on a priori signal to noise estimation. In 1996 IEEE International
Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Vol. 2. IEEE, 629–632.
A Review of Deep Learning Techniques for Speech Processing 103
[481] Carolina Scarton, Mikel L Forcada, Miquel Espla-Gomis, and Lucia Specia. 2019. Estimating post-editing effort: a
study on human judgements, task-based and reference-based metrics of MT quality. arXiv preprint arXiv:1910.06204
(2019).
[482] Robin Scheibler, Youna Ji, Soo-Whan Chung, Jaeuk Byun, Soyeon Choe, and Min-Seok Choi. 2022. Diffusion-based
Generative Speech Source Separation. arXiv preprint arXiv:2210.17327 (2022).
[483] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for
speech recognition. arXiv preprint arXiv:1904.05862 (2019).
[484] Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition
and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
[485] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal
Processing 45, 11 (1997), 2673–2681.
[486] Deokjin Seo, Heung-Seon Oh, and Yuchul Jung. 2021. Wav2kws: Transfer learning from speech representations for
keyword spotting. IEEE Access 9 (2021), 80682–80691.
[487] Joan Serrà, Santiago Pascual, Jordi Pons, R Oguz Araz, and Davide Scaini. 2022. Universal speech enhancement with
score-based diffusion. arXiv preprint arXiv:2206.03065 (2022).
[488] Joan Serrà, Jordi Pons, and Santiago Pascual. 2021. SESQA: semi-supervised learning for speech quality assessment.
In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 381–385.
[489] Benjamin Sertolli, Zhao Ren, Björn W Schuller, and Nicholas Cummins. 2021. Representation transfer learning from
deep end-to-end speech recognition networks for the classification of health states from speech. Computer Speech &
Language 68 (2021), 101204.
[490] Jonathan Shen, Ye Jia, Mike Chrzanowski, Yu Zhang, Isaac Elias, Heiga Zen, and Yonghui Wu. 2020. Non-attentive
tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. arXiv preprint
arXiv:2010.04301 (2020).
[491] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang,
Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram
predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779–4783.
[492] Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, and
Michael L Seltzer. 2020. Weak-attention suppression for transformer based speech recognition. arXiv preprint
arXiv:2005.09137 (2020).
[493] Kevin J Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, and Bryan Catanzaro. 2021. RAD-TTS: Parallel
flow-based TTS with robust alignment learning and diverse synthesis. In ICML Workshop on Invertible Neural Networks,
Normalizing Flows, and Explicit Likelihood Models.
[494] Hye-Jin Shim, Jungwoo Heo, Jae-Han Park, Ga-Hui Lee, and Ha-Jin Yu. 2022. Graph Attentive Feature Aggregation
for Text-Independent Speaker Verification. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). 7972–7976. https://doi.org/10.1109/ICASSP43922.2022.9746257
[495] Amitay Sicherman and Yossi Adi. 2023. Analysing Discrete Self Supervised Speech Representation for Spoken
Language Modeling. arXiv preprint arXiv:2301.00591 (2023).
[496] Nikola Simić, Siniša Suzić, Tijana Nosek, Mia Vujović, Zoran Perić, Milan Savić, and Vlado Delić. 2022. Speaker
recognition using constrained convolutional neural networks in emotional speech. Entropy 24, 3 (2022), 414.
[497] Ruby Melody Simply, Eliran Dafna, and Yaniv Zigel. 2019. Diagnosis of Obstructive Sleep Apnea using Speech Signals
from Awake Subjects. IEEE Journal of Selected Topics in Signal Processing 14, 2 (2019), 251–260.
[498] Gundeep Singh, Sahil Sharma, Vijay Kumar, Manjit Kaur, Mohammed Baz, and Mehedi Masud. 2021. Spoken language
identification using deep learning. Computational Intelligence and Neuroscience 2021 (2021).
[499] Prachi Singh and Sriram Ganapathy. 2021. Self-Supervised Metric Learning With Graph Clustering For Speaker
Diarization. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 90–97. https://doi.org/
10.1109/ASRU51503.2021.9688271
[500] Prachi Singh, Amrit Kaul, and Sriram Ganapathy. 2023. Supervised Hierarchical Clustering using Graph Neural
Networks for Speaker Diarization. arXiv preprint arXiv:2302.12716 (2023).
[501] Satwinder Singh, Ruili Wang, and Feng Hou. 2022. Improved Meta Learning for Low Resource Speech Recognition. In
ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4798–4802.
[502] Hubert Siuzdak, Piotr Dura, Pol van Rijn, and Nori Jacoby. 2022. WavThruVec: Latent speech representation as
intermediate features for neural speech synthesis. arXiv preprint arXiv:2203.16930 (2022).
[503] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A
Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In international
conference on machine learning. PMLR, 4693–4702.
[504] Nathan Smith and Mark Gales. 2001. Speech recognition using SVMs. Advances in neural information processing
systems 14 (2001).
104 Mehrish et al.
[505] Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. Advances in neural
information processing systems 30 (2017).
[506] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur. 2017. Deep neural network embeddings
for text-independent speaker verification.. In Interspeech, Vol. 2017. 999–1003.
[507] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust
dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing
(ICASSP). IEEE, 5329–5333.
[508] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning
using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
[509] Alex Solomonoff, William M Campbell, and Ian Boardman. 2005. Advances in channel compensation for SVM speaker
recognition. In Proceedings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Processing,
2005., Vol. 1. IEEE, I–629.
[510] Alex Solomonoff, Carl Quillen, and William M Campbell. 2004. Channel compensation for SVM speaker recognition..
In Odyssey, Vol. 4. 219–226.
[511] Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip reading sentences in the wild. In
Proceedings of the IEEE conference on computer vision and pattern recognition. 6447–6456.
[512] Man Sondhi and J. Schroeter. 1987. A hybrid time-frequency domain articulatory speech synthesizer. IEEE Transactions
on Acoustics, Speech, and Signal Processing 35, 7 (1987), 955–967. https://doi.org/10.1109/TASSP.1987.1165240
[513] Yang Song, Jingwen Zhu, Dawei Li, Xiaolong Wang, and Hairong Qi. 2018. Talking face generation by conditional
recurrent adversarial network. arXiv preprint arXiv:1804.04786 (2018).
[514] Meet H Soni and Hemant A Patil. 2016. Novel deep autoencoder features for non-intrusive speech quality assessment.
In 2016 24th European Signal Processing Conference (EUSIPCO). IEEE, 2315–2319.
[515] Alexander Sorin, Slava Shechtman, and Ron Hoory. 2020. Principal Style Components: Expressive Style Control and
Cross-Speaker Transfer in Neural TTS.. In INTERSPEECH. 3411–3415.
[516] Matthias Sperber and Matthias Paulik. 2020. Speech translation and the end-to-end promise: Taking stock of where
we are. arXiv preprint arXiv:2004.06358 (2020).
[517] Daniel Stoller, Sebastian Ewert, and Simon Dixon. 2018. Wave-u-net: A multi-scale neural network for end-to-end
audio source separation. arXiv preprint arXiv:1806.03185 (2018).
[518] Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. Attention is all you need
in speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 21–25.
[519] Vrunda N Sukhadia and S Umesh. 2023. Domain Adaptation of low-resource Target-Domain models using well-trained
ASR Conformer Models. In 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 295–301.
[520] Aolan Sun, Jianzong Wang, Ning Cheng, Huayi Peng, Zhen Zeng, Lingwei Kong, and Jing Xiao. 2021. Graphpb:
Graphical representations of prosody boundary in speech synthesis. In 2021 IEEE Spoken Language Technology
Workshop (SLT). IEEE, 438–445.
[521] Aolan Sun, Jianzong Wang, Ning Cheng, Huayi Peng, Zhen Zeng, and Jing Xiao. 2020. GraphTTS: Graph-to-Sequence
Modelling in Neural Text-to-Speech. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). 6719–6723. https://doi.org/10.1109/ICASSP40776.2020.9053355
[522] Tzu-Wei Sung, Jun-You Liu, Hung-yi Lee, and Lin-shan Lee. 2019. Towards end-to-end speech-to-text translation
with two-pass decoding. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 7175–7179.
[523] Suno-AI. 2023. Bark. https://github.com/suno-ai/bark
[524] Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Tatiana Likhomanenko, Edouard Grave, Vineel Pratap, Anuroop Sriram,
Vitaliy Liptchinsky, and Ronan Collobert. 2020. End-to-End ASR: from Supervised to Semi-Supervised Learning with
Modern Architectures. In ICML 2020 Workshop on Self-supervision in Audio and Speech.
[525] Jaesung Tae, Hyeongju Kim, and Taesu Kim. 2021. EdiTTS: Score-based Editing for Controllable Text-to-Speech.
CoRR abs/2110.02584 (2021). arXiv:2110.02584 https://arxiv.org/abs/2110.02584
[526] Ke Tan and DeLiang Wang. 2019. Learning complex spectral mapping with gated convolutional recurrent networks
for monaural speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2019),
380–390.
[527] Li Tan and Montri Karnjanadecha. 2003. Pitch detection algorithm: autocorrelation method and AMDF. In Proceedings
of the 3rd international symposium on communications and information technology, Vol. 2. 551–556.
[528] Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, and Nobukatsu Hojo. 2019. ATTS2S-VC: Sequence-to-sequence
Voice Conversion with Attention and Context Preservation Mechanisms. In ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). 6805–6809. https://doi.org/10.1109/ICASSP.2019.
8683282
A Review of Deep Learning Techniques for Speech Processing 105
[529] Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Wenxuan Xie, and Wenjun Zeng. 2021. Joint time-frequency and time
domain learning for speech enhancement. In Proceedings of the Twenty-Ninth International Conference on International
Joint Conferences on Artificial Intelligence. 3816–3822.
[530] Yun Tang, Guohong Ding, Jing Huang, Xiaodong He, and Bowen Zhou. 2019. Deep speaker embedding learning with
multi-level pooling for text-independent speaker verification. In ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6116–6120.
[531] Fei Tao and Carlos Busso. 2019. End-to-end audiovisual speech activity detection with bimodal recurrent neural
models. Speech Communication 113 (2019), 25–35.
[532] Xiaohai Tian, Eng Siong Chng, and Haizhou Li. 2019. A vocoder-free WaveNet voice conversion with non-parallel
data. arXiv preprint arXiv:1902.03705 (2019).
[533] Noé Tits, Fengna Wang, Kevin El Haddad, Vincent Pagel, and Thierry Dutoit. 2019. Visualization and Interpretation
of Latent Spaces for Controlling Expressive Speech Synthesis Through Audio Analysis. Proc. Interspeech 2019 (2019),
4475–4479.
[534] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2018. Sequence-to-sequence ASR optimization via rein-
forcement learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
5829–5833.
[535] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume
Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. ArXiv abs/2302.13971 (2023).
[536] Sue E Tranter and Douglas A Reynolds. 2006. An overview of automatic speaker diarization systems. IEEE Transactions
on audio, speech, and language processing 14, 5 (2006), 1557–1565.
[537] Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, and Shinji Watanabe. 2019. Transformer ASR with contextual
block processing. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 427–433.
[538] Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, and Hung-Yi Lee. 2019. End-to-end text-to-speech for low-resource
languages by cross-lingual transfer learning. arXiv preprint arXiv:1904.06508 (2019).
[539] Zoltán Tüske, Kartik Audhkhasi, and George Saon. 2019. Advancing Sequence-to-Sequence Based Speech Recognition..
In Interspeech. 3780–3784.
[540] Efthymios Tzinis, Yossi Adi, Vamsi K Ithapu, Buye Xu, and Anurag Kumar. 2022. Continual self-training with
bootstrapped remixing for speech enhancement. In ICASSP 2022-2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 6947–6951.
[541] Efthymios Tzinis, Yossi Adi, Vamsi K Ithapu, Buye Xu, Paris Smaragdis, and Anurag Kumar. 2022. RemixIT: Continual
self-training of speech enhancement models via bootstrapped remixing. IEEE Journal of Selected Topics in Signal
Processing 16, 6 (2022), 1329–1341.
[542] Panagiotis Tzirakis, Anurag Kumar, and Jacob Donley. 2021. Multi-channel speech enhancement using graph neural
networks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 3415–3419.
[543] Se-Yun Um, Sangshin Oh, Kyungguen Byun, Inseon Jang, ChungHyun Ahn, and Hong-Goo Kang. 2020. Emotional
Speech Synthesis with Rich and Granularized Control. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). 7254–7258. https://doi.org/10.1109/ICASSP40776.2020.9053732
[544] Jan Vainer and Ondřej Dušek. 2020. Speedyspeech: Efficient neural speech synthesis. arXiv preprint arXiv:2008.03802
(2020).
[545] Jean-Marc Valin, Umut Isik, Paris Smaragdis, and Arvindh Krishnaswamy. 2022. Neural speech synthesis on a
shoestring: Improving the efficiency of lpcnet. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 8437–8441.
[546] Jean-Marc Valin and Jan Skoglund. 2019. LPCNet: Improving neural speech synthesis through linear prediction. In
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5891–5895.
[547] Rafael Valle, Jason Li, Ryan Prenger, and Bryan Catanzaro. 2020. Mellotron: Multispeaker Expressive Voice Synthesis
by Conditioning on Rhythm, Pitch and Global Style Tokens. In ICASSP 2020 - 2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). 6189–6193. https://doi.org/10.1109/ICASSP40776.2020.9054556
[548] Rafael Valle, Kevin Shih, Ryan Prenger, and Bryan Catanzaro. 2020. Flowtron: an autoregressive flow-based generative
network for text-to-speech synthesis. arXiv preprint arXiv:2005.05957 (2020).
[549] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. 2016. Conditional image
generation with pixelcnn decoders. Advances in neural information processing systems 29 (2016).
[550] Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. Advances in neural
information processing systems 30 (2017).
[551] Andrea Vanzo, Danilo Croce, Emanuele Bastianelli, Roberto Basili, and Daniele Nardi. 2016. Robust spoken language
understanding for house service robots. Polibits 54 (2016), 11–16.
106 Mehrish et al.
[552] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural
networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics,
speech and signal processing (ICASSP). IEEE, 4052–4056.
[553] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural
networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). 4052–4056. https://doi.org/10.1109/ICASSP.2014.6854363
[554] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[555] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, Yoshua Bengio, et al. 2017. Graph
attention networks. stat 1050, 20 (2017), 10–48550.
[556] Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2018. Deep
graph infomax. arXiv preprint arXiv:1809.10341 (2018).
[557] Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot. 2018. Audio source separation and speech enhancement.
John Wiley & Sons.
[558] Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, and Reinhold Haeb-Umbach. 2021.
Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers.
arXiv preprint arXiv:2107.14446 (2021).
[559] Tyler Vuong, Yangyang Xia, and Richard M Stern. 2021. A modulation-domain loss for neural-network-based
real-time speech enhancement. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 6643–6647.
[560] Roman Vygon and Nikolay Mikhaylovskiy. 2021. Learning efficient representations for keyword spotting with triplet
loss. In Speech and Computer: 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30,
2021, Proceedings 23. Springer, 773–785.
[561] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification.
In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4879–4883.
[562] Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming
Wang, Jinyu Li, et al. 2023. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint
arXiv:2301.02111 (2023).
[563] Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, and Juan Pino. 2020. fairseq s2t:
Fast speech-to-text modeling with fairseq. arXiv preprint arXiv:2010.05171 (2020).
[564] Changhan Wang, Anne Wu, and Juan Pino. 2020. Covost 2 and massively multilingual speech-to-text translation.
arXiv preprint arXiv:2007.10310 (2020).
[565] Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, and Xuedong Huang. 2021.
Unispeech: Unified speech representation learning with labeled and unlabeled data. In International Conference on
Machine Learning. PMLR, 10937–10947.
[566] Feng Wang and David MJ Tax. 2016. Survey on the attention based RNN model and its applications in computer
vision. arXiv preprint arXiv:1601.06823 (2016).
[567] Gary Wang. 2019. Deep text-to-speech system with seq2seq model. arXiv preprint arXiv:1903.07398 (2019).
[568] Heming Wang and Deliang Wang. 2020. Time-Frequency Loss for CNN Based Speech Super-Resolution. In ICASSP
2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 861–865. https:
//doi.org/10.1109/ICASSP40776.2020.9053712
[569] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. 2018. Cosface:
Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition. 5265–5274.
[570] Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-yi Lee, and Lei Xie. 2021. Auto-KWS
2021 Challenge: Task, datasets, and baselines. arXiv preprint arXiv:2104.00513 (2021).
[571] Jixuan Wang, Xiong Xiao, Jian Wu, Ranjani Ramamurthy, Frank Rudzicz, and Michael Brudno. 2020. Speaker
Diarization with Session-Level Speaker Embedding Refinement Using Graph Neural Networks. In ICASSP 2020 - 2020
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7109–7113. https://doi.org/10.1109/
ICASSP40776.2020.9054176
[572] Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno. 2018. Speaker diarization
with LSTM. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 5239–5243.
[573] Qing Wang, Pengcheng Guo, Sining Sun, Lei Xie, and John HL Hansen. 2019. Adversarial Regularization for
End-to-End Robust Speaker Verification.. In Interspeech. 4010–4014.
[574] Qing Wang, Wei Rao, Sining Sun, Leib Xie, Eng Siong Chng, and Haizhou Li. 2018. Unsupervised Domain Adaptation
via Domain Adversarial Training for Speaker Recognition. In 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). 4889–4893. https://doi.org/10.1109/ICASSP.2018.8461423
A Review of Deep Learning Techniques for Speech Processing 107
[575] Tianzi Wang, Jiajun Deng, Mengzhe Geng, Zi Ye, Shoukang Hu, Yi Wang, Mingyu Cui, Zengrui Jin, Xunying Liu, and
Helen Meng. 2022. Conformer Based Elderly Speech Recognition System for Alzheimer’s Disease Detection. arXiv
preprint arXiv:2206.13232 (2022).
[576] Tingting Wang, Zexu Pan, Meng Ge, Zhen Yang, and Haizhou Li. 2023. Time-Domain Speech Separation Networks
With Graph Encoding Auxiliary. IEEE Signal Processing Letters 30 (2023), 110–114.
[577] Weiqing Wang, Qingjian Lin, Danwei Cai, and Ming Li. 2022. Similarity measurement of segment-level speaker
embeddings in speaker diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022),
2645–2658.
[578] Xueyi Wang, Lantian Li, and Dong Wang. 2019. VAE-based domain adaptation for speaker verification. In 2019
Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 535–539.
[579] Xi Wang, Huaiping Ming, Lei He, and Frank K Soong. 2020. s-transformer: Segment-transformer for robust neural
speech synthesis. arXiv preprint arXiv:2011.08480 (2020).
[580] Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, and Sheng Zhao. 2023. AUDIT: Audio Editing
by Following Instructions with Latent Diffusion Models. arXiv:2304.00830 [cs.SD]
[581] Yu Wang, Yilin Shen, and Hongxia Jin. 2018. A bi-model based rnn semantic frame parsing model for intent detection
and slot filling. arXiv preprint arXiv:1812.10235 (2018).
[582] Yongqiang Wang, Yangyang Shi, Frank Zhang, Chunyang Wu, Julian Chan, Ching-Feng Yeh, and Alex Xiao. 2021.
Transformer in Action: A Comparative Study of Transformer-Based Acoustic Models for Large Scale Speech Recog-
nition Applications. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). 6778–6782. https://doi.org/10.1109/ICASSP39728.2021.9414087
[583] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying
Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint
arXiv:1703.10135 (2017).
[584] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and
Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis.
In International Conference on Machine Learning. PMLR, 5180–5189.
[585] Yi Wang, Shiqi Zhang, and Joohyung Lee. 2019. Bridging commonsense reasoning and probabilistic planning via a
probabilistic action language. Theory and Practice of Logic Programming 19, 5-6 (2019), 1090–1106.
[586] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2019. Structured Pruning of Large Language Models. CoRR
abs/1910.04732 (2019). arXiv:1910.04732 http://arxiv.org/abs/1910.04732
[587] Zhong-Qiu Wang, Jonathan Le Roux, and John R Hershey. 2018. Alternative objective functions for deep clustering.
In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 686–690.
[588] Zhong-Qiu Wang, Peidong Wang, and DeLiang Wang. 2020. Complex spectral mapping for single-and multi-channel
speech enhancement and robust ASR. IEEE/ACM transactions on audio, speech, and language processing 28 (2020),
1778–1787.
[589] Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint
arXiv:1804.03209 (2018).
[590] Ron J Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, and Diederik P Kingma. 2021. Wave-tacotron:
Spectrogram-free end-to-end text-to-speech synthesis. In ICASSP 2021-2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 5679–5683.
[591] Chao Weng, Jia Cui, Guangsen Wang, Jun Wang, Chengzhu Yu, Dan Su, and Dong Yu. 2018. Improving Attention
Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition.. In Interspeech.
761–765.
[592] Nils L Westhausen and Bernd T Meyer. 2020. Dual-signal transformation lstm network for real-time noise suppression.
arXiv preprint arXiv:2005.07551 (2020).
[593] Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Zihan Liu, and Pascale Fung. 2020. Lightweight and Efficient
End-To-End Speech Recognition Using Low-Rank Transformer. In ICASSP 2020 - 2020 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). 6144–6148. https://doi.org/10.1109/ICASSP40776.2020.9053878
[594] Da-Yi Wu and Hung-yi Lee. 2020. One-shot voice conversion by vector quantization. In ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7734–7738.
[595] Da-Yi Wu and Hung-yi Lee. 2020. One-Shot Voice Conversion by Vector Quantization. In ICASSP 2020 - 2020 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7734–7738. https://doi.org/10.1109/
ICASSP40776.2020.9053854
[596] Jianfeng Wu, Yongzhu Hua, Shengying Yang, Hongshuai Qin, and Huibin Qin. 2019. Speech enhancement using
generative adversarial network by distilling knowledge from statistical method. Applied Sciences 9, 16 (2019), 3396.
[597] Shoule Wu and Ziqiang Shi. 2021. ItoTTS and ItoWave: Linear Stochastic Differential Equation Is All You Need For
Audio Generation. arXiv preprint arXiv:2105.07583 (2021).
108 Mehrish et al.
[598] Xianchao Wu. 2022. Deep Sparse Conformer for Speech Recognition. arXiv preprint arXiv:2209.00260 (2022).
[599] Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, and Tie-Yan Liu. 2022. Adaspeech 4:
Adaptive text to speech in zero-shot scenarios. arXiv preprint arXiv:2204.00436 (2022).
[600] Wei Xia, Jing Huang, and John HL Hansen. 2019. Cross-lingual text-independent speaker verification using unsuper-
vised adversarial discriminative domain adaptation. In ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 5816–5820.
[601] Xiong Xiao, Naoyuki Kanda, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka, Sanyuan Chen, Yong Zhao, Gang Liu, Yu
Wu, Jian Wu, et al. 2021. Microsoft speaker diarization system for the voxceleb speaker recognition challenge 2020. In
ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5824–5828.
[602] Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, and Hiroshi Saruwatari. 2020. Cross-Lingual Text-
To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space.. In Interspeech.
2947–2951.
[603] Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, Qi Ju, Tong Xiao, Jingbo Zhu, et al. 2021. Stacked acoustic-and-textual
encoding: Integrating the pre-trained models into speech translation encoders. arXiv preprint arXiv:2105.05752 (2021).
[604] Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, and Tie-Yan Liu. 2020. Lrspeech: Extremely low-resource
speech synthesis and recognition. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining. 2802–2812.
[605] Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko, Paden Tomasello, Alexis Conneau, Ronan Collobert, Gabriel
Synnaeve, and Michael Auli. 2021. Self-training and pre-training are complementary for speech recognition. In
ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3030–3034.
[606] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014. A regression approach to speech enhancement based on
deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 1 (2014), 7–19.
[607] Jinlong Xue, Yayue Deng, Yichen Han, Ya Li, Jianqing Sun, and Jiaen Liang. 2022. ECAPA-TDNN for Multi-speaker
Text-to-speech Synthesis. In 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP).
IEEE, 230–234.
[608] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel WaveGAN: A fast waveform generation model
based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203.
[609] Yuzi Yan, Xu Tan, Bohan Li, Tao Qin, Sheng Zhao, Yuan Shen, and Tie-Yan Liu. 2021. Adaspeech 2: Adaptive text to
speech with untranscribed data. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 6613–6617.
[610] Dongchao Yang, Songxiang Liu, Jianwei Yu, Helin Wang, Chao Weng, and Yuexian Zou. 2022. NoreSpeech: Knowledge
Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS. arXiv preprint arXiv:2211.02448
(2022).
[611] Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. 2022. Diffsound:
Discrete diffusion model for text-to-sound generation. arXiv preprint arXiv:2207.09983 (2022).
[612] Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, and Lei Xie. 2021. Multi-band melgan: Faster waveform
generation for high-quality text-to-speech. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 492–498.
[613] Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, and Lin-shan Lee. 2019. Improved speech separation with time-and-
frequency cross-domain joint embedding and clustering. arXiv preprint arXiv:1904.07845 (2019).
[614] Jinhyeok Yang, Junmo Lee, Youngik Kim, Hoonyoung Cho, and Injung Kim. 2020. VocGAN: A high-fidelity real-time
vocoder with a hierarchically-nested adversarial network. arXiv preprint arXiv:2007.15256 (2020).
[615] Shu-Wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi,
Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan
Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung-yi Lee. 2021. SUPERB: Speech processing
Universal PERformance Benchmark. CoRR abs/2105.01051 (2021). arXiv:2105.01051 https://arxiv.org/abs/2105.01051
[616] Shiqing Yang and Min Liu. 2022. Data augmentation for speaker verification. In Proceedings of the 2022 6th International
Conference on Electronic Information Technology and Computer Engineering. 1247–1251.
[617] Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang
Shan, and Xilin Chen. 2019. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the
Wild. In 2019 14th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2019). 1–8. https:
//doi.org/10.1109/FG.2019.8756582
[618] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized
autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).
[619] Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida
Wang, Michael W. Mahoney, and Kurt Keutzer. 2020. HAWQV3: Dyadic Neural Network Quantization. CoRR
abs/2011.10680 (2020). arXiv:2011.10680 https://arxiv.org/abs/2011.10680
A Review of Deep Learning Techniques for Speech Processing 109
[620] Yusuke Yasuda, Xin Wang, Shinji Takaki, and Junichi Yamagishi. 2019. Investigation of Enhanced Tacotron Text-to-
speech Synthesis Systems with Self-attention for Pitch Accent Language. In ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). 6905–6909. https://doi.org/10.1109/ICASSP.2019.
8682353
[621] Feng Ye and Jun Yang. 2021. A deep neural network model for speaker identification. Applied Sciences 11, 8 (2021),
3603.
[622] Rong Ye, Mingxuan Wang, and Lei Li. 2021. End-to-end speech translation via cross-modal progressive training.
arXiv preprint arXiv:2104.10380 (2021).
[623] Hao Yen, François G Germain, Gordon Wichern, and Jonathan Le Roux. 2022. Cold Diffusion for Speech Enhancement.
arXiv preprint arXiv:2211.02527 (2022).
[624] Reo Yoneyama, Ryuichi Yamamoto, and Kentaro Tachibana. 2022. Nonparallel High-Quality Audio Super Resolution
with Domain Adaptation and Resampling CycleGANs. arXiv preprint arXiv:2210.15887 (2022).
[625] Ji Won Yoon, Beom Jun Woo, and Nam Soo Kim. 2022. Hubert-ee: Early exiting hubert for efficient speech recognition.
arXiv preprint arXiv:2204.06328 (2022).
[626] Jaeseong You, Dalhyun Kim, Gyuhyeon Nam, Geumbyeol Hwang, and Gyeongsu Chae. 2021. Gan vocoder: Multi-
resolution discriminator is all you need. arXiv preprint arXiv:2103.05236 (2021).
[627] Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu, Deyi Tuo, Shiyin Kang, Guangzhi Lei, et al.
2019. Durian: Duration informed attention network for multimodal synthesis. arXiv preprint arXiv:1909.01700 (2019).
[628] Dong Yu and Li Deng. 2016. Automatic speech recognition. Vol. 1. Springer.
[629] Fisher Yu and Vladlen Koltun. 2015. Multi-Scale Context Aggregation by Dilated Convolutions. CoRR abs/1511.07122
(2015).
[630] Yechan Yu, Dongkeon Park, and Hong Kook Kim. 2022. Auxiliary loss of transformer with residual connection for
end-to-end speaker diarization. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 8377–8381.
[631] Fengpeng Yue, Yan Deng, Lei He, Tom Ko, and Yu Zhang. 2022. Exploring machine speech chain for domain
adaptation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 6757–6761.
[632] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. 2019. Graph transformer networks.
Advances in neural information processing systems 32 (2019).
[633] Neil Zeghidour and David Grangier. 2021. Wavesplit: End-to-end speech separation by speaker clustering. IEEE/ACM
Transactions on Audio, Speech, and Language Processing 29 (2021), 2840–2849.
[634] Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matějka, and Oldřich Plchot. 2019. But system description to
voxceleb speaker recognition challenge 2019. arXiv preprint arXiv:1910.12592 (2019).
[635] Jihen Zeremdini, Mohamed Anouar Ben Messaoud, and Aicha Bouzid. 2015. A comparison of several computational
auditory scene analysis (CASA) techniques for monaural speech segregation. Brain informatics 2 (2015), 155–166.
[636] Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter, and Hermann Ney. 2021. Librispeech transducer model
with internal language model prior correction. arXiv preprint arXiv:2104.03006 (2021).
[637] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong Wang. 2019. Fully supervised speaker diarization.
In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6301–
6305.
[638] Biao Zhang, Barry Haddow, and Rico Sennrich. 2022. Revisiting end-to-end speech-to-text translation from scratch.
In International Conference on Machine Learning. PMLR, 26193–26205.
[639] Biao Zhang, Ivan Titov, Barry Haddow, and Rico Sennrich. 2020. Adaptive feature selection for end-to-end speech
translation. arXiv preprint arXiv:2010.08518 (2020).
[640] Chunlei Zhang and Kazuhito Koishida. 2017. End-to-end text-independent speaker verification with triplet loss on
short utterances.. In Interspeech. 1487–1491.
[641] Chenwei Zhang, Yaliang Li, Nan Du, Wei Fan, and Philip S Yu. 2018. Joint slot filling and intent detection via capsule
neural networks. arXiv preprint arXiv:1812.09471 (2018).
[642] Chen Zhang, Yi Ren, Xu Tan, Jinglin Liu, Kejun Zhang, Tao Qin, Sheng Zhao, and Tie-Yan Liu. 2021. Denoispeech:
Denoising text to speech with frame-level noise modeling. In ICASSP 2021-2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7063–7067.
[643] Chunlei Zhang, Jiatong Shi, Chao Weng, Meng Yu, and Dong Yu. 2022. Towards end-to-end speaker diarization with
generalized neural speaker clustering. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 8372–8376.
[644] Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang, and Hui Chen. 2021. Meta-learning for
cross-channel speaker verification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 5839–5843.
110 Mehrish et al.
[645] Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang, and Helen Meng. 2023. Meta-Generalization
for Domain-Invariant Speaker Verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31
(2023), 1024–1036.
[646] Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai. 2018. Forward attention in sequence-to-sequence acoustic
modeling for speech synthesis. In 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP).
IEEE, 4789–4793.
[647] Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai. 2019. Non-parallel sequence-to-sequence voice conversion
with disentangled linguistic and speaker representations. IEEE/ACM Transactions on Audio, Speech, and Language
Processing 28 (2019), 540–552.
[648] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, and Li-Rong Dai. 2019. Sequence-to-sequence acoustic
modeling for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 3 (2019),
631–644.
[649] Jing-Xuan Zhang, Li-Juan Liu, Yan-Nian Chen, Ya-Jun Hu, Yuan Jiang, Zhen-Hua Ling, and Li-Rong Dai. 2020. Voice
conversion by cascading automatic speech recognition and text-to-speech synthesis with prosody transfer. arXiv
preprint arXiv:2009.01475 (2020).
[650] Lichao Zhang, Yi Ren, Liqun Deng, and Zhou Zhao. 2022. Hifidenoise: High-fidelity denoising text to speech with
adversarial networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 7232–7236.
[651] Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. 2020.
Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP
2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7829–7833.
[652] Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. 2019. Graph convolutional networks: a comprehensive
review. Computational Social Networks 6, 1 (2019), 1–23.
[653] Xingxuan Zhang, Feng Cheng, and Shilin Wang. 2019. Spatio-temporal fusion based convolutional sequence learning
for lip reading. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 713–722.
[654] Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. 2022. TDASS: Target Domain Adaptation Speech Synthesis
Framework for Multi-speaker Low-Resource TTS. In 2022 International Joint Conference on Neural Networks (IJCNN).
1–7. https://doi.org/10.1109/IJCNN55064.2022.9892596
[655] Ying Zhang, Hao Che, and Xiaorui Wang. 2021. Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary
Speakers. In 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). 1–5. https:
//doi.org/10.1109/ISCSLP49672.2021.9362095
[656] Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod,
Gary Wang, et al. 2023. Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages. arXiv preprint
arXiv:2303.01037 (2023).
[657] Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang,
Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai
Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu,
Ruoming Pang, and Yonghui Wu. 2022. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning
for Automatic Speech Recognition. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1519–1532.
https://doi.org/10.1109/JSTSP.2022.3182537
[658] Yu Zhang, James Qin, Daniel S Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V Le, and Yonghui Wu. 2020.
Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504
(2020).
[659] Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming
Wang, Jinyu Li, et al. 2023. Speak foreign languages with your own voice: Cross-lingual neural codec language
modeling. arXiv preprint arXiv:2303.03926 (2023).
[660] Chengqi Zhao, Mingxuan Wang, Qianqian Dong, Rong Ye, and Lei Li. 2020. NeurST: Neural speech translation toolkit.
arXiv preprint arXiv:2012.10018 (2020).
[661] Guanlong Zhao, Sinem Sonsaat, Alif Silpachai, Ivana Lucic, Evgeny Chukharev-Hudilainen, John Levis, and Ricardo
Gutierrez-Osuna. 2018. L2-ARCTIC: A non-native English speech corpus.. In Interspeech. 2783–2787.
[662] Hongyu Zhao, Hao Tan, and Hongyuan Mei. 2022. Tiny-Attention Adapter: Contexts Are More Important Than the
Number of Parameters. arXiv preprint arXiv:2211.01979 (2022).
[663] Shengkui Zhao and Bin Ma. 2023. MossFormer: Pushing the Performance Limit of Monaural Speech Separation using
Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions. arXiv preprint arXiv:2302.11824
(2023).
[664] Shengkui Zhao, Trung Hieu Nguyen, and Bin Ma. 2021. Monaural speech enhancement with complex convolutional
block attention module and joint time frequency losses. In ICASSP 2021-2021 IEEE International Conference on Acoustics,
A Review of Deep Learning Techniques for Speech Processing 111