A Review Deep Learning Techiques For Speech Processing2023
A Review Deep Learning Techiques For Speech Processing2023
com/science/article/pii/S1566253523001859
Manuscript_32cd6dcf8f7e351977caa62c923a0185
© 2023 published by Elsevier. This manuscript is made available under the Elsevier user license
https://www.elsevier.com/open-access/userlicense/1.0/
A Review of Deep Learning Techniques for Speech Processing
Vall-E
Whisper
• HuBERT
• Speechstew
• Wav2Vec 2.0
• FastSpeech2
• Conformer
Performance
• ContextNet
• LSTM
• GRU
HMM + GMM
Time
tools in speech processing, offering remarkable improve- analysis, synthesis, and recognition of speech signals, and the
ments in various tasks. Pioneering studies, such as [185], integration of deep learning techniques has led to significant
have demonstrated the substantial gains achieved by deep advancements in these areas. By examining the current state-
neural networks (DNNs) in speech recognition accuracy com- of-the-art approaches, this paper aims to shed light on the
pared to traditional HMM-based systems. Complementing potential of deep learning for tackling the existing challenges
this, research in [3] showcased the effectiveness of convolu- and further advancing speech processing research.
tional neural networks (CNNs) for speech recognition. More- The paper provides a comprehensive exploration of deep-
over, recurrent neural networks (RNNs) have proven their learning architectures in the field of speech processing. It
efficacy in both speech recognition and synthesis, as high- begins by establishing the background, encompassing the
lighted in [161]. Recent advancements in deep learning have definition of speech signals, speech features, and traditional
further enhanced speech processing systems, with attention non-neural models. Subsequently, the focus shifts towards an
mechanisms [85] and transformers [555] playing significant in-depth examination of various deep-learning architectures
roles. Attention mechanisms enable the model to focus on specifically tailored for speech processing, including RNNs,
salient sections of the input signal, while transformers facil- CNNs, Transformers, GNNs, and diffusion models. Recog-
itate modeling long-range dependencies within the signal. nizing the significance of representation learning techniques
These developments have led to substantial improvements in this domain, the survey paper dedicates a dedicated section
in the performance and versatility of speech processing sys- to their exploration.
tems, unlocking new possibilities for applications in diverse Moving forward, the paper delves into an extensive range
domains. of speech processing tasks where deep learning has demon-
Although deep learning has made remarkable progress in strated substantial advancements. These tasks encompass
speech processing, it still faces certain challenges that need critical areas such as speech recognition, speech synthesis,
to be addressed. These challenges include the requirement speaker recognition, speech-to-speech translation, and speech
for substantial amounts of labeled data, the interpretability of synthesis. By thoroughly analyzing the fundamentals, model
the models, and their robustness to different environmental architectures, and specific tasks within the field, the paper
conditions. To provide a comprehensive understanding of the then progresses to discuss advanced transfer learning tech-
advancements in this domain, this paper presents an extensive niques, including domain adaptation, meta-learning, and
overview of deep learning architectures employed in speech- parameter-efficient transfer learning.
processing applications. Speech processing encompasses the Finally, in the conclusion, the paper reflects on the current
state of the field and identifies potential future directions. By such as air pressure, to another form, typically an electrical
considering emerging trends and novel approaches, the paper signal.
aims to shed light on the evolving landscape of deep learning In signal processing, a signal that repetitively manifests
in speech processing and provide insights into promising after a fixed duration, known as a period, is classified as peri-
avenues for further research and development. odic. The reciprocal of this period represents the frequency
of the signal. The waveform of a periodic signal defines its
Why this paper? Deep learning has become a powerful tool shape and concurrently determines its timbre, which pertains
in speech processing because it automatically learns high- to the subjective perception of sound quality by humans. To
level representations of speech signals from raw audio data. facilitate the processing of speech, speech signals are com-
As a result, significant advancements have been made in var- monly digitized. This entails converting them into a series
ious speech-processing tasks, including speech recognition, of numerical values by measuring the signal’s amplitude at
speaker identification, speech synthesis, and more. These consistent time intervals. The sampling rate, defined by the
tasks are essential in various applications, such as human- number of samples collected per second, determines the gran-
computer interaction, speech-based search, and assistive tech- ularity of this digitization process.
nology for people with speech impairments. For example,
virtual assistants like Siri and Alexa use speech recognition 2.2. Speech Features
technology, while audiobooks and in-car navigation systems Speech features are numerical representations of speech
rely on text-to-speech systems. signals that are used for analysis, recognition, and synthesis.
Given the wide range of applications and the rapidly evolv- Broadly, speech signals can be classified into two categories:
ing nature of deep learning, a comprehensive review paper time-domain features and frequency-domain features.
that surveys the current state-of-the-art techniques and their Time-domain features are derived directly from the am-
applications in speech processing is necessary. Such a paper plitude of the speech signal over time. These are simple
can help researchers and practitioners stay up-to-date with to compute and often used in real-time speech-processing
the latest developments and trends and provide insights into applications. Some common time-domain features include:
potential areas for future research. However, to the best of
our knowledge, no current work covers a broad spectrum of • Energy: Energy is a quantitative measure of the ampli-
speech-processing tasks. tude characteristics of a speech signal over time. It is
A review paper on deep learning for speech processing computed by squaring each sample in the signal and
can also be a valuable resource for beginners interested in summing them within a specific time window. This
learning about the field. It can provide an overview of the captures the overall strength and dynamics of the sig-
fundamental concepts and techniques used in deep learning nal, revealing temporal variations in intensity. The
for speech processing and help them gain a deeper under- energy measure provides insights into segments with
standing of the field. While some survey papers focus on higher or lower amplitudes, aiding in speech recogni-
specific speech-processing tasks such as speech recognition, tion, audio segmentation, and speaker diarization. It
a broad survey would cover a wide range of other tasks such also helps identify events and transitions indicative of
as speaker recognition speech synthesis, and more. A broad changes in vocal activity. By quantifying amplitude
survey would highlight the commonalities and differences variations, energy analysis contributes to a comprehen-
between these tasks and provide a comprehensive view of sive understanding of speech signals and their acoustic
the advancements made in the field. properties.
• Zero-crossing rate: The zero-crossing rate indicates
2. Background how frequently the speech signal crosses the zero-axis
Before moving on to deep neural architectures, we discuss within a defined time frame. It is computed by counting
basic terms used in speech processing, low-level representa- the number of polarity changes in the signal during a
tions of speech signals, and traditional models used in the specific window.
field. • Pitch: Pitch refers to the perceived tonal quality in
a speaker’s voice, which is determined by analyzing
2.1. Speech Signals the fundamental frequency of the speech signal. The
Signal processing is a fundamental discipline that encom- fundamental frequency can be estimated through the
passes the study of quantities that exhibit variations in space application of pitch detection algorithms [443] or by
or time. In the realm of signal processing, a quantity exhibit- utilizing autocorrelation techniques [529].
ing spatial or temporal variations is commonly referred to as
a signal. Specifically, sound signals are defined as variations • Linear predictive coding (LPC):Linear Predictive Cod-
in air pressure. Consequently, a speech signal is identified as ing (LPC) is a powerful technique that represents the
a type of sound signal, namely pressure variations, generated speech signal as a linear combination of past samples,
by humans to facilitate spoken communication. Transducers employing an autoregressive model. The estimation of
play a vital role in converting these signals from one form, model parameters is accomplished through methods
like the Levinson-Durbin algorithm [54]. The obtained
coefficients serve as a valuable feature representation on a non-linear mel frequency scale. The MFCCs con-
for various speech-processing tasks. sist of a set of coefficients that collectively form a Mel-
Frequency-domain features are derived from the signal frequency cepstrum 1 . With just 12 parameters related
represented in the frequency domain also known as its spec- to the amplitude of frequencies, MFCCs provide an ad-
trum. A spectrum captures the distribution of energy as a equate number of frequency channels to analyze audio,
function of frequency. Spectrograms are two-dimensional while still maintaining a compact representation. The
visual representations capturing the variations in a signal’s main objectives of MFCC extraction are to eliminate
spectrum over time. When compared against time-domain vocal fold excitation (F0) information related to pitch,
features, it is generally more complex to compute frequency- ensure the independence of the extracted features, align
domain features as they tend to involve time-frequency trans- with human perception of loudness and frequency, and
form operations such as Fourier transform. capture the contextual dynamics of phones. The pro-
cess of extracting MFCC features involves A/D con-
• Mel-spectrogram: A Mel spectrogram, also known as version, pre-emphasis filtering, framing, windowing,
a Mel-frequency spectrogram or Melspectrogram, is Fourier transform, Mel filter bank application, logarith-
a representation of the short-term power spectrum of mic operation, discrete cosine transform (DCT), and
a sound signal. It is widely used in audio signal pro- liftering. By following these steps, MFCCs enable the
cessing and speech recognition tasks. It is obtained by extraction of informative audio features while avoiding
converting the power spectrum of a speech signal into a redundancy and preserving the relevant characteristics
mel-scale, which is a perceptual scale of pitches based of the sound signal.
on the human auditory system’s response to different
frequencies. The mel-scale divides the frequency range Other types of speech features include formant frequen-
into a set of mel-frequency bands, with higher resolu- cies, pitch contour, cepstral coefficients, wavelet coefficients,
tion in the lower frequencies and coarser resolution in and spectral envelope. These features can be used for vari-
the higher frequencies. This scale is designed to mimic ous speech-processing tasks, including speech recognition,
the non-linear frequency perception of human hearing. speaker identification, emotion recognition, and speech syn-
To compute the Melspectrogram, the speech signal is thesis.
typically divided into short overlapping frames. For In the field of speech processing, frequency-based repre-
each frame, the Fast Fourier Transform (FFT) is ap- sentations such as Mel spectrogram and MFCC are widely
plied to obtain the power spectrum. The power spec- used since they are more robust to noise as compared to tem-
trum is then transformed into the mel-scale using a poral variations of the sound [7]. Time-domain features can
filterbank that converts the power values at different fre- be useful when the task warrants this information (such as
quencies to their corresponding mel-frequency bands. pauses, emotions, phoneme duration, and speech segments).
Finally, the logarithm of the mel-scale power values is It is noteworthy that the time-domain and frequency-domain
computed, resulting in the Melspectrogram. features tend to capture different sets of information and thus
Melspectrogram provides a time-frequency represen- can be used in conjunction to solve a task [514, 568, 531].
tation of the audio signal, where the time dimension
2.3. Traditional models for speech processing
corresponds to the frame index, and the frequency di-
Traditional speech representation learning algorithms
mension represents the mel-frequency bands. It cap-
based on shallow models utilize basic non-parametric models
tures both the spectral content and temporal dynamics
for extracting features from speech signals. The primary ob-
of the signal, making it useful for tasks such as speech
jective of these models is to extract significant features from
recognition, music analysis, and sound classification.
the speech signal through mathematical operations, such as
By using the Melspectrogram, the representation of
Fourier transforms, wavelet transforms, and linear predic-
the audio signal is transformed to a more perceptually
tive coding (LPC). The extracted features serve as inputs
meaningful domain, which can enhance the perfor-
to classification or regression models. The shallow models
mance of various audio processing algorithms. It is
aim to extract meaningful features from the speech signal,
particularly beneficial in scenarios where capturing the
enabling the classification or regression model to learn and
spectral patterns and frequency content of the signal
make accurate predictions.
is important for the analysis or classification task at
hand. • Gaussian Mixture Models (GMMs): Gaussian Mix-
• Mel-frequency cepstral coefficients (MFCCs): Mel- ture Models (GMMs) are powerful generative models
frequency cepstral coefficients (MFCCs) are a feature employed to represent the probability distribution of a
representation widely utilized in various applications speech feature vector. They achieve this by combining
such as speech recognition, gesture recognition, speaker multiple Gaussian distributions with different weights.
identification, and cetacean auditory perception sys- GMMs have found widespread applications in speaker
tems. MFCCs capture the power spectrum of a sound identification [259] and speech recognition tasks [463].
over a short duration by utilizing a linear cosine trans- 1 https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
Specifically, in speaker identification, GMMs are uti- This algorithm has gained significant popularity due
lized to capture the distribution of speaker-specific to its practicality and intuitive nature, making it a re-
features, enabling the recognition of individuals based liable choice for classifying speech data in numerous
on their unique characteristics. Conversely, in speech real-world scenarios. By leveraging the proximity-
recognition, GMMs are employed to model the acous- based classification, KNN provides a straightforward
tic properties of speech sounds, facilitating accurate yet powerful method for accurately categorizing speech
recognition of spoken words and phrases. GMMs play samples based on their similarities to the training data.
a crucial role in these domains, enabling robust and Its versatility and ease of implementation contribute
efficient analysis of speech-related data. to its widespread adoption in various speech-related
domains, facilitating advancements in speaker recog-
• Support Vector Machines (SVMs): Support Vector nition, language identification, and other applications
Machines (SVMs) are a widely adopted class of su- in the field of speech processing.
pervised learning algorithms extensively utilized for
various speech classification tasks [506]. They are par- • Decision trees: Decision trees are widely employed
ticularly effective in domains like speaker recognition in speech classification tasks as a class of supervised
[174, 512, 511] and phoneme recognition [52]. SVMs learning algorithms. Their operation involves recur-
excel in their ability to identify optimal hyperplanes sively partitioning the feature space into smaller re-
that effectively separate different classes in the feature gions, guided by the values of the features. Within each
space. By leveraging this optimal separation, SVMs en- partition, a decision rule is established to assign the
able accurate classification and recognition of speech input feature vector to a specific class. The strength of
patterns. As a result, SVMs have become a fundamen- decision trees lies in their ability to capture complex de-
tal tool in the field of speech analysis and play a vital cision boundaries by hierarchically dividing the feature
role in enhancing the performance of speech-related space. By analyzing the values of the input features
classification tasks. at each node, decision trees efficiently navigate the
classification process. This approach not only provides
• Hidden Markov Models (HMMs): Hidden Markov interpretability, but also facilitates the identification of
Models (HMMs) have gained significant popularity as key features contributing to the classification outcome.
a powerful tool for performing various speech recog- Through their recursive partitioning mechanism, deci-
nition tasks, particularly ASR [149, 444]. In ASR, sion trees offer a flexible and versatile framework for
HMMs are employed to model the probability distri- speech classification. They excel in scenarios where
bution of speech sounds by incorporating a sequential the decision rules are based on discernible thresholds
arrangement of hidden states along with correspond- or ranges of feature values. The simplicity and trans-
ing observations. The training of HMMs is commonly parency of decision trees make them a valuable tool for
carried out using the Baum-Welch algorithm, a vari- understanding and solving speech-related classification
ant of the Expectation Maximization algorithm, which tasks.
enables effective parameter estimation and model opti-
mization2 . To summarize, conventional speech representation learn-
By leveraging HMMs in speech recognition, it be- ing algorithms based on shallow models entail feature ex-
comes possible to predict the most likely sequence traction from the speech signal, which is subsequently used
of speech sounds given an input speech signal. This as input for classification or regression models. These al-
enables accurate and efficient recognition of spoken gorithms have found extensive applications in speech pro-
language, making HMMs a crucial component in ad- cessing tasks like speech recognition, speaker identification,
vancing speech recognition technology. Their flexibil- and speech synthesis. However, they have been progres-
ity and ability to model temporal dependencies con- sively superseded by more advanced representation learning
tribute to their widespread use in ASR and various algorithms, particularly deep neural networks, due to their
other speech-related applications, further enhancing enhanced capabilities.
our understanding and utilization of spoken language.
• The K-nearest neighbors (KNN) algorithm is a sim- 3. Deep Learning Architectures and Their
ple yet effective classification approach utilized in a Applications in Speech Processing Tasks
wide range of speech-related applications, including Deep learning architectures have revolutionized the field
speaker recognition [477] and language recognition. of speech processing by demonstrating remarkable perfor-
The core principle of KNN involves identifying the mance across various tasks. With their ability to automati-
K-nearest neighbors of a given input feature vector cally learn hierarchical representations from raw speech data,
within the training data and assigning it to the class deep learning models have surpassed traditional approaches
that appears most frequently among those neighbors. in areas such as speech recognition, speaker identification,
2Wikipedia: Baum-Welch algorithm: http://en.wikipedia.org/wiki/
and speech synthesis. These architectures have been instru-
Baum%e2%80%93Welch_algorithm mental in capturing intricate patterns, uncovering latent fea-
tures, and extracting valuable information from vast amounts RNNs [487]. BRRNs encode both future and past (input) con-
of speech data. In this section, we delve into the applications text in separate hidden layers. The outputs of the two RNNs
of deep learning architectures in speech processing tasks, are then combined at each time step, typically by concatenat-
exploring their potential, advancements, and the impact they ing them together, to create a new, richer representation that
have had on the field. By examining the key components and includes both past and future context.
techniques employed in these architectures, we aim to provide
insights into the current state-of-the-art in deep learning for ℎ⃖⃖⃗𝑡 = (𝑊ℎℎ ⃖⃗
⃖⃖⃗ ℎ𝑡−1 + 𝑊𝑥ℎ
⃖⃖⃗ 𝑥𝑡 + 𝑏⃗ℎ⃖ ) (3)
ℎ𝑡 = (𝑊⃖⃖⃖
speech processing and shed light on the exciting prospects it ⃖⃖⃖ ⃖⃖
ℎ + 𝑊⃖⃖⃖ 𝑥 + 𝑏ℎ⃖⃖ ) (4)
holds for future advancements in the field. ℎℎ 𝑡+1 𝑥ℎ 𝑡
𝑦𝑡 = 𝑊ℎ𝑦 ⃖⃗
⃖⃖⃗ ℎ𝑡 + 𝑊⃖⃖⃖
ℎ𝑦 𝑡
⃖⃖
ℎ + 𝑏𝑦 (5)
3.1. Recurrent Neural Networks (RNNs)
It is natural to consider Recurrent Neural Networks for where high dimensional hidden states ℎ ⃖⃗𝑡−1 and ⃖⃖
ℎ𝑡+1 are
various speech processing tasks since the input speech signal hidden states modeling the forward context from 1, 2, … , 𝑡−1
is inherently a dynamic process [479]. RNNs can model a and backward context from 𝑇 , 𝑇 − 1, … , 𝑡 + 1, respectively.
given time-varying (sequential) patterns that were otherwise
hard to capture by standard feedforward neural architectures. Long Short-Term Memory Vanilla RNNs are observed to
Initially, RNNs were used in conjunction with HMMs where face another limitation, that is, vanishing gradients that do
the sequential data is first modeled by HMMs while localized not allow them to learn from long-range context information.
classification is done by the neural network. However, such To overcome this, a variant of RNN, named as LSTM, was
a hybrid model tends to inherit limitations of HMMs, for specifically designed to address the vanishing gradient prob-
instance, HMM requires task-specific knowledge and inde- lem and enable the network to selectively retain (or forget)
pendence constraints for observed states [43]. To overcome information over longer periods of time [187]. This attribute
the limitations inherited by the hybrid approach, end-to-end is achieved by maintaining separate purpose-built memory
systems completely based on RNNs became popular for se- cells in the network: the long-term memory cell 𝑐𝑡 and the
quence transduction tasks such as speech recognition and short-term memory cell ℎ𝑡 . In Equation (2), LSTM redefines
text[158, 246]. Next, we discuss RNN and it’s variants: the operator in terms of forget gate 𝑓𝑡 , input gate 𝑖𝑡 , and
output gate 𝑜𝑡 ,
3.1.1. RNN Models
Vanilla RNN Give input sequence of T states (𝑥1 , … , 𝑥𝑇 ) 𝑖𝑡 = 𝜎(𝑊𝑥𝑖 𝑥𝑡 + 𝑊ℎ𝑖 ℎ𝑡−1 + 𝑊𝑐𝑖 𝑐𝑡−1 + 𝑏𝑖 ), (6)
with 𝑥𝑖 ∈ ℝ𝑑 , the output state at time 𝑡 can be computed as 𝑓𝑡 = 𝜎(𝑊𝑥𝑓 𝑥𝑡 + 𝑊ℎ𝑓 ℎ𝑡−1 + 𝑊𝑐𝑓 𝑐𝑡−1 + 𝑏𝑓 ), (7)
ℎ𝑡 = (𝑊ℎℎ ℎ𝑡−1 + 𝑊𝑥ℎ 𝑥𝑡 + 𝑏ℎ ) (1) 𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡−1 + 𝑖𝑡 ⊙ tanh (𝑊𝑥𝑐 𝑥𝑡 + 𝑊ℎ𝑐 ℎ𝑡−1 + 𝑏𝑐 ),
(8)
𝑜𝑡 = 𝜎(𝑊𝑥𝑜 𝑥𝑡 + 𝑊ℎ𝑜 ℎ𝑡−1 + 𝑊𝑐𝑜 𝑐𝑡 + 𝑏𝑜 ), (9)
𝑦𝑡 = 𝑊ℎ𝑦 ℎ𝑡 + 𝑏𝑦 (2) ℎ𝑡 = 𝑜𝑡 ⊙ tanh (𝑐𝑡 ), (10)
it is a suitable input for a CNN processing pipeline that re- to their 2D counterparts in certain applications. For exam-
quires preserving locality in both frequency and time axis. ple, Alsabhan [12] found that the performance of predicting
For speech signals, modeling local correlations with CNNs emotions with a 2D CNN model was lower compared to a
will be beneficial. The CNNs can also effectively extract the 1D CNN model.
structural features from the spectrogram and reduce the com- 1D convolution is useful in speech processing for several
plexity of the model through weight sharing. This section reasons:
will discuss the architecture of 1D and 2D CNNs used in
• Since, speech signals are sequences of amplitudes sam-
various speech-processing tasks.
pled over time, 1D convolution can be applied along
3.2.1. CNN Model Variants temporal dimension to capture temporal variations in
2D CNN Since spectrograms are two-dimensional visual the signal.
representations, one can leverage CNN architectures widely • Robustness to distortion and noise: Since, 1D con-
used for visual data processing (images and videos) by per- volution allows local feature extraction, the resultant
forming convolutions in two dimensions. The mathematical features are often resilient to global distortions of the
equation for a 2D convolutional layer can be represented as: signal. For instance, a speaker might be interrupted in
the middle of an utterance. Local features would still
(∑
𝐿 ∑
𝑀 )
𝑦(𝑘) 𝑥(𝑙) 𝑤(𝑘) + 𝑏(𝑘) (14) produce robust representations for those relevant spans,
𝑖,𝑗 = 𝜎 𝑖+𝑙−1,𝑗+𝑚−1 𝑙,𝑚 which is key to ASR, among many speech process-
ing task. On the other hand, speech signals are often
𝑙=1 𝑚=1
dilation=1
3.2.2. Application
CNNs have proven to be versatile tools for a range of Input
speech-processing tasks. They have been successfully applied
to speech recognition [390, 4], including in hybrid NN-HMM
models for speech recognition, and can be used for multi-class
classification of words [5]. In addition, CNNs have been Figure 2: TCNNs leverage causal and dilated convolutions
proposed for speaker recognition in an emotional speech, to model temporal dependencies in sequential data. Causal
with a constrained CNN model presented in [498].
convolutions ensure that future information is not used dur-
ing training, while dilated convolutions increase the receptive
CNNs, both 1D and 2D, have emerged as the core building field without increasing computational complexity. This makes
block for various speech processing models, including acous- TCNNs an effective and efficient solution for a wide range of
tic models [485, 162, 273] in ASR systems. For instance, in tasks, including speech recognition, action recognition, and
2021, researchers from Facebook AI proposed wav2vec2.0 music analysis.
[485], a hybrid ASR system based on CNNs for learning
representations of raw speech signals that were then fed into
a transformer-based language model. The system achieved architecture combines the best practices of modern CNNs and
state-of-the-art results on several benchmark datasets. has demonstrated comparable performance to recurrent archi-
Similarly, Google’s VGGVox [92] used a CNN with VGG tectures such as LSTMs and GRUs. The TCN approach could
architecture to learn speaker embeddings from Mel spectro- revolutionize speech processing by providing an alternative
grams, achieving state-of-the-art results in speaker recogni- to the widely used recurrent neural network models.
tion. CNNs have also been widely used in developing state-
of-the-art speech enhancement and text-to-speech architec- 3.2.4. TCNN Model Variants
tures. For instance, the architecture proposed in [311, 543] for The architecture of TCNN is based upon two princi-
Deep Noise Suppression (DNS) [459] challenge and Google’s ples:(1) There is no information “leakage” from future to
Tacotron2 [493] are examples of models that use CNNs as past;(2) the architecture can map an input sequence of any
their core building blocks. In addition to traditional tasks length to an output sequence of the same length, similar to
like ASR and speaker identification, CNNs have also been RNN. TCN consists of dilated, causal 1D fully-convolutional
applied to non-traditional speech processing tasks like emo- layers with the same input and output lengths to satisfy the
tion recognition [230], Parkinson’s disease detection [224], above conditions. In other words, TCNN is simply a 1D
language identification [500] and sleep apnea detection [499]. fully-convolutional network (FCN) with casual convolutions
In all these tasks, CNN extracted features from speech signals as shown in Figure 2.
and fed them into the task classification model.
• Causal Convolution [404]: Causal convolution con-
3.2.3. Temporal Convolution Neural Networks volves the input at a specific time point 𝑡 solely with
Recurrent neural networks, including RNNs, LSTMs, and the temporally-prior elements.
GRUs, have long been popular for deep-learning sequence
modeling tasks. They are especially favored in the speech- • Dilated Convolution [629]: By itself, causal convolu-
processing domain. However, recent studies have revealed tion filters have a limited range of perception, meaning
that certain CNN architectures can achieve state-of-the-art they can only consider a fixed number of elements
accuracy in tasks such as audio synthesis, word-level lan- in the past. Therefore, it is challenging to learn any
guage modelling, and machine translation, as reported in dependency between temporally distant elements for
[233, 234, 102]. The advantage of convolutional neural net- longer sequences. Dilated convolution ameliorates this
works is that they enable faster training by allowing parallel limitation by repeatedly applying dilating filters to ex-
computation. They can avoid common issues associated with pand the range of perception, as shown in Figure 2.
recurrent models, such as the vanishing or exploding gradient The dilation is achieved by uniformly inserting zeros
problem or the inability to retain long-term memory. between the filter weights.
In a recent study by Bai et al. [30], they proposed a generic Consider a 1-D sequence 𝑥 ∈ 𝐑𝑛 and a filter: 𝑓 ∶
Temporal Convolutional Neural Network (TCNN) architec- {0, ..., 𝑘 − 1} → 𝐑, the dilated convolution operation
ture that can be applied to various speech-related tasks. This
Attention(Q,K,V)
module for building a deep model. For example, each encoder
Linear block output can be defined as follows:
MatMul
MultiHeadAttn(Q,K,V)
(20)
′
Softmax
Concatenate 𝐻 = LayerNorm(Self Attention(𝑋) + 𝑋)
head M
Mask (opt) Scaled Dot Product Attention head 2
head 1
(21)
′ ′
Scale 𝐻 = LayerNorm(FFN(𝐻 ) + 𝐻 )
Linear Linear Linear
combine word vector embeddings and positional encodings, discrete tokens as input, necessitating using a tokenizer or a
which are subsequently subjected to a sequence of encoders speech recognition system, introducing errors and noise. Fur-
and decoders. These fundamental differences between RNNs thermore, pre-training on large-scale text corpora can lead to
and Transformers establish the latter as a promising option domain mismatch problems when processing speech data. To
for various natural language processing tasks [244]. address these limitations, dedicated frameworks have been
A comparative study on transformer vs. RNN [244] in developed for learning speech representations using trans-
speech applications found that transformer neural networks formers, including wav2vec [485], data2vec [24], Whisper
achieve state-of-the-art performance in neural machine trans- [445], VALL-E [562], Unispeech [565], SpeechT5 [16] etc.
lation and other natural language processing applications We discuss some of them as follows.
[244]. The study compared and analysed transformer and
conventional RNNs in a total of 15 ASR, one multilingual • Speech representation learning frameworks, such as
ASR, one ST, and two TTS applications. The study found wav2vec, have enabled significant advancements in
that transformer neural networks outperformed RNNs in most speech processing tasks. One recent framework, w2v-
applications tested. Another survey of transformer-based BERT [585], combines contrastive learning and MLM
models in speech processing found that transformers have to achieve self-supervised speech pre-training on dis-
an advantage in comprehending speech, as they analyse the crete tokens. Fine-tuning wav2vec models with lim-
entire sentence simultaneously, whereas RNNs process input ited labeled data has also been demonstrated to achieve
words one by one. state-of-the-art results in speech recognition tasks [25].
Transformers have been successfully applied in end-to- Moreover, XLS-R [20], another model based on wav2vec
end speech processing, including automatic speech recogni- 2.0, has shown state-of-the-art results in various tasks,
tion (ASR), speech translation (ST), and text-to-speech (TTS) domains, data regimes, and languages, by leveraging
[309]. In 2018, the Speech-Transformer was introduced as a multilingual data augmentation and contrastive learn-
no-recurrence sequence-to-sequence model for speech recog- ing techniques on a large scale. These models learn
nition. To reduce the dimension difference between input universal speech representations that can be transferred
and output sequences, the model’s architecture was modified across languages and domains, thus representing a sig-
by adding convolutional neural network (CNN) layers before nificant advancement in speech representation learn-
feeding the features to the transformer. In a later study [388], ing.
the authors proposed a method to improve the performance • Transformers have been increasingly popular in the
of end-to-end speech recognition models based on transform- development of frameworks for learning representa-
ers. They integrated the connectionist temporal classification tions from multi-modal data, such as speech, images,
(CTC) with the transformer-based model to achieve better and text. Among these frameworks, Data2vec [24] is
accuracy and used language models to incorporate additional a self-supervised training approach that aims to learn
context and mitigate recognition errors. joint representations to capture cross-modal correla-
In addition to speech recognition, the transformer model tions and transfer knowledge across modalities. It has
has shown promising results in TTS applications. The trans- outperformed other unsupervised methods for learn-
former based TTS model generates mel-spectrograms, fol- ing multi-modal representations in benchmark datasets.
lowed by a WaveNet vocoder to output the final audio results However, for tasks that require domain-specific models,
[309]. Several neural network-based TTS models, such as such as speech recognition or speaker identification,
Tacotron 2, DeepVoice 3, and transformer TTS, have outper- domain-specific models may be more effective, partic-
formed traditional concatenative and statistical parametric ularly when dealing with data in specific domains or
approaches in terms of speech quality [309, 493, 428]. languages. The self-supervised training approach of
One of the strengths of Transformer-based architectures Data2vec enables cost-effective and scalable learning
for neural speech synthesis is their high efficiency while con- of representations without requiring labeled data, mak-
sidering the global context [162, 494]. The Transformer TTS ing it a promising framework for various multi-modal
model has shown advantages in training and inference effi- learning applications.
ciency over RNN-based models such as Tacotron 2 [493].
The efficiency of the Transformer TTS network can speed • The field of speech recognition has undergone a revolu-
up the training about 4.25 times [309]. Moreover, Multi- tionary change with the advent of the Whisper model
Speech, a multi-speaker TTS model based on the Transformer [445]. This innovative solution has proven to be highly
[309], has demonstrated the effectiveness of synthesizing a versatile, providing exceptional accuracy for various
more robust and better quality multi-speaker voice than naive speech-related tasks, even in challenging environments.
Transformer-based TTS. The Whisper model achieves its outstanding perfor-
In contrast to the strengths of Transformer-based architec- mance through a minimalist approach to data pre-processing
tures in neural speech synthesis, large language models based and weak supervision, which allows it to deliver state-
on Transformers such as BERT [109], GPT [446], XLNet of-the-art results in speech processing. The model is
[618], and T5 [450] have limitations when it comes to speech capable of performing multilingual speech recogni-
processing. One of the issues is that these models require tion, translation, and language identification, thanks to
8B
34M 118M 317M 317M
85M 317M 1B 2B
110M 1B 317M
Figure 4: Timeline highlighting notable large Transformer models developed for speech
processing, along with their corresponding parameter sizes.
its training on a diverse audio dataset. Its multitask- ing a significant advancement in TTS technology.
ing model can cater to various speech-related tasks,
The timeline highlights the development of large trans-
such as transcription, voice assistants, education, en-
former based models for speech processing is shown in Fig-
tertainment, and accessibility. One of the unique fea-
ure 4. The size of the models has grown exponentially, with
tures of Whisper is its minimalist approach to data
significant breakthroughs achieved in speech recognition,
pre-processing, which eliminates the need for signifi-
synthesis, and translation. These large models have set new
cant standardization and simplifies the speech recogni-
performance benchmarks in the field of speech processing,
tion pipeline. The resulting models generalize well to
but also pose significant computational and data requirements
standard benchmarks and deliver competitive perfor-
for training and inference.
mance without fine-tuning, demonstrating the potential
of advanced machine learning techniques in speech 3.4. Conformer
processing.
The Transformer architecture, which utilizes a self-attention
• Text-to-speech synthesis has been a topic of interest for mechanism, has successfully replaced recurrent operations in
many years, and recent advancements have led to the previous architectures. Over the past few years, various Trans-
development of new models such as VALL-E [562]. former variants have been proposed [162]. Architectures
VALL-E is a novel text-to-speech synthesis model that combining Transformers and CNNs have recently shown
has gained significant attention due to its unique ap- promising results on speech-processing tasks [582]. To ef-
proach to the task. Unlike traditional TTS systems, ficiently model both local and global dependencies of an
VALL-E treats the task as a conditional language mod- audio sequence, several attempts have been made to com-
elling problem and leverages a large amount of semi- bine CNNs and Transformers. One such architecture pro-
supervised data to train a generalized TTS system. It posed by the authors is the Conformer [162], a convolution-
can generate high-quality personalized speech with augmented transformer for speech recognition. Conformer
a 3-second acoustic prompt from an unseen speaker outperforms RNNs, previous Transformers, and CNN-based
and provides diverse outputs with the same input text. models, achieving state-of-the-art performance in speech
VALL-E also preserves the acoustic environment and recognition. The Conformer model consists of several build-
the speaker’s emotions about the acoustic prompt, with- ing blocks, including convolutional layers, self-attention lay-
out requiring additional structure engineering, pre- ers, and feedforward layers. The architecture of the Con-
designed acoustic features, or fine-tuning. Further- former model can be summarized as follows:
more, VALL-E X [659] is an extension of VALL-E • Input Layer: The Conformer model inputs a sequence
that enables cross-lingual speech synthesis, represent- of audio features, such as MFCCs or Mel spectrograms.
• Convolutional Layers: Local features are extracted has demonstrated exceptional performance in elderly speech
from the audio signal by processing the input sequence recognition and shows promise for the clinical diagnosis and
through convolutional layers. treatment of Alzheimer’s disease.
Several enhancements have been made to the Conformer-
• Self-Attention Layers: The Conformer model incorpo- based model to address high word error rates without a lan-
rates self-attention layers following the convolutional guage model, as documented in [336]. Wu [598] proposed
layers. Self-attention is a mechanism that enables the a deep sparse Conformer to improve its long-sequence rep-
model to focus on various sections of the input se- resentation capabilities. Furthermore, Burchi and Timofte
quence while making predictions. This is especially [49] have recently enhanced the noise robustness of the Effi-
advantageous for speech recognition because it facil- cient Conformer architecture by processing both audio and
itates capturing long-term dependencies in the audio visual modalities. In addition, models based on Conformer,
signal. such as Transducers [252], have been adopted for real-time
• Feedforward Layers: After the self-attention layers, speech recognition [414] due to their ability to process audio
the Conformer model applies a sequence of feedfor- data much more quickly than conventional recurrent neural
ward layers intended to process the output of the self- network (RNN) models.
attention layers further and ready it for the ultimate
3.5. Sequence to Sequence Models
prediction.
The sequence-to-sequence (seq2seq) model in speech
• Output Layer: Finally, the output from the feedfor- processing is popularly used for ASR, ST, and TTS tasks.
ward layers undergoes a softmax activation function to The general architecture of the seq2seq model involves an
generate the final prediction, typically representing a encoder-decoder network that learns to map an input sequence
sequence of character labels or phonemes. to an output sequence of varying lengths. In the case of ASR,
the input sequence is the speech signal, which is processed
The conformer model has emerged as a promising neu- by the encoder network to produce a fixed-length feature
ral network architecture for various speech-related research vector representation of the input signal. The decoder network
tasks, including but not limited to speech recognition, speaker inputs this feature vector and produces the corresponding text
recognition, and language identification. In a recent study by sequence. This can be achieved through a stack of RNNs
Gulati et al. [162], the conformer model was demonstrated [436], Transformer [116] or Conformer [162] in the encoder
to outperform previous state-of-the-art models, particularly and decoder networks.
in speech recognition significantly. This highlights the po- The sequence-to-sequence model has emerged as a potent
tential of the conformer model as a key tool for advancing tool in speech translation. It can train end-to-end to efficiently
speech-related research. map speech spectrograms in one language to their correspond-
ing spectrograms in another. The notable advantage of this
3.4.1. Application approach is eliminating the need for an intermediate text rep-
The Conformer model stands out among other speech resentation, resulting in improved efficiency. Additionally,
recognition models due to its ability to efficiently model both the Seq2seq models have been successfully implemented
local and global dependencies of an audio sequence. This in speech generation tasks, where they reverse the ASR ap-
is crucial for speech recognition, language translation, and proach. In such applications, the input text sequence serves as
audio classification [1, 162, 2]. The model achieves this the input, with the encoder network creating a feature vector
through self-attention and convolution modules, combining representation of the input text. The decoder network then
the strengths of CNNs and Transformers. While CNNs cap- leverages this representation to generate the desired speech
ture local information in audio sequences, the self-attention signal.
mechanism captures global dependencies [2]. The Conformer Karita et al. [244] conducted an extensive study com-
model has achieved remarkable performance in speech recog- paring the performance of transformer and traditional RNN
nition tasks, setting benchmarks on datasets such as Lib- models on 15 different benchmarks for Automatic Speech
riSpeech and AISHELL-1. Recognition (ASR), including a multilingual ASR bench-
Despite these successes, speech synthesis and recognition mark, a Speech Translation (ST) benchmark, and two Text-
challenges persist, including difficulties generating natural- to-Speech (TTS) benchmarks. In addition, they proposed
sounding speech in non-English languages and real-time a shared Sequence-to-Sequence (S2S) architecture for AST,
speech generation. To address these limitations, Wang et TTS, and ST tasks, which is depicted in Figure 5.
al. [658] proposed a novel approach that combines noisy stu-
dent training with SpecAugment and large Conformer models • Encoder
pre-trained on the Libri-Light dataset using the wav2vec 2.0 𝑋0 = Encoder−PreNet(𝑋),
pre-training method. This approach achieved state-of-the-art (22)
𝑋𝑒 = Encoder−Main(𝑋0 )
word error rates on the LibriSpeech dataset. Recently, Wang
et al. [575] developed Conformer-LHUC, an extension of the where 𝑋 is the sequence of speech features (e.g. Mel
Conformer model that employs learning hidden unit contri- spectrogram) for AST and ST and phoneme or charac-
bution (LHUC) for speaker adaptation. Conformer-LHUC ter sequence for TTS.
ASR: CE,CTC of ASR has seen significant progress, with several advanced
ST: CE
techniques emerging as popular options. These include the
CTC approach, which has been further developed and im-
TTs: L1, L2, BCE
diarization, and speech enhancement tasks in the speech field. accurately. In this case, the system receives an audio input and
One of the significant benefits of using RL for speech tasks outputs a text sequence corresponding to the spoken words.
is its ability to learn directly from raw audio data, eliminat- The environmental states might be learned from the input
ing the need for hand-engineered features. This can result audio features. The actions might be the generated phonemes.
in better performance compared to traditional methods that The reward could be the similarity between the generated and
rely on feature extraction. By capturing intricate patterns gold phonemes, quantified in edit distance. Several works
and relationships in the audio data, RL-based speech systems have also achieved promising results for non-native speech
have the potential to enhance accuracy and robustness. recognition [448]
DRL pre-training has shown promise in reducing training
3.6.1. Basic Models time and enhancing performance in various Human-Computer
The utilization of deep reinforcement learning (DRL) in Interaction (HCI) applications, including speech recognition
speech processing involves the environment (a set of states [453]. Recently, researchers have suggested using a reinforce-
𝑆), agent, actions (𝐴), and reward (𝑟). The semantics of ment learning algorithm to develop a Speech Enhancement
these components depends on the task at hand. For instance, (SE) system that effectively improves ASR systems. However,
in ASR tasks, the environment can be composed of speech ASR systems are often complicated and composed of non-
features, the action can be the choices of phonemes, and the differentiable units, such as acoustic and language models.
reward could be the correctness of those phonemes given the Therefore, the ASR system’s recognition outcomes should be
input. Audio signals are one-dimensional time-series signals employed to establish the objective function for optimizing
that undergo pre-processing and feature extraction procedures. the SE model. Other than ASR, SE, some studies have also
Pre-processing steps include noise suppression, silence re- focused on SER using DRL algorithms [282, 454, 243]
moval, and channel equalization, improving audio signal
quality and creating robust and efficient audio-based systems. Speaker identification Similarly, for speaker identification
Previous research has demonstrated that pre-processing im- tasks, the actions can be the speaker’s choices, and a binary
proves the performance of deep learning-based audio systems reward can be the correctness of choice.
[288].
Feature extraction is typically performed after pre-processing Speech synthesis and coding Likewise, the states can be
to convert the audio signal into meaningful and informative the input text, the actions can be the generated audio, and the
features while reducing their number. MFCCs and spectro- reward could be the similarity between the gold and generated
grams are popular feature extraction choices in speech-based mel-spectrogram.
systems [288]. These features are then given to the DRL Deep reinforcement learning has several advantages over
agent to perform various tasks depending on the application. traditional machine learning techniques. It can learn from
For instance, consider the scenario where a human speaks to raw data without needing hand-engineered features, making it
a DRL-trained machine, where the machine must act based more flexible and adaptable. It can also learn from feedback,
on features derived from audio signals. making it more robust and able to handle noisy environments.
However, deep reinforcement learning also has some chal-
• Value-based DRL: Given the state of the environment lenges that must be addressed. It requires a lot of data to train
(𝑠), a value function 𝑄 ∶ 𝑆 × 𝐴 → ℝ is learned to and can be computationally expensive. It also requires care-
estimate overall future reward 𝑄(𝑠, 𝑎) should an action ful selection of the reward function to ensure that the system
𝑎 be taken. This value function is parameterized with learns the desired behavior.
deep networks like CNN, Transformers, etc.
3.7. Graph Neural Network
• Policy-based DRL: As opposed to value-based RL, Over the past few years, the field of Graph Neural Net-
policy-based RL methods learns a policy function 𝜋 ∶ works (GNNs) has witnessed a remarkable expansion as a
𝑆 → 𝐴 that chooses the best possible action (𝑎) based widely adopted approach for analysing and learning from data
on reward. on graphs. GNNs have demonstrated their potential in various
• Model-based DRL: Unlike the previous two approaches, domains, including computer science, physics, mathematics,
model-based RL learns the dynamics of the environ- chemistry, and biology, by delivering successful outcomes.
ment in terms of the state transition probabilities, i.e., Furthermore, in recent times, the speech-processing domain
a function 𝑀 ∶ 𝑆 × 𝐴 × 𝑆 → ℝ. Given such a model, has also witnessed the growth of GNNs.
policy, or value functions are optimized.
3.7.1. Basic Models
3.6.2. Application Speech processing involves analysing and processing au-
In speech-related research, deep reinforcement learning dio signals, and GNNs can be useful in this context when we
can be used for several purposes, including: represent the audio data as a graph. In this answer, we will
explain the architecture of GNNs for speech processing. The
Speech recognition and Emotion modeling Deep rein- standard GNN pipeline is shown in Figure 6, according to
forcement learning (DRL) can be used to train speech recog- the application the GNN layer can consist of Graph Convolu-
nition systems [231, 453, 536, 88, 89] to transcribe speech tional Layers [652], Graph Attention Layers [556], or Graph
h1 h1
h2 h2
Graph G h0 MLP
h0 Graph Predcition
GNNl
h3 h3 MLP
h4 Concat(hiL,hjL) Edge Prediction
Embed. h4
Edge Features {e0ij}
Layer l: {hli}, {elij} Layer l+1: {hl+1i}, {el+1ij} MLP
Embed. {hLi} Node Prediction
Node Features {h0i}
Input Layer L x GNN Layer Prediction Layer
Figure 6: A standard experimental pipeline for GCNs, which embeds the graph node and embeds the graph node edge features,
performs several GNN layers to compute convolutional features, and finally predicts a task-specific MLP layer.
diffusion process
reverse process
ality of the latent variables fixed. While mostly used for It is non-autoregressive and generates high-fidelity audio for
image and audio synthesis, diffusion models have potential different waveform generation tasks, such as neural vocoding
applications in speech-processing tasks, such as speech syn- conditioned on mel spectrogram, class-conditional genera-
thesis and enhancement. This section offers a comprehensive tion, and unconditional generation. DiffWave delivers speech
overview of the fundamental principles of diffusion models quality on par with the strong WaveNet vocoder [404] while
and explores their potential uses in the speech domain. synthesizing audio much faster.
Diffusion models have shown great promise in speech
Forward diffusion process Given a clean speech data 𝑥0 ∼ processing, particularly in speech enhancement [347, 489,
𝑞𝑑𝑎𝑡𝑎 (𝑥0 ), 442, 348]. Recent advances in diffusion probabilistic models
have led to the development of a new speech enhancement
algorithm that incorporates the characteristics of the noisy
𝑇
∏
𝑞(𝑥1 , ..., 𝑥𝑇 |𝑥0 ) = 𝑞(𝑥𝑡 |𝑥𝑡−1 ). (24)
speech signal into the diffusion and reverses processes [349].
This new algorithm is a generalized form of the probabilistic
𝑡=1
formative and meaningful features from speech signals and Stacked Cross Entropy
outperform existing approaches. Therefore, this approach is Filter Bank -vectors Loss
considered a promising direction for future research in speech
representation learning.
This section provides a comprehensive overview of the
evolution of speech representation learning with neural net-
works. We will examine various techniques and architectures
developed over the years, including the emergence of unsu-
pervised representation learning methods like autoencoders,
generative adversarial networks (GANs), and self-supervised Fully Connected
representation learning frameworks. We will also examine
Hidden Layers
Table 1
The table summarizes various loss functions used in training the speaker recognition models
including their formulation [91].
Triplet [486] Metric learning [640] 𝐿𝑇 = 𝑁1 𝑗=1 max(0, ||𝑥𝑗,0 − 𝑥𝑗,1 ||22 , ||𝑥𝑗,0 − 𝑥𝑘≠𝑗,1 ||22 + 𝑚)
∑ exp S
𝐿𝑃 = − 𝑁1 𝑗=1 log ∑𝑁 𝑗,𝑗
𝑁
Prototypical [507] Metric learning [507]
𝑘=1 exp S𝑗,𝑘
1 ∑ exp S𝑗,𝑖,𝑗
Generalized end-to-end (GE2E) [561] Metric learning [573] 𝐿𝐺 = − 𝑁 𝑗,𝑖 log ∑𝑁
𝑘=1 exp S𝑗,𝑖,𝑘
∑ exp S
Angular Prototypical Metric learning 𝐿𝐴𝑃 = − 𝑁1 𝑗,𝑖 log ∑𝑁 𝑗,𝑖,𝑗
exp S
𝑘=1 𝑗,𝑖,𝑘
𝑢𝑛𝑠𝑢𝑝 =
|𝑦|
∑ ∑
1
𝑝𝜃 (𝑦𝑗 , 𝑥𝑖 ) log 𝑝𝜃 (𝑦𝑗 , 𝑥𝑖 ) (29)
𝑁𝑈 (𝑥𝑖 )∈𝑋𝑈 𝑗=1
with the data used for training the models are outlined in
Table 2. We further discuss different generative approaches
input input input
sequence sequence sequence
Table 2
Summary of generative self-supervised approaches and proposed models for speech pro-
cessing with associated metrics and training Data. ASR: Automatic Speech Recognition,
PR: Phoneme Recognition. PC: Phoneme Classification, SR: Speaker Recognition, LS:
LibriSpeech.
Pre-Training Dataset
Model Reference Task (Metric)
Dataset (hours) Training Test
PC LS (360h) LS (360h) LS (test-clean)
Mockingjay [330]
SR LS (360h) LS (100h) LS (100h)
PASE [418] ASR LS (50 hr) DIRHA DIRHA
DIRHA DIRHA
PASE+ [458] ASR LS (50 hr)
CHiME-5 CHiME-5
LS (100h, 360h, 460 h, 960h) LS (100h, 360h, 460 h, 960h) LS (test-clean)
DeCoAR [326] ASR
WSJ si284 WSJ si284 LS (test-other)
Table 3
Summary of contrastive self-supervised approaches and proposed models for speech pro-
cessing with associated metrics and training Data. ASR: Automatic Speech Recognition,
PR: Phoneme Recognition. PC: Phoneme Classification, SR: Speaker Recognition, LS:
LibriSpeech, LL: LibriLight, WSJ: Wall Street Journal.
Pre-Training Dataset
Model Reference Task
Dataset (hours) Training Test
PC LS (100h) LS (100h) LS (100h)
CPC [405]
SR LS (100h) LS (100h) LS (100h)
LS (100h, 360h)
Modified CPC [467] PC CV-Dataset CV-Dataset
Zerospeech2017(45h)
WSJ (80h) WSJ (80h)
LS (960h) LS (960h) WSJ (test92, test93)
TIMIT (5h) TIMIT (5h) LS (test-clean, test-other)
ASR
Bidirectional CPC [247] SSA (1h) SSA (1h) TED3 (dev, test)
TED3 (440h) TED3 (440h) SwithBoard (eval2000)
SwithBoard (310h) SwithBoard (310h)
Audio Set (2500h) Audio Set (2500h)
OpenSLR
ASR-Multi AVSpeech (3100h) AVSpeech (3100h)
ALFFA
CV-Dataset (430h CV-Dataset (430h)
LS 80/860h
ASR WSJ (si284) WSJ (eval92)
wav2vec [485] LS 960h + WSJ (si284)
PR TIMIT TIMIT TIMIT
LS (960h) LS (test-clean)
ASR LS (960h)
wav2vec 2.0 [26] LL (60000h) LS (test-other)
LS (960h)
PR TIMIT TIMIT
LL (60000h)
ASR LS (960h) WSJ (si284) WSJ (eval92)
vq-wav2vec 2.0 [25]
PR LS (960h) TIMIT TIMIT
wav2vec-C [476] ASR Alexa-10k Alexa-eval Alexa-eval
LS (test)
LS (test-other)
w2v-BERT [96] ASR LL (60000h) LS (960h)
LS (dev)
LS (dev-other)
LS (960h)
ASR WJS (si284) WJS (si284) WJS (si284)
Speech SimCLR [220]
TED2
LS (960h)
PR WJS (si284) TIMIT TIMIT
TED2
LL (60000h)
UnSpeech [381] ASR-Mult GigaSpeech (10000h) SUPERB SUPERB
VP (24000h)
• The direct application of BERT-type training to speech supervised speech representation learning. Even with
input presents challenges due to the unsegmented and a mere 10-minute fine-tuning set, it achieved a Word
unstructured nature of speech. To overcome this obsta- Error Rate (WER) of 25% on the standard test-other
cle, a pioneering model known as Discrete BERT [23] subset. This approach effectively tackles the challenge
has been developed. This model converts continuous of directly applying BERT-type training to continu-
speech input into a sequence of discrete codes, facili- ous speech input and holds substantial potential for
tating code representation learning. The discrete units significantly enhancing speech recognition accuracy
are obtained from a pre-trained vq-wav2vec model
[25], and they serve as both inputs and targets within • The HuBERT [196] and TERA [329] models are two
a standard BERT model. The architecture of Discrete self-supervised approaches for speech representation
BERT, illustrated in Figure 13 (a), incorporates a soft- learning. HuBERT uses an offline clustering step to
max normalized output layer. During training, cate- align target labels with a BERT-like prediction loss,
gorical cross-entropy loss is employed, with a masked with the prediction loss applied only over the masked
perspective of the original speech input utilized for pre- regions as outlined in Figure 13 (b). This encourages
dicting code representations. Remarkably, the Discrete the model to learn a combined acoustic and language
BERT model has exhibited impressive efficacy in self- model over the continuous inputs. On the other hand,
TERA is a self-supervised speech pre-training method
Table 4
Summary of predictive self-supervised approaches and proposed models for speech pro-
cessing with associated metrics and training Data. ASR: Automatic Speech Recognition,
PR: Phoneme Recognition. PC: Phoneme Classification, SR: Speaker Recognition, LL:
LibriLight, LS: LibriSpeech.
Pre-Training Dataset
Model Reference Task (Metric)
Dataset (hours) Training Test
LS (test)
LS (test-other)
ASR LL (60000h) LS (960h)
BEST-RQ [78] LS (dev)
LS (dev-other)
LL (60000h)
ASR-Multi GigaSpeech (10000h) SUPERB SUPERB
VP (24000h)
data2vec [24] ASR LS (960h) LS (10m, 1h, 100h, 960h) LS (960h)
LS (test)
Discrete BERT [23] ASR LS (960h) LS (100h)
LS (test-other)
LS (960h) LS (test)
HuBERT [625] ASR LS (960h)
LL (60000h) LS (test-other)
WavLM [71] ASR LL (60000h) SUPERB SUPERB
(a) (b)
Figure 13: Predictive Self-supervised learning: (a) Discrete 5. Speech Processing Tasks
BERT (b) HuBERT. In recent times, the field of speech processing has gained
significant attention due to its rapid evolution and its cru-
cial role in modern technological applications. This field in-
that reconstructs acoustic frames from their altered volves the use of diverse techniques and algorithms to analyse
counterparts using a stochastic policy to alter along var- and understand spoken language, ranging from basic speech
ious dimensions, including time, frequency, and tasks. recognition to more complex tasks such as spoken language
These alterations help extract feature-based speech rep- understanding and speaker identification. Since speech is one
resentations that can be fine-tuned as part of down- of the most natural forms of communication, speech process-
stream models. ing has become a critical component of many applications
such as virtual assistants, call centres, and speech-to-text
Microsoft has introduced UniSpeech-SAT [72] and WavLM
transcription. In this section, we provide a comprehensive
[71] models, which follow the HuBERT framework. These
overview of the various speech-processing tasks and the tech-
models have been designed to enhance speaker representation
niques used to achieve them, while also discussing the current
and improve various downstream tasks. The key focus of
challenges and limitations faced in this field and its potential
these models is data augmentation during the pre-training
for future development.
stage, resulting in superior performance. WavLM model has
The assessment of speech-processing models depends
exhibited outstanding effectiveness in diverse downstream
greatly on the calibre of datasets employed. By utilizing
tasks, such as automatic speech recognition, phoneme recog-
standardized datasets, researchers are enabled to objectively
nition, speaker identification, and emotion recognition. It is
gauge the efficacy of varying approaches and identify scopes
worth highlighting that this model currently holds the top
for advancement. The selection of evaluation metrics plays a
position on the SUPERB leaderboard [615], which evaluates
Table 5
Comparative analysis of speech processing datasets: This table summarizes the essential
features of different speech-processing datasets, including their typical applications in various
speech-processing tasks. ASR: Automatic Speech Recognition, PR: Phoneme Recognition.
PC: Phoneme Classification, SR: Speaker Recognition, SV: Speaker Verification, SER:
Speech Emotion Recognition, IC: Intent Classification, TTS: Text-to-Speech, VC: Voice
Conversion, ST: Speech Translation, SS: Speech Separation
Dataset Language Lenght (hours) ASR PR PC SR SV SER IC TTS VC ST SS
TIMIT Acoustic-Phonetic Continuous Speech Corpus English 5.4
Lip Reading Sentences 2 (LRS2) English
LibriSpeech (LS) English 1000
GigaSpeech English 10000
Fleurs Multilingual 12
LibriTTS English 585
L2ARCTIC English 11.2
CMUARCTIC English 20
Wall Street Journal (WSJ) English
VoxPopuli (VP) Multilingual 1800
BABEL (BBL) Multilingual
Common Voice (CV-dataset) Multilingual 9283
CSTR VCTK English
HUB 5 English 2000
CHiME-5 English 50.12
TED-LIUM 3 (TED 3) English 452
TED-LIUM 2 (TED 2) English 118
AISHELL-1 Mandarin 520
AISHELL-3 Mandarin 85
AISHELL-4 Mandarin 120
Arabic Speech Corpus Arabic 3.7
Persian Consonant Vowel Combination Persian -
ALFFA Multilingual 5.2-18.3
OpenSLR-multi Multilingual 4.4-265.9
VCTK English 44
VoxCeleb1/2 English
Fluent Speech Commands (FSC) English 14.7
Emotional Speech Dataset (ESD) English 29
Interactive Emotional Dyadic Motion Capture (IEMOCAP) English 12
Multimodal EmotionLines Dataset ( MELD) English -
LibraSeepch En-Fr English/French -
CoVoST-2 Multilingual 2880
LibriLight (LL) English 60000
critical role in this process, hinging on the task at hand and tion and analysis of acoustic features, including spectral and
the desired outcome. Therefore, it is essential that researchers prosodic features, which are then employed to recognize spo-
conduct a meticulous appraisal of different metrics to make ken words. Next, an acoustic model matches the extracted
informed decisions. This paper offers a thorough summary features to phonetic units, while a language model predicts
of frequently utilized datasets and metrics across diverse the most probable sequence of words based on the recognized
downstream tasks, as presented in Table 5 and, Table 6. phonetic units. Ultimately, the acoustic and language model
outcomes are merged to produce the transcription of spoken
5.1. Automatic speech recognition (ASR) & words. Deep learning techniques have gained popularity in
conversational multi-speaker AST recent years, allowing for improved accuracy in ASR sys-
5.1.1. Task Description tems [26, 445]. This paper provides an overview of the key
Automatic speech recognition (ASR) technology enables components involved in ASR and highlights the role of deep
machines to convert spoken language into text or commands, learning techniques in enhancing the technology’s accuracy.
serving as a cornerstone of human-machine communication Most speech recognition systems that use deep learning
and facilitating a wide range of applications such as speech- aim to simplify the processing pipeline by training a single
to-speech translation and information retrieval [345]. ASR model to directly map speech signals to their correspond-
involves multiple intricate steps, starting with the extrac- ing text transcriptions. Unlike traditional ASR systems that
Table 6
Comprehensive Evaluation Metrics for Speech Processing Tasks. This table provides a
comprehensive overview of the evaluation metrics used to assess the performance of speech-
based systems across various tasks such as ASR, speaker verification, and TTS. The table
highlights the specific metrics employed for each task, along with the score range and
commonly used datasets.
Tasks Metric Description Score range Evaluation dataset
Automatic speech recognition WER Word Error Rate 0-1 TIMIT
CER Character Error Rate 0-1 LibriSpeech
Phoneme recognition Accuracy Classification accuracy 0-1 TIMIT
Phoneme classification F1-score Harmonic mean of precision and recall 0-1 TIMIT
Speaker recognition EER Equal Error Rate 0-1 VoxCeleb1
Speaker verification FAR/FRR False Acceptance Rate / False Rejection Rate 0-1 VoxCeleb1
Speech emotion recognition Accuracy Classification accuracy 0-1 IEMOCAP, ESD
Intent classification F1-score Harmonic mean of precision and recall 0-1 ATIS, SNIPS
Text-to-speech MOS Mean Opinion Score 1-5 LJSpeech, LibriTTS
Voice conversion MOS Mean Opinion Score 1-5 VCC 2016
Speech translation BLEU Bilingual Evaluation Understudy 0-1 MuST-C
Speech separation SI-SDRi Signal to Distortion Ratio -20-30 WSJ0-2mix
Speech enhancement PESQ Perceptual Evaluation of Speech Quality -0.5-4.5 NOIZEUS
Voice activity detection F1-score Harmonic mean of precision and recall 0-1 QUT-NOISE
require multiple components to extract and model features, datasets used for this purpose. In this context, several popular
such as HMMs and GMMs, end-to-end models do not rely datasets have gained prominence for use in ASR systems.
on hand-designed components [19, 307]. Instead, end-to-
end ASR systems use DNNs to learn acoustic and linguistic • Common Voice: Mozilla’s Common Voice project [17]
representations directly from the input speech signals [307]. is dedicated to producing an accessible, unrestricted
One popular type of end-to-end model is the encoder-decoder collection of human speech for the purpose of train-
model with attention. This model uses an encoder network ing speech recognition systems. This ever-expanding
to map input audio signals to hidden representations, and dataset features contributions from more than 9, 000
a decoder network to generate text transcriptions from the speakers spanning 60 different languages.
hidden representations. During the decoding process, the • LibriSpeech: LibriSpeech [412] is a corpus of approx-
attention mechanism enables the decoder to selectively focus imately 1,000 hours of read English speech created
on different parts of the input signal [307]. from audiobooks in the public domain. It is widely
End-to-end ASR models can be trained using various used for speech recognition research and is notable for
techniques such as CTC [245], which is used to train models its high audio quality and clean transcription.
without explicit alignment between the input and output se-
quences, and RNNs, which are commonly used to model tem- • VoxCeleb: VoxCeleb [92] is a large-scale dataset con-
poral dependencies in sequential data such as speech signals. taining over 1 million short audio clips of celebrities
Transfer learning-based approaches can also improve end-to- speaking, which can be used for speech recognition
end ASR performance by leveraging pre-trained models or and recognition research. It includes a diverse range of
features [327, 106, 491]. While end-to-end ASR models have speakers from different backgrounds and professions.
shown promising results in various applications, there is still
room for improvement to achieve human-level performance • TIMIT: The TIMIT corpus [153] is a widely used
[327, 625, 236, 237, 106, 137]. Nonetheless, deep learning- speech dataset consisting of recordings consisting of
based end-to-end ASR architecture offers a promising and 630 speakers representing eight major dialects of Amer-
efficient approach to speech recognition that can simplify the ican English, each reading ten phonetically rich sen-
processing pipeline and improve recognition accuracy. tences. It has been used as a benchmark for speech
recognition research since its creation in 1986.
5.1.2. Dataset
• CHiME-5: The CHiME-5 dataset [33] is a collection of
The development and evaluation of ASR systems are
recordings made in a domestic environment to simulate
heavily dependent on the availability of large datasets. As
a real-world speech recognition scenario. It includes
a result, ASR is an active area of research, with numerous
6.5 hours of audio from multiple microphone arrays
Table 7
Table summarizing the performance of different ASR models in terms of WER% on five
different datasets (LibriSpeech test, LibriSpeech clean, TIMIT, Common Voice, WSJ eval92,
and GigaSpeech) also highlighting the use of extra data during training. ZS stands for
Zero-Shot Performance.
Extra Extra
Model Architecture WER% ↓ WER% ↓ Model Architecture WER% ↓
Training Data Training Data
LibriSpeech test clean others TIMIT
Conformer + Wav2vec 2.0 [658] Conformer + wav2vec2.0 Y 1.4 2.6 wav2vec 2.0 [26] Transformer + CNN Y 8.3
w2v-BERT XXL [96] CNN+Transformer Y 1.4 2.5 vq-wav2vec [25] Transformer + CNN Y 11.6
SpeechStew (1B)[58] Conformer Y 1.7 3.3 LSTM + Monophone Reg [457] LSTM N 14.5
SpeechStew (100M) [58] Conformer N 2.0 4.0 Common Voice
ContextNet + SpecAugment [415] LSTM+CNN Y 1.7 3.4 SpeechStew (1B) [58] Conformer N 10.8
Conformer (L) [162] Conformer N 1.9 4.1 Whisper [445] N 9.5
ContextNet [169] Conformer + wav2vec2.0 N 1.9 3.4 WSJ eval92
Squeezeformer [255] Conformer N 2.47 5.97 SpeechStew (100M) [58] Conformer N 1.3
LSTM Transducer [636] LSTM N 2.23 5.6 tdnn+chain [435] TDNN N 2.32
Transformer Transducer [331] Transformer N 2.0 4.2 GigaSpeech
Whisper [445] N 2.7 (ZS) 5.6 (ZS) Conformer/Transformer-AED [61] Conformer N 10.80
and is designed to test the performance of ASR systems where authors include CNN layers before submitting prepro-
in noisy and reverberant environments. cessed speech features to the input. By incorporating more
CNN layers, it becomes feasible to diminish the gap between
Other notable datasets include Google’s Speech Com- the sizes of the input and output sequences, given that the
mands Dataset [589], the Wall Street Journal dataset4 , and number of frames in audio exceeds the number of tokens in
TED-LIUM [470]. text. This results in a favorable impact on the training pro-
cess. The change in the original architecture is minimal, and
5.1.3. Models
the model achieves a competitive word error rate (WER) of
The use of RNN-based architecture in speech recognition
10.9% on the Wall Street Journal (WSK) speech recognition
has many advantages over traditional acoustic models. One
dataset (Table 7). Despite its numerous advantages, Trans-
of the most significant benefits is their ability to capture long-
formers in its pristine state has several issues when applied
term temporal dependencies [244] in speech data, enabling
to ASR. RNN, with its overall training speed (i.e., conver-
them to model the dynamic nature of speech signals. Addi-
gence) and better WER because of effective joint training
tionally, RNNs can effectively process variable-length audio
and decoding methods, is still the best option.
sequences, which is essential in speech recognition tasks
The authors in [116] propose the Speech Transformer,
where the duration of spoken words and phrases can vary
which has the advantage of faster iteration time, but slower
widely. RNN-based models can efficiently identify and seg-
convergence compared to RNN-based ASR. However, in-
ment phonemes, detect and transcribe spoken words, and can
tegrating the Speech Transformer with the naive language
be trained end-to-end, eliminating the need for intermediate
model (LM) is challenging. To address this issue, various im-
steps. These features make RNN-based models particularly
provements in the Speech Transformer architecture have been
useful in real-time applications, such as speech recognition in
proposed in recent years. For example, [245] suggests incor-
mobile devices or smart homes [117, 178], where low latency
porating the Connectionist Temporal Classification (CTC)
and high accuracy are crucial.
loss into the Speech Transformer. CTC is a popular tech-
In the past, RNNs were the go-to model for ASR. How-
nique used in speech recognition to align input and output
ever, their limited ability to handle long-range dependencies
sequences of varying lengths and one-to-many or many-to-
prompted the adoption of the Transformer architecture. For
one mappings. It introduces a blank symbol representing
example, in 2019, Google’s Speech-to-Text API transitioned
gaps between output symbols and computes the loss function
to a Transformer-based architecture that surpassed the previ-
by summing probabilities across all possible paths. The loss
ous RNN-based model, especially in noisy environments and
function encourages the model to assign high probabilities
for longer sentences, as reported in [651]. Additionally, Face-
to correct output symbols and low probabilities to incorrect
book AI Research introduced wav2vec 2.0, a self-supervised
output symbols and the blank symbol, allowing the model to
learning approach that leverages a Transformer-based archi-
predict sequences of varying lengths. The CTC loss is com-
tecture to perform unsupervised speech recognition. wav2vec
monly used with RNNs such as LSTM and GRU, which are
2.0 has significantly outperformed the previous RNN-based
well-suited for sequential data. CTC loss is a powerful tool for
model and achieved state-of-the-art results on several bench-
training neural networks to perform sequence-to-sequence
mark datasets.
tasks where the input and output sequences have varying
Transformer for the ASR task is first proposed in [116],
lengths and mappings between them are not one-to-one.
4 https://www.ldc.upenn.edu/ Various other improvements have also been proposed to
5.2. Neural Speech Synthesis tasks. Moreover, it has been used as a benchmark for nu-
5.2.1. Task Description merous neural speech synthesis models, including Tacotron
Neural speech synthesis is a technology that utilizes artifi- [583], WaveNet [404], and DeepVoice [18, 156].
cial intelligence and deep learning techniques to create speech Apart from the LJ Speech dataset, several other datasets
from text or other inputs. Its applications are widespread, are widely used in neural speech synthesis research. The
including in healthcare, where it can be used to develop assis- CMU Arctic [267] and L2 Arctic [661] datasets contain
tive technologies for those who are unable to communicate recordings of English speakers with diverse accents reading
due to neurological impairments. To generate speech, deep passages designed to capture various phonetic and prosodic
neural networks like CNNs, RNNs, transformers, and diffu- aspects of speech. The LibriSpeech [412], VoxCeleb [92],
sion models are trained using phonemes and the mel spectrum. TIMIT Acoustic-Phonetic Continuous Speech Corpus [153],
The process involves several components, such as text anal- and Common Voice Dataset [17] are other valuable datasets
ysis, acoustic models, and vocoders, as shown in Figure 14. that offer ample opportunities for training and evaluating
Acoustic models convert linguistic features into acoustic fea- text-to-speech synthesis models.
tures, which are then used by the vocoder to synthesize the
final speech signal. Various architectures, including neural 5.2.3. Models
vocoders based on GANs like HiFi-GAN [268], are used Neural network-based text-to-speech (TTS) systems have
by the vocoder to generate speech. Neural speech synthe- been proposed using neural networks as the basis for speech
sis also enables manipulation of voice, pitch, and speed of synthesis, particularly with the emergence of deep learn-
speech signals using frameworks such as Fastspeech2 [460] ing. In Statistical Parametric Speech Synthesis (SPSS), early
and NANSY/NANSY++ [82, 83]. These frameworks use neural models replaced HMMs for acoustic modeling. The
information bottleneck to disentangle analysis features for first modern neural TTS model, WaveNet [404], generated
controllable synthesis. The research in neural speech syn- waveforms directly from linguistic features. Other models,
thesis can be classified into two prominent approaches: au- such as DeepVoice 1/2 [18, 156], used neural network-based
toregressive and non-autoregressive models. Autoregressive models to follow the three components of statistical para-
models generate speech one element at a time, sequentially, metric synthesis. End-to-end models, including Tacotron
while non-autoregressive models generate all the elements 1 & 2 [583, 493], Deep Voice 3, and FastSpeech 1 & 2
simultaneously, in parallel. Table 9 outlines the different [462, 460], simplified text analysis modules and utilized
architecture proposed under each category. mel-spectrograms to simplify acoustic features with char-
The evaluation of synthesized speech is of paramount acter/phoneme sequences as input. Fully end-to-end TTS
importance for assessing its quality and fidelity. It serves systems, such as ClariNet [427], FastSpeech 2 [460], and
as a means to gauge the effectiveness of different speech EATS [114], are capable of directly generating waveforms
synthesis techniques, algorithms, and parameterization meth- from text inputs. Compared to concatenative synthesis 7 and
ods. In this regard, the application of statistical tests has statistical parametric synthesis, neural network-based speech
emerged as a valuable approach to objectively measure the synthesis offers several advantages including superior voice
similarity between synthesized speech and natural speech quality, naturalness, intelligibility, and reduced reliance on
[139]. These tests complement the traditional Mean Opinion human preprocessing and feature development. Therefore,
Score (MOS) evaluations and provide quantitative insights end-to-end TTS systems represent a promising direction for
into the performance of speech synthesis systems. Addition- advancing the field of speech synthesis.
ally, widely used objective metrics such as Mel Cepstral Dis- Transformer models have become increasingly popular
tortion (MCD) and Word Error Rate (WER) contribute to the for generating mel-spectrograms in TTS systems [460, 309].
comprehensive evaluation of synthesized speech, enabling These models are preferred over RNN structures in end-to-
researchers and practitioners to identify areas for improve- end TTS systems because they improve training and inference
ment and refine the synthesis process. By employing these efficiency [462, 309]. In a study conducted by Li et al. [309],
objective metrics and statistical tests, the evaluation of syn- a multi-head attention mechanism replaced both RNN struc-
thesized speech becomes a rigorous and systematic process, tures and the vanilla attention mechanism in Tacotron 2 [493].
enhancing the overall quality and fidelity of speech synthesis This approach addressed the long-distance dependency prob-
techniques. lem and improved pluralization. Phoneme sequences were
used as input to generate the mel-spectrogram, and speech
5.2.2. Datasets samples were synthesized using WaveNet as a vocoder. Re-
The field of neural speech synthesis is rapidly advancing sults showed that the transformer-based TTS approach was
and relies heavily on high-quality datasets for effective train- 4.25 times faster than Tacotron 2 and achieved similar MOS
ing and evaluation of models. One of the most frequently (Mean Opinion Score) performance.
utilized datasets in this field is the LJ Speech [217], which fea- Aside from the work mentioned above, there are other
tures about 24 hours of recorded speech from a single female studies that are based on the Tacotron architecture. For exam-
speaker reading passages from the public domain LJ Speech ple, Skerry-Ryan et al. [505] and Wang et al. [584] proposed
Corpus. This dataset is free and has corresponding transcripts, Tacotron-based models for prosody control. These models
making it an excellent choice for text-to-speech synthesis 7 https://en.wikipedia.org/wiki/Concatenative_synthesis
Table 9
Exploring the Landscape of TTS and Vocoder Architectures: Autoregressive and Non-
Autoregressive Models.
Method Text-To-Speech Vocoder
Tacotron [583] ,Tacotron2 [493], Deep Voice 1,2,3
WaveNet [404], WaveRNN [232], WaveGAN [422]
Transformer-TTS [309], DurIAN [627], Flowtron [550]
Autoregressive Model LPCNet [548], GAN-TTS [38], MultiBand-WaveRNN [627]
RobuTrans [310], DeviceTTS [211],Wave-Tacotron [590]
ImporvedLPCNet [547], Bunched LPCNet2 [416]
Apple TTS [9]
ParaNet [423], FastSpeech [462], JDI-T [317], EATS [115]
FastSpeech2 [460], FastPitch [284], Glow-TTS [250]
Flow-TTS [376], SpeedySpeech [546] Parallel-WaveNet [403], WaveGlow [437], Parallel-WaveGAN [608]
Parallel Tacotron [126], BVAE-TTS [296] MelGAN [275], MultiBand-MelGAN [612], VocGAN [614], WaveGrad [67]
Non-Autoregressive Model
Parallel Tacotron2 [126], Grad-TTS [433], VITS [251] DiffWave [269], HiFi-GAN [268], StyleMelGAN [386], Fre-GAN [254]
RAD-TTS [495], WaveGrad2 [69], DelightfulTTS [342] iSTFTNet [241], Avocodo [32]
PortaSpeech [461], DiffGAN-TTS [340], JETS [318]
WavThruVec [504], FastDiff [204], CLONE [343]
use a separate encoder to compute style information from systems like FastSpeech 2 and FastSpeech 2s, researchers
reference audio that is not provided in the text. Another note- have also been exploring the potential of Variational Autoen-
worthy work is the Global-style-Token (GST) [584] which coder (VAE) based TTS models [296, 195, 163, 251]. These
improves on style embeddings by adding an attention layer models can learn a latent representation of speech signals
to capture a wider range of acoustic styles. from textual input and may be able to produce high-quality
The FastSpeech [462] algorithm aims to improve the in- speech with less training data and greater control over the gen-
ference speed of TTS systems. To achieve this, it utilizes erated speech characteristics. For example, authors in [251]
a feedforward network based on 1D convolution and the used a conditional variational autoencoder (CVAE) to model
self-attention mechanism in transformers to generate Mel- the acoustic features of speech and an adversarial loss to im-
spectrograms in parallel. Additionally, it solves the issue prove the naturalness of the generated speech. This approach
of sequence length mismatch between the Mel-spectrogram involved conditioning the CVAE on the linguistic features
sequence and its corresponding phoneme sequence by em- of the input text and using an adversarial loss to match the
ploying a length regulator based on a duration predictor. The distribution of the generated speech to that of natural speech.
FastSpeech model was evaluated on the LJSpeech dataset and Results from this method have shown promise in generating
demonstrated significantly faster Mel-spectrogram generation speech that exhibits natural prosody and intonation.
than the autoregressive transformer model while maintaining WaveGrad [67] and DiffWave [269] have emerged as sig-
comparable performance. FastPitch builds on FastSpeech by nificant contributions in the field, employing diffusion models
conditioning the TTS model on fundamental frequency or to generate raw waveforms with exceptional performance. In
pitch contour, which improves convergence and eliminates contrast, GradTTS [433] and DiffTTS [218] utilize diffusion
the need for knowledge distillation of Mel-spectrogram tar- models to generate mel features rather than raw waveforms.
gets in FastSpeech. Addressing the intricate challenge of one-shot many-to-many
FastSpeech 2 [460] represents a transformer-based Text- voice conversion, DiffVC [434] introduces a novel solver
to-Speech (TTS) system that addresses the limitations of based on stochastic differential equations. Expanding the
its predecessor, FastSpeech, while effectively handling the scope of sound generation to include singing voice synthesis,
challenging one-to-many mapping problem in TTS. It intro- DiffSinger [334] introduces a shallow diffusion mechanism.
duces the utilization of a broader range of speech information, Additionally, Diffsound [611] proposes a sound generation
including energy, pitch, and more accurate duration, as con- framework that incorporates text conditioning and employs
ditional inputs. Furthermore, FastSpeech 2 trains the system a discrete diffusion model, effectively resolving concerns
directly on a ground-truth target, enhancing the quality of the related to unidirectional bias and accumulated errors.
synthesized speech. Additionally, a simplified variant called EdiTTS [527] introduces a diffusion-based audio model
FastSpeech 2s has been proposed in [61], eliminating the that is specifically tailored for the text-to-speech task. Its
requirement for intermediate Mel-spectrograms and enabling innovative approach involves the utilization of the denoising
the direct generation of speech from text during inference. reversal process to incorporate desired edits through coarse
Experimental evaluations conducted on the LJSpeech dataset perturbations in the prior space. Similarly, Guided-TTS [249]
demonstrated that both FastSpeech 2 and FastSpeech 2s offer and Guided-TTS2 [257] stand as early text-to-speech models
a streamlined training pipeline, resulting in fast, robust, and that have effectively harnessed diffusion models for sound
controllable speech synthesis compared to FastSpeech. generation. Furthermore, Levkovitch et al. [301] have made
Furthermore, in addition to the transformer-based TTS notable contributions by combining a voice diffusion model
Linguistic Acoustic
Text Features Acoustic Features Waveform
Text Waveform
analysis Model Generation
Figure 14: Neural Text-to-speech (TTS) pipeline: a diagram showing the main modules of a typical TTS system. The system
takes text input and processes it through various stages to generate speech output. The text analysis module tokenizes the input
text and generates linguistic features such as phonemes and prosody. The acoustic model module then converts these linguistic
features into acoustic features, such as mel spectrograms, using a neural network. Finally, the waveform generation module
synthesizes the speech waveform from the acoustic features using another neural network.
flow-based generative model is Glow-TTS [250], developed Speech resynthesis is a vital research area with various
specifically for parallel TTS without the need for an external applications, including speech enhancement and voice con-
aligner. The model employs the generic Glow architecture version, and recent advancements have revolutionized the
previously used in computer vision and vocoder models to field by incorporating self-supervised discrete representa-
produce mel-spectrograms from text inputs, which are then tions. These techniques enable the generation of high-quality
converted to speech audio. Glow-TTS has demonstrated supe- speech that maintains or degrades acoustic cues from natural
rior synthesis speed over the autoregressive model, Tacotron speech recordings, and they have been used in the GSLM
2, while maintaining comparable speech quality. [281] architecture for acoustic modeling, speech recognition,
Recently, a new TTS model called EfficientTTS [377] has and synthesis, as outlined in Figure 15. It comprises a dis-
been introduced. This model outperforms previous models crete speech encoder, a generative language model, and a
such as Tacotron 2 and Glow-TTS in terms of speech quality, speech decoder, all trained without supervision. GSLM is the
training efficiency, and synthesis speed. The EfficientTTS only prior work addressing the generative aspect of speech
model uses a multi-head attention mechanism to align in- pre-training, which builds a text-free language model using
put text and speech encodings, enabling it to generate high- discovered units.
quality speech with fewer parameters and faster synthesis
speed. Overall, the introduction of normalizing flow and the 5.2.6. Voice Conversion
development of models such as Glow-TTS and EfficientTTS Modifying a speaker’s voice in a provided audio sam-
have significantly improved the quality and efficiency of TTS ple to that of another individual is called voice conversion,
systems. preserving linguistic content information. TTS and Voice
To resolve output diversity issues in parallel TTS archi- conversion share a common objective of generating natu-
tectures, normalizing flow has been introduced to model the ral speech. While models based on RNNs and CNNs have
duration of speech [250, 495, 377]. Glow-TTS [250] is a been successfully applied to voice conversion, the use of
flow-based generative model for parallel TTS that does not the transformer has shown promising results. Voice Trans-
require any external aligner12345. It is built on the generic former Network (VTN) [210] is a seq2seq voice conversion
Glow model that is previously used in computer vision and (VC) model based on the transformer architecture with TTS
vocoder models3. Glow-TTS is designed to produce mel- pre-training. Seq2seq VC models are attractive as they can
spectrograms from text input, which can then be converted convert prosody, and the VTN is a novel approach in this field
to speech audio4. It has been shown to achieve an order-of- that has been proven to be effective in converting speech from
magnitude speed-up over the autoregressive model, Tacotron a source to a target without changing the linguistic content.
2, at synthesis with comparable speech quality. EfficientTTS ASR and TTS-based voice conversion is a promising ap-
is a recent study that proposed a new TTS model, which proach to voice conversion [534]. It involves using an ASR
significantly outperformed models such as Tacotron 2 [493] model to transcribe the source speech into the linguistic repre-
and Glow-TTS [250] in terms of speech quality, training effi- sentation and then using a TTS model to synthesize the target
ciency, and synthesis speed. The EfficientTTS [377] model speech with the desired voice characteristics [432]. How-
uses a multi-head attention mechanism to align the input text ever, this approach overlooks the modeling of prosody, which
and speech encodings, enabling it to generate high-quality plays an important role in speech naturalness and conversion
speech with fewer parameters and faster synthesis speed. similarity. To address this issue, researchers have proposed to
directly predict prosody from the linguistic representation in
5.2.5. Speech Resynthesis a target-speaker-dependent manner [649]. Other researchers
Speech resynthesis is the process of generating speech have explored using a mix of ASR and TTS features to im-
from a given input signal. The input signal can be in vari- prove the quality of voice conversion [209, 665, 86, 647].
ous forms, such as a digital recording, text, or other types of CycleGAN [238, 239, 240], VAE [82, 595, 235], and
data. The aim of speech resynthesis is to create an output VAE with the generative adversarial network [191] are other
that closely resembles the original signal in terms of sound popular VC other popular approaches for non-parallel-voice
quality, prosody, and other acoustic characteristics. Speech conversion. CycleGAN-VC [238] uses a cycle-consistent ad-
resynthesis is an important research area with various ap- versarial network to convert the source voice to the target
plications, including speech enhancement [528, 193, 363], voice and can generate high-quality speech without any extra
and voice conversion [362]. Recent advancements in speech data, modules, or alignment procedure. Several improve-
resynthesis have revolutionized the field by incorporating ments and modifications are also proposed in recent years
self-supervised discrete representations to generate disentan- [239, 240, 191]. VAE-based voice conversion is a promising
gled representations of speech content, prosodic information, approach that can generate high-quality speech with a small
and speaker identity. These techniques enable the genera- amount of training data [82, 595, 235].
tion of speech in a controlled and precise manner, as seen
in [281, 431, 439, 497]. The objective is to generate high- 5.2.7. Vocoders
quality speech that maintains or degrades acoustic cues, such The field of audio synthesis has undergone significant
as phonotactics, syllabic rhythm, or intonation, from natural advancements in recent years, with various approaches pro-
speech recordings. posed to enhance the quality of synthesized audio. Prior
studies have concentrated on improving discriminator archi- DiffWave model uses adaptive noise spectral shaping
tectures or incorporating auxiliary training losses. For in- to adapt the diffusion noise. This adaptation, achieved
stance, MelGAN introduced a multiscale discriminator that through time-varying filtering, improves sound quality,
uses window-based discriminators at different scales and ap- particularly in high-frequency bands. Other examples
plies average pooling to downsample the raw waveform. It of diffusion-based vocoders include InferGrad [74],
enforces the correspondence between the input Mel spec- SpecGrad [264], and Priorgrad [293]. InfraGrad incor-
trogram and the synthesized waveform using an L1 feature porates the inference process into training to reduce in-
matching loss from the discriminator. In contrast, GAN- ference iterations while maintaining high quality. Spec-
TTS [38] utilizes an ensemble of discriminators that operate Grad adapts the diffusion noise distribution to a given
on random windows of different sizes and enforce the map- acoustic feature and uses adaptive noise spectral shap-
ping between the conditioner and the waveform adversarially ing to generate high-fidelity speech waveforms.
using conditional discriminators. Another approach, paral-
lel WaveGAN [608], extends the single short-time Fourier • Flow-based models: Parallel WaveNet, WaveGlow,
transform loss to multi-resolution and employs it as an auxil- etc. [354, 437, 258, 429, 294] are based on normaliz-
iary loss for GAN training. Recently, some researchers have ing flows and are capable of generating high-fidelity
improved MelGAN by integrating the multi-resolution short- speech in real-time. While flow-based vocoders gener-
time Fourier transform loss. HiFi-GAN reuses the multi-scale ally perform worse than autoregressive vocoders with
discriminator from MelGAN and introduces the multi-period regard to modeling the density of speech signals, re-
discriminator for high-fidelity synthesis. UnivNet employs cent research [354] has proposed new techniques to
a multi-resolution discriminator that takes multi-resolution improve their performance.
spectrograms as input and can enhance the spectral structure Universal neural vocoding is a challenging task that has
of a synthesized waveform. In contrast, CARGAN integrates achieved limited success to date. However, recent advances
partial autoregression into the generator to enhance pitch in speech synthesis have shown a promising trend toward
and periodicity accuracy. The recent generative models for improving zero-shot performance by scaling up model sizes.
modeling raw audio can be categorized into the following Despite its potential, this approach has yet to be extensively
groups. explored. Nonetheless, several approaches have been pro-
• Autoregressive models: Although WaveNet is renowned posed to address the challenges of universal vocoding. For
for its exceptional ability to generate high-quality speech, example, WaveRNN has been utilized in previous studies to
including natural-sounding intonation and prosody, achieve universal vocoding (Lorenzo-Trueba et al. [344]; Paul
other neural vocoders have emerged as potential alter- et al. [421]). Another approach Jiao et al. [221] developed
natives in recent years. For instance, LPCNet [548] em- involves constructing a universal vocoder using a flow-based
ploys a combination of linear predictive coding (LPC) model. Additionally, the GAN vocoder has emerged as a
and deep neural networks (DNNs) to generate speech of promising candidate for this task, as suggested by You et al.
similar quality while being computationally efficient [626].
and capable of producing low-bitrate speech. Simi-
larly, SampleRNN [373], an unconditional end-to-end 5.2.8. Controllable Speech Synthesis
model, has demonstrated potential as it leverages a hi- Controllable Speech Synthesis [549, 122, 676, 545, 462,
erarchical RNN architecture and is trained end-to-end 584, 276] is a rapidly evolving research area that focuses
to generate raw speech of high quality. on generating natural-sounding speech with the ability to
control various aspects of speech, including pitch, speed, and
• Generative Adversarial Network (GAN) vocoders: Nu- emotion. Controllable Speech Synthesis is positioned in the
merous vocoders have been created that employ Gener- emerging field of affective computing at the intersection of
ative Adversarial Networks (GANs) to generate speech three disciplines: expressive speech analysis [535], natural
of exceptional quality. These GAN-based vocoders, language processing, and machine learning. This field aims
which include MelGAN MelGAN [275]and HiFIGAN to develop systems capable of recognizing, interpreting, and
[268], are capable of producing high-fidelity raw audio generating human-like emotional responses in interactions
by conditioning on mel spectrograms. Furthermore, between humans and machines.
they can synthesize audio at speeds several hundred Expressive speech analysis is a critical component of this
times faster than real-time on a single GPU, as evi- field. It provides mathematical tools to analyse speech sig-
denced by research conducted in [113, 39, 608, 268, nals and extract various acoustic features, including pitch,
275]. loudness, and duration, that convey emotions in speech. Nat-
ural language processing is also crucial to this field, as it
• Diffusion-based models: In recent years, there have
helps to process the text input and extract the meaning and
been several novel architectures proposed that are based
sentiment of the words. Finally, machine learning techniques
on diffusion. Two prominent examples of these are
are used to model and control the expressive features of the
WaveGrad [68] and DiffWave [269]. The WaveGrad
synthesized speech, enabling the systems to produce more
model architecture builds upon prior works from score
expressive and controllable speech [11, 337, 550, 274, 517,
matching and diffusion probabilistic models, while the
666, 410, 205, 295]. larly to the standard transformer, exhibiting robustness for
In the last few years, notable advancements have been extra-long sentences. Lastly, Zheng et al. [670] proposed an
achieved in this field [452, 248, 164], and several approaches approach that combines a local recurrent neural network with
have been proposed to enhance the quality of synthesized the transformer to capture sequential and local information in
speech. For example, some studies propose using deep learn- sequences. Evaluation of a 20-hour Mandarin speech corpus
ing techniques to synthesize expressive speech and condi- demonstrated that this model outperforms the transformer
tional generation models to control the prosodic features of alone in performance.
speech [452, 248]. Others propose using motion matching- In their recent paper [610], the authors proposed a novel
based algorithms to synthesize gestures from speech [164]. method for extracting dynamic prosody information from au-
dio recordings, even in noisy environments. Their approach
5.2.9. Disentangling and Transferring employs probabilistic denoising diffusion models and knowl-
The importance of disentangled representations for neural edge distillation to learn speaking style features from a teacher
speech synthesis cannot be overstated, as it has been widely model, resulting in a highly accurate reproduction of prosody
recognized in the literature that this approach can greatly and timber. This model shows great potential in applications
improve the interpretability and expressiveness of speech such as speech synthesis and recognition, where noise-robust
synthesis models [360, 194, 438]. Disentangling multiple prosody information is crucial. Other noteworthy advances
styles or prosody information during training is crucial to en- in the development of robust TTS systems include the work
hance the quality of expressive speech synthesis and control. by [495], which focuses on a robust speech-text alignment
Various disentangling techniques have been developed using module, as well as the use of normalizing flows for diverse
adversarial and collaborative games, the VAE framework, speech synthesis.
bottleneck reconstructions, and frame-level noise modeling
combined with adversarial training. 5.2.11. Low-Resource Neural Speech Synthesis
For instance, Ma et al. [360] have employed adversarial High-quality paired text and speech data are crucial for
and collaborative games to enhance the disentanglement of building high-quality Text-to-Speech (TTS) systems [147].
content and style, resulting in improved controllability. Hsu Unfortunately, most languages are not supported by popular
et al. [194] have utilized the VAE framework with adver- commercialized speech services due to the lack of sufficient
sarial training to separate speaker information from noise. training data [604]. To overcome this challenge, researchers
Qian et al. [438] have introduced speech flow, which can have developed TTS systems under low data resource scenar-
disentangle rhythm, pitch, content, and timbre through three ios using various techniques [147, 604, 127, 540].
bottleneck reconstructions. In another work based on, adver- Several techniques have been proposed by researchers to
sarial training, Zhang et al. [642] have proposed a method enhance the efficiency of low-resource/Zero-shot TTS sys-
that disentangles noise from the speaker by modeling the tems. One of these is the use of semi-supervised speech syn-
noise at the frame level. thesis methods that utilize unpaired training data to improve
Developing high-quality speech synthesis models that can data efficiency, as suggested in a study by Liu et al. [328]. An-
handle noisy data and generate accurate representations of other method involves cascading pre-trained models for ASR,
speech is a challenging task. To tackle this issue, Zhang et al. MT, and TTS to increase data size from unlabelled speech,
[650] propose a novel approach involving multi-length ad- as proposed by Nguyen et al. [395]. In addition, researchers
versarial training. This method allows for modeling different have employed crowdsourced acoustic data collection to de-
noise conditions and improves the accuracy of pitch predic- velop TTS systems for low-resource languages, as shown in
tion by incorporating discriminators on the mel-spectrogram. a study by Butryna et al. [50]. Huang et al. [205] introduced
By replacing the traditional pitch predictor model with this a zero-shot style transfer approach for out-of-domain speech
approach, the authors demonstrate significant improvements synthesis that generates speech samples exhibiting a new
in the fidelity of synthesized speech. and distinctive style, such as speaker identity, emotion, and
prosody.
5.2.10. Robustness
Using neural TTS models can present issues with robust- 5.3. Speaker recognition
ness, leading to low-quality audio samples for unseen or atyp- 5.3.1. Task Description
ical text. In response, Li et al. [310] proposed RobuTrans Speech signal consists of information on various charac-
[310], a robust transformer that converts input text to lin- teristics of a speaker, such as origin, identity, gender, emotion,
guistic features before feeding it to the encoder. This model etc. This property of speech allows speech-based speaker
also includes modifications to the attention mechanism and profiling with a wide range of applications in forensics, rec-
position embedding, resulting in improved MOS scores com- ommendation systems, etc. The research on recognizing
pared to other TTS models. Another approach to enhancing speakers is extensive and aims to solve two major tasks:
robustness is the s-Transformer, introduced by Wang et al. speaker identification (what is the identity?) and speaker
[579], which models speech at the segment level, allowing verification (is the speaker he/she claims to be?). Speaker
it to capture long-term dependencies and use segment-level recognition/verification tasks require extracting a fixed-length
encoder-decoder attention. This technique performs simi- vector, called speaker embedding, from unconstrained utter-
ances. These embeddings represent the speakers and can
be used for identification or verification tasks. Recent state- using multiple mobile devices. Additionally, the RedDots
of-the-art speaker-embedding-extractor models are based on project [291] and VOICES corpus [465] offer unique collec-
DNNs and have shown superior performance on both speaker tions of offline voice recordings in furnished rooms with back-
identification and verification tasks. ground noise, while the CN-CELEB database [135] focuses
on a specific person of interest extracted from bilibili.com
• Speaker Recognition (SR) relies on speaker identifi- using an automated pipeline followed by human verification.
cation as a key aspect, where an unknown speaker’s The BookTubeSpeech dataset [426] was also collected
speech sample is compared to speech models of known using an automated pipeline from BookTube videos, and the
speakers to determine their identity. The primary aim Hi-MIA database [440] was designed specifically for far-field
of speaker identification is to distinguish an individ- scenarios using multiple microphone arrays. The FFSVC20
ual’s identity from a group of known speakers. This challenge [441] and DIHARD challenge [473] are speaker
process involves a detailed analysis of the speaker’s verification and diarization research initiatives focusing on
voice characteristics such as pitch, tone, accent, and far-field and robustness challenges, respectively. Finally, the
other pertinent features to establish their identity. Re- LibriSpeech dataset [412], originally intended for speech
cent advancements in deep learning techniques have recognition, is also useful for speaker recognition tasks due
significantly enhanced speaker identification, leading to its included speaker identity labels.
to the creation of accurate, efficient, and end-to-end
models. Various deep learning-based models such as 5.3.3. Models
CNNs, RNNs, and their combinations have demon- Speaker identification (SI) and verification (SV) are cru-
strated exceptional performance in several subtasks of cial research topics in the field of speech technology due
speaker identification, including verification, identifi- to their significant importance in various applications such
cation, diarization, and robust recognition [458, 247, as security [125], forensics [270], biometric authentication
260]. [170], and speaker diarization [601]. Speaker recognition
has become more popular with technological advancements,
• Speaker Verification (SV) is a process that involves
including the Internet of Things (IoT), smart devices, voice
confirming the identity of a speaker through their speech.
assistants, smart homes, and humanoids. Therefore, a signif-
It differs from speaker identification, which aims to
icant quantity of research has been conducted in this field,
identify unknown speakers by comparing their voices
and many methods have been developed, making the state-
with that of registered speakers in a database. Speaker
of-the-art in this field quite mature and versatile. However, it
verification verifies whether a speaker is who they
has become increasingly challenging to provide an overview
claim to be by comparing their voice with an avail-
of the various methods due to the high number of studies in
able speaker template. Deep learning-based speaker
the field.
verification relies on Speaker Representation based on
A neural network approach for speaker verification was
embeddings, which involves learning low-dimensional
first attempted by Variani et al. [554] in 2014, utilizing four
vector representations from speech signals that cap-
fully connected layers for speaker classification. Their ap-
ture speaker characteristics, such as pitch and speaking
proach has successfully verified speakers with short-duration
style, and can be used to compare different speech
utterances by obtaining the 𝑑-vector by averaging the out-
signals and determine their similarity.
put of the last hidden layer across frames. Although various
5.3.2. Dataset attempts have been made to directly learn speaker represen-
The VoxCeleb dataset (VoxCeleb 1 & 2) is widely used tation from raw waveforms by other researchers (Ravanelli
in speaker recognition research, as mentioned in [92]. This and Bengio [456], Jung et al. [226]), other well-designed
dataset consists of speech data collected from publicly avail- neural networks like CNNs and RNNs have been proposed
able media, employing a fully automated pipeline that incor- for speaker verification tasks by Ye and Yang [621]. Nev-
porates computer vision techniques. The pipeline retrieves ertheless, the field still requires more powerful deep neural
videos from YouTube and applies active speaker verification networks for superior extraction of speaker features.
using a two-stream synchronization CNN. Speaker identity Speaker verification has seen notable advancements with
is further confirmed through CNN-based facial recognition. the advent of more powerful deep neural networks. One such
Another commonly employed dataset is TIMIT, which com- model is the 𝑥-vector-based system proposed by Snyder et al.
prises recordings of phonetically balanced English sentences [509], which has gained widespread popularity due to its re-
spoken by a diverse set of speakers. TIMIT is commonly used markable performance. Since its introduction, the 𝑥-vector
for evaluating speech recognition and speaker identification system has undergone significant architectural enhancements
systems, as referenced in [153]. and optimized training procedures [103]. The widely-used
Other noteworthy datasets in the field include the SITW ResNet [176] architecture has been incorporated into the sys-
database [371], which provides hand-annotated speech sam- tem to improve its performance further. Adding residual
ples for benchmarking text-independent speaker recognition connections between frame-level layers has been found to
technology, and the RSR2015 database [286], which contains improve the embeddings [152, 634]. This technique has also
speech recordings acquired in a typical office environment aided in faster convergence of the back-propagation algorithm
and mitigated the vanishing gradient problem [176]. Tang 5.4. Speaker Diarization
et al. [532] proposed further improvements to the 𝑥-vector 5.4.1. Task Description
system. They introduced a hybrid structure based on TDNN Speaker diarization is a critical component in the analysis
and LSTM to generate complementary speaker information of multi-speaker audio data, and it addresses the question of
at different levels. They also suggested a multi-level pool- "who spoke when." The term "diarize" refers to the process
ing strategy to collect the speaker information from global of making a note or keeping a record of events, as per the
and local perspectives. These advancements have signifi- English dictionary. A traditional speaker diarization system
cantly improved speaker verification systems’ performance comprises several crucial components that work together
and paved the way for further developments in the field. to achieve accurate and efficient speaker diarization. In this
Desplanques et al. [108] propose a state-of-the-art archi- section, we will discuss the different components of a speaker
tecture for speaker verification utilizing a Time Delay Neural diarization system (Figure 16) and their role in achieving
Network (TDNN) called ECAPA-TDNN. The paper presents accurate speaker diarization.
a range of enhancements to the existing 𝑥-vector architec-
ture that leverages recent developments in face verification • Acoustic Features Extraction: In the analysis of multi-
and computer vision. Specifically, the authors suggest three speaker speech data, one critical component is the ex-
major improvements. Firstly, they propose restructuring the traction of acoustic features [14, 538]. This process
initial frame layers into 1-dimensional Res2Net modules with involves extracting features such as pitch, energy, and
impactful skip connections, which can better capture the re- MFCCs from the audio signal. These acoustic features
lationships between different time frames. Secondly, they play a crucial role in identifying different speakers by
introduce Squeeze-and-Excitation blocks to the TDNN lay- analyzing their unique characteristics.
ers, which help highlight the most informative channels and
• Segmentation: Segmentation is a crucial component
improve feature discrimination. Lastly, the paper proposes
in the analysis of multi-speaker audio data, where the
channel attention propagation and aggregation to efficiently
audio signal is divided into smaller segments based on
propagate attention weights through multiple TDNN layers,
the silence periods between speakers [14, 538]. This
further enhancing the model’s ability to discriminate between
process helps in reducing the complexity of the prob-
speakers.
lem and makes it easier to identify different speakers
Additionally, the paper presents a new approach that uti-
in smaller segments
lizes ECAPA-TDNN from the speaker recognition domain
as the backbone network for a multiscale channel adaptive • Speaker Embedding Extraction: This process involves
module. The proposed method achieves promising results, obtaining a low-dimensional representation of each
demonstrating the effectiveness of the proposed architecture speaker’s voice, which is commonly referred to as
in speaker verification. Overall, ECAPA-TDNN offers a com- speaker embedding. This is achieved by passing the
prehensive solution to speaker verification by introducing acoustic features extracted from the speech signal through
several novel contributions that improve the existing 𝑥-vector a deep neural network, such as a CNN or RNN[508].
architecture, which has been state-of-the-art in speaker verifi-
cation for several years. The proposed approach also achieves • Clustering: In this component, the extracted speaker
promising results, suggesting that the proposed architecture embeddings are clustered based on similarity, and each
can effectively tackle the challenges of speaker verification. cluster represents a different speaker [14, 538]. This
The attention mechanism is a powerful method for obtain- process commonly uses unsupervised clustering algo-
ing a more discriminative utterance-level feature by explicitly rithms, such as k-means clustering.
selecting frame-level representations that better represent
• Speaker Classification: In this component, the speaker
speaker characteristics. Recently, the Transformer model
embeddings are classified into different speaker identi-
with a self-attention mechanism has become effective in var-
ties using a supervised classification algorithm, such
ious application fields, including speaker verification. The
as SVM or MLP [14, 538].
Transformer architecture has been extensively explored for
speaker verification. TESA [370] is an architecture based • Re-segmentation: This component is responsible for
on the Transformer’s encoder, proposed as a replacement refining the initial segmentation by adjusting the seg-
for conventional PLDA-based speaker verification to capture ment boundaries based on the classification results. It
speaker characteristics better. TESA outperforms PLDA on helps in improving the accuracy of speaker diarization
the same dataset by utilizing the next sentence prediction by reducing the errors made during the initial segmen-
task of BERT [109]. Zhu et al. [675] proposed a method to tation.
create fixed-dimensional speaker verification representation
using a serialized multi-layer multi-head attention mecha- Various studies focus on traditional speaker diarization sys-
nism. Unlike other studies that redesign the inner structure tems [14, 538]. This paper will review the recent efforts
of the attention module, their approach strictly follows the toward deep learning-based speaker diarizations techniques.
original Transformer, providing simple but effective modifi-
cations.
Speaker
utterance with multiple Speech Vs Non-Speech Covariance
overlapping
Speakers
Speaker
VAD smbedding Scoring Speaker
extraction Change
Speaker ID Segmented
and
Clustering Resegmentation
Clustered
utterances
Figure 16: Speaker diarization system diagram showcasing the process of identifying and differentiating multiple speakers in an
audio recording using various techniques such as VAD, segmentation, clustering and re-segmentation.
provided on the project website 8 . 171 meeting sessions held across various locations. It
features two distinct audio sources – one recorded us-
• LibriCSS: The LibriCSS corpus is a valuable resource ing lapel microphones for individual speakers and the
for researchers studying speech separation, recogni- other using omnidirectional microphone arrays placed
tion, and speaker diarization. The corpus comprises on the table. It is an ideal dataset for evaluating speaker
10 hours of multichannel recordings captured using a diarization systems integrated with the ASR module.
7-channel microphone array in a real meeting room. AMI’s value proposition is further enhanced by provid-
The audio was played from the LibriSpeech corpus, ing forced alignment data, which captures the timings
and each of the ten sessions was subdivided into six at the word and phoneme levels and speaker labeling.
10-minute mini-sessions. Each mini-session contained Finally, it’s worth noting that each meeting session
audio from eight speakers and was designed to have dif- involves a small group of three to five speakers.
ferent overlap ratios ranging from 0% to 40%. To make
research easier, the corpus includes baseline systems 5.4.3. Models
for speech separation and Automatic Speech Recogni- Speaker diarization has been a subject of research in the
tion (ASR) and a baseline system that integrates speech field of audio processing, with the goal of separating speak-
separation, speaker diarization, and ASR. These base- ers in an audio recording. In recent years, deep learning
line systems have already been developed and made has emerged as a powerful technique for speaker diarization,
available to researchers. leading to significant advancements in this field. In this ar-
ticle, we will explore some of the recent developments in
• Rich Transcription Evaluation Series: The Rich Tran-
deep learning architecture for speaker diarization, focusing
scription Evaluation Series dataset is a collection of
on different modules of speaker diarization as outlined in
speech data used for speaker diarization evaluation.
Figure 16. Through this discussion, we will highlight major
The Rich Transcription Fall 2003 Evaluation (RT-03F)
advancements in each module.
was the first evaluation in the series focused on "Who
Said What" tasks. The dataset has been used in sub- • Segmentation and clustering: Speaker diarization sys-
sequent evaluations, including the Second DIHARD tems typically use a range of techniques for segmenting
Diarization Challenge, which used the Jaccard index to speech, such as identifying speaker change, uniform
compute the JER (Jaccard Error Rate) for each pair of speaker segmentation, ASR-based word segmentation,
segmentations. The dataset is essential for data-driven and supervised speaker turn detection. However, each
spoken language processing methods and calculates approach has its own benefits and drawbacks. Uni-
speaker diarization accuracy at the utterance level. The form speaker segmentation involves dividing speech
dataset includes rules, evaluation methods, and base- into segments of equal length, which can be difficult
line systems to promote reproducible research in the to optimize to capture speaker turn boundaries and in-
field. The dataset has been used in various speaker clude enough speaker information. ASR-based word
diarization systems and their subtasks in the context segmentation identifies word boundaries using auto-
of broadcast news and CTS data matic speech recognition, but the resulting segments
• CHiME-5/6 challenge and dataset The CHiME-5/6 may be too brief to provide adequate speaker informa-
challenge is a speech processing challenge focusing tion. Supervised speaker turn detection, on the other
on distant multi-microphone conversational speech di- hand, involves a specialized model that can accurately
arization and recognition in everyday home environ- identify speaker turn timestamps. While this method
ments. The challenge provides a dataset of recordings can achieve high accuracy, it requires labeled data for
from everyday home environments, including dinner training. These techniques have been widely discussed
recordings originally collected for and exposed during in previous research, and choosing the appropriate one
the CHiME-5 challenge. The dataset is designed to be depends on the specific requirements of the applica-
representative of natural conversational speech. The tion.
challenge features two audio input conditions: single- – The authors in [98] propose real-time speaker di-
channel and multichannel. Participants are provided arization system that combines incremental clus-
with baseline systems for speech enhancement, speech tering and local diarization applied to a rolling
activity detection (SAD), and diarization, as well as window of speech data and is designed to han-
results obtained with these systems for all tracks. The dle overlapping speech segments. The proposed
challenge aims to improve the robustness of diarization pipeline is designed to utilize end-to-end overlap-
systems to variations in recording equipment, noise aware segmentation to detect and separate over-
conditions, and conversational domains. lapping speakers.
• AMI dataset: The AMI database is a comprehensive – In another related work, authors in [643] intro-
collection of 100 hours of recordings sourced from duce a novel speaker diarization system with a
8 https://www.robots.ox.ac.uk/
generalized neural speaker clustering module as
vgg/data/voxconverse/
the backbone.
– In a recent study conducted by Park et al. [417], tion systems that can effectively handle both voice
a new framework for spectral clustering is pro- activity and overlapped speech detection. This
posed that allows for automatic parameter tun- approach can also be a post-processing step to
ing of the clustering algorithm in the context identify and assign overlapped speech regions
of speaker diarization. The proposed technique accurately. Notable examples of such works in-
utilizes normalized maximum eigengap (NME) clude those by Bullock et al. [47] and Bredin and
values to determine the number of clusters and Laurent [45].
threshold parameters for each row in an affinity
matrix during spectral clustering. The authors • End-to-End Neural Diarization: In addition to the
demonstrated that their method outperformed ex- above work, end-to-end speaker diarization systems
isting state-of-the-art methods on two different have gained the attention of the research community
datasets for speaker diarization. due to their ability to handle speaker overlaps and their
optimization to minimize diarization errors directly. In
– Bayesian HMM clustering of x-vector sequences one such work, the authors propose end-to-end neural
(VBx) diarization approach, which clusters x- speaker diarization that does not rely on clustering and
vectors using a Bayesian hidden Markov model instead uses a self-attention-based neural network to
(BHMM) [285], combined with a ResNet101 (He directly output the joint speech activities of all speakers
et al. [176]) 𝑥-vector extractor achieves superior for each segment [145]. Following the trend, several
results on CALLHOME [111], AMI [53] and DI- other works propose enhanced architectures based on
HARD II [474] datasets self-attention [324, 630]
• Speaker Embedding Extraction and Classification:
5.5. Speech-to-speech translation
– Attentive Aggregation for Speaker Diarization 5.5.1. Task Description
[278]: This approach uses an attention mech- Speech-to-text translation (ST) is the process of con-
anism to aggregate embeddings from multiple verting spoken language from one language to another in
frames and generate speaker embeddings. The text form. Traditionally, this has been achieved using a cas-
speaker embeddings are then used for clustering caded structure that incorporates automatic speech recogni-
to identify speaker segments. tion (ASR) and machine translation (MT) components. How-
– End-to-End Speaker Diarization with Self-Attention ever, a more recent end-to-end (E2E) method [524, 480, 639,
[145]: This method uses a self-attention mecha- 62, 166, 669, 15] has gained popularity due to its ability to
nism to capture the correlations between the input eliminate issues with error propagation and high latency as-
frames and generates embeddings for each frame. sociated with cascaded methods [518, 63]. The E2E method
The embeddings are then used for clustering to uses an audio encoder to analyze audio signals and a text
identify speaker segments. decoder to generate translated text.
One notable advantage of ST systems is that they allow
– Wang et al. [577] present an innovative method
for more natural and fluent communication than other lan-
for measuring similarity between speaker em-
guage translation methods. By translating speech in real-time,
beddings in speaker diarization using neural net-
ST systems can capture the subtleties of speech, including
works. The approach incorporates past and future
tone, intonation, and rhythm, which are essential for effective
contexts and uses a segmental pooling strategy.
communication. Developing ST systems is a highly intri-
Furthermore, the speaker embedding network
cate process that involves integrating various technologies
and similarity measurement model are jointly
such as speech recognition, natural language processing, and
trained. The paper extends this framework to
machine translation. One significant obstacle in ST is the
target-speaker voice activity detection (TS-VAD)
variation in accents and dialects across different languages,
[372]. The proposed method effectively learns
which can significantly impact the accuracy of the translation.
the similarity between speaker embeddings by
considering both past and future contexts. 5.5.2. Dataset
– Time-Depth Separable Convolutions for Speaker There are numerous datasets available for the end-to-end
Diarization [266]: This approach uses time-depth speech translation task, with some of the most widely used
separable convolutions to generate embeddings ones being MuST-C [56], IWSLT [483], and CoVoST 2 [564].
for each frame, which are then used for clustering These datasets cover a variety of languages, including English,
to identify speaker segments. The method is com- German, Spanish, French, Italian, Dutch, Portuguese, Roma-
putationally efficient and achieves state-of-the-art nian, Arabic, Chinese, Japanese, Korean, and Russian. For
performance on several benchmark datasets. instance, TED-LIUM [470] is a suitable dataset for speech-to-
text, text-to-speech, and speech-to-speech translation tasks,
• Re-segmentation:
as it contains transcriptions and audio recordings of TED
– Numerous studies in this field centre around de- talks in English, French, German, Italian, and Spanish. An-
veloping a re-segmentation strategy for diariza- other open-source dataset is Common Voice, which covers
several languages, including English, French, German, Ital- marks covering 23 languages without pretraining. The paper
ian, and Spanish. Additionally, VoxForge9 is designed for also discusses neural acoustic feature modeling, which ex-
acoustic model training and includes speech recordings and tracts acoustic features directly from raw speech signals to
transcriptions in several languages, including English, French, simplify inductive biases and enhance speech description.
German, Italian, and Spanish. LibriSpeech [412] is a dataset
of spoken English specifically designed for speech recogni- 5.6. Speech enhancement
tion and speech-to-text translation tasks. Lastly, How2 [124] 5.6.1. Task Description
is a multimodal machine translation dataset that includes In situations where there is ambient noise present, speech
speech recordings, text transcriptions, and video and image recognition systems can encounter difficulty in correctly in-
data, covering English, German, Italian, and Spanish. These terpreting spoken language signals, resulting in reduced per-
datasets have been instrumental in training state-of-the-art formance [123]. One possible solution to address this issue
speech-to-speech translation models and will continue to play is the development of speech enhancement systems that can
a crucial role in further advancing the field. eliminate noise and other types of signal distortion from
spoken language, thereby improving signal quality. These
5.5.3. Models systems are frequently implemented as a preprocessing step
End-to-end speech translation models are a promising to enhance the accuracy of speech recognition and can serve
approach to direct the speech translation field. These models as an effective approach for enhancing the performance of
use a single sequence-to-sequence model for speech-to-text ASR systems in noisy environments. This section will delve
translation and then text-to-speech translation. In 2017, re- into the significance of speech enhancement technology in
searchers demonstrated that end-to-end models outperform boosting the accuracy of speech recognition.
cascade models[3]. One study published in 2019 provides an
overview of different end-to-end architectures and the usage 5.6.2. Dataset
of an additional connectionist temporal classification (CTC) One popular dataset for speech enhancement tasks is
loss for better convergence [27]. The study compares differ- AISHELL-4, which comprises authentic Mandarin speech
ent end-to-end architectures for speech-to-text translation. In recordings captured during conferences using an 8-channel
2019, Google introduced Translatotron [219], an end-to-end circular microphone array. In accordance with [144], AISHELL-
speech-to-speech translation system. Translatotron uses a 4 is composed of 211 meeting sessions, each featuring 4 to
single sequence-to-sequence model for speech-to-text trans- 8 speakers, for a total of 120 hours of content. This dataset
lation and then text-to-speech translation. No transcripts or is of great value for research into multi-speaker processing
other intermediate text representations are used during in- owing to its realistic acoustics and various speech qualities,
ference. The system was validated by measuring the BLEU including speaker diarization and speech recognition
score, computed with text transcribed by a speech recogni- Another popular dataset used for speech enhancement is
tion system. Though the results lag behind a conventional the dataset from Deep Noise Suppression (DNS) challenge
cascade system, the feasibility of the end-to-end direct speech- [459], a large-scale dataset of noisy speech signals and their
to-speech translation was demonstrated [219]. corresponding clean speech signals. The DNS dataset con-
In a recent publication from 2020, researchers presented tains over 10, 000 hours of noisy speech signals and over
a study on an end-to-end speech translation system. This 1, 000 hours of clean speech signals, making it useful for
system incorporates pre-trained models such as Wav2Vec training deep learning models for speech enhancement. The
2.0 and mBART, along with coupling modules between the Voice Bank Corpus (VCTK) is another dataset containing
encoder and decoder. The study also introduces an efficient speech recordings from 109 speakers, each recording approx-
fine-tuning technique, which selectively trains only 20% of imately 400 sentences. The dataset contains clean and noisy
the total parameters [622]. The system developed by the speech recordings, making it useful for training speech en-
UPC Machine Translation group actively participated in the hancement models. These datasets provide realistic acoustics,
IWSLT 2021 offline speech translation task, which aimed rich natural speech characteristics, and large-scale noisy and
to develop a system capable of translating English audio clean speech signals, making them useful for training deep
recordings from TED talks into German text. learning models.
E2E ST is often improved by pretraining the encoder
and/or decoder with transcripts from speech recognition or 5.6.3. Models
text translation tasks [110, 563, 639, 603]. Consequently, it Several Classical algorithms have been reported in the
has become the standard approach used in various toolkits literature for speech enhancement, including spectral subtrac-
[214, 563, 660, 669]. However, transcripts are not always tion [41], Wiener and Kalman filtering [319, 482], MMSE
available, and the significance of pretraining for E2E ST estimation [128], comb filtering [222], subspace methods
is rarely studied. Zhang et al. [638] explored the effective- [171]. Phase spectrum compensation [409]. However, clas-
ness of E2E ST trained solely on speech-translation pairs sical algorithms such as spectral subtraction and Wiener fil-
and proposed an algorithm for training from scratch. The tering approach the problem in the spectral domain and are
proposed system outperforms previous studies in four bench- restricted to stationary or quasi-stationary noise.
Neural network-based approaches inspired from other ar-
9 http://www.voxforge.org/
eas such as computer vision [188, 146, 10] and generative
Table 10
Performance of different speech enhancement algorithms on the Deep Noise Suppression
(DNS) Challenge dataset. The table showcases improvements in PESQ-WB, PESQ-NB,
SI-SDR-WB, and SI-SDR-NB metrics, and identifies the top-performing methods in each
category.
Model PESQ-WB PESQ-NB SI-SDR-WB SI-SDR-NB Architecture
FRCRN [664] 3.23 - - - U-Net + CRN
Sudo rm -rf [543] 2.95 - 19.7 - UConvBlock + CNN
DCTCRN-P [311] 2.82 - - - CNN
PoCoNet [216] 2.7885 - - - -
FullSubNet [172] 2.777 3.305 17.29 - LSTM
RNN-Modulation [559] 2.75 - - - GRU
Conv-TasNet-SNR [271] 2.73 - - - CNN
Sudo rm-rf [542] 2.69 - 18.6 - UConvBlock + CNN
RemixIT [543] 2.34 - 16.0 - UConvBlock
SN-Net [668] - 3.39 - 19.52 CNN
DCCRN-E-Aug [202] - 3.214 - - CNN + LSTM
DTLN [592] - 3.04 16.34 - LSTM
DCCRN-E [202] - 3.04 - - CNN + LSTM
adversarial networks [596, 321, 471, 142] or developed for 5.7. Audio Super Resolution
general audio processing tasks [588, 157] have outperformed 5.7.1. Task Description
the classical approaches. Various neural network models Audio super-resolution is a technique that involves pre-
based on different architectures, including fully connected dicting the missing high-resolution components of low-resolution
neural networks [606], deep denoising autoencoder [346], audio signals. Achieving this task can be difficult due to the
CNN [143], LSTM [77], and Transformer [263] have effec- continuous nature of audio signals. Current methods typically
tively handled diverse noisy conditions. approach super-resolution by treating audio as discrete data
Diffusion-based models have also shown promising re- and focusing on fixed scale factors. In order to accomplish au-
sults for speech enhancement [298, 623, 349] and have led dio super-resolution, deep neural networks are trained using
to the development of novel speech enhancement algorithms pairs of low and high-quality audio examples. During testing,
called Conditional Diffusion Probabilistic Model (CDiffuSE) the model predicts missing samples within a low-resolution
that incorporates characteristics of the observed noisy speech signal. Some recent deep network approaches have shown
signal into the diffusion and reverse processing [349]. CD- promise by framing the problem as a regression issue either
iffuSE is a generalized formulation of the diffusion proba- in the time or frequency domain [320]. These methods have
bilistic model that can adapt to non-Gaussian real noises in been able to achieve impressive results.
the estimated speech signal. Another diffusion-based model
for speech enhancement is StoRM [298], which stands for 5.7.2. Datasets
Stochastic Regeneration Model. It uses a predictive model to This section provides an overview of the diverse datasets
remove vocalizing and breathing artifacts while producing utilized in Audio Super Resolution literature. One of the
high-quality samples using a diffusion process, even in ad- most frequently used datasets is the MUSDB18, specifically
verse conditions. StoRM has shown great ability at bridging designed for music source separation and enhancement. This
the performance gap between predictive and generative ap- dataset encompasses more than 150 songs with distinct tracks
proaches for speech enhancement. Furthermore, authors in for individual instruments. Another prominent dataset is Ur-
[623] propose cold diffusion process is an advanced iterative banSound8K, which comprises over, 8000 environmental
version of the diffusion process to recover clean speech from sound files collected from 10 different categories, making it
noisy speech. According to the authors, it can be utilized ideal for evaluating Audio Super Resolution algorithms in
to restore high-quality samples from arbitrary degradations. noisy environments. Furthermore, the VoiceBank dataset is
Table 10 summarizing the performance of different speech en- another essential resource for evaluating Audio Super Resolu-
hancement algorithms on the Deep Noise Suppression (DNS) tion systems, comprising over 10,000 speech recordings from
Challenge dataset using different metrics. five distinct speakers. This dataset offers a rich source of
information for assessing speech processing systems, includ-
ing Audio Super Resolution. Another dataset, LibriSpeech,
features more than 1000 hours of spoken words from several TIMIT dataset is popular, providing, 6300 phonetically tran-
books and speakers, making it valuable for evaluating Audio scribed utterances from 630 speakers. On the other hand,
Super Resolution algorithms to enhance the quality of spoken CHiME-5 is designed for speech separation and recognition
words. Finally, the TED-LIUM dataset, which includes over in real-world environments and includes multichannel record-
140 hours of speech recordings from various speakers giving ings of 20 speakers in locations such as cafés, buses, and
TED talks, provides a real-world setting for evaluating Audio pedestrian areas. Despite its primary purpose, CHiME-5
Super Resolution algorithms for speech enhancement. By is widely used for voice activity detection. AURORA-4 is
using these datasets, researchers can evaluate Audio Super specifically designed to evaluate the robustness of ASR sys-
Resolution systems for a wide range of audio signals and im- tems and contains over 10, 000 in noisy speech utterances
prove the generalizability of these algorithms for real-world recorded in environments like car noise, babble noise, and
scenarios. street noise. It is also extended to VAD for evaluating chal-
lenging scenarios. DEMAND is a suitable dataset for eval-
5.7.3. Models uating VAD algorithms as it includes over 1200 artificially
Audio super-resolution has been extensively explored us- created noise signals with various noise types like white noise,
ing deep learning architectures [455, 624, 320, 290, 168, 40, pink noise, and café noise. Finally, VoxCeleb contains over
8, 393, 253, 333]. One notable paper by Rakotonirina [455] 100,000 utterances from more than 6,000 speakers, primarily
proposes a novel network architecture that integrates convolu- designed for speaker recognition systems evaluation, but it
tion and self-attention mechanisms for audio super-resolution. can also be used for voice activity detection.
Specifically, they use Attention-based Feature-Wise Linear
Modulation (AFiLM) [455] to modulate the activations of the 5.8.3. Models
convolutional model. In another recent work by Yoneyama Recent advances in deep learning have greatly improved
et al. [624], the super-resolution task is decomposed into do- the performance of voice activity detection (VAD), partic-
main adaptation and resampling processes to handle acoustic ularly in noisy environments [464, 380]. To further im-
mismatch in unpaired low- and high-resolution signals. To prove VAD accuracy, researchers have explored various deep
address this, they jointly optimize the two processes within learning architectures, including NAS-VAD [464] and self-
the CycleGAN framework. attentive VAD [223]. NAS-VAD employs neural architecture
Moreover, the Time-Frequency Network (TFNet) [320] search to reduce the need for human effort in network de-
proposed a deep network that achieves promising results by sign and has demonstrated superior performance in terms of
modeling the task as a regression problem in either time or fre- AUC and F1-score compared to other models. Similarly, self-
quency domain. To further enhance audio super-resolution, attentive VAD uses a self-attention mechanism to capture
the paper proposes a time-frequency network that combines long-term dependencies in input signals and has also outper-
time and frequency domain information. Finally, recent ad- formed other models on the TIMIT dataset. Additionally, a
vancements in diffusion models have introduced new ap- deep neural network (DNN) system has been proposed for au-
proaches to neural audio upsampling. Specifically, Lee and tomatic speech detection in audio signals [380]. This system
Han [290], and Han and Lee [168] propose NU-Wave 1 and uses MLPs, RNNs, and CNNs, with CNNs delivering the best
2 diffusion probabilistic models, respectively, which can pro- performance. Furthermore, a hybrid acoustic-lexical deep
duce high-quality waveforms with a sampling rate of 48kHz learning approach has been proposed for deception detection,
from coarse 16kHz or 24kHz inputs. These models are a combining both acoustic and lexical features.
promising direction for improving audio super-resolution.
5.9. Speech Quality Assessment
5.8. Voice Activity Detection (VAD) 5.9.1. Task Description
5.8.1. Task Description Speech quality assessment is a crucial process that in-
Due to the increasing sophistication of mobile devices like volves the objective evaluation of speech signals using various
smartphones, speech-controlled applications have become metrics and measures. The primary aim of this assessment is
incredibly popular. These apps offer a hands-free method for to determine the level of intelligibility and comprehensibility
controlling home devices, facilitating telephony, and allow- of speech to a human listener. Although human evaluation is
ing drivers to safely use their vehicle’s infotainment systems considered the gold standard for assessing speech quality, it
while on the go. However, accurately distinguishing between can be time-consuming, expensive, and not scalable. Mean
noise and human speech is critical for these applications to opinion score (MOS) is the most commonly used and reliable
work without interruption. To overcome this issue, Voice method of obtaining human judgments for speech quality es-
Activity Detection (VAD) systems have been created to rec- timation. Accurate speech quality assessment is essential in
ognize speech presence or absence, thus ensuring consistent the development and design of real-world applications such
and effective operation. as ASR, Speech Enhancement, and VoIP.
has clean speech recordings and artificially generated de- as phone conversations, meetings, and live events, where var-
graded versions for speech synthesis and quality assessment ious extraneous sounds may contaminate speech. Tradition-
research. The NOIZEUS dataset [203] is designed for eval- ally, speech separation has been studied as a signal-processing
uating noise reduction and speech quality assessment algo- problem, where researchers have focused on developing al-
rithms, with clean speech and artificially degraded versions gorithms to separate sources based on their spectral charac-
containing various types of noise and distortion. The ETSI teristics [635, 558]. However, recent advances in machine
Aurora databases [361] are used for evaluating speech en- learning have led to a new approach that formulates speech
hancement techniques and quality assessment algorithms, separation as a supervised learning problem [181, 587, 352].
containing speech recordings with different types of distor- This approach has seen a significant improvement in per-
tions like acoustic echo and background noise. Furthermore, formance with the advent of deep neural networks, which
for training and validation, the clean speech recordings from can learn complex relationships between input features and
the DNS Challenge [459] can be used along with the noise output sources.
dataset such as FSDK50 [138] for additive noise degradation.
5.10.2. Datasets
5.9.3. Models The WSJ0-2mix dataset comprises mixtures of two Wall
Current objective methods such as Perceptual Evaluation Street Journal corpus (WSJ) speakers. It consists of a train-
of Speech Quality (PESQ) [468] and Perceptual Objective ing set of 30,000 mixtures and a test set of 5000 mixtures,
Listening Quality Assessment (POLQA) [36] for evaluating and it has been widely used to evaluate speech separation
the quality of speech mostly rely on the availability of the algorithms. CHiME-4 is a dataset that contains recordings
corresponding clean reference. These methods fail in real- of multiple speakers in real-world environments, such as a
world scenarios where the ground truth clean reference is living room, a kitchen, and a café and is designed to test algo-
unavailable. In recent years, several attempts to automati- rithms in challenging acoustic environments. TIMIT-2mix is
cally estimate the MOS using neural networks for performing a dataset based on the TIMIT corpus, consisting of mixtures
quality assessment and predicting ratings or scores have at- of two speakers, and includes a training set of 462 mixtures
tracted much attention [516, 406, 55, 118, 119, 57]. These and a test set of 400 mixtures. The dataset provides a more
approaches outperform traditional approaches without the controlled environment than CHiME-4 to test speech separa-
need for a clean reference. However, they lack robustness and tion algorithms. LibriMix is derived from the LibriSpeech
generalization capabilities, limiting their use in real-world corpus and includes mixtures of up to four speakers, with
applications. The authors in [406] explore Deep machine a training set of 100,000 mixtures and a test set of 1,000
listening for Estimating Speech Quality (DESQ) for predict- mixtures, providing a more realistic and challenging envi-
ing the perceived speech quality based on phoneme posterior ronment than WSJ0-2mix. Lastly, the MUSDB18 dataset
probabilities obtained using a deep neural network. contains mixtures of music tracks separated into individual
In recent years, there have been several quality assessment stems, including vocals, drums, bass, and other instruments.
frameworks developed to estimate speech quality, such as It consists of a training set of 100 songs and a test set of 50
NORESQA [369] based on non-matching reference (NMR). songs. Despite not being specifically designed for that pur-
NORESQA takes inspiration from the human ability to assess pose, it has been used as a benchmark for evaluating speech
speech quality even when the content is non-matching. Addi- separation algorithms.
tionally, NORESQA introduces two new metrics - NORESQA-
score, which is based on SI-SDR for speech, and NORESQA- 5.10.3. Models
MOS, which evaluates the Mean Opinion Score (MOS) of a Deep Clustering++ [181], first proposed in 2015, em-
speech recording using non-matching references. A recent ploys deep neural networks to extract features from the input
extension to NORESQA, known as NORESQA-MOS, has signal and cluster similar feature vectors in a latent space
been proposed in [368]. The primary difference between to separate different speakers. The model’s performance is
these frameworks is that while NORESQA estimates speech improved using spectral masking and a permutation invariant
quality using non-matching references through NORESQA- training method. The advantage of this model is its ability
score and NORESQA-MOS, NORESQA-MOS is specifically to handle multiple speakers, but it also has a high computa-
designed to assess the MOS of a given speech recording using tional cost. Chimera++ [587] is another effective model that
NMRs. combines deep clustering with mask-inference networks in a
multi-objective training scheme. The model is trained using a
5.10. Speech Separation multitask learning approach, optimizing speech enhancement
5.10.1. Task Description and speaker identification. Chimera++ can perform speech
Speech separation refers to separating a mixed audio sig- enhancement and speaker identification but has a relatively
nal into its sources, including speech, music, and background long training time.
noise. The problem is often referred to as the cocktail party TasNet v2 [352] employs a convolutional neural network
problem [175], as it mimics the difficulty of listening to a (CNN) to process the input signal and generate a time-frequency
conversation in a noisy room with multiple speakers. This mask for each source. The model is trained using an invariant
problem is particularly relevant in real-world scenarios such permutation training (PIT) method [265], which enables it
Table 11
Table comparing the performance of different speech separation methods using SI-SDRi
metrics on various speech separation benchmarks.
Model Architecture WSJ0-2mix WSJ0-3mix WSJ0-5mix Libri2Mix Libri5Mix Libri10Mix Libri20Mix WHAM
Separate And Diffuse [357] Diffusion 23.9 20.9 - 21.5 14.2 9 5.2 -
MossFormer (L) [663] Transformer 22.8 21.2 - - - - - -
MossFormer (M) [663] Transformer 22.5 20.8 - - - - - 17.3
SepFormer [520] Transformer 22.3 19.5 - - - - - -
Sandglasset [283] Transformer + LSTM 21.0 19.5 - - - - - -
Hungarian PIT [120] RNN - - 13.22 - 12.72 7.78 4.26 -
TDANet (L) [308] Transformer + CNN - - - 17.4 - - - 15.2
TDANet [308] Transformer + CNN - - - 16.9 - - - 14.8
Sepit [356] CNN 22.4 20.1 - - 13.7 8.2 - -
Gated DualPathRNN [387] CNN + LSTM 20.12 16.85 10.56 - - - - -
Dual-path RNN [351] LSTM 18.8 - - - - - - -
Conv-Tasnet [353] CNN 15.3 - - - - - - -
to separate multiple sources accurately. TasNet v2 achieves convolutional neural network to extract features from the
state-of-the-art performance in various speech separation input signal and generate a time-frequency mask for each
tasks with high separation accuracy, but its disadvantage is source. Wavesplit achieves impressive performance in var-
its relatively high computational cost. The variant of TasNet ious speech separation tasks. The advantage of this model
based on CNNs is proposed in [353]. The model is called is its high separation accuracy and relatively fast processing
Conv-TasNet and can generate a time-frequency mask for time, but its disadvantage is its relatively high memory usage.
each source to obtain the separated source’s signal. Com- Numerous studies have investigated the application of
pared to previous models, Conv-TasNet has faster processing Transformer architecture in the context of speech separa-
time but lower accuracy. tion. One such study is SepFormer [520], which has yielded
In recent research, encoder-decoder architectures have encouraging outcomes on the WSJ0-2mix and WSJ0-3mix
been explored for effectively separating source signals. One datasets, as evidenced by the data presented in Table 11. Addi-
promising approach is the Hybrid Tasnet architecture [613], tionally, MossFormer [663] is another cutting-edge architec-
which utilizes an encoder to extract features from the input ture that has successfully pushed the boundaries of monaural
signal and a decoder to generate the independent sources. speech separation across multiple speech separation bench-
This hybrid architecture captures both short-term and long- marks. It is worth noting that although both models employ
term dependencies in the input signal, leading to improved attention mechanisms, MossFormer integrates a blend of
separation performance. However, it should be noted that convolutional modules to further amplify its performance.
this model’s higher computational cost should be considered Diffusion models have been proven to be highly effec-
when selecting an appropriate separation method. tive in various machine learning tasks related to computer
Dual-path RNN [351] uses RNN architecture to perform vision, as well as speech-processing tasks. The recent de-
speech separation. The model uses a dual-path structure velopment of DiffSep [484] for speech separation, which is
[351] to capture low-frequency and high-frequency informa- based on score-matching of a stochastic differential equa-
tion in the input signal. Dual-path RNN achieves impressive tion, has shown competitive performance on the VoiceBank-
performance in various speech separation tasks. The advan- DEMAND dataset. Additionally, Separate And Diffuse [357],
tage of this model is its ability to capture low-frequency and another diffusion-based model that utilizes a pretrained dif-
high-frequency information, but its disadvantage is its high fusion model, currently represents the state-of-the-art per-
computational cost. Gated DualPathRNN [387] is a variant formance in various speech separation benchmarks (refer
of Dual-path RNN that employs gated recurrent units (GRUs) to Table 11). These advancements demonstrate the signifi-
to improve the model’s performance. The model uses a gating cant potential of diffusion models in advancing the field of
mechanism to control the flow of information in the recur- machine learning and speech processing.
rent network, allowing it to capture long-term dependencies
in the input signal. Gated DualPathRNN achieves state-of- 5.11. Spoken Language Understanding
the-art performance in various speech separation tasks. The 5.11.1. Task Description
advantage of this model is its ability to capture long-term de- Spoken Language Understanding (SLU) is a rapidly devel-
pendencies, but its disadvantage is its higher computational oping field that brings together speech processing and natural
cost than other models. language processing to help machines comprehend human
Wavesplit [633] employs a Wave-U-Net [519] architec- speech and respond appropriately. The ultimate goal of SLU
ture to perform speech separation. The model uses a fully is to bridge the gap between human and machine understand-
ing. Typically, SLU tasks involve identifying the domain or with a diverse set of speakers and background
topic of a spoken utterance, determining the speaker’s intent noise conditions.
or goal in making the utterance, and filling in any relevant – Leroy et al. [300]: This dataset is a federated
slots or variables associated with that intent. For example, learning-based keyword spotting dataset, it is
consider the spoken utterance, "What is the weather like in composed of data from multiple sources that are
San Francisco today?" An SLU system would need to identify trained together without sharing the raw data.
the domain (weather), the intent (obtaining current weather The dataset consists of audio recordings from
information), and the specific slot to be filled (location-San multiple devices and environments, with the goal
Francisco) to generate an appropriate response. By improving of improving the robustness of KS across differ-
SLU capabilities, we can enable more effective communi- ent devices and settings
cation between humans and machines, making interactions
more natural and efficient. – Auto-KWS [570]: This dataset is automatically
Data-driven methods are frequently utilized to achieve generated using TTS approach. The dataset con-
these tasks, employing large datasets to train models capable sists of 1000 keywords spoken by 100 different
of accurately recognizing and interpreting spoken language. synthetic voices, with variations in accent, gen-
Among these methods, machine learning techniques, such der, and age.
as deep neural networks, are widely employed, given their – Speech Commands [589]: This data is a large-
exceptional ability to handle complex and ambiguous speech scale dataset for KS task that consists of over
data. The SLU task may be subdivided into the following 100, 000 spoken commands in English, with each
categories for greater clarity. command belonging to 35 different keywords.
The dataset is specifically designed to be highly
• Keyword Spotting: Keyword Spotting (KS) is a tech- varied and challenging, with a diverse set of speak-
nique used in speech processing to identify specific ers and background noises. It is commonly used
words or phrases within spoken language. It involves as a benchmark dataset for KS research.
analysing audio recordings and detecting instances of
pre-defined keywords or phrases. This technique is • Intent Classification and Slot Filling
commonly used in applications such as voice assis-
tants, where the system needs to recognize specific – ATIS [179]: The Airline Travel Information Sys-
commands or questions from the user. tem (ATIS) dataset is a collection of spoken queries
and responses related to airline travel, such as
• Intent Classification: Intent Classification (IC) is a flight reservations, flight status, and airport in-
spoken language understanding task that involves iden- formation. The dataset is annotated with both
tifying the intent behind a spoken sentence. It is usu- intent labels (e.g. “flight booking”, “flight status
ally implemented as a pipeline process, with a speech inquiry") and slot labels (e.g. depart city, arrival
recognition module followed by text processing that city, date). The ATIS dataset has been used ex-
classifies the intents. However, end-to-end intent clas- tensively as a benchmark for natural language
sification using speech has numerous advantages com- understanding models.
pared to the conventional pipeline approach using AST – SNIPS [101]: SNIPS is a dataset of voice com-
followed by NLP modules. mands designed for building a natural language
• Slot Filling: Slot Filling (SF) is a widely used technique understanding system. It consists of thousands
in Speech Language Understanding (SLU) that enables of examples of spoken requests, each annotated
the extraction of important information, such as names, with the intent of the request (e.g. “play music”,
dates, and locations, from a user’s speech. The process “set an alarm”, etc.). The dataset is widely used
involves identifying the specific pieces of information for training IC and SF models.
that are relevant to the user’s request and placing them – Fluent Speech Commands [350]: It is a dataset
into pre-defined slots. For instance, if a user asks for of voice commands for controlling smart home
the weather in a particular city, the system will identify devices, such as lights, thermostats, and locks.
the city name and fill it into the appropriate slot, thereby The dataset consists of over 1,5000 spoken com-
providing an accurate and relevant response. mands, each labeled with the intended devices
and action (e.g. “turn on the living room lights”,
5.11.2. Dataset “set the thermostat to 72 degrees”). The dataset
• Keyword Spotting Datasets: is designed to have variations in speaker accent,
background noise, and device placement.
– Coucke et al. [100]: This dataset is a speech com-
mand recognition dataset that consists of 105,000 – MIT-Restaurant and MIT-Movie [335]: These
spoken commands in English, with each com- are two datasets created by researchers at MIT
mand being one of 35 keywords. The dataset for training natural language understanding mod-
is designed to be highly varied and challenging, els from restaurant and movie information re-
Table 12
Comprehensive performance analysis of various models for Keyword Spotting (KS) and Slot
Filling (SF) tasks, evaluated on two benchmark datasets: Google Speech Commands for
KS and ATIS for SF.
Keyword Spotting on Google Speech Commands (Accuracy % ↑) Slot Filling on ATIS (F1 ↑)
Model Reference Google Speech Commands V1 12 Google Speech Commands V2 12 Google Speech Commands V2 35 Model Reference ATIS
quests. The dataset contains spoken and text- Despite the remarkable progress made in the field of
based queries, each labeled with the intent of the SLU, accurately comprehending human speech in real-life
request (e.g. “find a nearby Italian restaurant”,” situations continues to pose significant challenges. These
get information about the movie Inception”) and challenges are amplified by the presence of diverse accents,
relevant slot information (e.g. restaurant type, dialects, and linguistic variations. In a notable study, Vanzo
movie name, etc). The datasets are widely used et al. [552] emphasize the significance of SLU in facilitat-
for benchmarking natural language understand- ing effective human-robot interaction, particularly within the
ing models. context of house service robots. The authors delve into the
specific obstacles encountered in this domain, which encom-
5.11.3. Models pass handling noisy and unstructured speech, accommodating
• Keyword Spotting: The state-of-the-art techniques various accents and speech variations, and deciphering com-
for keyword spotting in speech involve deep learn- plex commands involving multiple actions. To overcome
ing models, such as CNNs [469] and transformers these obstacles, ongoing research endeavors are dedicated to
[37]. Wav2Keyword is one of the popular model based developing innovative solutions that enhance the precision
on Wav2Vec2.0 architecture [488] and have achieved and efficacy of SLU systems. By addressing these challenges,
SOTA results on Speech Commands data V1 and V21. the aim is to enable more robust and accurate speech compre-
Another model that achieves SOTA classification accu- hension in diverse real-life scenarios.
racy on the Google Speech commands dataset is Key- Recent studies, including the comprehensive analysis of
word Transformer (KWT) [488]. KWT uses a trans- the performance of different models and techniques for Key-
former model and achieves 98.6% and 97.7% accuracy word Spotting (KS) and Slot Filling (SF) tasks on Google
on the 12 and 35-word tasks, respectively. KWT also Speech Commands and ATIS benchmark datasets (Table 12),
has low latency and can be used on mobile devices. have furnished valuable insights into the strengths and lim-
itations of such approaches in SLU. Capitalizing on these
• The DIET architecture, as introduced in [48], is a
findings and leveraging the latest advances in deep learning
transformer-based multitask model that addresses in-
and speech recognition could help us continue to expand the
tent classification and entity recognition simultane-
frontiers of spoken language understanding and drive further
ously. DIET allows for the seamless integration of
innovation in this domain.
various pre-trained embeddings such as BERT, GloVe,
and ConveRT. Results from experiments show that 5.12. Audio/visual multimodal speech processing
DIET outperforms fine-tuned BERT and has the added The process of speech perception in humans is intricate
benefit of being six times faster to train. and involves multiple sensory modalities, including auditory
• Chang et al. [59] investigated the effectiveness of prompt and visual cues. The generation of speech sounds involves
tuning on the GSLM architecture and showcased its articulators such as the tongue, lips, and teeth, whose move-
competitiveness on various SLU tasks, such as KS, ments are critical for producing different speech sounds and
IC, and SF. Impressively, this approach achieves com- visible to others. The importance of visual cues becomes
parable results with fewer trainable parameters than more pronounced for individuals with hearing impairments
full fine-tuning. Despite being a popular and effective who depend on lip-reading to comprehend spoken language,
technique in numerous NLP tasks, prompt tuning has while individuals with normal hearing can also benefit from
not received much attention in the speech community. visual cues in noisy environments.
Additionally, other researchers have pursued a different When investigating language comprehension and commu-
path by utilizing pre-trained wav2vec2.0 and different nication, it is essential to consider both auditory and visual
adapters [315] to attain state-of-the-art outcomes. information, as studies have demonstrated that visual infor-
mation can assist in distinguishing between acoustically sim-
ilar sounds that differ in articulatory characteristics. A com- generation holds immense potential for various appli-
prehensive understanding of the interaction between these cations, including teleconferencing, creating virtual
sensory modalities can lead to the development of assistive characters with specific facial expressions, and enhanc-
technologies for individuals with hearing impairments and ing speech comprehension. In recent years, signifi-
enhance communication strategies in challenging listening cant advancements have been made in the field of talk-
environments. ing face generation, as evidenced by notable studies
[515, 671, 65, 133, 134].
5.12.1. Task Description
The tasks under audiovisual multimodal processing can 5.12.2. Datasets
be subdivided into the following categories. Several datasets are widely used for audiovisual multi-
modal research, including VoxCeleb, TCD-TIMID [173] , etc.
• Lip-reading: Lip-reading is a remarkable ability that
We briefly discuss some of them in the following section.
allows us to comprehend spoken language from silent
videos. However, it is a challenging task even for hu- • TCD-TIMID [173]: This is an extensive and diverse
mans. Recent advancements in deep learning technol- audiovisual dataset that encompasses both audio and
ogy have enabled the development of neural network- video recordings of 600 distinct sentences spoken by 60
based lip-reading models to accomplish this task with participants. The dataset features a wide range of speak-
high accuracy. These models take silent facial videos ers with different genders, accents, and backgrounds,
as input and produce the corresponding speech audio or making it highly suitable for talker-independent speech
characters as output. The potential applications of au- recognition research. The audio recordings are of ex-
tomatic lip-reading models are vast and diverse, includ- ceptional quality, captured using high-fidelity micro-
ing enabling videoconferencing in noisy environments, phones with a sampling rate of 48kHz. Meanwhile, the
using surveillance videos as long-range listening de- video footage is of 720p resolution and includes depth
vices, and facilitating conversations in noisy social information for every frame
settings. Developing these models could significantly
improve our daily lives. • LipReading in the Wild (LRW) [93]: The LRW is a
comprehensive audiovisual dataset that encompasses
• Audiovisual speech separation: Recent years have wit- 500 distinct words spoken by more than 1000 speakers.
nessed a growing interest in audiovisual speech sep- This dataset has been segmented into distinct training,
aration, driven by the remarkable human capacity to evaluation, and test sets to facilitate efficient research.
selectively focus on a specific sound source amidst Additionally, the LRW-1000 dataset [617] represents
background noise, commonly known as the "cocktail a subset of LRW, featuring a 1000-word vocabulary.
party effect." This phenomenon poses a significant Researchers can benefit from pre-trained weights in-
challenge in computer speech recognition, prompting cluded with this dataset, simplifying the evaluation
the development of automatic speech separation tech- process. Overall, these datasets are highly regarded in
niques aimed at isolating individual speech sources the scientific community for their size and versatility
from complex audio signals. In a noteworthy study by in supporting research related to speech recognition
Ephrat et al. (2018) Ephrat et al. [130], the authors and natural language processing
proposed that audiovisual speech separation surpasses
audio-only approaches by leveraging visual cues from • LRS2 and LRS3 10 : The LRS2 and LRS3 datasets are
a speaker’s face to resolve ambiguity in speech signals. additional examples of audiovisual speech recognition
By integrating visual information, the model’s ability datasets that have been gathered from videos captured
to disentangle overlapping speech signals is enhanced. in real-world settings. Each of these datasets has its
The implications of automatic speech separation extend own distinct train/test split and includes cropped face
across diverse applications, including assistive tech- tracks as well as corresponding audio clips sourced
nologies for individuals with hearing impairments and from British television. Both datasets are considered
head-mounted devices designed to facilitate effective to be of significant value to researchers in the field
communication in noisy meeting scenarios. of speech recognition, particularly those focused on
audiovisual analysis.
• Talking face generation: Generating a realistic talking
face of a target character, synchronized with a given • GRID [97]: This dataset comprises high-fidelity audio
speech and ensuring smooth transitions between fa- and video recordings of more than 1000 sentences
cial images, is the objective of talking face generation. spoken by 34 distinct speakers, including 18 males
This task has garnered substantial interest and poses and 16 females. The sentences were gathered using the
a significant challenge due to the dynamic nature of prompt "put red at G9 now" and are widely employed
facial movements, which depend on both visual infor- in research related to audio-visual speech separation
mation (input face image) and acoustic information and talking face synthesis. The dataset is considered
(input speech audio) to achieve accurate lip-speech 10 https://www.robots.ox.ac.uk/ṽgg/data/lip_reading/lrs2.html
synchronization. Despite its challenges, talking face
to be of exceptional quality and is highly sought after are proposed in [10, 379, 146].
in the scientific community. The rise of Deepfake videos on the internet has led to a
surge in demand for creating realistic talking faces for various
5.12.3. Models applications, such as video production, marketing, and en-
In recent years, there has been a remarkable surge in tertainment. Previously, the conventional approach involved
the development of algorithms tailored for multimodal tasks. manipulating 3D meshes to create specific faces, which was
Specifically, significant attention has been devoted to the time-consuming and limited to certain identities. However,
advancement of neural networks for Text-to-Speech (TTS) recent advancements in deep generative models have made
applications [462, 460, 461, 251]. The integration of visual significant progress. For example, DAVS [671] introduced
and auditory modalities through multimodal processing has an end-to-end trainable deep neural network capable of learn-
played a pivotal role in enhancing various tasks relevant to our ing a joint audiovisual representation, which uses adversarial
daily lives. Lip-reading, for instance, has witnessed notable training to disentangle the latent space. Another architecture
progress in recent years, whether accompanied by audio or proposed by ATVGnet [65] consists of an audio transfor-
not. Son et al. have made a significant contribution to this mation network (AT-net) and a visual generation network
field with their hybrid model [513]. Combining convolutional (VG-net) for processing acoustic and visual information, re-
neural networks (CNN), long short-term memory (LSTM) spectively. This method introduced a regression-based dis-
networks, and an attention mechanism, their model captures criminator, a dynamically adjustable pixel-wise loss, and an
correlations between lip videos and audio, enabling accurate attention mechanism. In [674], a novel framework for talking
character generation. Additionally, the authors introduce a face generation was presented, which discovers audiovisual
new dataset called LRS, which facilitates the development of coherence through an asymmetrical mutual information esti-
lip-reading models. mator. Furthermore, the authors in [133] proposed an end-
Another noteworthy model, LiRA [359], focuses on self- to-end approach based on generative adversarial networks
supervised learning for lip-reading. It leverages lip image that use noisy speech for talking face generation. In addition,
sequences and audio waveforms to derive high-level repre- alternative methods based on conditional recurrent adversar-
sentations during the pre-training stage, achieving word-level ial networks and speech-driven talking face generation were
and sentence-level lip-reading capabilities. In the realm of introduced in [515, 134].
capturing human emotions expressed through acoustic sig-
nals, Ephrat et al. [129] propose an innovative model that
frames the task as an acoustic regression problem instead of 6. Advanced Transfer Learning Techniques for
a visual-to-text modeling approach. Their work emphasizes Speech Processing
the advantages of this perspective. Furthermore, Vid2Speech 6.1. Domain Adaptation
[131], a CNN-based model, takes facial image sequences 6.1.1. Task Description
as input and generates corresponding speech audio wave- Domain adaptation is a field that deals with adapting a
forms. It employs a two-tower CNN model that processes fa- model trained on a labeled dataset from a source domain to
cial grayscale images while calculating optical flow between a target domain, where the source domain differs from the
frames. Additionally, other models such as those based on target domain. The goal of domain adaptation is to reduce
mutual information maximization [667] and spatiotemporal the performance gap between the source and target domains
fusion [653] have been proposed for the lip-reading task, fur- by minimizing the difference between their distributions. In
ther expanding the methodologies explored in this domain. speech processing, domain adaptation has various applica-
In an early attempt to develop algorithms for audiovisual tions such as speech recognition [44, 396, 292, 87, 200],
speech separation, the authors of [130] proposed a CNN- speaker verification [600, 76, 578, 645, 184], and speech
based architecture that encodes facial images and speech synthesis [602, 631]. This section explores the use of domain
spectrograms to compute a complex mask for speech sepa- adaptation in these tasks by reviewing recent literature on
ration. Additionally, they introduced the AVspeech dataset the subject. Specifically, we discuss the techniques used in
in this work. AV-CVAE [394] utilizes a conditional VAE domain adaptation, their effectiveness, and the challenges
to detect the lip movements of the speaker and predict sep- that arise when applying them to speech processing.
arated speech. In a deviation from speech signals, [385]
focuses on audiovisual singing separation and employs a two- 6.1.2. Models
stream CNN architecture, Y-Net [374], to process audio and Various techniques have been proposed to adapt a deep
video separately. This work introduces a large dataset of solo learning model for speech processing tasks. An example
singing videos for audiovisual singing separation. The Visu- of a technique is reconstruction-based domain adaptation,
alSpeech [151] architecture takes a face image sequence and which leverages an additional reconstruction task to generate
mixed audio of lip movement as input and predicts a complex a communal representation for all the domains. The Deep
mask. It also proposes a cross-modal embedding space to Reconstruction Classification Network (DRCN) [154] is an
facilitate the correlation of audio and visual modalities. Fi- illustration of such an approach, as it endeavors to address
nally, FaceFilter [94] uses still images as visual information, both tasks concurrently: (i) classification of the source data
and other methods for the audiovisual speech separation task and (ii) reconstruction of the input data. Another technique
Add and
LayerNorm Attention
LoRA
Prefix
Adapter Tuning Attention LoRA
Feed Forward FF Up
FF FF
Down UP
Nonlinear LoRA LoRA LoRA
Add and
LayerNorm
FF Down Hidden States
Multi-Head Hidden States
Attention
Bottleneck Multi-Head Attention Multi-Head Attention
Adapter
Figure 17: Transformer architecture and Adapter, Prefix Tuning, and LoRA.
The different approaches are illustrated in Figure 17 and matrices 𝑲 and 𝑽 , while the query matrix 𝑸 remains un-
Figure 18 changed. The resulting matrices are then used for multi-head
attention, where each head of the attention mechanism is
Residual computed as follows:
Convolution Adapter
where Attn(⋅) is scaled dot-product attention given by:
𝑸𝑲 𝑇
Figure 18: The architecture of 1D convolution layer-based Attn(𝑸, 𝑲, 𝑽 ) = softmax( √ )𝑽 (32)
lightweight adapter. 𝑘 is the kernel size of 1D convolution. ∗ 𝑑𝑘
denotes depth-wise convolution.
The attention heads in each layer are modified by prefix tun-
ing, with only the prefix vectors 𝑷 𝐾 and 𝑷 𝑉 being updated
Adapter Tuning. Adapters are a type of neural module that during training. This approach provides greater control over
can be retrofitted onto a pre-trained language model, with the transmission of acoustic information between layers and
significantly fewer parameters than the original model. One effectively activates the pre-trained model’s knowledge.
such type is the bottleneck or standard adapter (Houlsby et
LoRA. LoRA is a novel approach proposed by Hu et al.
al., 2019; Pfeiffer et al., 2020) [189, 425]. The adapter takes
(2021) [199], which aims to approximate weight updates
an input vector ℎ ∈ 𝐑𝑑 and down-projects it to a lower-
in the Transformer by injecting trainable low-rank matri-
dimensional space with dimensionality 𝑚 (where 𝑚 < 𝑑),
ces into its layers. In this method, a pre-trained weight ma-
applies a non-linear function 𝑔(⋅), and then up-projects the
trix 𝑊 ∈ ℝ𝑑×𝑘 is updated by a low-rank decomposition
result back to the original 𝑑-dimensional space. Finally, the
𝑾 + Δ𝑾 = 𝑾 + 𝑾 down𝑾 up, where 𝑾 down ∈ ℝ𝑑×𝑟 ,
output is obtained by adding a residual connection.
𝑾 up ∈ ℝ𝑟×𝑘 are tunable parameters and 𝑟 represents the
𝒉 ← 𝒉 + 𝑔(𝒉𝑾down )𝑾up (30) rank of the decomposition matrices, with 𝑟 < 𝑑. Specifically,
for a given input 𝒙 to the linear projection in the multi-headed
where matrices 𝑾down and 𝑾up are used as down and up attention layer, LoRA modifies the projection output 𝒉 as fol-
projection matrices, respectively, with 𝑾 down having di- lows:
mensions ℝ𝑑×𝑚 and 𝑾up having dimensions ℝ𝑚×𝑑 . Previous
studies have empirically shown that a two-layer feedforward 𝒉 ← 𝒉 + 𝑠 ⋅ 𝒙𝑾down 𝑾up (33)
neural network with a bottleneck is effective. In this work,
we follow the experimental settings outlined in [425] for the In this work, LoRA is integrated into four locations of the
adapter, which is inserted after the feedforward layer of every multi-head attention layer, as illustrated in Figure 17. Thanks
transformer module, as depicted in Figure 17. to its lightweight nature, the pre-trained model can accom-
modate many small modules for different tasks, allowing
Prefix tuning. Recent studies have suggested modifying the for efficient task switching by replacing the modules. Addi-
attention module of the Transformer model to improve its per- tionally, LoRA incurs no inference latency and achieves a
formance in natural language processing tasks. This approach convergence rate that is comparable to that of training the
involves adding learnable vectors to the pre-trained multi- original model, unlike fully fine-tuned models [199].
head attention keys and values at every layer, as depicted in Convolutional Adapter. CNNs have become increasingly
Figure 17. Specifically, two sets of learnable prefix vectors, popular in the field of speech processing due to their ability
𝑷𝑲 and 𝑷𝑽 , are concatenated with the original key and value to learn task-specific information and combine channel-wise
Table 13
The study evaluated various parameter-efficient training methods on pre-trained Word2Vec
2.0, including full fine-tuning, on the SURE benchmark. The fraction of trainable parameters
were represented by percentages, with the number of KS task’s trainable parameters given.
Results are reported using weighted-f1 as the metric (w-f1) on MELD, with the best
performance in bold and the second best underlined. To avoid data imbalance, the
researchers opted for using weighted-f1 as the metric. The study cites Li et al. (2023)
[315] as a reference.
Fine Tuning 315,703,947 96.53 42.93 99.00 92.36 0.2295 0.135 0.0903 99.08
Adapter 25,467,915 (8.08%) 94.07 41.58 98.87 96.32 0.2290 0.214 0.2425 99.19
Prefix Tuning 1,739,787 (0.55%) 90.00 44.21 99.73 98.49 0.2255 0.166 0.1022 98.86
LoRA 3,804,171 (1.20%) 90.00 47.05 99.00 97.61 0.2428 0.149 0.1014 98.28
ConvAdapter 2,952,539 (0.94%) 91.87 46.30 99.60 97.61 0.2456 0.2062 0.2958 98.99
Table 14
Results on SURE benchmark for full fine-tuning and other parameter-efficient training
methods on pre-trained Wav2Vec 2.0 for IC and PR tasks on FS: Fluent Speech [350] and
LS: LibriSpeech [412] datasets, respectively.
IC PR SF
Method FS LS SNIPS
#Parameters #Parameters #Parameters
ACC% ↑ PER ↓ F1 % ↑ CER ↓
Fine-Tuning 315707288 99.60 311304394 0.0577 311375119 93.89 0.1411
Adapter 25471256 (8.06%) 99.39 25278538 (8.01%) 0.1571 25349263 (8.14%) 92.60 0.1666
Prefix Tuning 1743128 (0.55%) 93.43 1550410 (0.49%) 0.1598 1621135 (0.50%) 62.32 0.6041
LoRA 3807512 (1.20%) 99.68 3614794 (1.16%) 0.1053 3685519 (1.18%) 90.61 0.2016
ConvAdapter 3672344 (1.16%) 95.60 3479626 (1.11%) 0.1532 3550351 (1.14%) 59.27 0.6405
better performance while using fewer resources. In this ap- Adapter 659200 6.1634 0.3143 6.544 0.2504
proach, the ConvAdapter is added to the same location as the Prefix 153600 6.2523 0.3334 7.4264 0.3244
Bottleneck Adapter (Figure 17). LoRA 81920 6.8319 0.3786 7.0698 0.3291
Table 13, Table 14, and Table 15 present the results of Convadapter 108800 6.9202 0.3365 6.9712 0.3227
various speech processing tasks in the SURE benchmark.
The findings demonstrate that the adapter-based methods
perform comparably well in fine-tuning. However, there is outputs of the larger model or, by using, the larger model’s
no significant advantage of any particular adapter type over hidden representations as input to the smaller model. Knowl-
others for these benchmark tasks and datasets. edge distillation is effective in reducing the computational
cost of training and inference.
6.3.2. Knowledge Distillation (KD) Cho et al. [81] conducted knowledge distillation (KD)
Knowledge distillation involves training a smaller model by directly applying it to the downstream task. One way to
to mimic the behavior of a larger and more complex model. improve this approach is to use KD as pre-training for various
This can be done by training the smaller model to predict the downstream tasks, thus allowing for knowledge reuse. A
noteworthy result achieved by Denisov and Vu [107] was larger and more comprehensive models, along with the
using KD in pretraining. However, they achieved this by utilization of larger datasets. By leveraging these re-
initializing an utterance encoder with a trained ASR model’s sources, it becomes possible to create TTS models that
backbone, followed by a trained NLU backbone. Knowledge exhibit enhanced naturalness and human-like prosody.
distillation can be applied directly into a wav2vec 2.0 encoder One promising approach to achieve this is through the
without ASR training and a trained NLU module to enhance application of adversarial training, where a discrim-
this method. Kim et al. [256] implemented a more complex inator is employed to distinguish between machine-
architecture, utilizing KD in both the pretraining and fine- generated speech and reference speech. This adversar-
tuning stages. ial framework facilitates the generation of TTS models
that closely resemble human speech, providing a sig-
6.3.3. Model Compression nificant step forward in achieving more realistic and
Researchers have also explored various architectural mod- high-quality synthesized speech. By exploring these
ifications to existing models to make them more parameter- avenues, researchers aim to push the boundaries of
efficient. One such approach is pruning [141, 586], where speech synthesis technology, ultimately enhancing the
motivated by lottery-ticket hypothesis (LTH) [140], the task- overall performance and realism of TTS systems.
irrelevant parameters are masked based on some threshold
defined by importance score, such as some parameter norm. 2. Multilingual Models: Self-supervised learning has
Another form of compression could be low-rank factoriza- emerged as a transformative approach in the field of
tion [197], where the parameter matrices are factorized into speech recognition, particularly for low-resource lan-
lower-rank matrices with much fewer parameters. Finally, guages characterized by scarce or unavailable labeled
quantization is a popular approach to reduce the model size datasets. The recent development of the XLS-R model,
and improve energy efficiency with a minimal performance a state-of-the-art self-supervised speech recognition
penalty. It involves transforming 32-bit floating point model model, represents a significant milestone in this do-
weights into integers with fewer bit-counts [619]—8-bit, 4- main. With a remarkable scale of over 2 billion param-
bit, 2-bit, and even 1-bit—through scaling and shifting. At the eters, the XLS-R model has been trained on a diverse
same time, the quantization of the activation is also handled dataset spanning 128 languages, surpassing its prede-
based on the input. cessor in terms of language coverage. The notable
Lai et al. [280] iteratively prune and subsequently fine- advantage of scaling up larger multilingual models like
tune wav2vec2.0 on downstream tasks to obtained improved XLS-R lies in the substantial performance improve-
results over fine-tuned wav2vec2.0. Winata et al. [593] em- ments they offer. As a result, these models are poised
ploy low-rank transformers to excise the model size by half to outperform single-language models and hold im-
and increase the inference speed by 1.35 times. Peng et al. mense promise for the future of speech recognition.
[424] employ KD and quantization to make wav2vec2.0 twice By harnessing the power of self-supervised learning
as fast, twice as energy efficient, and 4.8 times smaller at the and leveraging multilingual datasets, the XLS-R model
cost of a 7% increase in WER. Without the KD step, the model showcases the potential for addressing the challenges
is 3.6 times smaller with mere 0.1% WER degradation. posed by low-resource languages and advancing the
field of speech recognition to new heights.
7. Conclusion and Future Research Directions 3. Multimodal Speech Models: Traditional speech and
text models have typically operated within a single
The rapid advancements in deep learning techniques have
modality, focusing solely on either speech or text in-
revolutionized speech processing tasks, enabling significant
puts and outputs. However, as the scale of generative
progress in speech recognition, speaker recognition, and
models continues to grow exponentially, the integration
speech synthesis. This paper provides a comprehensive re-
of multiple modalities becomes a natural progression.
view of the latest developments in deep learning techniques
This trend is evident in the latest developments, such
for speech-processing tasks. We begin by examining the early
as the unveiling of groundbreaking language models
developments in speech processing, including representation
like GPT-4 [407] and Kosmos-I [207], which demon-
learning and HMM-based modeling, before presenting a con-
strate the ability to process both images and text jointly.
cise summary of fundamental deep learning techniques and
These pioneering multimodal models pave the way
their applications in speech processing. Furthermore, we
for the emergence of large-scale architectures that can
discuss key speech-processing tasks, highlight the datasets
seamlessly handle speech and other modalities in a
used in these tasks, and present the latest and most relevant
unified manner. The convergence of multiple modali-
research works utilizing deep learning techniques.
ties within a single model opens up new avenues for
We envisage several lines of development in speech pro-
comprehensive understanding and generation of mul-
cessing:
timodal content, and it is highly anticipated that we
1. Large Speech Models: In addition to the advancements will witness the rapid development of large multimodal
made with wav2vec2.0, further progress in the field models tailored for speech and beyond in the near fu-
of ASR and TTS models involves the development of ture.
4. In-Context Learning: Utilizing mixed-modality mod- yet, explainability could be built-in as inductive bias
els opens up possibilities for the development of in- in architecture. To this end, brain-inspired architec-
context learning approaches for a wide range of speech- tures [382] are being developed, which may shed more
related tasks. This paradigm allows the tasks to be light on this aspect of large models.
explicitly defined within the input, along with accom-
panying examples. Remarkable progress has already 8. Neuroscience-inspired Architectures:In recent years,
been demonstrated in large language models (LLMs), there has been significant research exploring the paral-
including notable works such as InstructGPT [408], lels between speech-processing architectures and the
FLAN-T5 [90], and LLaMA [537]. These models intricate workings of the human brain [382]. These
showcase the efficacy of in-context learning, where studies have unveiled compelling evidence of a strong
the integration of context-driven information empow- correlation between the layers of speech models and
ers the models to excel in various speech tasks. By the functional hierarchy observed in the human brain.
leveraging mixed-modality models and incorporating This intriguing finding has served as a catalyst for the
contextual cues, researchers are advancing the bound- development of neuroscience-inspired speech models
aries of speech processing capabilities, paving the way that demonstrate comparable performance to state-of-
for more versatile and context-aware speech systems. the-art (SOTA) models [382]. By drawing inspiration
from the underlying principles of neural processing
5. Controllable Speech Generation:An intriguing appli- in the human brain, these innovative speech models
cation stemming from the aforementioned concept is aim to enhance our understanding of speech percep-
controllable text-to-speech (TTS), which allows for tion and production while pushing the boundaries of
fine-grained control over various attributes of the syn- performance in the field of speech processing.
thesized speech. Attributes such as tone, accent, age,
9. Text-to-Audio Models for Text-to-Speech: Lately, trans-
gender, and more can be precisely controlled through
former and diffusion-based text-to-audio (TTA) model
in-context text guidance. This controllability in TTS
development is turning into an exciting area of research.
opens up exciting possibilities for personalization and
Until recently, most of these models [332, 272, 611,
customization, enabling users to tailor the synthesized
155, 580] overlooked speech in favour of general audio.
speech to their specific requirements. By leveraging
In the future, however, the models will likely strive to
advanced models and techniques, researchers are mak-
be equally performant in both audio and speech. To
ing significant strides in developing controllable TTS
that end, current TTS methods will likely be an inte-
systems that provide users with a powerful and flexible
gral part of those models. Recently, Suno-AI [525]
speech synthesis experience.
have aimed at striking a good balance between general
6. Parameter-efficient Learning: With the increasing scale audio and speech, although their implementation is not
of LLMs and speech models, it becomes imperative public, nor have they provided any detailed paper.
to adapt these models with minimal parameter up-
dates. This necessitates the development of special- CRediT authorship contribution statement
ized adapters that can efficiently update these emerg-
ing mixed-modality large models. Additionally, model Ambuj Mehrish: Conceptualization, Writing - Origi-
compression techniques have proven to be practical nal Draft. Navonil Majumder: Writing - Original Draft.
solutions in addressing the challenges posed by these Rishabh Bhardwaj: Review, Editing. Rada Mihalcea: Re-
large models. Recent research [280, 593, 424] has view, Editing. Soujanya Poria: Review, Editing.
demonstrated the effectiveness of model compression,
highlighting the sparsity that exists within these mod- Declaration of Competing Interest
els, particularly for specific tasks. By employing model
Declaration of Competing Interest The authors declare
compression techniques, researchers can reduce the
that they have no known competing financial interests or
computational requirements and memory footprint of
personal relationships that could have appeared to influence
these models while preserving their performance, mak-
the work reported in this paper.
ing them more practical and accessible for real-world
applications.
Acknowledgement
7. Explainability: Explainability remains elusive to these
large networks as they grow. Researchers are steadfast This research is supported by the Ministry of Educa-
in explaining these networks’ functioning and learn- tion, Singapore, under its AcRF Tier-2 grant (Project no.
ing dynamics. Recently, much work has been done to T2MOE2008, and Grantor reference no. MOE-T2EP20220-
learn the fine-tuning and in-context learning dynamics 0017), and A*STAR under its RIE 2020 AME programmatic
of these large models for text under the neural-tangent- grant (project reference no. RGAST2003. Any opinions,
kernel (NTK) asymptotic framework [366]. Such ex- findings and conclusions or recommendations expressed in
ploration is yet to be done in the speech domain. More this material are those of the author(s) and do not reflect the
views of the Ministry of Education, Singapore.
References [18] Arık, S.Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A.,
Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., et al., 2017. Deep
[1] , 2022. Conformer-1. AssemblyAI URL: https://www.assemblyai. voice: Real-time neural text-to-speech, in: International conference
com/blog/conformer-1/. on machine learning, PMLR. pp. 195–204.
[2] , 2022. Speech recognition with conformer. Nvidia URL: [19] Audhkhasi, K., Saon, G., Tüske, Z., Kingsbury, B., Picheny, M., 2019.
https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_ Forget a bit to learn better: Soft forgetting for ctc-based automatic
recognition_with_conformer.html. speech recognition., in: Interspeech, pp. 2618–2622.
[3] Abdel-Hamid, O., Mohamed, A.r., Jiang, H., Deng, L., Penn, G., Yu, [20] Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N.,
D., 2014a. Convolutional neural networks for speech recognition. Singh, K., von Platen, P., Saraf, Y., Pino, J., et al., 2021. Xls-r:
IEEE/ACM Transactions on audio, speech, and language processing Self-supervised cross-lingual speech representation learning at scale.
22, 1533–1545. arXiv preprint arXiv:2111.09296 .
[4] Abdel-Hamid, O., Mohamed, A.r., Jiang, H., Deng, L., Penn, G., [21] Badlani, R., Łańcucki, A., Shih, K.J., Valle, R., Ping, W., Catanzaro,
Yu, D., 2014b. Convolutional neural networks for speech recogni- B., 2022a. One tts alignment to rule them all, in: ICASSP 2022-
tion. IEEE/ACM Transactions on Audio, Speech, and Language 2022 IEEE International Conference on Acoustics, Speech and Signal
Processing 22, 1533–1545. doi:10.1109/TASLP.2014.2339736. Processing (ICASSP), IEEE. pp. 6092–6096.
[5] Abdel-Hamid, O., Mohamed, A.r., Jiang, H., Penn, G., 2012. Apply- [22] Badlani, R., Łańcucki, A., Shih, K.J., Valle, R., Ping, W., Catanzaro,
ing convolutional neural networks concepts to hybrid nn-hmm model B., 2022b. One tts alignment to rule them all, in: ICASSP 2022 -
for speech recognition, in: 2012 IEEE International Conference on 2022 IEEE International Conference on Acoustics, Speech and Signal
Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280. Processing (ICASSP), pp. 6092–6096. doi:10.1109/ICASSP43922.2022.
doi:10.1109/ICASSP.2012.6288864. 9747707.
[6] Abdeljaber, O., Avci, O., Kiranyaz, S., Gabbouj, M., Inman, D.J., [23] Baevski, A., Auli, M., Mohamed, A., 2019a. Effectiveness of
2017. Real-time vibration-based structural damage detection using self-supervised pre-training for speech recognition. arXiv preprint
one-dimensional convolutional neural networks. Journal of Sound arXiv:1911.03912 .
and Vibration 388, 154–170. [24] Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M., 2022.
[7] Abdul, Z.K., Al-Talabani, A.K., 2022. Mel frequency cepstral co- Data2vec: A general framework for self-supervised learning in
efficient and its applications: A review. IEEE Access 10, 122136– speech, vision and language, in: International Conference on Machine
122158. doi:10.1109/ACCESS.2022.3223444. Learning, PMLR. pp. 1298–1312.
[8] Abdulatif, S., Cao, R., Yang, B., 2022. Cmgan: Conformer- [25] Baevski, A., Schneider, S., Auli, M., 2019b. vq-wav2vec: Self-
based metric-gan for monaural speech enhancement. arXiv preprint supervised learning of discrete speech representations. arXiv preprint
arXiv:2209.11112 . arXiv:1910.05453 .
[9] Achanta, S., Antony, A., Golipour, L., Li, J., Raitio, T., Rasipuram, [26] Baevski, A., Zhou, Y., Mohamed, A., Auli, M., 2020. wav2vec 2.0:
R., Rossi, F., Shi, J., Upadhyay, J., Winarsky, D., et al., 2021. On- A framework for self-supervised learning of speech representations.
device neural speech synthesis, in: 2021 IEEE Automatic Speech Advances in Neural Information Processing Systems 33.
Recognition and Understanding Workshop (ASRU), IEEE. pp. 1155– [27] Bahar, P., Bieschke, T., Ney, H., 2019. A comparative study on end-
1161. to-end speech to text translation, in: 2019 IEEE Automatic Speech
[10] Afouras, T., Chung, J.S., Zisserman, A., 2018. The conversa- Recognition and Understanding Workshop (ASRU), IEEE. pp. 792–
tion: Deep audio-visual speech enhancement. arXiv preprint 799.
arXiv:1804.04121 . [28] Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine trans-
[11] Aggarwal, V., Cotescu, M., Prateek, N., Lorenzo-Trueba, J., Barra- lation by jointly learning to align and translate. arXiv preprint
Chicote, R., 2020. Using vaes and normalizing flows for one-shot arXiv:1409.0473 .
text-to-speech synthesis of expressive speech, in: ICASSP 2020 - [29] Bai, H., Zheng, R., Chen, J., Ma, M., Li, X., Huang, L., 2022. A3t:
2020 IEEE International Conference on Acoustics, Speech and Signal Alignment-aware acoustic and text pretraining for speech synthe-
Processing (ICASSP), pp. 6179–6183. doi:10.1109/ICASSP40776.2020. sis and editing, in: International Conference on Machine Learning,
9053678. PMLR. pp. 1399–1411.
[12] Alsabhan, W., 2023. Human–computer interaction with a real-time [30] Bai, S., Kolter, J.Z., Koltun, V., 2018. An empirical evaluation of
speech emotion recognition with ensembling techniques 1d convolu- generic convolutional and recurrent networks for sequence modeling.
tion neural network and attention. Sensors 23, 1386. arXiv preprint arXiv:1803.01271 .
[13] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, [31] Bai, Z., Zhang, X.L., 2021. Speaker recognition based on deep
E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al., learning: An overview. Neural Networks 140, 65–99.
2016. Deep speech 2: End-to-end speech recognition in english and [32] Bak, T., Lee, J., Bae, H., Yang, J., Bae, J.S., Joo, Y.S., 2022. Av-
mandarin, in: International conference on machine learning, PMLR. ocodo: Generative adversarial network for artifact-free vocoder.
pp. 173–182. arXiv preprint arXiv:2206.13404 .
[14] Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., [33] Barker, J., Watanabe, S., Vincent, E., Trmal, J., 2018. The
Vinyals, O., 2012. Speaker diarization: A review of recent research. fifth’chime’speech separation and recognition challenge: dataset,
IEEE Transactions on audio, speech, and language processing 20, task and baselines. arXiv preprint arXiv:1803.10609 .
356–370. [34] Baskar, M.K., Watanabe, S., Astudillo, R., Hori, T., Burget, L., Čer-
[15] Ansari, E., Axelrod, A., Bach, N., Bojar, O., Cattoni, R., Dalvi, nockỳ, J., 2019. Semi-supervised sequence-to-sequence asr using
F., Durrani, N., Federico, M., Federmann, C., Gu, J., et al., 2020. unpaired speech and text. arXiv preprint arXiv:1905.01152 .
Findings of the iwslt 2020 evaluation campaign, in: Proceedings of [35] Battenberg, E., Skerry-Ryan, R., Mariooryad, S., Stanton, D., Kao,
the 17th International Conference on Spoken Language Translation, D., Shannon, M., Bagby, T., 2020. Location-relative attention mecha-
pp. 1–34. nisms for robust long-form speech synthesis, in: ICASSP 2020-2020
[16] Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, IEEE International Conference on Acoustics, Speech and Signal Pro-
T., Li, Q., Zhang, Y., et al., 2021. Speecht5: Unified-modal encoder- cessing (ICASSP), IEEE. pp. 6194–6198.
decoder pre-training for spoken language processing. arXiv preprint [36] Beerends, J.G., Schmidmer, C., Berger, J., Obermann, M., Ullmann,
arXiv:2110.07205 . R., Pomy, J., Keyhl, M., 2013. Perceptual objective listening quality
[17] Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, assessment (polqa), the third generation itu-t standard for end-to-end
J., Morais, R., Saunders, L., Tyers, F.M., Weber, G., 2019. Com- speech quality measurement part i—temporal alignment. Journal of
mon voice: A massively-multilingual speech corpus. arXiv preprint the Audio Engineering Society 61, 366–384.
arXiv:1912.06670 .
[37] Berg, A., O’Connor, M., Cruz, M.T., 2021. Keyword trans- [56] Cattoni, R., Di Gangi, M.A., Bentivogli, L., Negri, M., Turchi, M.,
former: A self-attention model for keyword spotting. arXiv preprint 2021. Must-c: A multilingual corpus for end-to-end speech transla-
arXiv:2104.00769 . tion. Computer Speech & Language 66, 101155.
[38] Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., [57] Cauchi, B., Siedenburg, K., Santos, J.F., Falk, T.H., Doclo, S., Goetze,
Casagrande, N., Cobo, L.C., Simonyan, K., 2019a. High fi- S., 2019. Non-intrusive speech quality prediction using modula-
delity speech synthesis with adversarial networks. arXiv preprint tion energies and lstm-network. IEEE/ACM Transactions on Audio,
arXiv:1909.11646 . Speech, and Language Processing 27, 1151–1163.
[39] Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen, E., [58] Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., Norouzi, M., 2021.
Casagrande, N., Cobo, L.C., Simonyan, K., 2019b. High fidelity Speechstew: Simply mix all available speech recognition data to train
speech synthesis with adversarial networks, in: International Confer- one large neural network. arXiv preprint arXiv:2104.02133 .
ence on Learning Representations. [59] Chang, K.W., et al., 2022. An exploration of prompt tuning on
[40] Birnbaum, S., Kuleshov, V., Enam, Z., Koh, P.W.W., Ermon, S., generative spoken language model for speech processing tasks. arXiv
2019. Temporal film: Capturing long-range sequence dependencies preprint arXiv:2203.16773 .
with feature-wise modulations. Advances in Neural Information [60] Chaudhari, S., Mithal, V., Polatkan, G., Ramanath, R., 2021. An
Processing Systems 32. attentive survey of attention models. ACM Transactions on Intelligent
[41] Boll, S., 1979. Suppression of acoustic noise in speech using spectral Systems and Technology (TIST) 12, 1–32.
subtraction. IEEE Transactions on acoustics, speech, and signal [61] Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.Q., Weng, C., Su,
processing 27, 113–120. D., Povey, D., Trmal, J., Zhang, J., et al., 2021a. Gigaspeech: An
[42] Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von evolving, multi-domain asr corpus with 10,000 hours of transcribed
Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al., audio. arXiv preprint arXiv:2106.06909 .
2021. On the opportunities and risks of foundation models. arXiv [62] Chen, J., Ma, M., Zheng, R., Huang, L., 2020a. Mam: Masked
preprint arXiv:2108.07258 . acoustic modeling for end-to-end speech-to-text translation. arXiv
[43] Bourlard, H.A., Morgan, N., 1994. Connectionist speech recognition: preprint arXiv:2010.11445 .
a hybrid approach. volume 247. Springer Science & Business Media. [63] Chen, J., Ma, M., Zheng, R., Huang, L., 2021b. Specrec: An alterna-
[44] Bousquet, P.M., Rouvier, M., 2019. On robustness of unsupervised tive solution for improving end-to-end speech-to-text translation via
domain adaptation for speaker recognition, in: Interspeech. spectrogram reconstruction., in: Interspeech, pp. 2232–2236.
[45] Bredin, H., Laurent, A., 2021. End-to-end speaker segmentation for [64] Chen, J., Tan, X., Leng, Y., Xu, J., Wen, G., Qin, T., Liu, T.Y., 2021c.
overlap-aware resegmentation. arXiv preprint arXiv:2104.04045 . Speech-t: Transducer for text to speech and beyond. Advances in
[46] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhari- Neural Information Processing Systems 34, 6621–6633.
wal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., [65] Chen, L., Maddox, R.K., Duan, Z., Xu, C., 2019a. Hierarchical
2020. Language models are few-shot learners. Advances in neural cross-modal talking face generation with dynamic pixel-wise loss, in:
information processing systems 33, 1877–1901. Proceedings of the IEEE/CVF conference on computer vision and
[47] Bullock, L., Bredin, H., Garcia-Perera, L.P., 2020. Overlap-aware di- pattern recognition, pp. 7832–7841.
arization: Resegmentation using neural end-to-end overlapped speech [66] Chen, M., Tan, X., Li, B., Liu, Y., Qin, T., Zhao, S., Liu, T.Y., 2021d.
detection, in: ICASSP 2020-2020 IEEE International Conference Adaspeech: Adaptive text to speech for custom voice. arXiv preprint
on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. arXiv:2103.00993 .
7114–7118. [67] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.,
[48] Bunk, T., Varshneya, D., Vlasov, V., Nichol, A., 2020. Diet: 2020b. Wavegrad: Estimating gradients for waveform generation.
Lightweight language understanding for dialogue systems. arXiv arXiv preprint arXiv:2009.00713 .
preprint arXiv:2004.09936 . [68] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Chan, W.,
[49] Burchi, M., Timofte, R., 2023. Audio-visual efficient conformer for 2020c. Wavegrad: Estimating gradients for waveform generation, in:
robust speech recognition, in: Proceedings of the IEEE/CVF Winter International Conference on Learning Representations.
Conference on Applications of Computer Vision, pp. 2258–2267. [69] Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M., Dehak, N.,
[50] Butryna, A., Chu, S.H.C., Demirsahin, I., Gutkin, A., Ha, L., He, Chan, W., 2021e. Wavegrad 2: Iterative refinement for text-to-speech
F., Jansche, M., Johny, C., Katanova, A., Kjartansson, O., et al., synthesis. arXiv preprint arXiv:2106.09660 .
2020. Google crowdsourced speech corpora and related open-source [70] Chen, Q., Zhuo, Z., Wang, W., 2019b. Bert for joint intent classifica-
resources for low-resource languages and dialects: an overview. arXiv tion and slot filling. arXiv preprint arXiv:1902.10909 .
preprint arXiv:2010.06778 . [71] Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda,
[51] C S, A., A P, P., Ramakrishnan, A.G., 2021. Unsupervised domain N., Yoshioka, T., Xiao, X., et al., 2022a. Wavlm: Large-scale self-
adaptation schemes for building asr in low-resource languages, in: supervised pre-training for full stack speech processing. IEEE Journal
2021 IEEE Automatic Speech Recognition and Understanding Work- of Selected Topics in Signal Processing 16, 1505–1518.
shop (ASRU), pp. 342–349. doi:10.1109/ASRU51503.2021.9688269. [72] Chen, S., Wu, Y., Wang, C., Chen, Z., Chen, Z., Liu, S., Wu, J., Qian,
[52] Campbell, W., Campbell, J., Reynolds, D., Jones, D., Leek, T., 2003. Y., Wei, F., Li, J., et al., 2022b. Unispeech-sat: Universal speech
Phonetic speaker recognition with support vector machines. Advances representation learning with speaker aware pre-training, in: ICASSP
in neural information processing systems 16. 2022-2022 IEEE International Conference on Acoustics, Speech and
[53] Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, Signal Processing (ICASSP), IEEE. pp. 6152–6156.
T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., et al., 2006. [73] Chen, Y., Guo, W., Gu, B., 2021f. Improved meta-learning training
The ami meeting corpus: A pre-announcement, in: Machine Learning for speaker verification. arXiv preprint arXiv:2103.15421 .
for Multimodal Interaction: Second International Workshop, MLMI [74] Chen, Z., Tan, X., Wang, K., Pan, S., Mandic, D., He, L., Zhao,
2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers 2, S., 2022c. Infergrad: Improving diffusion models for vocoder by
Springer. pp. 28–39. considering inference in training, in: ICASSP 2022-2022 IEEE In-
[54] Castiglioni, P., 2005. Levinson-durbin algorithm. Encyclopedia of ternational Conference on Acoustics, Speech and Signal Processing
Biostatistics 4. (ICASSP), IEEE. pp. 8432–8436.
[55] Catellier, A.A., Voran, S.D., 2020. Wawenets: A no-reference con- [75] Chen, Z., Wang, S., Qian, Y., 2020d. Adversarial domain adap-
volutional waveform-based approach to estimating narrowband and tation for speaker verification using partially shared network., in:
wideband speech quality, in: ICASSP 2020-2020 IEEE International Interspeech, pp. 3017–3021.
Conference on Acoustics, Speech and Signal Processing (ICASSP), [76] Chen, Z., Wang, S., Qian, Y., 2021g. Self-supervised learning based
IEEE. pp. 331–335. domain adaptation for robust speaker verification, in: ICASSP 2021-
2021 IEEE International Conference on Acoustics, Speech and Signal [96] Chung, Y.A., Zhang, Y., Han, W., Chiu, C.C., Qin, J., Pang, R., Wu,
Processing (ICASSP), IEEE. pp. 5834–5838. Y., 2021. W2v-bert: Combining contrastive learning and masked
[77] Chen, Z., Watanabe, S., Erdogan, H., Hershey, J.R., 2015. Speech language modeling for self-supervised speech pre-training, in: 2021
enhancement and recognition using multi-task learning of long short- IEEE Automatic Speech Recognition and Understanding Workshop
term memory recurrent neural networks, in: Sixteenth Annual Con- (ASRU), IEEE. pp. 244–250.
ference of the International Speech Communication Association. [97] Cooke, M., Barker, J., Cunningham, S., Shao, X., 2006. An audio-
[78] Chiu, C.C., Qin, J., Zhang, Y., Yu, J., Wu, Y., 2022. Self-supervised visual corpus for speech perception and automatic speech recognition.
learning with random-projection quantizer for speech recognition, in: The Journal of the Acoustical Society of America 120, 2421–2424.
International Conference on Machine Learning, PMLR. pp. 3915– [98] Coria, J.M., Bredin, H., Ghannay, S., Rosset, S., 2021. Overlap-
3924. aware low-latency online speaker diarization based on end-to-end
[79] Chiu, C.C., Raffel, C., 2017. Monotonic chunkwise attention. arXiv local segmentation, in: 2021 IEEE Automatic Speech Recognition
preprint arXiv:1712.05382 . and Understanding Workshop (ASRU), IEEE. pp. 1139–1146.
[80] Cho, K., Courville, A., Bengio, Y., 2015. Describing multimedia [99] Coto-Jiménez, M., 2019. Improving post-filtering of artificial speech
content using attention-based encoder-decoder networks. IEEE Trans- using pre-trained lstm neural networks. Biomimetics 4, 39.
actions on Multimedia 17, 1875–1886. [100] Coucke, A., Chlieh, M., Gisselbrecht, T., Leroy, D., Poumeyrol, M.,
[81] Cho, W.I., Kwak, D., Yoon, J., Kim, N.S., 2020. Speech to text adap- Lavril, T., 2019. Efficient keyword spotting using dilated convolutions
tation: Towards an efficient cross-modal distillation, in: Interspeech. and gating, in: ICASSP 2019-2019 IEEE International Conference
[82] Choi, H.S., Lee, J., Kim, W., Lee, J., Heo, H., Lee, K., 2021. Neural on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp.
analysis and synthesis: Reconstructing speech from self-supervised 6351–6355.
representations. Advances in Neural Information Processing Systems [101] Coucke, A., Saade, A., Ball, A., Bluche, T., Caulier, A., Leroy, D.,
34, 16251–16265. Doumouro, C., Gisselbrecht, T., Caltagirone, F., Lavril, T., et al.,
[83] Choi, H.S., Yang, J., Lee, J., Kim, H., 2022. Nansy++: Unified 2018. Snips voice platform: an embedded spoken language under-
voice synthesis with neural analysis and synthesis. arXiv preprint standing system for private-by-design voice interfaces. arXiv preprint
arXiv:2211.09407 . arXiv:1805.10190 .
[84] Chorowski, J., Weiss, R.J., Bengio, S., Van Den Oord, A., 2019. [102] Dauphin, Y.N., Fan, A., Auli, M., Grangier, D., 2017. Language mod-
Unsupervised speech representation learning using wavenet autoen- eling with gated convolutional networks, in: International conference
coders. IEEE/ACM transactions on audio, speech, and language on machine learning, PMLR. pp. 933–941.
processing 27, 2041–2053. [103] Deng, J., Guo, J., Xue, N., Zafeiriou, S., 2019. Arcface: Additive
[85] Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y., angular margin loss for deep face recognition, in: Proceedings of the
2015. Attention-based models for speech recognition. Advances in IEEE/CVF conference on computer vision and pattern recognition,
neural information processing systems 28. pp. 4690–4699.
[86] Chou, J.c., Yeh, C.c., Lee, H.y., 2019. One-shot voice conversion by [104] Deng, K., Cao, S., Zhang, Y., Ma, L., 2021. Improving hybrid
separating speaker and content representations with instance normal- ctc/attention end-to-end speech recognition with pretrained acoustic
ization. arXiv preprint arXiv:1904.05742 . and language models, in: 2021 IEEE Automatic Speech Recogni-
[87] Chowdhury, A., Cozzo, A., Ross, A., 2022. Domain adaptation for tion and Understanding Workshop (ASRU), pp. 76–82. doi:10.1109/
speaker recognition in singing and spoken voice, in: ICASSP 2022- ASRU51503.2021.9688009.
2022 IEEE International Conference on Acoustics, Speech and Signal [105] Deng, K., Cao, S., Zhang, Y., Ma, L., Cheng, G., Xu, J., Zhang,
Processing (ICASSP), IEEE. pp. 7192–7196. P., 2022a. Improving ctc-based speech recognition via knowledge
[88] Chung, H., Jeon, H.B., Park, J.G., 2020a. Semi-supervised training transferring from pre-trained language models, in: ICASSP 2022 -
for sequence-to-sequence speech recognition using reinforcement 2022 IEEE International Conference on Acoustics, Speech and Signal
learning, in: 2020 International Joint Conference on Neural Networks Processing (ICASSP), pp. 8517–8521. doi:10.1109/ICASSP43922.2022.
(IJCNN), IEEE. pp. 1–6. 9747887.
[89] Chung, H., Jeon, H.B., Park, J.G., 2020b. Semi-supervised training [106] Deng, K., Cao, S., Zhang, Y., Ma, L., Cheng, G., Xu, J., Zhang,
for sequence-to-sequence speech recognition using reinforcement P., 2022b. Improving ctc-based speech recognition via knowledge
learning, in: 2020 International Joint Conference on Neural Networks transferring from pre-trained language models, in: ICASSP 2022-
(IJCNN), pp. 1–6. doi:10.1109/IJCNN48605.2020.9207023. 2022 IEEE International Conference on Acoustics, Speech and Signal
[90] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Processing (ICASSP), IEEE. pp. 8517–8521.
Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., [107] Denisov, P., Vu, N.T., 2020. Pretrained semantic speech embed-
Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Valter, D., Narang, dings for end-to-end spoken language understanding via cross-modal
S., Mishra, G., Yu, A.W., Zhao, V., Huang, Y., Dai, A.M., Yu, H., teacher-student learning, in: Interspeech.
Petrov, S., hsin Chi, E.H., Dean, J., Devlin, J., Roberts, A., Zhou, [108] Desplanques, B., Thienpondt, J., Demuynck, K., 2020. Ecapa-tdnn:
D., Le, Q.V., Wei, J., 2022. Scaling instruction-finetuned language Emphasized channel attention, propagation and aggregation in tdnn
models. ArXiv abs/2210.11416. based speaker verification. arXiv preprint arXiv:2005.07143 .
[91] Chung, J.S., Huh, J., Mun, S., Lee, M., Heo, H.S., Choe, S., Ham, C., [109] Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-
Jung, S., Lee, B.J., Han, I., 2020c. In defence of metric learning for training of deep bidirectional transformers for language understand-
speaker recognition. arXiv preprint arXiv:2003.11982 . ing. arXiv preprint arXiv:1810.04805 .
[92] Chung, J.S., Nagrani, A., Zisserman, A., 2018. Voxceleb2: Deep [110] Di Gangi, M.A., Negri, M., Turchi, M., 2019. Adapting transformer
speaker recognition, in: INTERSPEECH. to end-to-end spoken language translation, in: Proceedings of INTER-
[93] Chung, J.S., Zisserman, A., 2017. Lip reading in the wild, in: Com- SPEECH 2019. International Speech Communication Association
puter Vision–ACCV 2016: 13th Asian Conference on Computer (ISCA), pp. 1133–1137.
Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected [111] Diez, M., Burget, L., Landini, F., Wang, S., Černockỳ, H., 2020.
Papers, Part II 13, Springer. pp. 87–103. Optimizing bayesian hmm based x-vector clustering for the second
[94] Chung, S.W., Choe, S., Chung, J.S., Kang, H.G., 2020d. Facefilter: dihard speech diarization challenge, in: ICASSP 2020-2020 IEEE
Audio-visual speech separation using still images. arXiv preprint International Conference on Acoustics, Speech and Signal Processing
arXiv:2005.07074 . (ICASSP), IEEE. pp. 6519–6523.
[95] Chung, Y.A., Hsu, W.N., Tang, H., Glass, J., 2019. An unsuper- [112] Dingliwal, S., Shenoy, A., Bodapati, S., Gandhe, A., Gadde, R.T.,
vised autoregressive model for speech representation learning. arXiv Kirchhoff, K., 2021. Prompt-tuning in asr systems for efficient
preprint arXiv:1904.03240 . domain-adaptation. arXiv preprint arXiv:2110.06502 .
[113] Donahue, C., McAuley, J., Puckette, M., 2018. Adversarial audio lenges. IEEE Signal Processing Magazine 39, 42–62.
synthesis. arXiv preprint arXiv:1802.04208 . [133] Eskimez, S.E., Maddox, R.K., Xu, C., Duan, Z., 2020. End-to-end
[114] Donahue, J., Dieleman, S., Binkowski, M., Elsen, E., Simonyan, generation of talking faces from noisy speech, in: ICASSP 2020-
K., 2020a. End-to-end adversarial text-to-speech, in: International 2020 IEEE international conference on acoustics, speech and signal
Conference on Learning Representations. processing (ICASSP), IEEE. pp. 1948–1952.
[115] Donahue, J., Dieleman, S., Bińkowski, M., Elsen, E., Simonyan, [134] Eskimez, S.E., Zhang, Y., Duan, Z., 2021. Speech driven talking
K., 2020b. End-to-end adversarial text-to-speech. arXiv preprint face generation from a single image and an emotion condition. IEEE
arXiv:2006.03575 . Transactions on Multimedia 24, 3480–3490.
[116] Dong, L., Xu, S., Xu, B., 2018. Speech-transformer: A no-recurrence [135] Fan, Y., Kang, J., Li, L., Li, K., Chen, H., Cheng, S., Zhang, P., Zhou,
sequence-to-sequence model for speech recognition, in: 2018 IEEE Z., Cai, Y., Wang, D., 2020. Cn-celeb: a challenging chinese speaker
International Conference on Acoustics, Speech and Signal Processing recognition dataset, in: ICASSP 2020-2020 IEEE International Con-
(ICASSP), pp. 5884–5888. doi:10.1109/ICASSP.2018.8462506. ference on Acoustics, Speech and Signal Processing (ICASSP), IEEE.
[117] Dong, P., Wang, S., Niu, W., Zhang, C., Lin, S., Li, Z., Gong, Y., pp. 7604–7608.
Ren, B., Lin, X., Tao, D., 2020. Rtmobile: Beyond real-time mobile [136] Fan, Y., Qian, Y., Xie, F.L., Soong, F.K., 2014. Tts synthesis with bidi-
acceleration of rnns for speech recognition, in: 2020 57th ACM/IEEE rectional lstm based recurrent neural networks, in: Fifteenth annual
Design Automation Conference (DAC), IEEE. pp. 1–6. conference of the international speech communication association.
[118] Dong, X., Williamson, D.S., 2020a. An attention enhanced multi-task [137] Fazel, A., Yang, W., Liu, Y., Barra-Chicote, R., Meng, Y., Maas,
model for objective speech assessment in real-world environments, R., Droppo, J., 2021. Synthasr: Unlocking synthetic data for speech
in: ICASSP 2020-2020 IEEE International Conference on Acoustics, recognition. arXiv preprint arXiv:2106.07803 .
Speech and Signal Processing (ICASSP), IEEE. pp. 911–915. [138] Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X., 2021. Fsd50k:
[119] Dong, X., Williamson, D.S., 2020b. A pyramid recurrent network for an open dataset of human-labeled sound events. IEEE/ACM Trans-
predicting crowdsourced speech-quality ratings of real-world signals. actions on Audio, Speech, and Language Processing 30, 829–852.
arXiv preprint arXiv:2007.15797 . [139] Franco-Galván, C., Herrera-Camacho, A., Escalante-Ramírez, B.,
[120] Dovrat, S., Nachmani, E., Wolf, L., 2021. Many-speakers single 2019. Application of different statistical tests for validation of syn-
channel speech separation with optimal permutation training. arXiv thesized speech parameterized by cepstral coefficients and lsp. Com-
preprint arXiv:2104.08955 . putación y Sistemas 23, 461–467.
[121] Drexler, J., Glass, J., 2019. Explicit alignment of text and speech [140] Frankle, J., Carbin, M., 2018. The lottery ticket hypothesis: Training
encodings for attention-based end-to-end speech recognition, in: 2019 pruned neural networks. CoRR abs/1803.03635. URL: http://arxiv.
IEEE Automatic Speech Recognition and Understanding Workshop org/abs/1803.03635, arXiv:1803.03635.
(ASRU), IEEE. pp. 913–919. [141] Frantar, E., Alistarh, D., 2023. Sparsegpt: Massive language models
[122] Du, C., Yu, K., 2022. Phone-level prosody modelling with gmm- can be accurately pruned in one-shot. ArXiv abs/2301.00774.
based mdn for diverse and controllable speech synthesis. IEEE/ACM [142] Fu, S.W., Liao, C.F., Tsao, Y., Lin, S.D., 2019. Metricgan: Generative
Transactions on Audio, Speech, and Language Processing 30, 190– adversarial networks based black-box metric scores optimization
201. doi:10.1109/TASLP.2021.3133205. for speech enhancement, in: International Conference on Machine
[123] Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.R., Lee, C.H., 2014. Robust Learning, PMLR. pp. 2031–2041.
speech recognition with speech enhanced deep neural networks, in: [143] Fu, S.W., Tsao, Y., Lu, X., 2016. Snr-aware convolutional neural
Fifteenth annual conference of the international speech communica- network modeling for speech enhancement., in: Interspeech, pp. 3768–
tion association. 3772.
[124] Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., [144] Fu, Y., Cheng, L., Lv, S., Jv, Y., Kong, Y., Chen, Z., Hu, Y., Xie, L.,
Metze, F., Torres, J., Giro-i Nieto, X., 2021. How2sign: a large- Wu, J., Bu, H., et al., 2021. Aishell-4: An open source dataset for
scale multimodal dataset for continuous american sign language, in: speech enhancement, separation, recognition and speaker diarization
Proceedings of the IEEE/CVF conference on computer vision and in conference scenario. arXiv preprint arXiv:2104.03603 .
pattern recognition, pp. 2735–2744. [145] Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., Watan-
[125] Edu, J.S., Such, J.M., Suarez-Tangil, G., 2020. Smart home personal abe, S., 2019. End-to-end neural speaker diarization with self-
assistants: a security and privacy review. ACM Computing Surveys attention, in: 2019 IEEE Automatic Speech Recognition and Un-
(CSUR) 53, 1–36. derstanding Workshop (ASRU), IEEE. pp. 296–303.
[126] Elias, I., Zen, H., Shen, J., Zhang, Y., Jia, Y., Weiss, R.J., Wu, Y., [146] Gabbay, A., Shamir, A., Peleg, S., 2017. Visual speech enhancement.
2021. Parallel tacotron: Non-autoregressive and controllable tts, in: arXiv preprint arXiv:1711.08789 .
ICASSP 2021-2021 IEEE International Conference on Acoustics, [147] Gabryś, A., Huybrechts, G., Ribeiro, M.S., Chien, C.M., Roth, J.,
Speech and Signal Processing (ICASSP), IEEE. pp. 5709–5713. Comini, G., Barra-Chicote, R., Perz, B., Lorenzo-Trueba, J., 2022.
[127] Elneima, A., Bińkowski, M., 2022. Adversarial text-to-speech for Voice filter: Few-shot text-to-speech speaker adaptation using voice
low-resource languages, in: Proceedings of the The Seventh Arabic conversion as a post-processing module, in: ICASSP 2022-2022
Natural Language Processing Workshop (WANLP), pp. 76–84. IEEE International Conference on Acoustics, Speech and Signal
[128] Ephraim, Y., 1992. A bayesian estimation approach for speech en- Processing (ICASSP), IEEE. pp. 7902–7906.
hancement using hidden markov models. IEEE Transactions on [148] Galassi, A., Lippi, M., Torroni, P., 2020. Attention in natural lan-
Signal Processing 40, 725–735. guage processing. IEEE transactions on neural networks and learning
[129] Ephrat, A., Halperin, T., Peleg, S., 2017. Improved speech recon- systems 32, 4291–4308.
struction from silent video, in: Proceedings of the IEEE International [149] Gales, M., Young, S., et al., 2008. The application of hidden markov
Conference on Computer Vision Workshops, pp. 455–462. models in speech recognition. Foundations and Trends® in Signal
[130] Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, Processing 1, 195–304.
A., Freeman, W.T., Rubinstein, M., 2018. Looking to listen at the [150] Gao, C., Gu, Y., Caliva, F., Liu, Y., 2023. Self-supervised speech
cocktail party: A speaker-independent audio-visual model for speech representation learning for keyword-spotting with light-weight trans-
separation. arXiv preprint arXiv:1804.03619 . formers. arXiv preprint arXiv:2303.04255 .
[131] Ephrat, A., Peleg, S., 2017. Vid2speech: speech reconstruction from [151] Gao, R., Grauman, K., 2021. Visualvoice: Audio-visual speech sepa-
silent video, in: 2017 IEEE International Conference on Acoustics, ration with cross-modal consistency, in: 2021 IEEE/CVF Conference
Speech and Signal Processing (ICASSP), IEEE. pp. 5095–5099. on Computer Vision and Pattern Recognition (CVPR), IEEE. pp.
[132] Ericsson, L., Gouk, H., Loy, C.C., Hospedales, T.M., 2022. Self- 15490–15500.
supervised representation learning: Introduction, advances, and chal- [152] Garcia-Romero, D., McCree, A., Snyder, D., Sell, G., 2020. Jhu-
hltcoe system for the voxsrc speaker recognition challenge, in: 6633–6637.
ICASSP 2020-2020 IEEE International Conference on Acoustics, [173] Harte, N., Gillen, E., 2015. Tcd-timit: An audio-visual corpus of
Speech and Signal Processing (ICASSP), IEEE. pp. 7559–7563. continuous speech. IEEE Transactions on Multimedia 17, 603–615.
[153] Garofolo, J.S., 1993. Timit acoustic phonetic continuous speech [174] Hatch, A.O., Kajarekar, S., Stolcke, A., 2006. Within-class covari-
corpus. Linguistic Data Consortium, 1993 . ance normalization for svm-based speaker recognition, in: Ninth
[154] Ghifary, M., Kleijn, W.B., Zhang, M., Balduzzi, D., Li, W., 2016. international conference on spoken language processing.
Deep reconstruction-classification networks for unsupervised do- [175] Haykin, S., Chen, Z., 2005. The cocktail party problem. Neural
main adaptation, in: Computer Vision–ECCV 2016: 14th European computation 17, 1875–1902.
Conference, Amsterdam, The Netherlands, October 11–14, 2016, [176] He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning
Proceedings, Part IV 14, Springer. pp. 597–613. for image recognition, in: Proceedings of the IEEE conference on
[155] Ghosal, D., Majumder, N., Mehrish, A., Poria, S., 2023. Text-to- computer vision and pattern recognition, pp. 770–778.
audio generation using instruction-tuned llm and latent diffusion [177] He, Y., Prabhavalkar, R., Rao, K., Li, W., Bakhtin, A., McGraw, I.,
model. arXiv:2304.13731. 2017. Streaming small-footprint keyword spotting using sequence-
[156] Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., to-sequence models, in: 2017 IEEE Automatic Speech Recognition
Raiman, J., Zhou, Y., 2017. Deep voice 2: Multi-speaker neural and Understanding Workshop (ASRU), IEEE. pp. 474–481.
text-to-speech. Advances in neural information processing systems [178] He, Y., Sainath, T.N., Prabhavalkar, R., McGraw, I., Alvarez, R.,
30. Zhao, D., Rybach, D., Kannan, A., Wu, Y., Pang, R., et al., 2019.
[157] Giri, R., Isik, U., Krishnaswamy, A., 2019. Attention wave-u-net Streaming end-to-end speech recognition for mobile devices, in:
for speech enhancement, in: 2019 IEEE Workshop on Applications ICASSP 2019-2019 IEEE International Conference on Acoustics,
of Signal Processing to Audio and Acoustics (WASPAA), IEEE. pp. Speech and Signal Processing (ICASSP), IEEE. pp. 6381–6385.
249–253. [179] Hemphill, C.T., Godfrey, J.J., Doddington, G.R., 1990. The atis spo-
[158] Graves, A., 2012. Sequence transduction with recurrent neural net- ken language systems pilot corpus, in: Speech and Natural Language:
works. arXiv preprint arXiv:1211.3711 . Proceedings of a Workshop Held at Hidden Valley, Pennsylvania,
[159] Graves, A., Graves, A., 2012. Connectionist temporal classification. June 24-27, 1990.
Supervised sequence labelling with recurrent neural networks , 61– [180] Hendrycks, D., Dietterich, T., 2019. Benchmarking neural net-
93. work robustness to common corruptions and perturbations, in: In-
[160] Graves, A., Jaitly, N., 2014. Towards end-to-end speech recogni- ternational Conference on Learning Representations. URL: https:
tion with recurrent neural networks, in: International conference on //openreview.net/forum?id=HJz6tiCqYm.
machine learning, PMLR. pp. 1764–1772. [181] Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S., 2016. Deep clus-
[161] Graves, A., Mohamed, A.r., Hinton, G., 2013. Speech recognition tering: Discriminative embeddings for segmentation and separation,
with deep recurrent neural networks, in: 2013 IEEE international in: 2016 IEEE international conference on acoustics, speech and
conference on acoustics, speech and signal processing, Ieee. pp. 6645– signal processing (ICASSP), IEEE. pp. 31–35.
6649. [182] Higuchi, Y., Yan, B., Arora, S., Ogawa, T., Kobayashi, T., Watanabe,
[162] Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., S., 2022. Bert meets ctc: New formulation of end-to-end speech
Han, W., Wang, S., Zhang, Z., Wu, Y., et al., 2020. Conformer: recognition with pre-trained masked language model. arXiv preprint
Convolution-augmented transformer for speech recognition. arXiv arXiv:2210.16663 .
preprint arXiv:2005.08100 . [183] Higy, B., Bell, P., 2018. Few-shot learning with attention-based
[163] Guo, H., Xie, F., Soong, F.K., Wu, X., Meng, H., 2022. A multi- sequence-to-sequence models. arXiv preprint arXiv:1811.03519 .
stage multi-codebook vq-vae approach to high-performance neural [184] Himawan, I., Villavicencio, F., Sridharan, S., Fookes, C., 2019. Deep
tts. arXiv preprint arXiv:2209.10887 . domain adaptation for anti-spoofing in speaker verification systems.
[164] Habibie, I., Elgharib, M., Sarkar, K., Abdullah, A., Nyatsanga, S., Computer Speech & Language 58, 377–402.
Neff, M., Theobalt, C., 2022. A motion matching-based framework [185] Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N.,
for controllable gesture synthesis from speech, in: ACM SIGGRAPH Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al., 2012.
2022 Conference Proceedings, pp. 1–9. Deep neural networks for acoustic modeling in speech recognition:
[165] Hady, M.F.A., Schwenker, F., 2013. Semi-supervised learning. Hand- The shared views of four research groups. IEEE Signal processing
book on Neural Information Processing , 215–239. magazine 29, 82–97.
[166] Han, C., Wang, M., Ji, H., Li, L., 2021. Learning shared semantic [186] Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic
space for speech-to-text translation. arXiv preprint arXiv:2105.03095 models. Advances in Neural Information Processing Systems 33,
. 6840–6851.
[167] Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., [187] Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory.
Xiao, A., Xu, C., Xu, Y., et al., 2022. A survey on vision transformer. Neural computation 9, 1735–1780.
IEEE transactions on pattern analysis and machine intelligence 45, [188] Hou, J.C., Wang, S.S., Lai, Y.H., Tsao, Y., Chang, H.W., Wang,
87–110. H.M., 2018. Audio-visual speech enhancement using multimodal
[168] Han, S., Lee, J., 2022. Nu-wave 2: A general neural audio upsampling deep convolutional neural networks. IEEE Transactions on Emerging
model for various sampling rates. arXiv preprint arXiv:2206.08545 . Topics in Computational Intelligence 2, 117–128.
[169] Han, W., Zhang, Z., Zhang, Y., Yu, J., Chiu, C.C., Qin, J., Gulati, A., [189] Houlsby, N., , et al., 2019a. Parameter-efficient transfer learning for
Pang, R., Wu, Y., 2020. Contextnet: Improving convolutional neural nlp, in: International Conference on Machine Learning, pp. 2790–
networks for automatic speech recognition with global context. arXiv 2799.
preprint arXiv:2005.03191 . [190] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Larous-
[170] Hanifa, R.M., Isa, K., Mohamad, S., 2021. A review on speaker silhe, Q., Gesmundo, A., Attariyan, M., Gelly, S., 2019b. Parameter-
recognition: Technology and challenges. Computers & Electrical efficient transfer learning for NLP, in: Chaudhuri, K., Salakhutdinov,
Engineering 90, 107005. R. (Eds.), Proceedings of the 36th International Conference on Ma-
[171] Hansen, P.S., 1997. Signal subspace methods for speech enhancement. chine Learning, PMLR. pp. 2790–2799. URL: https://proceedings.
Ph.D. thesis. Citeseer. mlr.press/v97/houlsby19a.html.
[172] Hao, X., Su, X., Horaud, R., Li, X., 2021. Fullsubnet: A full-band [191] Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M., 2017.
and sub-band fusion model for real-time single-channel speech en- Voice conversion from unaligned corpora using variational autoen-
hancement, in: ICASSP 2021-2021 IEEE International Conference coding wasserstein generative adversarial networks. arXiv preprint
on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. arXiv:1704.00849 .
[192] Hsu, J.Y., Chen, Y.J., Lee, H.y., 2020. Meta learning for end-to- Voice transformer network: Sequence-to-sequence voice conversion
end low-resource speech recognition, in: ICASSP 2020-2020 IEEE using transformer with text-to-speech pretraining. arXiv preprint
International Conference on Acoustics, Speech and Signal Processing arXiv:1912.06813 .
(ICASSP), IEEE. pp. 7844–7848. [211] Huang, Z., Li, H., Lei, M., 2020. Devicetts: A small-footprint,
[193] Hsu, W.N., Remez, T., Shi, B., Donley, J., Adi, Y., 2022a. Revise: fast, stable network for on-device text-to-speech. arXiv preprint
Self-supervised speech resynthesis with visual input for universal and arXiv:2010.15311 .
generalized speech enhancement. arXiv preprint arXiv:2212.11377 . [212] Hung, Y.N., Wu, C.W., Orife, I., Hipple, A., Wolcott, W., Lerch,
[194] Hsu, W.N., Zhang, Y., Weiss, R.J., Chung, Y.A., Wang, Y., Wu, A., 2022. A large tv dataset for speech and music activity detection.
Y., Glass, J., 2019. Disentangling correlated speaker and noise for EURASIP Journal on Audio, Speech, and Music Processing 2022,
speech synthesis via data augmentation and adversarial factorization, 21.
in: ICASSP 2019-2019 IEEE International Conference on Acoustics, [213] Hwang, D., Misra, A., Huo, Z., Siddhartha, N., Garg, S., Qiu, D.,
Speech and Signal Processing (ICASSP), IEEE. pp. 5901–5905. Sim, K.C., Strohman, T., Beaufays, F., He, Y., 2022. Large-scale
[195] Hsu, W.N., Zhang, Y., Weiss, R.J., Zen, H., Wu, Y., Wang, Y., Cao, Y., asr domain adaptation using self-and semi-supervised learning, in:
Jia, Y., Chen, Z., Shen, J., et al., 2018. Hierarchical generative mod- ICASSP 2022-2022 IEEE International Conference on Acoustics,
eling for controllable speech synthesis, in: International Conference Speech and Signal Processing (ICASSP), IEEE. pp. 6627–6631.
on Learning Representations. [214] Inaguma, H., Kiyono, S., Duh, K., Karita, S., Soplin, N.E.Y., Hayashi,
[196] Hsu, W.N., et al., 2021. Hubert: Self-supervised speech represen- T., Watanabe, S., 2020. Espnet-st: All-in-one speech translation
tation learning by masked prediction of hidden units. IEEE/ACM toolkit. arXiv preprint arXiv:2004.10234 .
Transactions on Audio, Speech, and Language Processing 29, 3451– [215] Indurthi, S., Han, H., Lakumarapu, N.K., Lee, B., Chung, I., Kim,
3460. S., Kim, C., 2020. End-end speech-to-text translation with modality
[197] Hsu, Y.C., Hua, T., Chang, S.E., Lou, Q., Shen, Y., Jin, H., 2022b. agnostic meta-learning, in: ICASSP 2020-2020 IEEE International
Language model compression with weighted low-rank factorization. Conference on Acoustics, Speech and Signal Processing (ICASSP),
ArXiv abs/2207.00112. IEEE. pp. 7904–7908.
[198] Hu, E.J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., [216] Isik, U., Giri, R., Phansalkar, N., Valin, J.M., Helwani, K., Krish-
Wang, L., Chen, W., 2022a. LoRA: Low-rank adaptation of large naswamy, A., 2020. Poconet: Better speech enhancement with
language models, in: International Conference on Learning Repre- frequency-positional embeddings, semi-supervised conversational
sentations. URL: https://openreview.net/forum?id=nZeVKeeFYf9. data, and biased loss. arXiv preprint arXiv:2008.04470 .
[199] Hu, E.J., et al., 2021. Lora: Low-rank adaptation of large language [217] Ito, K., Johnson, L., 2017. The lj speech dataset. https://keithito.
models, in: International Conference on Learning Representations. com/LJ-Speech-Dataset/.
[200] Hu, H.R., Song, Y., Liu, Y., Dai, L.R., McLoughlin, I., Liu, L., 2022b. [218] Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S., 2021. Diff-
Domain robust deep embedding learning for speaker recognition, in: tts: A denoising diffusion model for text-to-speech. arXiv preprint
ICASSP 2022-2022 IEEE International Conference on Acoustics, arXiv:2104.01409 .
Speech and Signal Processing (ICASSP), IEEE. pp. 7182–7186. [219] Jia, Y., Ramanovich, M.T., Remez, T., Pomerantz, R., 2022. Transla-
[201] Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E., 2017. Squeeze- totron 2: High-quality direct speech-to-speech translation with voice
and-excitation networks. URL: https://arxiv.org/abs/1709.01507, preservation, in: International Conference on Machine Learning,
doi:10.48550/ARXIV.1709.01507. PMLR. pp. 10120–10134.
[202] Hu, Y., Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., Wu, J., [220] Jiang, D., Li, W., Cao, M., Zou, W., Li, X., 2020. Speech simclr: Com-
Zhang, B., Xie, L., 2020. Dccrn: Deep complex convolution re- bining contrastive and reconstruction objective for self-supervised
current network for phase-aware speech enhancement. arXiv preprint speech representation learning. arXiv preprint arXiv:2010.13991 .
arXiv:2008.00264 . [221] Jiao, Y., Gabryś, A., Tinchev, G., Putrycz, B., Korzekwa, D., Klimkov,
[203] Hu, Y., Loizou, P.C., 2007. Evaluation of objective quality measures V., 2021. Universal neural vocoding with parallel wavenet, in:
for speech enhancement. IEEE Transactions on audio, speech, and ICASSP 2021-2021 IEEE International Conference on Acoustics,
language processing 16, 229–238. Speech and Signal Processing (ICASSP), IEEE. pp. 6044–6048.
[204] Huang, R., Lam, M.W., Wang, J., Su, D., Yu, D., Ren, Y., Zhao, Z., [222] Jin, W., Liu, X., Scordilis, M.S., Han, L., 2009. Speech enhance-
2022a. Fastdiff: A fast conditional diffusion model for high-quality ment using harmonic emphasis and adaptive comb filtering. IEEE
speech synthesis. arXiv preprint arXiv:2204.09934 . transactions on audio, speech, and language processing 18, 356–368.
[205] Huang, R., Ren, Y., Liu, J., Cui, C., Zhao, Z., 2022b. Generspeech: [223] Jo, Y.R., Moon, Y.K., Cho, W.I., Jo, G.S., 2021. Self-attentive vad:
Towards style transfer for generalizable out-of-domain text-to-speech. Context-aware detection of voice from noise, in: ICASSP 2021-2021
Advances in Neural Information Processing Systems 35, 10970– IEEE International Conference on Acoustics, Speech and Signal
10983. Processing (ICASSP), IEEE. pp. 6808–6812.
[206] Huang, R., Zhao, Z., Liu, H., Liu, J., Cui, C., Ren, Y., 2022c. Prodiff: [224] Johri, A., Tripathi, A., et al., 2019. Parkinson disease detection using
Progressive fast diffusion model for high-quality text-to-speech, in: deep neural networks, in: 2019 twelfth international conference on
Proceedings of the 30th ACM International Conference on Multime- contemporary computing (IC3), IEEE. pp. 1–4.
dia, pp. 2595–2605. [225] Ju, Y., Kim, I., Yang, H., Kim, J.H., Kim, B., Maiti, S., Watanabe,
[207] Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., S., 2022. Trinitts: Pitch-controllable end-to-end tts without external
Cui, L., Mohammed, O.K., Liu, Q., Aggarwal, K., Chi, Z., Bjorck, aligner, in: Proc. Interspeech, pp. 16–20.
J., Chaudhary, V., Som, S., Song, X., Wei, F., 2023. Language is [226] Jung, J.w., Heo, H.S., Kim, J.h., Shim, H.j., Yu, H.J., 2019.
not all you need: Aligning perception with language models. ArXiv Rawnet: Advanced end-to-end deep neural network using raw wave-
abs/2302.14045. forms for text-independent speaker verification. arXiv preprint
[208] Huang, S.F., Lin, C.J., Liu, D.R., Chen, Y.C., Lee, H.y., 2022d. arXiv:1904.08104 .
Meta-tts: Meta-learning for few-shot speaker adaptive text-to-speech. [227] Jung, J.w., Heo, H.S., Yu, H.J., Chung, J.S., 2021a. Graph attention
IEEE/ACM Transactions on Audio, Speech, and Language Process- networks for speaker verification, in: ICASSP 2021-2021 IEEE In-
ing 30, 1558–1571. ternational Conference on Acoustics, Speech and Signal Processing
[209] Huang, W.C., Hayashi, T., Li, X., Watanabe, S., Toda, T., 2021. (ICASSP), IEEE. pp. 6149–6153.
On prosody modeling for asr+ tts based voice conversion, in: 2021 [228] Jung, J.w., Heo, H.S., Yu, H.J., Chung, J.S., 2021b. Graph attention
IEEE Automatic Speech Recognition and Understanding Workshop networks for speaker verification, in: ICASSP 2021 - 2021 IEEE
(ASRU), IEEE. pp. 642–649. International Conference on Acoustics, Speech and Signal Processing
[210] Huang, W.C., Hayashi, T., Wu, Y.C., Kameoka, H., Toda, T., 2019. (ICASSP), pp. 6149–6153. doi:10.1109/ICASSP39728.2021.9414057.
[229] Kahn, J., Lee, A., Hannun, A., 2020. Self-training for end-to-end Learning robust and multilingual speech representations. arXiv
speech recognition, in: ICASSP 2020-2020 IEEE International Con- preprint arXiv:2001.11128 .
ference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. [248] Kenter, T., Wan, V., Chan, C.A., Clark, R., Vit, J., 2019. Chive:
pp. 7084–7088. Varying prosody in speech synthesis with a linguistically driven dy-
[230] Kakuba, S., Poulose, A., Han, D.S., 2022. Deep learning-based namic hierarchical conditional variational network, in: International
speech emotion recognition using multi-level fusion of concurrent Conference on Machine Learning, PMLR. pp. 3331–3340.
features. IEEE Access 10, 125538–125551. [249] Kim, H., Kim, S., Yoon, S., 2022a. Guided-tts: A diffusion model for
[231] Kala, T., Shinozaki, T., 2018. Reinforcement learning of speech text-to-speech via classifier guidance, in: International Conference
recognition system based on policy gradient and hypothesis selection, on Machine Learning, PMLR. pp. 11119–11133.
in: 2018 ieee international conference on acoustics, speech and signal [250] Kim, J., Kim, S., Kong, J., Yoon, S., 2020. Glow-tts: A generative
processing (icassp), IEEE. pp. 5759–5763. flow for text-to-speech via monotonic alignment search. Advances in
[232] Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Neural Information Processing Systems 33, 8067–8077.
Lockhart, E., Stimberg, F., Oord, A., Dieleman, S., Kavukcuoglu, K., [251] Kim, J., Kong, J., Son, J., 2021a. Conditional variational autoencoder
2018. Efficient neural audio synthesis, in: International Conference with adversarial learning for end-to-end text-to-speech, in: Interna-
on Machine Learning, PMLR. pp. 2410–2419. tional Conference on Machine Learning, PMLR. pp. 5530–5540.
[233] Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A.v.d., Graves, [252] Kim, J., Lee, J., 2021. Generalizing rnn-transducer to out-domain au-
A., Kavukcuoglu, K., 2016. Neural machine translation in linear time. dio via sparse self-attention layers. arXiv preprint arXiv:2108.10752
arXiv preprint arXiv:1610.10099 . .
[234] Kalchbrenner, N., Grefenstette, E., Blunsom, P., 2014. A convo- [253] Kim, J., Lee, Y., Hong, S., Ok, J., 2022b. Learning continuous repre-
lutional neural network for modelling sentences. arXiv preprint sentation of audio for arbitrary scale super resolution, in: ICASSP
arXiv:1404.2188 . 2022-2022 IEEE International Conference on Acoustics, Speech and
[235] Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N., 2019. Acvae-vc: Signal Processing (ICASSP), IEEE. pp. 3703–3707.
Non-parallel voice conversion with auxiliary classifier variational au- [254] Kim, J.H., Lee, S.H., Lee, J.H., Lee, S.W., 2021b. Fre-gan:
toencoder. IEEE/ACM Transactions on Audio, Speech, and Language Adversarial frequency-consistent audio synthesis. arXiv preprint
Processing 27, 1432–1443. arXiv:2106.02297 .
[236] Kanda, N., Wu, J., Wu, Y., Xiao, X., Meng, Z., Wang, X., Gaur, [255] Kim, S., Gholami, A., Shaw, A., Lee, N., Mangalam, K., Malik,
Y., Chen, Z., Li, J., Yoshioka, T., 2022a. Streaming speaker- J., Mahoney, M.W., Keutzer, K., 2022c. Squeezeformer: An effi-
attributed asr with token-level speaker embeddings. arXiv preprint cient transformer for automatic speech recognition. arXiv preprint
arXiv:2203.16685 . arXiv:2206.00888 .
[237] Kanda, N., Xiao, X., Gaur, Y., Wang, X., Meng, Z., Chen, Z., Yosh- [256] Kim, S., Kim, G., Shin, S., Lee, S., 2021c. Two-stage textual
ioka, T., 2022b. Transcribe-to-diarize: Neural speaker diarization for knowledge distillation for end-to-end spoken language understand-
unlimited number of speakers using end-to-end speaker-attributed asr, ing, in: ICASSP 2021 - 2021 IEEE International Conference on
in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Acoustics, Speech and Signal Processing (ICASSP), pp. 7463–7467.
Speech and Signal Processing (ICASSP), IEEE. pp. 8082–8086. doi:10.1109/ICASSP39728.2021.9414619.
[238] Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N., 2019. Cyclegan-vc2: [257] Kim, S., Kim, H., Yoon, S., 2022d. Guided-tts 2: A diffusion
Improved cyclegan-based non-parallel voice conversion, in: ICASSP model for high-quality adaptive text-to-speech with untranscribed
2019-2019 IEEE International Conference on Acoustics, Speech and data. arXiv preprint arXiv:2205.15370 .
Signal Processing (ICASSP), IEEE. pp. 6820–6824. [258] Kim, S., Lee, S.g., Song, J., Kim, J., Yoon, S., 2018. Flowavenet: A
[239] Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N., 2020. Cyclegan- generative flow for raw audio. arXiv preprint arXiv:1811.02155 .
vc3: Examining and improving cyclegan-vcs for mel-spectrogram [259] Kinnunen, T., Karpov, E., Franti, P., 2005. Real-time speaker identi-
conversion. arXiv preprint arXiv:2010.11672 . fication and verification. IEEE Transactions on Audio, Speech, and
[240] Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N., 2021. Maskcyclegan- Language Processing 14, 277–288.
vc: Learning non-parallel voice conversion with filling in frames, [260] Kinoshita, K., Ochiai, T., Delcroix, M., Nakatani, T., 2020. Improv-
in: ICASSP 2021-2021 IEEE International Conference on Acoustics, ing noise robust automatic speech recognition with single-channel
Speech and Signal Processing (ICASSP), IEEE. pp. 5919–5923. time-domain enhancement network, in: ICASSP 2020-2020 IEEE
[241] Kaneko, T., Tanaka, K., Kameoka, H., Seki, S., 2022. istftnet: Fast International Conference on Acoustics, Speech and Signal Processing
and lightweight mel-spectrogram vocoder incorporating inverse short- (ICASSP), IEEE. pp. 7009–7013.
time fourier transform, in: ICASSP 2022-2022 IEEE International [261] Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj, M., In-
Conference on Acoustics, Speech and Signal Processing (ICASSP), man, D.J., 2021. 1d convolutional neural networks and applica-
IEEE. pp. 6207–6211. tions: A survey. Mechanical Systems and Signal Processing 151,
[242] Kang, J., Liu, R., Li, L., Cai, Y., Wang, D., Zheng, T.F., 2020. 107398. URL: https://www.sciencedirect.com/science/article/
Domain-invariant speaker vector projection by model-agnostic meta- pii/S0888327020307846, doi:https://doi.org/10.1016/j.ymssp.2020.
learning. arXiv preprint arXiv:2005.11900 . 107398.
[243] Kansizoglou, I., Bampis, L., Gasteratos, A., 2019. An active learn- [262] Kiranyaz, S., Ince, T., Hamila, R., Gabbouj, M., 2015. Convolu-
ing paradigm for online audio-visual emotion recognition. IEEE tional neural networks for patient-specific ecg classification, in: 2015
Transactions on Affective Computing 13, 756–768. 37th Annual International Conference of the IEEE Engineering in
[244] Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Medicine and Biology Society (EMBC), IEEE. pp. 2608–2611.
Someki, M., Soplin, N.E.Y., Yamamoto, R., Wang, X., et al., 2019a. [263] Koizumi, Y., Yatabe, K., Delcroix, M., Masuyama, Y., Takeuchi,
A comparative study on transformer vs rnn in speech applications, D., 2020. Speech enhancement using self-adaptation and multi-head
in: 2019 IEEE Automatic Speech Recognition and Understanding self-attention, in: ICASSP 2020-2020 IEEE International Conference
Workshop (ASRU), IEEE. pp. 449–456. on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp.
[245] Karita, S., Yalta, N., Watanabe, S., Delcroix, M., Ogawa, A., 181–185.
Nakatani, T., 2019b. Improving transformer-based end-to-end speech [264] Koizumi, Y., Zen, H., Yatabe, K., Chen, N., Bacchiani, M., 2022.
recognition with connectionist temporal classification and language Specgrad: Diffusion probabilistic model based neural vocoder with
model integration, in: INTERSPEECH. adaptive noise spectral shaping. arXiv preprint arXiv:2203.16749 .
[246] Kawakami, K., 2008. Supervised sequence labelling with recurrent [265] Kolbæk, M., Yu, D., Tan, Z.H., Jensen, J., 2017. Multitalker speech
neural networks. Ph.D. thesis. Technical University of Munich. separation with utterance-level permutation invariant training of
[247] Kawakami, K., Wang, L., Dyer, C., Blunsom, P., Oord, A.v.d., 2020. deep recurrent neural networks. IEEE/ACM Transactions on Audio,
Speech, and Language Processing 25, 1901–1913. [284] Łańcucki, A., 2021. Fastpitch: Parallel text-to-speech with pitch
[266] Koluguri, N.R., Park, T., Ginsburg, B., 2022. Titanet: Neural model prediction, in: ICASSP 2021-2021 IEEE International Conference
for speaker representation with 1d depth-wise separable convolutions on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp.
and global context, in: ICASSP 2022-2022 IEEE International Con- 6588–6592.
ference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. [285] Landini, F., Profant, J., Diez, M., Burget, L., 2022. Bayesian hmm
pp. 8102–8106. clustering of x-vector sequences (vbx) in speaker diarization: theory,
[267] Kominek, J., Black, A.W., 2004. The cmu arctic speech databases, implementation and analysis on standard tasks. Computer Speech &
in: Fifth ISCA workshop on speech synthesis. Language 71, 101254.
[268] Kong, J., Kim, J., Bae, J., 2020a. Hifi-gan: Generative adversarial [286] Larcher, A., Lee, K.A., Ma, B., Li, H., 2012. The rsr2015: Database
networks for efficient and high fidelity speech synthesis. Advances for text-dependent speaker verification using multiple pass-phrases,
in Neural Information Processing Systems 33, 17022–17033. in: Annual Conference of the International Speech Communication
[269] Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B., 2020b. Dif- Association (Interspeech).
fwave: A versatile diffusion model for audio synthesis. arXiv preprint [287] Larcher, A., Mehrish, A., Tahon, M., Meignier, S., Carrive, J.,
arXiv:2009.09761 . Doukhan, D., Galibert, O., Evans, N., 2021. Speaker embeddings
[270] Koval, S., Krynov, S., 2020. Practice of usage of spectral analysis for diarization of broadcast data in the allies challenge, in: ICASSP
for forensic speaker identification, in: RLA2C 1998-Speaker Recog- 2021-2021 IEEE International Conference on Acoustics, Speech and
nition and its Commercial and Forensic Applications, pp. 136–140. Signal Processing (ICASSP), IEEE. pp. 5799–5803.
[271] Koyama, Y., Vuong, T., Uhlich, S., Raj, B., 2020. Exploring the best [288] Latif, S., Qadir, J., Qayyum, A., Usama, M., Younis, S., 2020. Speech
loss function for dnn-based low-latency speech enhancement with technology for healthcare: Opportunities, challenges, and state of the
temporal convolutional networks. arXiv preprint arXiv:2005.11611 . art. IEEE Reviews in Biomedical Engineering 14, 342–356.
[272] Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., D’efossez, A., Copet, [289] Lee, H.y., Mohamed, A., Watanabe, S., Sainath, T., Livescu, K., Li,
J., Parikh, D., Taigman, Y., Adi, Y., 2022. Audiogen: Textually S.W., Yang, S.w., Kirchhoff, K., 2022a. Self-supervised represen-
guided audio generation. ArXiv abs/2209.15352. tation learning for speech processing, in: Proceedings of the 2022
[273] Kriman, S., Beliaev, S., Ginsburg, B., Huang, J., Kuchaiev, O., Conference of the North American Chapter of the Association for
Lavrukhin, V., Leary, R., Li, J., Zhang, Y., 2020. Quartznet: Deep Computational Linguistics: Human Language Technologies: Tutorial
automatic speech recognition with 1d time-channel separable con- Abstracts, pp. 8–13.
volutions, in: ICASSP 2020-2020 IEEE International Conference [290] Lee, J., Han, S., 2021. Nu-wave: A diffusion probabilistic model for
on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. neural audio upsampling. arXiv preprint arXiv:2104.02321 .
6124–6128. [291] Lee, K.A., Larcher, A., Wang, G., Kenny, P., Brümmer, N.,
[274] Kulkarni, A., Colotte, V., Jouvet, D., 2020. Transfer learning of the Van Leeuwen, D., Aronowitz, H., Kockmann, M., Vaquero, C., Ma,
expressivity using flow metric learning in multispeaker text-to-speech B., et al., 2015. The reddots data collection for speaker recognition,
synthesis, in: INTERSPEECH 2020. in: Interspeech 2015.
[275] Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W.Z., [292] Lee, K.A., Wang, Q., Koshinaka, T., 2019. The coral+ algorithm
Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C., 2019. Mel- for unsupervised domain adaptation of plda, in: ICASSP 2019-2019
gan: Generative adversarial networks for conditional waveform syn- IEEE International Conference on Acoustics, Speech and Signal
thesis. Advances in neural information processing systems 32. Processing (ICASSP), IEEE. pp. 5821–5825.
[276] Kwon, O., Jang, I., Ahn, C., Kang, H.G., 2019. An effective style [293] Lee, S.g., Kim, H., Shin, C., Tan, X., Liu, C., Meng, Q., Qin, T., Chen,
token weight control technique for end-to-end emotional speech syn- W., Yoon, S., Liu, T.Y., 2021a. Priorgrad: Improving conditional
thesis. IEEE Signal Processing Letters 26, 1383–1387. doi:10.1109/ denoising diffusion models with data-dependent adaptive prior, in:
LSP.2019.2931673. International Conference on Learning Representations.
[277] Kwon, Y., Heo, H.S., Jung, J.w., Kim, Y.J., Lee, B.J., Chung, J.S., [294] Lee, S.g., Kim, S., Yoon, S., 2020. Nanoflow: Scalable normalizing
2022. Multi-scale speaker embedding-based graph attention networks flows with sublinear parameter complexity. Advances in Neural
for speaker diarisation, in: ICASSP 2022-2022 IEEE International Information Processing Systems 33, 14058–14067.
Conference on Acoustics, Speech and Signal Processing (ICASSP), [295] Lee, S.H., Kim, S.B., Lee, J.H., Song, E., Hwang, M.J., Lee, S.W.,
IEEE. pp. 8367–8371. 2022b. Hierspeech: Bridging the gap between text and speech by hi-
[278] Kwon, Y., Jung, J.w., Heo, H.S., Kim, Y.J., Lee, B.J., Chung, J.S., erarchical variational inference using self-supervised representations
2021. Adapting speaker embeddings for speaker diarisation. arXiv for speech synthesis. Advances in Neural Information Processing
preprint arXiv:2104.02879 . Systems 35, 16624–16636.
[279] Kye, S.M., Jung, Y., Lee, H.B., Hwang, S.J., Kim, H., 2020. Meta- [296] Lee, Y., Shin, J., Jung, K., 2021b. Bidirectional variational inference
learning for short utterance speaker recognition with imbalance length for non-autoregressive text-to-speech, in: International Conference
pairs. arXiv preprint arXiv:2004.02863 . on Learning Representations.
[280] Lai, C.J., Zhang, Y., Liu, A.H., Chang, S., Liao, Y., Chuang, [297] Lemaire, Q., Holzapfel, A., 2019. Temporal convolutional networks
Y., Qian, K., Khurana, S., Cox, D.D., Glass, J.R., 2021. PARP: for speech and music detection in radio broadcast, in: 20th Interna-
prune, adjust and re-prune for self-supervised speech recognition. tional Society for Music Information Retrieval Conference, ISMIR
CoRR abs/2106.05933. URL: https://arxiv.org/abs/2106.05933, 2019, 4-8 November 2019, International Society for Music Informa-
arXiv:2106.05933. tion Retrieval.
[281] Lakhotia, K., et al., 2021. On generative spoken language modeling [298] Lemercier, J.M., Richter, J., Welker, S., Gerkmann, T., 2022. Storm:
from raw audio. Transactions of the Association for Computational A diffusion-based stochastic regeneration model for speech enhance-
Linguistics 9. ment and dereverberation. arXiv preprint arXiv:2212.11851 .
[282] Lakomkin, E., Zamani, M.A., Weber, C., Magg, S., Wermter, S., [299] Leng, Y., Chen, Z., Guo, J., Liu, H., Chen, J., Tan, X., Mandic, D.,
2018. Emorl: continuous acoustic emotion classification using deep He, L., Li, X.Y., Qin, T., et al., 2022. Binauralgrad: A two-stage
reinforcement learning, in: 2018 IEEE International Conference on conditional diffusion probabilistic model for binaural audio synthesis.
Robotics and Automation (ICRA), IEEE. pp. 4445–4450. arXiv preprint arXiv:2205.14807 .
[283] Lam, M.W., Wang, J., Su, D., Yu, D., 2021. Sandglasset: A light [300] Leroy, D., Coucke, A., Lavril, T., Gisselbrecht, T., Dureau, J., 2019.
multi-granularity self-attentive network for time-domain speech sep- Federated learning for keyword spotting, in: ICASSP 2019-2019
aration, in: ICASSP 2021-2021 IEEE International Conference on IEEE international conference on acoustics, speech and signal pro-
Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 5759– cessing (ICASSP), IEEE. pp. 6341–6345.
5763. [301] Levkovitch, A., Nachmani, E., Wolf, L., 2022. Zero-shot voice
conditioning for denoising diffusion tts models. arXiv preprint [320] Lim, T.Y., Yeh, R.A., Xu, Y., Do, M.N., Hasegawa-Johnson, M., 2018.
arXiv:2206.02246 . Time-frequency networks for audio super-resolution, in: 2018 IEEE
[302] Li, B., Chang, S.y., Sainath, T.N., Pang, R., He, Y., Strohman, T., International Conference on Acoustics, Speech and Signal Processing
Wu, Y., 2020a. Towards fast and accurate streaming end-to-end asr, (ICASSP), pp. 646–650. doi:10.1109/ICASSP.2018.8462049.
in: ICASSP 2020-2020 IEEE International Conference on Acoustics, [321] Lin, J., Niu, S., Wei, Z., Lan, X., Wijngaarden, A.J., Smith, M.C.,
Speech and Signal Processing (ICASSP), IEEE. pp. 6069–6073. Wang, K.C., 2019. Speech enhancement using forked generative
[303] Li, B., Gulati, A., Yu, J., Sainath, T.N., Chiu, C.C., Narayanan, A., adversarial networks with spectral subtraction. Proceedings of Inter-
Chang, S.Y., Pang, R., He, Y., Qin, J., et al., 2021a. A better and faster speech 2019 .
end-to-end model for streaming asr, in: ICASSP 2021-2021 IEEE [322] Lin, J., van Wijngaarden, A.J.d.L., Wang, K.C., Smith, M.C., 2021a.
International Conference on Acoustics, Speech and Signal Processing Speech enhancement using multi-stage self-attentive temporal con-
(ICASSP), IEEE. pp. 5634–5638. volutional networks. IEEE/ACM Transactions on Audio, Speech,
[304] Li, B., Sainath, T.N., Narayanan, A., Caroselli, J., Bacchiani, M., and Language Processing 29, 3440–3450. doi:10.1109/TASLP.2021.
Misra, A., Shafran, I., Sak, H., Pundak, G., Chin, K.K., et al., 2017. 3125143.
Acoustic modeling for google home., in: Interspeech, pp. 399–403. [323] Lin, J.h., Lin, Y.Y., Chien, C.M., Lee, H.y., 2021b. S2vc: a frame-
[305] Li, J., Liu, W., Lee, T., 2022a. Editnet: A lightweight network work for any-to-any voice conversion with self-supervised pretrained
for unsupervised domain adaptation in speaker verification. arXiv representations. arXiv preprint arXiv:2104.02901 .
preprint arXiv:2206.07548 . [324] Lin, Q., Hou, Y., Li, M., 2020. Self-attentive similarity measurement
[306] Li, J., Zhang, H., Zhang, X., Li, C., 2019a. Single channel speech strategies in speaker diarization., in: INTERSPEECH, pp. 284–288.
enhancement using temporal convolutional recurrent neural networks, [325] Lin, W.W., Mak, M.W., 2020. Wav2spk: A simple dnn architecture for
in: 2019 Asia-Pacific Signal and Information Processing Association learning speaker embeddings from waveforms., in: INTERSPEECH,
Annual Summit and Conference (APSIPA ASC), IEEE. pp. 896–900. pp. 3211–3215.
[307] Li, J., et al., 2022b. Recent advances in end-to-end automatic speech [326] Ling, S., Liu, Y., 2020. Decoar 2.0: Deep contextualized
recognition. APSIPA Transactions on Signal and Information Pro- acoustic representations with vector quantization. arXiv preprint
cessing 11. arXiv:2012.06659 .
[308] Li, K., Yang, R., Hu, X., 2022c. An efficient encoder-decoder archi- [327] Liu, A.H., Hsu, W.N., Auli, M., Baevski, A., 2023a. Towards end-
tecture with top-down attention for speech separation. arXiv preprint to-end unsupervised speech recognition, in: 2022 IEEE Spoken Lan-
arXiv:2209.15200 . guage Technology Workshop (SLT), IEEE. pp. 221–228.
[309] Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., 2019b. Neural speech [328] Liu, A.H., Lai, C.I.J., Hsu, W.N., Auli, M., Baevskiv, A., Glass, J.,
synthesis with transformer network, in: Proceedings of the AAAI 2022a. Simple and effective unsupervised speech synthesis. arXiv
conference on artificial intelligence, pp. 6706–6713. preprint arXiv:2204.02524 .
[310] Li, N., Liu, Y., Wu, Y., Liu, S., Zhao, S., Liu, M., 2020b. Robutrans: [329] Liu, A.T., Li, S.W., Lee, H.y., 2021a. Tera: Self-supervised learning
A robust transformer-based text-to-speech model, in: Proceedings of of transformer encoder representation for speech. IEEE/ACM Trans-
the AAAI Conference on Artificial Intelligence, pp. 8228–8235. actions on Audio, Speech, and Language Processing 29, 2351–2366.
[311] Li, Q., Gao, F., Guan, H., Ma, K., 2021b. Real-time monaural [330] Liu, A.T., Yang, S.w., Chi, P.H., Hsu, P.c., Lee, H.y., 2020. Mock-
speech enhancement with short-time discrete cosine transform. arXiv ingjay: Unsupervised speech representation learning with deep bidi-
preprint arXiv:2102.04629 . rectional transformer encoders, in: ICASSP 2020-2020 IEEE Inter-
[312] Li, Q., Qiu, D., Zhang, Y., Li, B., He, Y., Woodland, P.C., Cao, national Conference on Acoustics, Speech and Signal Processing
L., Strohman, T., 2021c. Confidence estimation for attention-based (ICASSP), IEEE. pp. 6419–6423.
sequence-to-sequence models for speech recognition, in: ICASSP [331] Liu, C., Zhang, F., Le, D., Kim, S., Saraf, Y., Zweig, G., 2021b.
2021-2021 IEEE International Conference on Acoustics, Speech and Improving rnn transducer based asr with auxiliary tasks, in: 2021
Signal Processing (ICASSP), IEEE. pp. 6388–6392. IEEE Spoken Language Technology Workshop (SLT), IEEE. pp.
[313] Li, R., Zhang, W., Chen, D., 2022d. The coral++ algorithm for 172–179.
unsupervised domain adaptation of speaker recognition, in: ICASSP [332] Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D.P., Wang,
2022-2022 IEEE International Conference on Acoustics, Speech and W., Plumbley, M.., 2023b. Audioldm: Text-to-audio generation with
Signal Processing (ICASSP), IEEE. pp. 7172–7176. latent diffusion models. ArXiv abs/2301.12503.
[314] Li, X.L., Liang, P., 2021. Prefix-tuning: Optimizing continuous [333] Liu, H., Choi, W., Liu, X., Kong, Q., Tian, Q., Wang, D., 2022b.
prompts for generation, in: Proceedings of the 59th Annual Meeting Neural vocoder is all you need for speech super-resolution. arXiv
of the Association for Computational Linguistics and the 11th Inter- preprint arXiv:2203.14941 .
national Joint Conference on Natural Language Processing (Volume [334] Liu, J., Li, C., Ren, Y., Chen, F., Zhao, Z., 2022c. Diffsinger: Singing
1: Long Papers), Association for Computational Linguistics, Online. voice synthesis via shallow diffusion mechanism, in: Proceedings of
pp. 4582–4597. URL: https://aclanthology.org/2021.acl-long.353, the AAAI Conference on Artificial Intelligence, pp. 11020–11028.
doi:10.18653/v1/2021.acl-long.353. [335] Liu, J., Pasupat, P., Cyphers, S., Glass, J., 2013. Asgard: A portable
[315] Li, Y., Mehrish, A., Zhao, S., Bhardwaj, R., Zadeh, A., Majumder, N., architecture for multilingual dialogue systems, in: 2013 IEEE Inter-
Mihalcea, R., Poria, S., 2023. Evaluating parameter-efficient transfer national Conference on Acoustics, Speech and Signal Processing,
learning approaches on sure benchmark for speech understanding. IEEE. pp. 8386–8390.
arXiv preprint arXiv:2303.03267 . [336] Liu, M., Wei, Y., 2022. An improvement to conformer-based model
[316] Li, Y.A., Han, C., Mesgarani, N., 2022e. Styletts: A style-based for high-accuracy speech feature extraction and learning. Entropy 24,
generative model for natural and diverse text-to-speech synthesis. 866.
arXiv preprint arXiv:2205.15439 . [337] Liu, R., Sisman, B., Gao, G., Li, H., 2021c. Expressive tts training
[317] Lim, D., Jang, W., Park, H., Kim, B., Yoon, J., et al., 2020. Jdi- with frame and style reconstruction loss. IEEE/ACM Transactions
t: Jointly trained duration informed transformer for text-to-speech on Audio, Speech, and Language Processing 29, 1806–1818. doi:10.
without explicit alignment. arXiv preprint arXiv:2005.07799 . 1109/TASLP.2021.3076369.
[318] Lim, D., Jung, S., Kim, E., 2022. Jets: Jointly training fast- [338] Liu, R., Sisman, B., Li, H., 2021d. Graphspeech: Syntax-aware
speech2 and hifi-gan for end to end text to speech. arXiv preprint graph attention network for neural speech synthesis, in: ICASSP
arXiv:2203.16852 . 2021-2021 IEEE International Conference on Acoustics, Speech and
[319] Lim, J., Oppenheim, A., 1978. All-pole modeling of degraded speech. Signal Processing (ICASSP), IEEE. pp. 6059–6063.
IEEE Transactions on Acoustics, Speech, and Signal Processing 26, [339] Liu, S., Cao, Y., Wang, D., Wu, X., Liu, X., Meng, H., 2021e. Any-to-
197–210. many voice conversion with location-relative sequence-to-sequence
modeling. IEEE/ACM Transactions on Audio, Speech, and Language adversarial and collaborative games, in: International Conference on
Processing 29, 1717–1728. Learning Representations.
[340] Liu, S., Su, D., Yu, D., 2022d. Diffgan-tts: High-fidelity and effi- [361] Macho, D., Mauuary, L., Noé, B., Cheng, Y.M., Ealey, D., Jouvet, D.,
cient text-to-speech with denoising diffusion gans. arXiv preprint Kelleher, H., Pearce, D., Saadoun, F., 2002. Evaluation of a noise-
arXiv:2201.11972 . robust dsr front-end on aurora databases, in: Seventh International
[341] Liu, X., Van De Weijer, J., Bagdanov, A.D., 2019. Exploiting unla- Conference on Spoken Language Processing.
beled data in cnns by self-supervised learning to rank. IEEE transac- [362] Maimon, G., Adi, Y., 2022. Speaking style conversion with discrete
tions on pattern analysis and machine intelligence 41, 1862–1878. self-supervised units. arXiv preprint arXiv:2212.09730 .
[342] Liu, Y., Xu, Z., Wang, G., Chen, K., Li, B., Tan, X., Li, J., He, L., [363] Maiti, S., Mandel, M.I., 2020. Speaker independence of neural
Zhao, S., 2021f. Delightfultts: The microsoft speech synthesis system vocoders and their effect on parametric resynthesis speech enhance-
for blizzard challenge 2021. arXiv preprint arXiv:2110.12612 . ment, in: ICASSP 2020-2020 IEEE International Conference on
[343] Liu, Z., Tian, Q., Hu, C., Liu, X., Wu, M., Wang, Y., Zhao, H., Wang, Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 206–
Y., 2022e. Controllable and lossless non-autoregressive end-to-end 210.
text-to-speech. arXiv preprint arXiv:2207.06088 . [364] Majumdar, S., Acharya, S., Lavrukhin, V., Ginsburg, B., 2023. Dam-
[344] Lorenzo-Trueba, J., Drugman, T., Latorre, J., Merritt, T., Pu- age control during domain adaptation for transducer based automatic
trycz, B., Barra-Chicote, R., Moinet, A., Aggarwal, V., 2018. To- speech recognition, in: 2022 IEEE Spoken Language Technology
wards achieving robust universal neural vocoding. arXiv preprint Workshop (SLT), IEEE. pp. 130–135.
arXiv:1811.06292 . [365] Majumdar, S., Balam, J., Hrinchuk, O., Lavrukhin, V., Noroozi,
[345] Lu, X., Li, S., Fujimoto, M., 2020. Automatic speech recognition. V., Ginsburg, B., 2021. Citrinet: Closing the gap between non-
Speech-to-Speech Translation , 21–38. autoregressive and autoregressive end-to-end models for automatic
[346] Lu, X., Tsao, Y., Matsuda, S., Hori, C., 2013. Speech enhancement speech recognition. arXiv preprint arXiv:2104.01721 .
based on deep denoising autoencoder., in: Interspeech, pp. 436–440. [366] Malladi, S., Wettig, A., Yu, D., Chen, D., Arora, S., 2022. A kernel-
[347] Lu, Y.J., Tsao, Y., Watanabe, S., 2021. A study on speech enhance- based view of language model fine-tuning. ArXiv abs/2210.05643.
ment based on diffusion probabilistic model, in: 2021 Asia-Pacific [367] Mani, A., Palaskar, S., Meripo, N.V., Konam, S., Metze, F., 2020.
Signal and Information Processing Association Annual Summit and Asr error correction and domain adaptation using machine translation,
Conference (APSIPA ASC), pp. 659–666. in: ICASSP 2020-2020 IEEE International Conference on Acoustics,
[348] Lu, Y.J., Wang, Z.Q., Watanabe, S., Richard, A., Yu, C., Tsao, Y., Speech and Signal Processing (ICASSP), IEEE. pp. 6344–6348.
2022a. Conditional diffusion probabilistic model for speech enhance- [368] Manocha, P., Kumar, A., 2022. Speech quality assessment through
ment, in: ICASSP 2022 - 2022 IEEE International Conference on mos using non-matching references. arXiv preprint arXiv:2206.12285
Acoustics, Speech and Signal Processing (ICASSP), pp. 7402–7406. .
doi:10.1109/ICASSP43922.2022.9746901. [369] Manocha, P., Xu, B., Kumar, A., 2021. Noresqa: A framework for
[349] Lu, Y.J., Wang, Z.Q., Watanabe, S., Richard, A., Yu, C., Tsao, Y., speech quality assessment using non-matching references. Advances
2022b. Conditional diffusion probabilistic model for speech enhance- in Neural Information Processing Systems 34, 22363–22378.
ment, in: ICASSP 2022-2022 IEEE International Conference on [370] Mary, N.J.M.S., Umesh, S., Katta, S.V., 2021. S-vectors and tesa:
Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 7402– Speaker embeddings and a speaker authenticator based on transformer
7406. encoder. IEEE/ACM Transactions on Audio, Speech, and Language
[350] Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V.S., Bengio, Y., Processing 30, 404–413.
2019. Speech model pre-training for end-to-end spoken language [371] McLaren, M., Ferrer, L., Castan, D., Lawson, A., 2016. The speakers
understanding. arXiv preprint arXiv:1904.03670 . in the wild (sitw) speaker recognition database., in: Interspeech, pp.
[351] Luo, Y., Chen, Z., Yoshioka, T., 2020. Dual-path rnn: efficient long 818–822.
sequence modeling for time-domain single-channel speech separation, [372] Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., Ko-
in: ICASSP 2020-2020 IEEE International Conference on Acoustics, renevskaya, M., Sorokin, I., Timofeeva, T., Mitrofanov, A., An-
Speech and Signal Processing (ICASSP), IEEE. pp. 46–50. drusenko, A., Podluzhny, I., et al., 2020. Target-speaker voice activity
[352] Luo, Y., Mesgarani, N., 2018. Real-time single-channel dereverbera- detection: a novel approach for multi-speaker diarization in a dinner
tion and separation with time-domain audio separation network., in: party scenario. arXiv preprint arXiv:2005.07272 .
Interspeech, pp. 342–346. [373] Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J.,
[353] Luo, Y., Mesgarani, N., 2019. Conv-tasnet: Surpassing ideal time– Courville, A., Bengio, Y., 2016. Samplernn: An unconditional end-to-
frequency magnitude masking for speech separation. IEEE/ACM end neural audio generation model. arXiv preprint arXiv:1612.07837
transactions on audio, speech, and language processing 27, 1256– .
1266. [374] Mehta, S., Mercan, E., Bartlett, J., Weaver, D., Elmore, J.G., Shapiro,
[354] Luong, M., Tran, V.A., 2021. Flowvocoder: A small footprint neural L., 2018. Y-net: joint segmentation and classification for diagnosis of
vocoder based normalizing flow for speech synthesis. arXiv preprint breast biopsy images, in: Medical Image Computing and Computer
arXiv:2109.13675 . Assisted Intervention–MICCAI 2018: 21st International Conference,
[355] Luong, M.T., Pham, H., Manning, C.D., 2015. Effective ap- Granada, Spain, September 16-20, 2018, Proceedings, Part II 11,
proaches to attention-based neural machine translation. arXiv preprint Springer. pp. 893–901.
arXiv:1508.04025 . [375] Mehta, S., Szekely, E., Beskow, J., Henter, G.E., 2022. Neural hmms
[356] Lutati, S., Nachmani, E., Wolf, L., 2022. Sepit approaching a single are all you need (for high-quality attention-free tts), in: ICASSP 2022 -
channel speech separation bound. arXiv preprint arXiv:2205.11801 . 2022 IEEE International Conference on Acoustics, Speech and Signal
[357] Lutati, S., Nachmani, E., Wolf, L., 2023. Separate and diffuse: Using Processing (ICASSP), pp. 7457–7461. doi:10.1109/ICASSP43922.2022.
a pretrained diffusion model for improving source separation. arXiv 9746686.
preprint arXiv:2301.10752 . [376] Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., Xiao, J., 2020a. Flow-
[358] Lux, F., Vu, N.T., 2022. Language-agnostic meta-learning for low- tts: A non-autoregressive network for text to speech based on flow,
resource text-to-speech with articulatory features. arXiv preprint in: ICASSP 2020-2020 IEEE International Conference on Acoustics,
arXiv:2203.03191 . Speech and Signal Processing (ICASSP), IEEE. pp. 7209–7213.
[359] Ma, P., Mira, R., Petridis, S., Schuller, B.W., Pantic, M., 2021. [377] Miao, C., Shuang, L., Liu, Z., Minchuan, C., Ma, J., Wang, S., Xiao,
Lira: Learning visual speech representations from audio through J., 2021. Efficienttts: An efficient and high-quality text-to-speech ar-
self-supervision. arXiv preprint arXiv:2106.09171 . chitecture, in: International Conference on Machine Learning, PMLR.
[360] Ma, S., Mcduff, D., Song, Y., 2019. Neural tts stylization with pp. 7700–7709.
[378] Miao, H., Cheng, G., Gao, C., Zhang, P., Yan, Y., 2020b. Transformer- [397] Ning, Y., He, S., Wu, Z., Xing, C., Zhang, L.J., 2019. A review of
based online ctc/attention end-to-end speech recognition architecture, deep learning based speech synthesis. Applied Sciences 9, 4050.
in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, [398] Niu, P., Chen, Z., Song, M., et al., 2019. A novel bi-directional
Speech and Signal Processing (ICASSP), pp. 6084–6088. doi:10. interrelated model for joint intent detection and slot filling. arXiv
1109/ICASSP40776.2020.9053165. preprint arXiv:1907.00390 .
[379] Michelsanti, D., Tan, Z.H., Zhang, S.X., Xu, Y., Yu, M., Yu, D., [399] Okamoto, T., Toda, T., Shiga, Y., Kawai, H., 2019a. Real-time
Jensen, J., 2021. An overview of deep-learning-based audio-visual neural text-to-speech with sequence-to-sequence acoustic model and
speech enhancement and separation. IEEE/ACM Transactions on waveglow or single gaussian wavernn vocoders., in: INTERSPEECH,
Audio, Speech, and Language Processing 29, 1368–1396. pp. 1308–1312.
[380] Mihalache, S., Burileanu, D., 2022. Using voice activity detection [400] Okamoto, T., Toda, T., Shiga, Y., Kawai, H., 2019b. Tacotron-based
and deep neural networks with hybrid speech feature extraction for acoustic model using phoneme alignment for practical neural text-
deceptive speech detection. Sensors 22, 1228. to-speech systems, in: 2019 IEEE Automatic Speech Recognition
[381] Milde, B., Biemann, C., 2018. Unspeech: Unsupervised speech and Understanding Workshop (ASRU), pp. 214–221. doi:10.1109/
context embeddings. arXiv preprint arXiv:1804.06775 . ASRU46091.2019.9003956.
[382] Millet, J., Caucheteux, C., Boubenec, Y., Gramfort, A., Dunbar, E., [401] Okamoto, T., Toda, T., Shiga, Y., Kawai, H., 2020. Transformer-based
Pallier, C., King, J.R., et al., 2022. Toward a realistic model of speech text-to-speech with weighted forced attention, in: ICASSP 2020 -
processing in the brain with self-supervised learning. Advances in 2020 IEEE International Conference on Acoustics, Speech and Signal
Neural Information Processing Systems 35, 33428–33443. Processing (ICASSP), pp. 6729–6733. doi:10.1109/ICASSP40776.2020.
[383] Mohamed, A., Okhonko, D., Zettlemoyer, L., 2019. Transformers 9053915.
with convolutional context for asr. arXiv preprint arXiv:1904.11660 . [402] Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves,
[384] Monteiro, J., Alam, M.J., Falk, T.H., 2019. Combining speaker A., et al., 2016. Conditional image generation with pixelcnn decoders.
recognition and metric learning for speaker-dependent representation Advances in neural information processing systems 29.
learning., in: INTERSPEECH, pp. 4015–4019. [403] Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O.,
[385] Montesinos, J.F., Kadandale, V.S., Haro, G., 2021. A cappella: Audio- Kavukcuoglu, K., Driessche, G., Lockhart, E., Cobo, L., Stimberg,
visual singing voice separation. arXiv preprint arXiv:2104.09946 F., et al., 2018a. Parallel wavenet: Fast high-fidelity speech synthe-
. sis, in: International conference on machine learning, PMLR. pp.
[386] Mustafa, A., Pia, N., Fuchs, G., 2021. Stylemelgan: An efficient high- 3918–3926.
fidelity adversarial vocoder with temporal adaptive normalization, [404] Oord, A.v.d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O.,
in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K., 2016.
Speech and Signal Processing (ICASSP), IEEE. pp. 6034–6038. Wavenet: A generative model for raw audio. arXiv preprint
[387] Nachmani, E., Adi, Y., Wolf, L., 2020. Voice separation with an arXiv:1609.03499 .
unknown number of multiple speakers, in: International Conference [405] Oord, A.v.d., Li, Y., Vinyals, O., 2018b. Representation learning
on Machine Learning, PMLR. pp. 7164–7175. with contrastive predictive coding. arXiv preprint arXiv:1807.03748
[388] Nakatani, T., 2019. Improving transformer-based end-to-end speech .
recognition with connectionist temporal classification and language [406] Ooster, J., Meyer, B.T., 2019. Improving deep models of speech
model integration, in: Proc. Interspeech. quality prediction through voice activity detection and entropy-based
[389] Nankaku, Y., Sumiya, K., Yoshimura, T., Takaki, S., Hashimoto, K., measures, in: ICASSP 2019-2019 IEEE International Conference
Oura, K., Tokuda, K., 2021. Neural sequence-to-sequence speech syn- on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp.
thesis using a hidden semi-markov model based structured attention 636–640.
mechanism. arXiv preprint arXiv:2108.13985 . [407] OpenAI, 2023. Gpt-4 technical report. ArXiv abs/2303.08774.
[390] Nassif, A.B., Shahin, I., Attili, I., Azzeh, M., Shaalan, K., 2019. [408] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L.,
Speech recognition using deep neural networks: A systematic review. Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schul-
IEEE access 7, 19143–19165. man, J., Hilton, J., Kelton, F., Miller, L.E., Simens, M., Askell, A.,
[391] von Neumann, T., Kinoshita, K., Boeddeker, C., Delcroix, M., Haeb- Welinder, P., Christiano, P.F., Leike, J., Lowe, R.J., 2022. Training
Umbach, R., 2021. Graph-pit: Generalized permutation invariant language models to follow instructions with human feedback. ArXiv
training for continuous separation of arbitrary numbers of speakers. abs/2203.02155.
arXiv preprint arXiv:2107.14446 . [409] Paliwal, K., Wójcicki, K., Shannon, B., 2011. The importance of
[392] Nguyen, H.B., Van Hai, D., Bui, T.D., Chau, H.N., Nguyen, Q.C., phase in speech enhancement. speech communication 53, 465–494.
2022a. Multi-channel speech enhancement using a minimum variance [410] Pamisetty, G., Sri Rama Murty, K., 2023. Prosody-tts: An end-to-end
distortionless response beamformer based on graph convolutional speech synthesis system with prosody control. Circuits, Systems, and
network. International Journal of Advanced Computer Science and Signal Processing 42, 361–384.
Applications 13. [411] Pan, J., Lei, T., Kim, K., Han, K.J., Watanabe, S., 2022. Sru++:
[393] Nguyen, V.A., Nguyen, A.H., Khong, A.W., 2022b. Tunet: A block- Pioneering fast recurrence with attention for speech recognition, in:
online bandwidth extension model based on transformers and self- ICASSP 2022 - 2022 IEEE International Conference on Acoustics,
supervised pretraining, in: ICASSP 2022-2022 IEEE International Speech and Signal Processing (ICASSP), pp. 7872–7876. doi:10.
Conference on Acoustics, Speech and Signal Processing (ICASSP), 1109/ICASSP43922.2022.9746187.
IEEE. pp. 161–165. [412] Panayotov, V., Chen, G., Povey, D., Khudanpur, S., 2015. Librispeech:
[394] Nguyen, V.N., Sadeghi, M., Ricci, E., Alameda-Pineda, X., 2021. an asr corpus based on public domain audio books, in: 2015 IEEE
Deep variational generative models for audio-visual speech separa- international conference on acoustics, speech and signal processing
tion, in: 2021 IEEE 31st International Workshop on Machine Learn- (ICASSP), IEEE. pp. 5206–5210.
ing for Signal Processing (MLSP), IEEE. pp. 1–6. [413] Pandey, A., Wang, D., 2019. Tcnn: Temporal convolutional neural
[395] Nguyen, X.P., Popuri, S., Wang, C., Tang, Y., Kulikov, I., Gong, H., network for real-time speech enhancement in the time domain, in:
2022c. Improving speech-to-speech translation through unlabeled ICASSP 2019-2019 IEEE International Conference on Acoustics,
text. arXiv preprint arXiv:2210.14514 . Speech and Signal Processing (ICASSP), IEEE. pp. 6875–6879.
[396] Nidadavolu, P.S., Villalba, J., Dehak, N., 2019. Cycle-gans for [414] Papastratis, I., 2021. Speech recognition: a review of the different
domain adaptation of acoustic features for speaker recognition, in: deep learning approaches. Accessed on 2, 2021.
ICASSP 2019-2019 IEEE International Conference on Acoustics, [415] Park, D.S., Zhang, Y., Jia, Y., Han, W., Chiu, C.C., Li, B., Wu, Y.,
Speech and Signal Processing (ICASSP), IEEE. pp. 6206–6210. Le, Q.V., 2020. Improved noisy student training for automatic speech
[457] Ravanelli, M., Parcollet, T., Bengio, Y., 2019. The pytorch-kaldi supervised model for speech representation learning. arXiv preprint
speech recognition toolkit, in: ICASSP 2019-2019 IEEE International arXiv:2103.08393 .
Conference on Acoustics, Speech and Signal Processing (ICASSP), [477] Sadjadi, S.O., Pelecanos, J., Zhu, W., 2014. Nearest neighbor discrim-
IEEE. pp. 6465–6469. inant analysis for robust speaker recognition, in: Fifteenth Annual
[458] Ravanelli, M., Zhong, J., Pascual, S., Swietojanski, P., Monteiro, J., Conference of the International Speech Communication Association.
Trmal, J., Bengio, Y., 2020. Multi-task self-supervised learning for [478] Saito, Y., Takamichi, S., Saruwatari, H., 2021. Perceptual-similarity-
robust speech recognition, in: ICASSP 2020-2020 IEEE International aware deep speaker representation learning for multi-speaker gen-
Conference on Acoustics, Speech and Signal Processing (ICASSP), erative modeling. IEEE/ACM Transactions on Audio, Speech, and
IEEE. pp. 6989–6993. Language Processing 29, 1033–1048.
[459] Reddy, C.K., Gopal, V., Cutler, R., Beyrami, E., Cheng, R., Dubey, [479] Salehinejad, H., Sankar, S., Barfett, J., Colak, E., Valaee, S., 2017.
H., Matusevych, S., Aichner, R., Aazami, A., Braun, S., et al., 2020. Recent advances in recurrent neural networks. arXiv preprint
The interspeech 2020 deep noise suppression challenge: Datasets, arXiv:1801.01078 .
subjective testing framework, and challenge results. arXiv preprint [480] Salesky, E., Sperber, M., Black, A.W., 2019. Exploring phoneme-
arXiv:2005.13981 . level speech representations for end-to-end speech translation. arXiv
[460] Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y., 2020. preprint arXiv:1906.01199 .
Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv [481] Sathyendra, K.M., Muniyappa, T., Chang, F.J., Liu, J., Su, J., Strimel,
preprint arXiv:2006.04558 . G.P., Mouchtaris, A., Kunzmann, S., 2022. Contextual adapters for
[461] Ren, Y., Liu, J., Zhao, Z., 2021. Portaspeech: Portable and high- personalized speech recognition in neural transducers, in: ICASSP
quality generative text-to-speech. Advances in Neural Information 2022-2022 IEEE International Conference on Acoustics, Speech and
Processing Systems 34, 13963–13974. Signal Processing (ICASSP), IEEE. pp. 8537–8541.
[462] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y., 2019. [482] Scalart, P., et al., 1996. Speech enhancement based on a priori
Fastspeech: Fast, robust and controllable text to speech. Advances in signal to noise estimation, in: 1996 IEEE International Conference
neural information processing systems 32. on Acoustics, Speech, and Signal Processing Conference Proceedings,
[463] Reynolds, D.A., 2003. Channel robust speaker verification via feature IEEE. pp. 629–632.
mapping, in: 2003 IEEE International Conference on Acoustics, [483] Scarton, C., Forcada, M.L., Espla-Gomis, M., Specia, L., 2019. Es-
Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., timating post-editing effort: a study on human judgements, task-
IEEE. pp. II–53. based and reference-based metrics of mt quality. arXiv preprint
[464] Rho, D., Park, J., Ko, J.H., 2022. Nas-vad: Neural architecture search arXiv:1910.06204 .
for voice activity detection. arXiv preprint arXiv:2201.09032 . [484] Scheibler, R., Ji, Y., Chung, S.W., Byun, J., Choe, S., Choi, M.S.,
[465] Richey, C., Barrios, M.A., Armstrong, Z., Bartels, C., Franco, H., 2022. Diffusion-based generative speech source separation. arXiv
Graciarena, M., Lawson, A., Nandwana, M.K., Stauffer, A., van Hout, preprint arXiv:2210.17327 .
J., et al., 2018. Voices obscured in complex environmental settings [485] Schneider, S., Baevski, A., Collobert, R., Auli, M., 2019. wav2vec:
(voices) corpus. arXiv preprint arXiv:1804.05053 . Unsupervised pre-training for speech recognition. arXiv preprint
[466] Richter, J., Carbajal, G., Gerkmann, T., 2020. Speech enhancement arXiv:1904.05862 .
with stochastic temporal convolutional networks., in: Interspeech, pp. [486] Schroff, F., Kalenichenko, D., Philbin, J., 2015. Facenet: A unified
4516–4520. embedding for face recognition and clustering, in: Proceedings of
[467] Riviere, M., Joulin, A., Mazaré, P.E., Dupoux, E., 2020. Unsu- the IEEE conference on computer vision and pattern recognition, pp.
pervised pretraining transfers well across languages, in: ICASSP 815–823.
2020-2020 IEEE International Conference on Acoustics, Speech and [487] Schuster, M., Paliwal, K.K., 1997. Bidirectional recurrent neural
Signal Processing (ICASSP), IEEE. pp. 7414–7418. networks. IEEE transactions on Signal Processing 45, 2673–2681.
[468] Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P., 2001. Per- [488] Seo, D., Oh, H.S., Jung, Y., 2021. Wav2kws: Transfer learning
ceptual evaluation of speech quality (pesq)-a new method for speech from speech representations for keyword spotting. IEEE Access 9,
quality assessment of telephone networks and codecs, in: 2001 IEEE 80682–80691.
international conference on acoustics, speech, and signal processing. [489] Serrà, J., Pascual, S., Pons, J., Araz, R.O., Scaini, D., 2022. Univer-
Proceedings (Cat. No. 01CH37221), IEEE. pp. 749–752. sal speech enhancement with score-based diffusion. arXiv preprint
[469] Rostami, A.M., Karimi, A., Akhaee, M.A., 2022. Keyword spotting arXiv:2206.03065 .
in continuous speech using convolutional neural network. Speech [490] Serrà, J., Pons, J., Pascual, S., 2021. Sesqa: semi-supervised learning
Communication 142, 15–21. for speech quality assessment, in: ICASSP 2021-2021 IEEE Inter-
[470] Rousseau, A., Deléglise, P., Esteve, Y., 2012. Ted-lium: an automatic national Conference on Acoustics, Speech and Signal Processing
speech recognition dedicated corpus., in: LREC, pp. 125–129. (ICASSP), IEEE. pp. 381–385.
[471] Routray, S., Mao, Q., 2022. Phase sensitive masking-based single [491] Sertolli, B., Ren, Z., Schuller, B.W., Cummins, N., 2021. Repre-
channel speech enhancement using conditional generative adversarial sentation transfer learning from deep end-to-end speech recognition
network. Computer Speech & Language 71, 101270. networks for the classification of health states from speech. Computer
[472] Rouvier, M., Dufour, R., Bousquet, P.M., 2021. Review of different Speech & Language 68, 101204.
robust x-vector extractors for speaker verification, in: 2020 28th [492] Shen, J., Jia, Y., Chrzanowski, M., Zhang, Y., Elias, I., Zen, H., Wu,
European Signal Processing Conference (EUSIPCO), pp. 1–5. doi:10. Y., 2020. Non-attentive tacotron: Robust and controllable neural tts
23919/Eusipco47968.2020.9287426. synthesis including unsupervised duration modeling. arXiv preprint
[473] Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., arXiv:2010.04301 .
Liberman, M., 2018. First dihard challenge evaluation plan. 2018, [493] Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen,
tech. Rep. . Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., et al., 2018. Natural tts
[474] Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, synthesis by conditioning wavenet on mel spectrogram predictions,
S., Liberman, M., 2019. The second dihard diarization challenge: in: 2018 IEEE international conference on acoustics, speech and
Dataset, task, and baselines. arXiv preprint arXiv:1906.07839 . signal processing (ICASSP), IEEE. pp. 4779–4783.
[475] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Lau- [494] Shi, Y., Wang, Y., Wu, C., Fuegen, C., Zhang, F., Le, D., Yeh, C.F.,
renzo, S., 2020. Streaming keyword spotting on mobile devices. Seltzer, M.L., 2020. Weak-attention suppression for transformer
arXiv preprint arXiv:2005.06720 . based speech recognition. arXiv preprint arXiv:2005.09137 .
[476] Sadhu, S., He, D., Huang, C.W., Mallidi, S.H., Wu, M., Rastrow, [495] Shih, K.J., Valle, R., Badlani, R., Lancucki, A., Ping, W., Catanzaro,
A., Stolcke, A., Droppo, J., Maas, R., 2021. Wav2vec-c: A self- B., 2021. Rad-tts: Parallel flow-based tts with robust alignment learn-
ing and diverse synthesis, in: ICML Workshop on Invertible Neural ation by conditional recurrent adversarial network. arXiv preprint
Networks, Normalizing Flows, and Explicit Likelihood Models. arXiv:1804.04786 .
[496] Shim, H.J., Heo, J., Park, J.H., Lee, G.H., Yu, H.J., 2022. Graph [516] Soni, M.H., Patil, H.A., 2016. Novel deep autoencoder features for
attentive feature aggregation for text-independent speaker verification, non-intrusive speech quality assessment, in: 2016 24th European
in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Signal Processing Conference (EUSIPCO), IEEE. pp. 2315–2319.
Speech and Signal Processing (ICASSP), pp. 7972–7976. doi:10. [517] Sorin, A., Shechtman, S., Hoory, R., 2020. Principal style compo-
1109/ICASSP43922.2022.9746257. nents: Expressive style control and cross-speaker transfer in neural
[497] Sicherman, A., Adi, Y., 2023. Analysing discrete self supervised tts., in: INTERSPEECH, pp. 3411–3415.
speech representation for spoken language modeling. arXiv preprint [518] Sperber, M., Paulik, M., 2020. Speech translation and the end-
arXiv:2301.00591 . to-end promise: Taking stock of where we are. arXiv preprint
[498] Simić, N., Suzić, S., Nosek, T., Vujović, M., Perić, Z., Savić, M., arXiv:2004.06358 .
Delić, V., 2022. Speaker recognition using constrained convolutional [519] Stoller, D., Ewert, S., Dixon, S., 2018. Wave-u-net: A multi-scale
neural networks in emotional speech. Entropy 24, 414. neural network for end-to-end audio source separation. arXiv preprint
[499] Simply, R.M., Dafna, E., Zigel, Y., 2019. Diagnosis of obstructive arXiv:1806.03185 .
sleep apnea using speech signals from awake subjects. IEEE Journal [520] Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J., 2021.
of Selected Topics in Signal Processing 14, 251–260. Attention is all you need in speech separation, in: ICASSP 2021-2021
[500] Singh, G., Sharma, S., Kumar, V., Kaur, M., Baz, M., Masud, M., IEEE International Conference on Acoustics, Speech and Signal
2021. Spoken language identification using deep learning. Computa- Processing (ICASSP), IEEE. pp. 21–25.
tional Intelligence and Neuroscience 2021. [521] Sukhadia, V.N., Umesh, S., 2023. Domain adaptation of low-resource
[501] Singh, P., Ganapathy, S., 2021. Self-supervised metric learning with target-domain models using well-trained asr conformer models, in:
graph clustering for speaker diarization, in: 2021 IEEE Automatic 2022 IEEE Spoken Language Technology Workshop (SLT), IEEE.
Speech Recognition and Understanding Workshop (ASRU), pp. 90– pp. 295–301.
97. doi:10.1109/ASRU51503.2021.9688271. [522] Sun, A., Wang, J., Cheng, N., Peng, H., Zeng, Z., Kong, L., Xiao,
[502] Singh, P., Kaul, A., Ganapathy, S., 2023. Supervised hierarchical J., 2021. Graphpb: Graphical representations of prosody boundary
clustering using graph neural networks for speaker diarization. arXiv in speech synthesis, in: 2021 IEEE Spoken Language Technology
preprint arXiv:2302.12716 . Workshop (SLT), IEEE. pp. 438–445.
[503] Singh, S., Wang, R., Hou, F., 2022. Improved meta learning for [523] Sun, A., Wang, J., Cheng, N., Peng, H., Zeng, Z., Xiao, J., 2020.
low resource speech recognition, in: ICASSP 2022-2022 IEEE In- Graphtts: Graph-to-sequence modelling in neural text-to-speech, in:
ternational Conference on Acoustics, Speech and Signal Processing ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
(ICASSP), IEEE. pp. 4798–4802. Speech and Signal Processing (ICASSP), pp. 6719–6723. doi:10.
[504] Siuzdak, H., Dura, P., van Rijn, P., Jacoby, N., 2022. Wavthruvec: La- 1109/ICASSP40776.2020.9053355.
tent speech representation as intermediate features for neural speech [524] Sung, T.W., Liu, J.Y., Lee, H.y., Lee, L.s., 2019. Towards end-to-
synthesis. arXiv preprint arXiv:2203.16930 . end speech-to-text translation with two-pass decoding, in: ICASSP
[505] Skerry-Ryan, R., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., 2019-2019 IEEE International Conference on Acoustics, Speech and
Shor, J., Weiss, R., Clark, R., Saurous, R.A., 2018. Towards end-to- Signal Processing (ICASSP), IEEE. pp. 7175–7179.
end prosody transfer for expressive speech synthesis with tacotron, [525] Suno-AI, 2023. Bark. URL: https://github.com/suno-ai/bark.
in: international conference on machine learning, PMLR. pp. 4693– [526] Synnaeve, G., Xu, Q., Kahn, J., Likhomanenko, T., Grave, E., Pratap,
4702. V., Sriram, A., Liptchinsky, V., Collobert, R., 2020. End-to-end asr:
[506] Smith, N., Gales, M., 2001. Speech recognition using svms. Advances from supervised to semi-supervised learning with modern architec-
in neural information processing systems 14. tures, in: ICML 2020 Workshop on Self-supervision in Audio and
[507] Snell, J., Swersky, K., Zemel, R., 2017. Prototypical networks for few- Speech.
shot learning. Advances in neural information processing systems [527] Tae, J., Kim, H., Kim, T., 2021. Editts: Score-based editing for
30. controllable text-to-speech. CoRR abs/2110.02584. URL: https:
[508] Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S., 2017. //arxiv.org/abs/2110.02584, arXiv:2110.02584.
Deep neural network embeddings for text-independent speaker verifi- [528] Tan, K., Wang, D., 2019. Learning complex spectral mapping with
cation., in: Interspeech, pp. 999–1003. gated convolutional recurrent networks for monaural speech enhance-
[509] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S., ment. IEEE/ACM Transactions on Audio, Speech, and Language
2018. X-vectors: Robust dnn embeddings for speaker recognition, in: Processing 28, 380–390.
2018 IEEE international conference on acoustics, speech and signal [529] Tan, L., Karnjanadecha, M., 2003. Pitch detection algorithm: auto-
processing (ICASSP), IEEE. pp. 5329–5333. correlation method and amdf, in: Proceedings of the 3rd international
[510] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S., symposium on communications and information technology, pp. 551–
2015. Deep unsupervised learning using nonequilibrium thermody- 556.
namics, in: International Conference on Machine Learning, PMLR. [530] Tanaka, K., Kameoka, H., Kaneko, T., Hojo, N., 2019. Atts2s-
pp. 2256–2265. vc: Sequence-to-sequence voice conversion with attention and con-
[511] Solomonoff, A., Campbell, W.M., Boardman, I., 2005. Advances in text preservation mechanisms, in: ICASSP 2019 - 2019 IEEE In-
channel compensation for svm speaker recognition, in: Proceed- ternational Conference on Acoustics, Speech and Signal Processing
ings.(ICASSP’05). IEEE International Conference on Acoustics, (ICASSP), pp. 6805–6809. doi:10.1109/ICASSP.2019.8683282.
Speech, and Signal Processing, 2005., IEEE. pp. I–629. [531] Tang, C., Luo, C., Zhao, Z., Xie, W., Zeng, W., 2021. Joint time-
[512] Solomonoff, A., Quillen, C., Campbell, W.M., 2004. Channel com- frequency and time domain learning for speech enhancement, in:
pensation for svm speaker recognition., in: Odyssey, pp. 219–226. Proceedings of the Twenty-Ninth International Conference on Inter-
[513] Son Chung, J., Senior, A., Vinyals, O., Zisserman, A., 2017. Lip national Joint Conferences on Artificial Intelligence, pp. 3816–3822.
reading sentences in the wild, in: Proceedings of the IEEE conference [532] Tang, Y., Ding, G., Huang, J., He, X., Zhou, B., 2019. Deep speaker
on computer vision and pattern recognition, pp. 6447–6456. embedding learning with multi-level pooling for text-independent
[514] Sondhi, M., Schroeter, J., 1987. A hybrid time-frequency domain speaker verification, in: ICASSP 2019-2019 IEEE International Con-
articulatory speech synthesizer. IEEE Transactions on Acoustics, ference on Acoustics, Speech and Signal Processing (ICASSP), IEEE.
Speech, and Signal Processing 35, 955–967. doi:10.1109/TASSP.1987. pp. 6116–6120.
1165240. [533] Tao, F., Busso, C., 2019. End-to-end audiovisual speech activity
[515] Song, Y., Zhu, J., Li, D., Wang, X., Qi, H., 2018. Talking face gener- detection with bimodal recurrent neural models. Speech Communi-
cation 113, 25–35. [553] Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-
[534] Tian, X., Chng, E.S., Li, H., 2019. A vocoder-free wavenet voice Dominguez, J., 2014a. Deep neural networks for small footprint
conversion with non-parallel data. arXiv preprint arXiv:1902.03705 . text-dependent speaker verification, in: 2014 IEEE international con-
[535] Tits, N., Wang, F., El Haddad, K., Pagel, V., Dutoit, T., 2019. Visual- ference on acoustics, speech and signal processing (ICASSP), IEEE.
ization and interpretation of latent spaces for controlling expressive pp. 4052–4056.
speech synthesis through audio analysis. Proc. Interspeech 2019 , [554] Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-
4475–4479. Dominguez, J., 2014b. Deep neural networks for small footprint
[536] Tjandra, A., Sakti, S., Nakamura, S., 2018. Sequence-to-sequence asr text-dependent speaker verification, in: 2014 IEEE International
optimization via reinforcement learning, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. doi:10.1109/ICASSP.2014.6854363.
IEEE. pp. 5829–5833. [555] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
[537] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need.
Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, Advances in neural information processing systems 30.
A., Joulin, A., Grave, E., Lample, G., 2023. Llama: Open and efficient [556] Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P.,
foundation language models. ArXiv abs/2302.13971. Bengio, Y., et al., 2017. Graph attention networks. stat 1050, 10–
[538] Tranter, S.E., Reynolds, D.A., 2006. An overview of automatic 48550.
speaker diarization systems. IEEE Transactions on audio, speech, [557] Veličković, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm,
and language processing 14, 1557–1565. R.D., 2018. Deep graph infomax. arXiv preprint arXiv:1809.10341 .
[539] Tsunoo, E., Kashiwagi, Y., Kumakura, T., Watanabe, S., 2019. Trans- [558] Vincent, E., Virtanen, T., Gannot, S., 2018. Audio source separation
former asr with contextual block processing, in: 2019 IEEE Auto- and speech enhancement. John Wiley & Sons.
matic Speech Recognition and Understanding Workshop (ASRU), [559] Vuong, T., Xia, Y., Stern, R.M., 2021. A modulation-domain loss
IEEE. pp. 427–433. for neural-network-based real-time speech enhancement, in: ICASSP
[540] Tu, T., Chen, Y.J., Yeh, C.c., Lee, H.Y., 2019. End-to-end text-to- 2021-2021 IEEE International Conference on Acoustics, Speech and
speech for low-resource languages by cross-lingual transfer learning. Signal Processing (ICASSP), IEEE. pp. 6643–6647.
arXiv preprint arXiv:1904.06508 . [560] Vygon, R., Mikhaylovskiy, N., 2021. Learning efficient representa-
[541] Tüske, Z., Audhkhasi, K., Saon, G., 2019. Advancing sequence-to- tions for keyword spotting with triplet loss, in: Speech and Computer:
sequence based speech recognition., in: Interspeech, pp. 3780–3784. 23rd International Conference, SPECOM 2021, St. Petersburg, Rus-
[542] Tzinis, E., Adi, Y., Ithapu, V.K., Xu, B., Kumar, A., 2022a. Continual sia, September 27–30, 2021, Proceedings 23, Springer. pp. 773–785.
self-training with bootstrapped remixing for speech enhancement, [561] Wan, L., Wang, Q., Papir, A., Moreno, I.L., 2018. Generalized
in: ICASSP 2022-2022 IEEE International Conference on Acoustics, end-to-end loss for speaker verification, in: 2018 IEEE International
Speech and Signal Processing (ICASSP), IEEE. pp. 6947–6951. Conference on Acoustics, Speech and Signal Processing (ICASSP),
[543] Tzinis, E., Adi, Y., Ithapu, V.K., Xu, B., Smaragdis, P., Kumar, A., IEEE. pp. 4879–4883.
2022b. Remixit: Continual self-training of speech enhancement [562] Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., Chen,
models via bootstrapped remixing. IEEE Journal of Selected Topics Z., Liu, Y., Wang, H., Li, J., et al., 2023a. Neural codec language
in Signal Processing 16, 1329–1341. models are zero-shot text to speech synthesizers. arXiv preprint
[544] Tzirakis, P., Kumar, A., Donley, J., 2021. Multi-channel speech arXiv:2301.02111 .
enhancement using graph neural networks, in: ICASSP 2021-2021 [563] Wang, C., Tang, Y., Ma, X., Wu, A., Popuri, S., Okhonko, D., Pino, J.,
IEEE International Conference on Acoustics, Speech and Signal 2020a. fairseq s2t: Fast speech-to-text modeling with fairseq. arXiv
Processing (ICASSP), IEEE. pp. 3415–3419. preprint arXiv:2010.05171 .
[545] Um, S.Y., Oh, S., Byun, K., Jang, I., Ahn, C., Kang, H.G., 2020. [564] Wang, C., Wu, A., Pino, J., 2020b. Covost 2 and massively multi-
Emotional speech synthesis with rich and granularized control, in: lingual speech-to-text translation. arXiv preprint arXiv:2007.10310
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, .
Speech and Signal Processing (ICASSP), pp. 7254–7258. doi:10. [565] Wang, C., Wu, Y., Qian, Y., Kumatani, K., Liu, S., Wei, F., Zeng,
1109/ICASSP40776.2020.9053732. M., Huang, X., 2021a. Unispeech: Unified speech representation
[546] Vainer, J., Dušek, O., 2020. Speedyspeech: Efficient neural speech learning with labeled and unlabeled data, in: International Conference
synthesis. arXiv preprint arXiv:2008.03802 . on Machine Learning, PMLR. pp. 10937–10947.
[547] Valin, J.M., Isik, U., Smaragdis, P., Krishnaswamy, A., 2022. Neural [566] Wang, F., Tax, D.M., 2016. Survey on the attention based rnn
speech synthesis on a shoestring: Improving the efficiency of lpcnet, model and its applications in computer vision. arXiv preprint
in: ICASSP 2022-2022 IEEE International Conference on Acoustics, arXiv:1601.06823 .
Speech and Signal Processing (ICASSP), IEEE. pp. 8437–8441. [567] Wang, G., 2019. Deep text-to-speech system with seq2seq model.
[548] Valin, J.M., Skoglund, J., 2019. Lpcnet: Improving neural speech arXiv preprint arXiv:1903.07398 .
synthesis through linear prediction, in: ICASSP 2019-2019 IEEE [568] Wang, H., Wang, D., 2020. Time-frequency loss for cnn based
International Conference on Acoustics, Speech and Signal Processing speech super-resolution, in: ICASSP 2020 - 2020 IEEE International
(ICASSP), IEEE. pp. 5891–5895. Conference on Acoustics, Speech and Signal Processing (ICASSP),
[549] Valle, R., Li, J., Prenger, R., Catanzaro, B., 2020a. Mellotron: Multi- pp. 861–865. doi:10.1109/ICASSP40776.2020.9053712.
speaker expressive voice synthesis by conditioning on rhythm, pitch [569] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z.,
and global style tokens, in: ICASSP 2020 - 2020 IEEE International Liu, W., 2018a. Cosface: Large margin cosine loss for deep face
Conference on Acoustics, Speech and Signal Processing (ICASSP), recognition, in: Proceedings of the IEEE conference on computer
pp. 6189–6193. doi:10.1109/ICASSP40776.2020.9054556. vision and pattern recognition, pp. 5265–5274.
[550] Valle, R., Shih, K., Prenger, R., Catanzaro, B., 2020b. Flowtron: [570] Wang, J., He, Y., Zhao, C., Shao, Q., Tu, W.W., Ko, T., Lee, H.y., Xie,
an autoregressive flow-based generative network for text-to-speech L., 2021b. Auto-kws 2021 challenge: Task, datasets, and baselines.
synthesis. arXiv preprint arXiv:2005.05957 . arXiv preprint arXiv:2104.00513 .
[551] Van Den Oord, A., Vinyals, O., et al., 2017. Neural discrete represen- [571] Wang, J., Xiao, X., Wu, J., Ramamurthy, R., Rudzicz, F., Brudno, M.,
tation learning. Advances in neural information processing systems 2020c. Speaker diarization with session-level speaker embedding
30. refinement using graph neural networks, in: ICASSP 2020 - 2020
[552] Vanzo, A., Croce, D., Bastianelli, E., Basili, R., Nardi, D., 2016. Ro- IEEE International Conference on Acoustics, Speech and Signal
bust spoken language understanding for house service robots. Polibits Processing (ICASSP), pp. 7109–7113. doi:10.1109/ICASSP40776.2020.
, 11–16. 9054176.
[572] Wang, Q., Downey, C., Wan, L., Mansfield, P.A., Moreno, I.L., 2018b. [590] Weiss, R.J., Skerry-Ryan, R., Battenberg, E., Mariooryad, S.,
Speaker diarization with lstm, in: 2018 IEEE International conference Kingma, D.P., 2021. Wave-tacotron: Spectrogram-free end-to-end
on acoustics, speech and signal processing (ICASSP), IEEE. pp. 5239– text-to-speech synthesis, in: ICASSP 2021-2021 IEEE International
5243. Conference on Acoustics, Speech and Signal Processing (ICASSP),
[573] Wang, Q., Guo, P., Sun, S., Xie, L., Hansen, J.H., 2019a. Adver- IEEE. pp. 5679–5683.
sarial regularization for end-to-end robust speaker verification., in: [591] Weng, C., Cui, J., Wang, G., Wang, J., Yu, C., Su, D., Yu, D., 2018.
Interspeech, pp. 4010–4014. Improving attention based sequence-to-sequence models for end-to-
[574] Wang, Q., Rao, W., Sun, S., Xie, L., Chng, E.S., Li, H., 2018c. end english conversational speech recognition., in: Interspeech, pp.
Unsupervised domain adaptation via domain adversarial training 761–765.
for speaker recognition, in: 2018 IEEE International Conference on [592] Westhausen, N.L., Meyer, B.T., 2020. Dual-signal transforma-
Acoustics, Speech and Signal Processing (ICASSP), pp. 4889–4893. tion lstm network for real-time noise suppression. arXiv preprint
doi:10.1109/ICASSP.2018.8461423. arXiv:2005.07551 .
[575] Wang, T., Deng, J., Geng, M., Ye, Z., Hu, S., Wang, Y., Cui, M., [593] Winata, G.I., Cahyawijaya, S., Lin, Z., Liu, Z., Fung, P., 2020.
Jin, Z., Liu, X., Meng, H., 2022a. Conformer based elderly speech Lightweight and efficient end-to-end speech recognition using low-
recognition system for alzheimer’s disease detection. arXiv preprint rank transformer, in: ICASSP 2020 - 2020 IEEE International Con-
arXiv:2206.13232 . ference on Acoustics, Speech and Signal Processing (ICASSP), pp.
[576] Wang, T., Pan, Z., Ge, M., Yang, Z., Li, H., 2023b. Time-domain 6144–6148. doi:10.1109/ICASSP40776.2020.9053878.
speech separation networks with graph encoding auxiliary. IEEE [594] Wu, D.Y., Lee, H.y., 2020a. One-shot voice conversion by vector
Signal Processing Letters 30, 110–114. quantization, in: ICASSP 2020-2020 IEEE International Conference
[577] Wang, W., Lin, Q., Cai, D., Li, M., 2022b. Similarity measure- on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp.
ment of segment-level speaker embeddings in speaker diarization. 7734–7738.
IEEE/ACM Transactions on Audio, Speech, and Language Process- [595] Wu, D.Y., Lee, H.y., 2020b. One-shot voice conversion by vector
ing 30, 2645–2658. quantization, in: ICASSP 2020 - 2020 IEEE International Conference
[578] Wang, X., Li, L., Wang, D., 2019b. Vae-based domain adaptation for on Acoustics, Speech and Signal Processing (ICASSP), pp. 7734–
speaker verification, in: 2019 Asia-Pacific Signal and Information 7738. doi:10.1109/ICASSP40776.2020.9053854.
Processing Association Annual Summit and Conference (APSIPA [596] Wu, J., Hua, Y., Yang, S., Qin, H., Qin, H., 2019. Speech enhance-
ASC), IEEE. pp. 535–539. ment using generative adversarial network by distilling knowledge
[579] Wang, X., Ming, H., He, L., Soong, F.K., 2020d. s-transformer: from statistical method. Applied Sciences 9, 3396.
Segment-transformer for robust neural speech synthesis. arXiv [597] Wu, S., Shi, Z., 2021. Itotts and itowave: Linear stochastic differ-
preprint arXiv:2011.08480 . ential equation is all you need for audio generation. arXiv preprint
[580] Wang, Y., Ju, Z., Tan, X., He, L., Wu, Z., Bian, J., Zhao, S., 2023c. arXiv:2105.07583 .
Audit: Audio editing by following instructions with latent diffusion [598] Wu, X., 2022. Deep sparse conformer for speech recognition. arXiv
models. arXiv:2304.00830. preprint arXiv:2209.00260 .
[581] Wang, Y., Shen, Y., Jin, H., 2018d. A bi-model based rnn seman- [599] Wu, Y., Tan, X., Li, B., He, L., Zhao, S., Song, R., Qin, T., Liu, T.Y.,
tic frame parsing model for intent detection and slot filling. arXiv 2022. Adaspeech 4: Adaptive text to speech in zero-shot scenarios.
preprint arXiv:1812.10235 . arXiv preprint arXiv:2204.00436 .
[582] Wang, Y., Shi, Y., Zhang, F., Wu, C., Chan, J., Yeh, C.F., Xiao, A., [600] Xia, W., Huang, J., Hansen, J.H., 2019. Cross-lingual text-
2021c. Transformer in action: A comparative study of transformer- independent speaker verification using unsupervised adversarial dis-
based acoustic models for large scale speech recognition applications, criminative domain adaptation, in: ICASSP 2019-2019 IEEE Inter-
in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, national Conference on Acoustics, Speech and Signal Processing
Speech and Signal Processing (ICASSP), pp. 6778–6782. doi:10. (ICASSP), IEEE. pp. 5816–5820.
1109/ICASSP39728.2021.9414087. [601] Xiao, X., Kanda, N., Chen, Z., Zhou, T., Yoshioka, T., Chen, S.,
[583] Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Zhao, Y., Liu, G., Wu, Y., Wu, J., et al., 2021. Microsoft speaker
Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al., 2017. Tacotron: To- diarization system for the voxceleb speaker recognition challenge
wards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 2020, in: ICASSP 2021-2021 IEEE International Conference on
. Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 5824–
[584] Wang, Y., Stanton, D., Zhang, Y., Ryan, R.S., Battenberg, E., Shor, 5828.
J., Xiao, Y., Jia, Y., Ren, F., Saurous, R.A., 2018e. Style tokens: Un- [602] Xin, D., Saito, Y., Takamichi, S., Koriyama, T., Saruwatari, H., 2020.
supervised style modeling, control and transfer in end-to-end speech Cross-lingual text-to-speech synthesis via domain adaptation and
synthesis, in: International Conference on Machine Learning, PMLR. perceptual similarity regression in speaker space., in: Interspeech,
pp. 5180–5189. pp. 2947–2951.
[585] Wang, Y., Zhang, S., Lee, J., 2019c. Bridging commonsense rea- [603] Xu, C., Hu, B., Li, Y., Zhang, Y., Ju, Q., Xiao, T., Zhu, J., et al.,
soning and probabilistic planning via a probabilistic action language. 2021a. Stacked acoustic-and-textual encoding: Integrating the pre-
Theory and Practice of Logic Programming 19, 1090–1106. trained models into speech translation encoders. arXiv preprint
[586] Wang, Z., Wohlwend, J., Lei, T., 2019d. Structured pruning of large arXiv:2105.05752 .
language models. CoRR abs/1910.04732. URL: http://arxiv.org/ [604] Xu, J., Tan, X., Ren, Y., Qin, T., Li, J., Zhao, S., Liu, T.Y., 2020.
abs/1910.04732, arXiv:1910.04732. Lrspeech: Extremely low-resource speech synthesis and recognition,
[587] Wang, Z.Q., Le Roux, J., Hershey, J.R., 2018f. Alternative objective in: Proceedings of the 26th ACM SIGKDD International Conference
functions for deep clustering, in: 2018 IEEE International Conference on Knowledge Discovery & Data Mining, pp. 2802–2812.
on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. [605] Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Conneau,
686–690. A., Collobert, R., Synnaeve, G., Auli, M., 2021b. Self-training and
[588] Wang, Z.Q., Wang, P., Wang, D., 2020e. Complex spectral mapping pre-training are complementary for speech recognition, in: ICASSP
for single-and multi-channel speech enhancement and robust asr. 2021-2021 IEEE International Conference on Acoustics, Speech and
IEEE/ACM transactions on audio, speech, and language processing Signal Processing (ICASSP), IEEE. pp. 3030–3034.
28, 1778–1787. [606] Xu, Y., Du, J., Dai, L.R., Lee, C.H., 2014. A regression approach
[589] Warden, P., 2018. Speech commands: A dataset for limited- to speech enhancement based on deep neural networks. IEEE/ACM
vocabulary speech recognition. arXiv preprint arXiv:1804.03209 Transactions on Audio, Speech, and Language Processing 23, 7–19.
. [607] Xue, J., Deng, Y., Han, Y., Li, Y., Sun, J., Liang, J., 2022. Ecapa-tdnn
for multi-speaker text-to-speech synthesis, in: 2022 13th International [626] You, J., Kim, D., Nam, G., Hwang, G., Chae, G., 2021. Gan
Symposium on Chinese Spoken Language Processing (ISCSLP), vocoder: Multi-resolution discriminator is all you need. arXiv
IEEE. pp. 230–234. preprint arXiv:2103.05236 .
[608] Yamamoto, R., Song, E., Kim, J.M., 2020. Parallel wavegan: A fast [627] Yu, C., Lu, H., Hu, N., Yu, M., Weng, C., Xu, K., Liu, P., Tuo, D.,
waveform generation model based on generative adversarial networks Kang, S., Lei, G., et al., 2019. Durian: Duration informed attention
with multi-resolution spectrogram, in: ICASSP 2020-2020 IEEE network for multimodal synthesis. arXiv preprint arXiv:1909.01700
International Conference on Acoustics, Speech and Signal Processing .
(ICASSP), IEEE. pp. 6199–6203. [628] Yu, D., Deng, L., 2016. Automatic speech recognition. volume 1.
[609] Yan, Y., Tan, X., Li, B., Qin, T., Zhao, S., Shen, Y., Liu, T.Y., 2021. Springer.
Adaspeech 2: Adaptive text to speech with untranscribed data, in: [629] Yu, F., Koltun, V., 2015. Multi-scale context aggregation by dilated
ICASSP 2021-2021 IEEE International Conference on Acoustics, convolutions. CoRR abs/1511.07122.
Speech and Signal Processing (ICASSP), IEEE. pp. 6613–6617. [630] Yu, Y., Park, D., Kim, H.K., 2022. Auxiliary loss of transformer with
[610] Yang, D., Liu, S., Yu, J., Wang, H., Weng, C., Zou, Y., 2022a. Nore- residual connection for end-to-end speaker diarization, in: ICASSP
speech: Knowledge distillation based conditional diffusion model for 2022-2022 IEEE International Conference on Acoustics, Speech and
noise-robust expressive tts. arXiv preprint arXiv:2211.02448 . Signal Processing (ICASSP), IEEE. pp. 8377–8381.
[611] Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y., Yu, D., [631] Yue, F., Deng, Y., He, L., Ko, T., Zhang, Y., 2022. Exploring machine
2022b. Diffsound: Discrete diffusion model for text-to-sound genera- speech chain for domain adaptation, in: ICASSP 2022-2022 IEEE
tion. arXiv preprint arXiv:2207.09983 . International Conference on Acoustics, Speech and Signal Processing
[612] Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., Xie, L., 2021a. (ICASSP), IEEE. pp. 6757–6761.
Multi-band melgan: Faster waveform generation for high-quality text- [632] Yun, S., Jeong, M., Kim, R., Kang, J., Kim, H.J., 2019. Graph
to-speech, in: 2021 IEEE Spoken Language Technology Workshop transformer networks. Advances in neural information processing
(SLT), IEEE. pp. 492–498. systems 32.
[613] Yang, G.P., Tuan, C.I., Lee, H.Y., Lee, L.s., 2019a. Improved speech [633] Zeghidour, N., Grangier, D., 2021. Wavesplit: End-to-end speech
separation with time-and-frequency cross-domain joint embedding separation by speaker clustering. IEEE/ACM Transactions on Audio,
and clustering. arXiv preprint arXiv:1904.07845 . Speech, and Language Processing 29, 2840–2849.
[614] Yang, J., Lee, J., Kim, Y., Cho, H., Kim, I., 2020. Vocgan: A high- [634] Zeinali, H., Wang, S., Silnova, A., Matějka, P., Plchot, O., 2019. But
fidelity real-time vocoder with a hierarchically-nested adversarial system description to voxceleb speaker recognition challenge 2019.
network. arXiv preprint arXiv:2007.15256 . arXiv preprint arXiv:1910.12592 .
[615] Yang, S., Chi, P., Chuang, Y., Lai, C.J., Lakhotia, K., Lin, Y.Y., Liu, [635] Zeremdini, J., Ben Messaoud, M.A., Bouzid, A., 2015. A comparison
A.T., Shi, J., Chang, X., Lin, G., Huang, T., Tseng, W., Lee, K., of several computational auditory scene analysis (casa) techniques
Liu, D., Huang, Z., Dong, S., Li, S., Watanabe, S., Mohamed, A., for monaural speech segregation. Brain informatics 2, 155–166.
Lee, H., 2021b. SUPERB: speech processing universal performance [636] Zeyer, A., Merboldt, A., Michel, W., Schlüter, R., Ney, H., 2021.
benchmark. CoRR abs/2105.01051. URL: https://arxiv.org/abs/ Librispeech transducer model with internal language model prior
2105.01051, arXiv:2105.01051. correction. arXiv preprint arXiv:2104.03006 .
[616] Yang, S., Liu, M., 2022. Data augmentation for speaker verifica- [637] Zhang, A., Wang, Q., Zhu, Z., Paisley, J., Wang, C., 2019a. Fully
tion, in: Proceedings of the 2022 6th International Conference on supervised speaker diarization, in: ICASSP 2019-2019 IEEE Inter-
Electronic Information Technology and Computer Engineering, pp. national Conference on Acoustics, Speech and Signal Processing
1247–1251. (ICASSP), IEEE. pp. 6301–6305.
[617] Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., Long, [638] Zhang, B., Haddow, B., Sennrich, R., 2022a. Revisiting end-to-end
K., Shan, S., Chen, X., 2019b. Lrw-1000: A naturally-distributed speech-to-text translation from scratch, in: International Conference
large-scale benchmark for lip reading in the wild, in: 2019 14th IEEE on Machine Learning, PMLR. pp. 26193–26205.
International Conference on Automatic Face and Gesture Recognition [639] Zhang, B., Titov, I., Haddow, B., Sennrich, R., 2020a. Adaptive
(FG 2019), pp. 1–8. doi:10.1109/FG.2019.8756582. feature selection for end-to-end speech translation. arXiv preprint
[618] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, arXiv:2010.08518 .
Q.V., 2019c. Xlnet: Generalized autoregressive pretraining for lan- [640] Zhang, C., Koishida, K., 2017. End-to-end text-independent speaker
guage understanding. Advances in neural information processing verification with triplet loss on short utterances., in: Interspeech, pp.
systems 32. 1487–1491.
[619] Yao, Z., Dong, Z., Zheng, Z., Gholami, A., Yu, J., Tan, E., Wang, L., [641] Zhang, C., Li, Y., Du, N., Fan, W., Yu, P.S., 2018a. Joint slot filling
Huang, Q., Wang, Y., Mahoney, M.W., Keutzer, K., 2020. HAWQV3: and intent detection via capsule neural networks. arXiv preprint
dyadic neural network quantization. CoRR abs/2011.10680. URL: arXiv:1812.09471 .
https://arxiv.org/abs/2011.10680, arXiv:2011.10680. [642] Zhang, C., Ren, Y., Tan, X., Liu, J., Zhang, K., Qin, T., Zhao, S.,
[620] Yasuda, Y., Wang, X., Takaki, S., Yamagishi, J., 2019. Investiga- Liu, T.Y., 2021a. Denoispeech: Denoising text to speech with frame-
tion of enhanced tacotron text-to-speech synthesis systems with self- level noise modeling, in: ICASSP 2021-2021 IEEE International
attention for pitch accent language, in: ICASSP 2019 - 2019 IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP),
International Conference on Acoustics, Speech and Signal Processing IEEE. pp. 7063–7067.
(ICASSP), pp. 6905–6909. doi:10.1109/ICASSP.2019.8682353. [643] Zhang, C., Shi, J., Weng, C., Yu, M., Yu, D., 2022b. Towards end-to-
[621] Ye, F., Yang, J., 2021. A deep neural network model for speaker end speaker diarization with generalized neural speaker clustering,
identification. Applied Sciences 11, 3603. in: ICASSP 2022-2022 IEEE International Conference on Acoustics,
[622] Ye, R., Wang, M., Li, L., 2021. End-to-end speech translation via Speech and Signal Processing (ICASSP), IEEE. pp. 8372–8376.
cross-modal progressive training. arXiv preprint arXiv:2104.10380 . [644] Zhang, H., Wang, L., Lee, K.A., Liu, M., Dang, J., Chen, H., 2021b.
[623] Yen, H., Germain, F.G., Wichern, G., Roux, J.L., 2022. Cold diffusion Meta-learning for cross-channel speaker verification, in: ICASSP
for speech enhancement. arXiv preprint arXiv:2211.02527 . 2021-2021 IEEE International Conference on Acoustics, Speech and
[624] Yoneyama, R., Yamamoto, R., Tachibana, K., 2022. Nonparallel high- Signal Processing (ICASSP), IEEE. pp. 5839–5843.
quality audio super resolution with domain adaptation and resampling [645] Zhang, H., Wang, L., Lee, K.A., Liu, M., Dang, J., Meng, H.,
cyclegans. arXiv preprint arXiv:2210.15887 . 2023a. Meta-generalization for domain-invariant speaker verifica-
[625] Yoon, J.W., Woo, B.J., Kim, N.S., 2022. Hubert-ee: Early exiting hu- tion. IEEE/ACM Transactions on Audio, Speech, and Language
bert for efficient speech recognition. arXiv preprint arXiv:2204.06328 Processing 31, 1024–1036.
. [646] Zhang, J.X., Ling, Z.H., Dai, L.R., 2018b. Forward attention in
☒ The authors declare that they have no known competng fnancial interests or personal relatonships
that could have appeared to infuence the work reported in this paper.
☐ The authors declare the following fnancial interests/personal relatonships which may be considered
as potental competng interests: