FluentNet - End-to-End Detection of Speech Disfluency With Deep Learning
FluentNet - End-to-End Detection of Speech Disfluency With Deep Learning
Abstract—Strong presentation skills are valuable and sought- referred to as a speech disfluency [5]. There are hundreds
after in workplace and classroom environments alike. Of the of different speech disfluencies often grouped together along-
possible improvements to vocal presentations, disfluencies and side language and swallowing disorders. Of these afflictions,
arXiv:2009.11394v1 [eess.AS] 23 Sep 2020
terjections from the user and use Speech-to-Text (STT) tools longations and interjections. The dataset contains the output
in order to match the spoken word with any interjections in labels that can be used in training deep learning methods;
the list. Though this may work fine for interjections such (3) We evaluate our model (FluentNet) on two datasets,
as ‘um’ and ‘uh’ (assuming the used STT tool has the UCLASS and LibriStutter. The experiments show that our
necessary embeddings), this can lead to serious overall errors model achieves state-of-the-art results on both datasets out-
in classification for most other utterances that are actual words, performing a number of other baselines as well as previously
such as ‘like’, which is commonly used as a filler word in the published work; (4) We make our annotations on the existing
English language. UCLASS dataset, along with the entire LibriStutter dataset
Early works in stutter detection, realizing the challenges and its labels, publicly available1 to contribute to the field and
mentioned above, first sought out to test the viability of facilitate further research.
identifying stutters from clean speech. These models primarily This is an extension of our earlier work titled “Detecting
focused on machine learning models with very small datasets, Multiple Speech Disfluencies using a Deep Residual Network
consisting of a single stutter type, or even a single word with Bidirectional Long Short-Term Memory”, published in
[14], [19]. In more recent years, and due to the rise of the 2020 IEEE International Conference on Acoustics, Speech,
automatic speech recognition (ASR), language models have and Signal Processing (ICASSP). This paper focused on
been used to tackle stutter recognition. These works have tackling the problem of detection and classification of different
proven to be strong at identifying certain stutter types, and forms of stutters. The model used a deep residual network
have been showing ever improving results [17], [16]. However, and bidirectional long short-term memory layers to classify
due to the uncertainty surrounding relations between cleanly different types of stutters. In this extended work, we replace
spoken and stuttered word embeddings, it remains difficult the previously used residual blocks of the spectral encoder
for these models to generalize across multiple stutter types. with residual squeeze-and-excitation blocks. Additionally, we
It is hypothesized that by bypassing the use of language add an attention mechanism after the recurrent network to
models, and by focusing solely on phonetics through the use better focus the network on salient parts of input speech.
of convolution networks, a model can be created that both Furthermore, we develop a new dataset, which we present in
maintains a strong average accuracy while also being effective this paper and make publicly available. Lastly, we perform
across all stutter types. thorough experiments, for instance through additional bench-
In this paper, we propose a model capable of detecting mark comparisons and ablation studies. Our experiments show
speech disfluencies. To this end, we design FluentNet, a the improvements made by FluentNet over our preliminary
deep neural network (DNN) for automated speech disfluency work, as validated on both the UCLASS dataset (previously
detection. The proposed network does not apply any language used) as well as the newly developed dataset. This new model
model aspects, but instead focuses on the direct classification provides greater advancement towards end-to-end disfluency
of speech signals. This allows for the avoidance of complex detection and classification.
and time consuming ASR as a pre-processing steps in our The rest of this paper is organized as follows; a discussion
model, and would provide the ability to view the scenario as of previous contributions towards stutter recognition in Section
an end-to-end solution using a single deep neural network. We II followed by our methodology including a breakdown of
validate our model on a commonly used benchmark dataset the model in Section III, the datasets and benchmark models
UCLASS [18]. To tackle the issue of scarce stutter-related applied in Section IV, a discussion of our results in Section
speech datasets, we also develop a synthetic dataset based on V, and our conclusion in the final section.
a non-stuttered speech dataset (LibriSpeech [20]), which we
entitle LibriStutter. This dataset is created to mimic stuttered II. R ELATED W ORK
speech and vastly expand the amount of data available for There has recently been increasing interest in the fields
use. Our end-to-end neural network takes spectrogram feature of deep learning, speech, and audio processing. However,
images as inputs, and uses Squeeze-and-Excitation residual as discussed earlier in section I, there has been minimal
(SE-ResNet) blocks for learning the speech embedding. Next, research targeting automated detection of speech disfluencies
a bidirectional long short-term memory (BLSTM) network including stuttering, most likely as a result of insufficient data
is used to learn the temporal relationships, followed by an and smaller number of potential applications in comparison
attention mechanism to focus on the more salient parts of the to other speech-related problems such as speech recognition
speech. Experiments show the effectiveness of our approach in [21] [9] and speaker recognition [10] [11]. In the following
generalizing across multiple classes of stutters while maintain- sections we first provide a summary of the type of disfluencies
ing a high accuracy and strong consistency between classes on commonly targeted in the area, followed by a review of the
both datasets. existing work that fall under the umbrella of speech disfluency
The key contributions of our work can be summarized detection and classification.
as follows: (1) We propose FluentNet, an end-to-end deep
neural network capable of detection of several types of speech A. Background: Types of Speech Disfluency
disfluencies; (2) We develop a synthesized disfluency dataset There are a number of different stuttering types, often
called LibriStutter based on the publicly available LibriSpeech categorized into four main groups: repetitions, prolongations,
dataset by artificially generating several types of disfluencies,
namely sound, word, and phrase repetitions, as well as pro- 11 http://aiimlab.com/resources.html
3
interjections, and blocks. A summary of all these disfluency and was weakest for mild ones. These models were able
types and examples of each have been presented in Table I. to achieve a maximum detection rate of 0.82 on severe
The descriptions for each of these categories is as follows. prolongation stutters.Howell et al. [15] later furthered their
Repetitions are classified as any part of an utterance re- work using a larger set of data, as well as a wider variety of
peated at quick pace. As this definition still remains general, audio parameters. This work also introduced an ANN model
repetitions are often further sub-categorized [5]. These sub- for both repetition and prolongation types, and more judges
categories have been used in previous works classifying stutter were used to identify stutters with strict restrictions towards
disfluencies [22] [23] [17], which include sound, word, and agreement of disfluency labeling. Results showed that the best
phrase repetitions, as well as revisions. Sound repetitions (S) parameters for disfluency classification were fragmentation
are repetitions of a single phoneme, or short sound, often spectral measures for whole words, as well as duration and
consisting of a single letter. Part-word, or syllable repetitions supralexical disfluencies of energy in part-words.
(PW), as its name suggests, are classified as the repetition of Tan et al. [19] worked on testing the viability of stutter
syllables, which can consist of multiple phonemes. Similarly, detection through a simplified approach in order to maximize
word repetitions (W) are defined as any repetition of a single the possible results. By collecting audio samples of clean,
word, and phrase repetitions (PH) are the repetition of phrases, stuttered, and artificially generated copies of single pre-chosen
consisting of multiple consecutive words. The final repetition- words, they were able to reach an average accuracy of 96%
type disfluency is revision (R). Similar to phrase repetitions, on the human samples using a hidden Markov model (HMM).
they consist of repeated phrases, where the repeated segment This served as a temporary benchmark towards the possible
is rephrased, containing new or different information from the best average results for stutter detection.
first iteration. A rise in pitch may accompany this disfluency Ravikumar et al. have attempted a variety of classifiers
type [24]. on syllable repetitions, including an HMM [25] and support
Interjections (I), often referred to as filler words, consist of vector machine (SVM) [26] using Mel-frequency cepstral
the addition of any utterance that does not logically belong coefficients (MFCCs) features. Their best results were ob-
in the spoken phrase. Common interjections in the English tained when classifying this stutter type using the SVM on
language include exclamations, such as ‘um’ and ‘uh’, as well 15 participants, achieved an accuracy of 94.35%. No other
as discourse markers such as ‘like’, ‘okay’, and ‘right’. disfluency types were considered.
Prolongation (PR) stutters are presented as a lengthened or A detailed summary of previously attempted stutter classifi-
sustained phoneme. The duration of these prolonged utterances cation methods, including some of the aforementioned classi-
vary alongside the severity of the stutter. Similar to repetitions, cal models, is available in the form of a review paper in [27].
this disfluency is often accompanied by a rise in pitch. This paper provides background on the use of three different
The final category of stuttering are silent blocks (B), which models (ANNs, HMMs and SVM) towards the application of
are sudden cutoffs of vocal utterances. These are often invol- stutter recognition. Of the works considered in that review
untary and are presented as pauses within a given phrase. paper in 2009, it was concluded that HMMs achieve the best
results in stutter recognition.
B. Stutter Recognition with Classical Machine Learning
Before the focus of stutter recognition targeted maximizing
C. Stutter Recognition with Deep Learning
accuracy in classification of stammers, a number of works
were performed toward testing the viability of stutter detection. With the recent advancements in deep learning, disfluency
In 1995, Howell et al. [14], who later helped to create the detection and classification has seen an increase in popularity
UCLASS dataset [18] used in this paper, employed a set of within the field with a higher tendency towards end-to-end
pre-defined words to identify repetition and prolongation stut- approaches. ASR has become an increasingly popular method
ters. From these, they extracted the autocorrelation features, of tackling the problem of disfluency classification. As some
spectral information, and envelope parameters from the audio. stuttered speech results in repeated words, as well as prolonged
Each was used as an input to a fully connected artificial neural utterances, these can be represented by word embeddings and
network (ANN). Findings showed that the model achieved sound amplitude features, respectively. To exploit this concept,
its strongest classification results against severe disfluencies, Alharbi et al. [17] detected sound and word repetitions, as well
4
as revision disfluencies using task-oriented finite state trans- especially linear ones.
ducer (FST) lattices. They also utilized amplitude thresholding Table II provides a summary of the related works on
techniques to detect prolongations in speech. These methods disfluency detection and classification. It can be observed and
resulted in an average 37% miss rate across the 4 different concluded that disfluency classification has been progressing
types of disfluencies. in one of two fronts i) end-to-end speech-based methods, or
Dash et al. [16] have used an STT model in order to identify ii) language-based models relying on an ASR pre-processing
word and phrase repetitions within stuttered speech. To detect step. Our work in this paper is positioned in the first category
prolongation stutters, they integrated a neural network capable in order to avoid the reliance on an ASR step. Moreover, from
of finding optimal cutoff amplitudes for a given speaker to Table II, we observe that although progresses is being made in
expand upon simple thresholding methods. As these ASR the area of speech disfluency recognition, the lack of available
works required full word embeddings to classify repetitions, data remains a hindrance to potential further achievements in
they either fared poorly against, or did not attempt sound or the field.
part word repetitions.
Deep recurrent neural networks (RNN), namely BLSTM, III. P ROPOSED M ETHOD
have been used to tackle stutter classification. Zayats et al.
[28] trained a BLSTM with Integer Linear Programming (ILP) A. Problem and Solution Overview
[32] on a set of MFCC features to detect repetitions with Our goal in this section is to design and develop a system
an F-score of 85.9. Similarly, a work done by Santoso et that can be used for detecting various types of disfluencies.
al. applied a BLSTM followed by an attention mechanism While one approach to tackle this concept is to design a multi-
to perform stutter detection based on input MFCC features, class problem, another approach is to design an ensemble of
obtaining an maximum F-score of 69.1 [30]. More recently in single-disfluency detectors. In this paper, given the relatively
a study by Chen et al., a Controllable Time-delay Transformer small size of available stutter datasets, we use the latter
(CT-Transformer) has been used to detect speech disfluencies approach which can help reduce the complexity of each binary
and correct punctuation in real time [31]. In our initial work task. Accordingly, the goal is to design a single network
on stutter classification, we utilized spectrogram features of architecture that can be trained separately to detect different
stuttered audio and used a BLSTM [33] to learn tempo- disfluency types with each trained instance, where together
ral relationships following spectral frame-level representation they could detect a number of different disfluencies. Figure
learning by a ResNet. This model achieved a 91.15% average 1 shows the overview of our system. The designed network
accuracy across six different stutter categories. should possess the capability of learning spectral frame-level
In an interesting recent work, Villegas et al. utilized res- representations as well as temporal relationships. Moreover,
piratory biosignals in order to better detect stutters [29]. By the model should be able to focus on salient parts of the
correlating respiratory volume and flow, as well as heart rate inputs in order to effectively learn the disfluencies and perform
measurements correlating to the time when a stutter occurs, accurately.
they were able to classify block stutters with an accuracy of
95.4% using an MLP.
A 2018 summary and comparison of different features B. Proposed Network: FluentNet
and classification methods for stuttering has been conducted We propose an end-to-end network, FluentNet, which uses
by Khara et al. [34]. This work discusses and compares the short-time Fourier transform (STFT) spectrograms of audio
different popular feature extraction methods, classifiers and clips as inputs. These inputs are passed through a Squeeze-
their uses, as well as their advantages and shortcomings. The and-Excitation Residual Network (SE-ResNet) to learn frame-
paper discusses that MFCC feature extraction has historically level spectral representations. As most stutter types have
provided the strongest results.Similarly, ANNs provide the distinct spectral and temporal properties, a bidirectional LSTM
most flexibility and adaptability compared to other models, network is introduced to learn the temporal relationships
5
Sound
Squeeze-and-excitation modules have been recently pro-
SE-ResNet BLSTM Attention Repetition
Classification
posed and have shown to outperform various DNN models
using previous CNN architectures, namely VGG and ResNet,
SE-ResNet BLSTM Attention
Word
Repetition
as their backbone architectures [36]. SE networks were first
Classification proposed for image classification, reducing the relative error
compared to previous works on the ImageNet dataset by
Phrase
SE-ResNet BLSTM Attention Repetition approximately 25% [36].
Classification
Input Every kernel within a convolution layer of a CNN results in
Audio
an added channel (depth) for the output feature map. Whereas
Revision
SE-ResNet BLSTM Attention
Classification recent works have focused on expanding on the spectral
relationships within these models [39] [40], SE-blocks build
Interjection stronger focus on channel-wise relationships within a CNN.
SE-ResNet BLSTM Attention
Classification
These blocks consist of two major operations. The squeeze
operation aggregates a feature map across both its height
SE-ResNet BLSTM Attention
Prolongation
Classification
and width resulting in a one-dimensional channel descriptor.
The excitation operation consists of fully connected layers
providing channel-wise weights, which are then applied back
Fig. 1: Full model overview using FluentNet for disfluency to the original feature map.
classification. To exploit the capabilities of both ResNet and SE architec-
tures and learn effective spectral frame-level representations
from the input, we use an SE-ResNet model in our end-to-
present among different spectrograms. An attention mecha- end network. The network consists of 8 SE-ResNet blocks, as
nism is then added to the final recurrent layer to better focus shown in Figure 2(a). Each SE-ResNet block in FluentNet,
on the necessary features needed for stutter classification. illustrated in Figure 2(b), consists of three sets of two-
FluentNet’s final output reveals a binary classification to detect dimensional convolution layers, each succeeded by a batch
a specific disfluency type that it has been trained for. The normalization and Rectified Linear Unit (ReLU) activation
architecture of the network is presented in Figure 2(a). In the function. A separate residual connection shares the same input
following, we describe each of the components of our model as the block’s non-identity branch, and is added back to
in detail. the non-identity branch before the final ReLU function, but
1) Data Representation: Input audio clips recorded with after the SE unit (described below). Each residual connection
a sampling rate of 16 khz are converted to spectrograms contains a convolution layer followed by batch normalization.
using STFT with 256 filters (frequency bins) to be fed to The Squeeze-and-Excitation unit within each SE-ResNet block
our end-to-end model. A sample spectrogram can be seen begins with a global pooling layer. The output is then passed
in Figure 2 where the colours represent the amplitude of through two fully connected layers: the first followed by a
each frequency bin at a given frame, with blue representing ReLU activation function, and the second succeeded with
lower amplitudes, and green and yellow representing higher a sigmoid gating function. The main convolution branch is
amplitudes. Following the common practice in audio-signal scaled with the output of the SE unit using channel-wise
processing, a 25 ms frame has been used with an overlap of multiplication.
10 ms. 3) Learning Temporal Relationships: In order to learn the
2) Learning Frame-level Spectral Representations: Fluent- temporal relationships between the representations learned
Net first focuses on learning effective representations from from the input spectrogram, we use an RNN. In particular,
each input spectrogram. To do so, CNN architectures are often LSTM [41] networks have shown to be effective for this
used. Though both residual networks [35] and squeeze-and- purpose in the past and are widely used for learning sequences
excitation (SE) networks [36] are relatively new in the field of spectral representations obtained from consecutive segments
of deep learning, both have proven to improve on previous of time-series data [42] [43] [44].
state-of-the-art models in a variety of different application Each LSTM unit contains a cell state, which holds informa-
areas [37], [38]. The ResNet architecture has proven a reliable tion contained in previous units allowing the network to learn
solution to the vanishing or exploding gradient problems, both temporal relationships. This cell state is part of the LSTM’s
common issues when back-propagating through a deep neural memory unit, where there lie several gates that together control
network. In many cases, as the model depth increases, the which information from inputs, as well as from the previous
gradients of weights in the model become increasing smaller, cell and hidden states, will be used to generate the current cell
or inversely, explosively larger with each layer. This may and hidden states. Namely, the forget gate, ft , and input gate,
eventually prevent the gradients from actually changing the it , are utilized to learn what information from each of these
weights, or from the weights becoming too large, thus pre- respective states will be saved within the new current state,
venting improvements in the model. A ResNet, overcomes this Ct . This is shown by the following equations:
by utilizing shortcuts all through its CNN blocks resulting in
ft = σ(Wf · [ht−1 , xt ] + bf ) (1)
norm-preserving blocks capable of carrying gradients through
very deep models. it = σ(Wi · [ht−1 , xt ] + bi ) (2)
6
a)
SE SE SE ... ...
ResNet ResNet ... ResNet Attention
Stutter
Classification
Block Block Block
Raw Audio Input
Spectrogram
BLSTM
b) Squeeze-and-Excitation Block
Global Pool
Sigmoid
ReLU
FC
FC
Batch Norm
Batch Norm
Batch Norm
2d Conv
2d Conv
2d Conv
ReLU
ReLU
ReLU
X +
Batch Norm
2d Conv
Fig. 2: a) A full workflow of FluentNet is presented. This network consists of 8 SE residual blocks, two BLSTM layers, and
a global attention mechanism. b) The breakdown of a single SE-ResNet block in FluentNet is presented.
Ct = ft ∗ Ct−1 + it ∗ tanh(WC · [ht−1 , xt ] + bC ) (3) The final output value of the second layer of the BLSTM,
ht , as well as a context vector, Ct , derived through the use of
where σ represents the sigmoid function, and the ∗ operator the attention mechanism are used to generate FluentNet’s final
denotes point-wise multiplication. This new cell state, along classification, h̃t . This is done by applying a tanh activation
with an output gate, ot , are used to generate the hidden state function, as shown by:
of the unit, ht , as represented by:
h̃t = tanh(Wc [Ct ; ht ]) (6)
ot = σ(Wo · [ht−1 , xt ] + bo ) (4)
ht = ot ∗ tanh(Ct ) (5) The context vector of the global attention is the weighted
sum of all hidden state outputs of the encoder. An alignment
The cell state and hidden state are then passed to successive vector, generated as a relation between ht and each hidden
LSTM units, allowing the network to learn long-term depen- state value is passed through a softmax layer, which is then
dencies. used to represent the weights to the context vector. Dot product
We used a BLSTM network [45] in FluentNet. BLSTMs was used as the alignment score function for this attention
consist of two LSTMs advancing in opposite directions, maxi- mechanism. The calculation for the context vector can be
mizing the available context from relationships of both the past represented by:
and future. The outputs of these two networks are multiplied t >
·h̄i
together into a single output layer. FluentNet consists of two X eh t
Ct = h̄i ( Pt > ·h̄ ) (7)
consecutive BLSTMs, each utilizing LSTM cells with 512 i=1 i‘=1 eht i‘
between the ages of 8 and 18, totalling to just over one hour
of audio.
~
ht
In order to pair the utterances with their transcriptions, each
audio file and its corresponding orthographic transcription
were passed through a forced time alignment tool. The result-
ing table related each alphabetical token in the transcription
to its matching timestamp within the audio. This process was
Ct then manually checked for outlaying utterances not matching
Attention
Softmax
their transcriptions.
Alignment The provided orthographic transcriptions only flagged the
existence of disfluencies (through the use of capitalization),
but gave no information towards a disfluency type. To build
ht a more detailed dataset and be able to classify the type of
disfluency, all utterances were manually labelled as one of
... the seven represented classes for our model. These included
RNN
etitions were generated by copying the first fraction of a TABLE III: Cosine similarity between a UCLASS dataset
random spoken word within the sample and repeating this stutter and a matching LibriStutter stutter, as well as the
short utterance a several times before said word. Although average of 100 random samples from the LibriStutter dataset.
repetitions of sounds can occur at the end of words, known Stutter UCLASS vs. LibriStutter UCLASS vs. Random
as word-final disfluencies, this is rarely the case [60]. One to Sound Repetition 3.73893e−3 2.58116e−4
three repeated sound utterances were added in each stuttered Word Repetition 3.14077e−3 2.61084e−4
Prolongation 7.70236e−3 2.57234e−4
word. After each instance of the repeated sound, a random
empty pause duration of 100 to 350 ms was appended as
this range sounded most natural. Inserted audio may leave
sharp cutoffs, especially part-way through an utterances. To To ensure that sufficient realism was incorporated into the
avoid this, interpolation was used to smooth the added audio’s dataset, a registered speech language pathologist was consulted
transition into the existing clip. for this project. Nonetheless, it should be mentioned that
Both word and phrase repetitions underwent similar pro- despite our attention to creating a perceptually valid and
cesses to that of sound repetitions. For word repetitions we realistic dataset, the notion of “realism” itself is not a focus of
repeated one to two copies of a randomly selected word this dataset. Instead, much like synthetic datasets in other areas
before the original utterance. For phrase repetitions, a similar such as image processing, the aim is for the dataset to be valid
approach was taken, where instead of repeating a particular enough such that machine learning and deep learning methods
word, a phrase consisting of two to three words were repeated. can be trained and evaluated with, and later on transferred
The same pause duration and interpolation techniques used to real large-scale datasets [in the future] with little to no
for sound repetitions were applied to both word and phrase adjustments to the model architectures.
repetition disfluencies. Figure 4 displays side by side comparisons of spectrograms
of real stuttered data from the UCLASS dataset, and artificial
Prolongations consist of sustained sounds, primarily at the
stutters from LibriStutter. Each pairing represents a single stut-
end of a word. To mimic this behaviour, the last portion
ter type, with the same word or sound being spoken in each.
of a word was stretched to simulate prolonged speech. For
It can be observed that the UCLASS stutter samples and their
a randomly chosen word, the latter 20% of the signal was
corresponding LibriStutter examples show clear similarities.
stretched by a factor of 5. This prolonged speech segment
Moreover, to numerically compare the samples, cosine similar-
replaced the original word ending. As applying time stretching
ity [61] was calculated between the UCLASS and LibriStutter
to audio results in a drop in pitch, pitch shifting was used to
spectrogram samples shown earlier. To add relevance to these
realign the pitch with the original audio. The average pitch of
values, a second comparison was performed for each UCLASS
the given speech segment was used to normalize the disfluent
spectrogram with respect to 100 random samples from the
utterance.
LibriStutter dataset, and the average score was used as the
Unlike the aforementioned classes, interjection disfluencies
represented comparison value. These scores are summarized
cannot be created from existing speech within a sample as it
in Table III. We observe that the UCLASS cosine similarity
requires the addition of filler words absent from the original
scores corresponding to the matching LibriStutter samples are
audio (for example ‘umm’). Multiple samples of common filler
noticeably (approximately between 10× to 30×) greater than
words from the UCLASS were isolated and saved separately
those compared to random audio samples, confirming that the
to create a pool of interjections. To create interjection dis-
disfluent utterances contained in LibriStutter share phonetic
fluencies, a random filler word from this pool was inserted
similarities with real stuttered samples, empirically showing
between two random words, followed by a short empty pause.
the similarity between a few sample real and synthesized
The same pitch scaling and normalization method as used
stutters.
for prolongations was applies to match the pitches between
The LibriStutter dataset consists of approximately 20 hours
the interjection and audio clip. Interpolation was used as in
of speech data from the LibriSpeech train-clean-100 (training
repetition disfluencies to smooth sharp cutoffs caused by the
set of 100 hours “clean” speech). In turn, LibriStutter shares
added utterance.
a similar make up to that of its predecessor. It consists of
Sound Repetition Word Repetition Prolongation
disfluent prompted English speech from audiobooks. It also
contains 23 male and 27 female speakers, with an approximate
53% of the audio coming from males, and 47% from females.
UCLASS There are 15000 disfluencies in this dataset, with equal counts
for each of the five disfluency types: 3000 sound, word, and
phrase repetitions, as well as prolongations and interjections.
Each audio file has a corresponding CSV file containing each
LibriStutter word or utterance spoken, the start and end time of the
utterance, and its disfluency type, if any.
B. Benchmarks
Fig. 4: Spectrograms of the same stutters found in the For a thorough analysis of our results, we compare the
UCLASS dataset and generated in the LibriStutter dataset. results obtained by the proposed FluentNet to a number of
9
other models. In particular, we employ two type of solutions Dash et al. [16]: This method passed the maximum ampli-
for comparison purposes. First, we compare our results to tude of the provided audio sample through a neural network
related works and the state-of-the-art as follows: to generate a custom threshold for each sample, trained on a
Alharbi et al. [17]: This work conducted classification set of 60 speech samples. This amplitude threshold was used
of sound repetitions, word repetitions, revisions, and prolon- to remove any perceived prolongations and interjections. The
gations on the UCLASS dataset through the application of audio was then passed the audio through a SST tool, which
two different methods. First, an original speech prompt was allowed for the removal of any repeated words, phrases, or
aligned, and then passed to a task-oriented FST to generate characters, achieving an overall stutter classification of 86%
word lattices. These lattices were used to detect repeated part- on a test set of 50 speech segments.
words, words, and phrases within the sample. This method Note that the latter three works only provide results on a
scored perfect results on word repetition classification,though group of disfluency types [28], a single disfluency type [29], or
the results on sound repetitions and revisions proved much overall stutter classification [16]. As such, only their average
weaker. To classify prolongation stutters, an autocorrelation disfluency classification results could be compared. Moreover,
algorithm consisting of two thresholds was used: the first to these works ([31], [28], [29], and [16]) have not used the
detect speech with similar amplitudes (sustained speech), and UCLASS dataset, therefore the comparisons should be taken
another dynamic threshold to decide whether the duration of cautiously.
similar speech would be considered a prolongation. Using this Next, we also compare the performance of our solution
algorithm, perfect prolongation classification was achieved. to a number of other models for benchmarking purposes.
Chen et al. [31]: A CT-Transformer was designed to These models were selected due to their popularity for time-
conduct repetition and interjection disfluency detection on series learning and their hyperparameters of these models
an in-house Chinese speech dataset. Both word and position are all tuned to obtain the best possible results given their
embeddings of a provided audio sample were passed through architectures. These benchmarks are as follows: (i) VGG-16
a series of CT self attention layers and fully connected (Benchmark 1): VGG-16 [62] consists of 16 convolutional
layers. This work was able to achieve an overall disfluency or fully connected layers, comprised of groups of two or
classification miss rate of 38.5% (F1 score of 70.5). Notably, three convolution layers with ReLU activation, with each
this is one of the few works to have attempted interjection grouping being followed by a max pooling layer.The model
disfluency classification, yielding a miss rate of 25.1%. Note concludes with three fully connected layers and a final softmax
that the performance on repetition disfluencies encompasses function. (ii) VGG-19 (Benchmark 2): This network is very
all repetition-type stutters, including sound, word, and phrase similar to its VGG-16 counterpart, with the only difference
repetitions, as well as revisions. being an addition of three more convolution layers spread
Kourkounakis et al. [33]: As opposed to other current throughout the model. (iii) ResNet-18 (Benchmark 3): ResNet-
models focusing on ASR and language models, our previous 18 was chosen as a benchmark, which contains 18 layers: eight
work proposed a model relying solely on acoustic and phonetic consecutive residual blocks each containing two convolutional
features, allowing for the classification of several multiple layers with ReLU activation, followed by an average pooling
disfluencies types without the need for speech recognition layer and a final fully connected layer.
methods. This model applied a deep residual network, con-
sisting of 6 residual blocks (18 convolution layers) and two V. R ESULTS AND A NALYSIS
bidirectional long short-term memory layers to classify six
different types of stutters. This work achieved an average miss A. Validation
rate of 10.03% on the UCLASS dataset, and sustained strong In order to rigorously test FluentNet on the UCLASS
accuracy and miss rates across all stutter types, prominently dataset, a leave-one-subject-out (LOSO) cross validation
word repetitions and revisions. method was used. The results of models tested on this dataset
Zayats et al. [28]: A recurrent network was used to are represented as the average between 25 experiments, each
classify repetition disfluencies within the Switchboard corpus. consisting of audio samples from 24 of the participants as
It consists of a BLSTM followed by an ILP post processing training data, and a unique single participant’s audio as a
method. The input embedding to this network consisted of a test set. A 10-fold cross validation method was used on the
vector containing each word’s index, part of speech, as well LibriStutter dataset with a random 90% subset of the samples
as 18 other disfluency-based features. The method achieved a from each stutter being used for training along with 90% of
miss rate of 19.4% across all repetitions disfluencies. the clean samples chosen randomly. The remaining 10% of
Villegas et al. [29]: This model was used a reference to both clean and stuttered samples were used for testing. All
compare the effectiveness of repository signals towards stutter experiments were trained over 30 epochs, with minimal change
classification. These features included the means, standard in loss seen in further epochs.
deviations, and distances of respiratory volume, respiratory The two metrics used to measure the performance of the
flow, and heart rate. Sixty-eight participants were used to aforementioned experiments were miss rate and accuracy.
generate the data for their experiments. The best performing Miss rate (1 - recall) is used to determine the proportion of
model in this work was an MLP with 40 hidden layers, disfluencies which were incorrectly classified by the model.
resulting in a 82.6% average classification accuracy between To balance out any bias this metric may hold, accuracy was
block and non-block type stutters. used as a second performance metric.
10
TABLE IV: Percent miss rates (MR) and accuracy (Acc) of the six stutter types trained on the UCLASS dataset.
S W PH I PR R
Paper Method Dataset MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑
Alharbi et al. [17] Word Lattice UCLASS 60 – 0 – 0 – 25 –
Kourkounakis et al. [33] ResNet+BLSTM UCLASS 18.10 84.10 3.20 96.60 4.46 95.54 25.12 81.40 5.92 94.08 2.86 97.14
Benchmark 1 VGG-16 UCLASS 20.80 81.03 6.54 93.01 12.82 87.91 28.44 72.03 9.04 90.83 5.20 94.90
Benchmark 2 VGG-19 UCLASS 19.41 81.35 5.22 95.42 10.13 91.60 26.06 73.64 5.72 94.21 4.72 96.32
Benchmark 3 ResNet-18 UCLASS 19.51 81.38 5.26 94.50 7.32 94.01 25.55 76.38 7.02 93.22 5.16 94.74
Ours FluentNet UCLASS 16.78 84.46 3.43 96.57 3.86 96.26 24.05 81.95 5.34 94.89 2.62 97.38
Kourkounakis et al. [33] ResNet+BLSTM LibriStutter 19.23 79.80 5.17 92.52 6.12 92.52 31.49 69.22 9.80 89.44
Benchmark 1 VGG-16 LibriStutter 20.97 79.33 6.27 92.74 8.90 91.94 36.47 64.05 10.63 89.10
Benchmark 2 VGG-19 LibriStutter 20.79 79.66 6.45 93.44 7.92 92.44 34.46 66.92 10.78 89.98
Benchmark 3 ResNet-18 LibriStutter 22.47 78.71 6.22 92.70 6.74 93.36 35.56 64.78 10.52 90.32
Ours FluentNet LibriStutter 17.65 82.24 4.11 94.69 5.71 94.32 29.78 70.12 7.88 92.14
B. Performance and Comparison LibriStutter dataset as compared to UCLASS between the two
The results of our model for recognition of each stutter models. Notably, word repetitions and prolongation relay a
type are presented for the UCLASS and LibriStutter datasets decrease in miss rate of approximately 20% between FluentNet
in Table IV. FluentNet achieves strong results against all the and [33]. This implies the SE and attention mechanisms assist
disfluency types within both datasets, outperforming nearly all in better representing the disfluent utterances within stuttered
of the related work as well as the benchmark models. speech found in the synthetic dataset.
As some previous works have been designed to tackle An interesting observation is that LibriStutter proves a more
specific disfluency types as opposed to a general solution for difficult dataset compared to UCLASS as evident by the
detecting different types of disfluencies, a few of FluentNet’s lower performance of all the solutions including FluentNet.
individual class accuracies do not surpass previous works’, This is likely due to the fact that given the large number
namely word repetitions and prolongation. In particular, the of controllable parameters for each stutter type, LibriStutter
work by Alharbi et al. [17] offers perfect word repetition is likely to contain a larger variance of stutter styles and
classification, as word lattices can easily identify two words variations, resulting in a more difficult problem space.
repeated one after the other. Amplitude thresholding also Table V presents the overall performance of our model with
proves to be a successful prolongation classification method. respect to all disfluency types on UCLASS and LibriStutter
It should be noted that FluentNet does achieve strong results datasets. The results are compared with other works on re-
for these disfluency types as well. Notably, our work is one spective datasets, and the benchmarks which we implemented
of the only ones that has attempted classification of interjec- for comparison purposes. We observe that FluentNet achieves
tion disfluencies. These disfluent utterances lack the unique average miss rates and accuracy of 9.35% and 91.75%on the
phonetic and temporal patterns that, for instance, repetition UCLASS dataset, surpassing the other models and setting
or prolongation disfluencies contain. Moreover, they may be a new state-of-the-art. A similar trend can be seen for the
present as a combination of other disfluency types, for example LibriStutter dataset where FluentNet outperforms the previous
an interjection can be both prolonged or repeated. For these model along with all the benchmark models.
reasons, interjections remain the hardest category, with a The BLSTM used in [28] yields successful results to-
24.05% and 29.78% miss rate on the UCLASS an LibriStutter wards repetition stutter classification by learning temporal
datasets, respectively. Nonetheless, FluentNet provides good relationships between words, however it remains impaired
results, especially given that interjections have been histori- by its reliance solely on lexical model inputs. On the other
cally avoided. hand, as shown by the results, FluentNet is better able to
The task-oriented lattices generated in [17] show strong learn these phonetic details through the spectral and temporal
performance on word repetitions and prolongations, but strug- representations of speech.
gle to detect sound repetitions and revision. Likewise, as is The work from [16] uses similar classification techniques to
presented in [31], the CT-Transformer yields a comparable [17], however improves upon the thresholding technique with
interjection classification miss rate to that of FluentNet. How- the addition of a neural neural network. Though achieving
ever, when the same model is applied to repetition stutters, the an average accuracy of 86% across the same disfluency types
performance of the model drops severely, hindering its overall used in this work, FluentNet remains a stronger model given
disfluency detection capabilities. The use of an attention-based its effective spectral frame-level and temporal embeddings.
transformer proves a viable method of classifying interjection Nonetheless, the results of this work contains only a single
disfluencies, however as the results suggest, the convolutional overall accuracy value across all of repetition, interjection, and
and recurrent architecture in FluentNet allows for effective prolongation disfluency detection. Little is discussed on the
representations to be learned for interjection disfluencies origin and makeup of the dataset used.
alongside repetitions and prolongations. Of the benchmark models without an RNN component,
FluentNet’s achievements surpass our previous work’s ResNet performs better than both VGG networks for both
across all disfluency types on the Libristutter dataset, and all datasets, indicating that ResNet-style architectures are able
but word repetition accuracy on the UCLASS dataset. The to learn effective spectral representations of speech. This
results show a greater margin of improvement against the further justifies the use of a ResNet as the backbone of our
11
TABLE V: Average percent miss rates (MR) and accuracy UCLASS LibriStutter
100% 100%
(Acc) of disfluency classification models.
90% 90%
Paper Dataset Ave. MR↓ Ave. Acc.↑
Training Accuracy
Training Accuracy
80% 80%
Zayats et al. [28] Switchboard 19.4 – Interjection
70% 70%
Villegas et al. [29] Custom – 82.6 Sound Repetition Interjection
Word Repetition Sound Repetition
Phrase Repetition 60% Word Repetition
Dash et al. [16] Custom – 86 60%
Revision Phrase Repetition
Prolongation Prolongation
Chen et al. [31] Custom 38.5 – 5 10 15 20 25 30 5 10 15 20 25 30
Epoch Epoch
Alharbi et al. [17] UCLASS 37 –
Kourkounakis et al. [33] UCLASS 10.03 91.15 (a) UCLASS (b) LibriStutter
Benchmark 1 (VGG-16) UCLASS 13.81 86.62
Benchmark 2 (VGG-19) UCLASS 12.21 87.92 Fig. 6: Average training accuracy for FluentNet on the consid-
Benchmark 3 (ResNet-18) UCLASS 12.14 89.14 ered stuttered types for the UCLASS and LibriStutter datasets.
FluentNet UCLASS 9.35 91.75
Kourkounakis et al. [33] LibriStutter 14.36 85.30
Benchmark 1 (VGG-16) LibriStutter 16.65 83.43
Benchmark 2 (VGG-19) LibriStutter 16.08 84.49 UCLASS dataset. Similarly, we experimented with the use of
Benchmark 3 (ResNet-18) LibriStutter 16.30 83.97 different number of BLSTM layers, ranging between 0 to 3
FluentNet LibriStutter 13.03 86.70
layers. The use of 2 layers yielded the best results. Moreover,
the use of bi-directional layers proved slightly more effective
UCLASS 1.0
LibriStutter than uni-directional layers. Lastly, we experimented with a
1.0
number of different values and strategies for the learning rate
0.8 0.8
where 10−4 showed the best results.
True Positive Rate
True Positive Rate
0.6 0.6
Figures 6(a) and 6(b) show FluentNet’s performance for
0.4 Interjection 0.4
Sound Repetition
Word Repetition
Interjection
Sound Repetition
each stutter type against different epochs on the UCLASS
0.2 Phrase Repetition 0.2 Word Repetition
Revision
Prolongation
Phrase Repetition
Prolongation
and LibriStutter datasets, respectively. It can be seen that the
0.0 0.0
0.0 0.2 0.4 0.6
False Positive Rate
0.8 1.0 0.0 0.2 0.4 0.6
False Positive Rate
0.8 1.0 training accuracy stabilizes after around 20 epochs. Whereas
(a) UCLASS (b) LibriStutter all disfluencies types in the UCLASS dataset approach perfect
training accuracy, training accuracy plateaus at much lower
Fig. 5: ROC curves for each stutter type tested on the UCLASS accuracies for interjections and sound repetitions within the
and LibriStutter datasets. LibriStutter dataset.
TABLE VI: Ablation experiment results, providing accuracy (Acc) and miss rates (MR) for each stutter type and model on
the UCLASS dataset.
S W PH I PR R Average
Method Dataset MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑
FluentNet UCLASS 16.78 84.46 3.43 96.57 3.86 96.26 24.05 81.95 5.34 94.89 2.62 97.38 9.35 91.75
w/o Attention UCLASS 16.97 83.13 3.51 96.29 4.23 95.78 24.22 80.78 6.88 92.50 3.25 96.55 9.84 90.84
w/o Squeeze-and-Excitation UCLASS 17.37 82.01 4.82 95.34 4.81 95.17 24.59 79.84 6.22 93.10 3.14 96.98 10.16 90.41
w/o Squeeze-and-Excitation & Attention UCLASS 18.18 82.83 4.96 95.04 5.32 93.68 28.89 71.01 8.30 91.72 3.30 96.70 11.49 88.50
FluentNet LibriStutter 17.65 82.24 4.11 94.69 5.71 94.32 29.78 70.12 7.88 92.14 13.03 86.70
w/o Attention LibriStutter 18.91 81.14 4.17 94.01 5.92 93.73 31.26 68.91 8.53 91.24 13.76 85.81
w/o Squeeze-and-Excitation LibriStutter 19.11 80.72 4.95 94.60 5.87 94.15 31.14 70.02 8.80 91.28 13.97 86.15
w/o Squeeze-and-Excitation & Attention LibriStutter 19.23 79.80 5.17 92.52 6.12 92.52 31.49 69.22 9.80 89.44 14.36 85.30
reinforcing the validity of LibriStutter’s similarity to real [3] Mayo Foundation for Medical Education and Research.
stutters. (2017) Stuttering. [Online]. Available: https://www.mayoclinic.org/
diseases-conditions/stuttering/diagnosis-treatment/drc-20353577
[4] H. Trinh, R. Asadi, D. Edge, and T. Bickmore, “Robocop: A robotic
VI. C ONCLUSION coach for oral presentations,” ACM Conference on Interactive, Mobile,
Wearable and Ubiquitous Technologies, vol. 1, no. 2, p. 27, 2017.
Of the measurable metrics of speech, stuttering continues to [5] ASHA. (2020) Childhood fluency disorders. [On-
be the most difficult to identify as their diversity and unique- line]. Available: https://www.asha.org/Practice-Portal/Clinical-Topics/
Childhood-Fluency-Disorders
ness make them challenging for simple algorithms to model. [6] United Kingdom National Health Service. (2019) Stammering. [Online].
To this end, we proposed a deep neural network, FluentNet, Available: https://www.nhs.uk/conditions/stammering/
to accurately classify these disfluencies. FluentNet is an end- [7] Anxiety and Depression Association of America.
(2019). [Online]. Available: https://adaa.org/understanding-anxiety/
to-end deep neural network designed to accurately classify social-anxiety-disorder/treatment/conquering-stage-fright
stuttered speech across six different stutter types: sound, word, [8] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar,
and phrase repetitions, as well as revisions, interjections, H. Huang, A. Tjandra, X. Zhang, F. Zhang et al., “Transformer-based
acoustic modeling for hybrid speech recognition,” in ICASSP 2020-
and prolongations. This model uses a Squeeze-and-Excitation 2020 IEEE International Conference on Acoustics, Speech and Signal
residual network to learn effective spectral frame-level speech Processing (ICASSP). IEEE, 2020, pp. 6874–6878.
representations, followed by recurrent bidirectional long short- [9] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao,
D. Rybach, A. Kannan, Y. Wu, R. Pang et al., “Streaming end-to-end
term memory layers to learn temporal relationships from speech recognition for mobile devices,” in ICASSP 2019-2019 IEEE
stuttered speech. A global attention mechanism was then added International Conference on Acoustics, Speech and Signal Processing
to focus on salient parts of speech in order to accurately detect (ICASSP). IEEE, 2019, pp. 6381–6385.
[10] A. Hajavi and A. Etemad, “A deep neural network for short-segment
the required influences. Through comprehensive experiments, speaker recognition,” INTERSPEECH, 2019.
we demonstrate that FluentNet achieves state-of-the-art results [11] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and
on disfluency classification with respect to other works in S. Khudanpur, “Speaker recognition for multi-speaker conversations
using x-vectors,” in ICASSP 2019-2019 IEEE International Conference
the area as well as a number of benchmark models on the on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019,
public UCLASS dataset. Given the lack of sufficient data to pp. 5796–5800.
facilitate more in-depth research on disfluency detection, we [12] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based
generative network for speech synthesis,” in ICASSP 2019-2019 IEEE
developed a synthetic dataset, LibriStutter, based on the public International Conference on Acoustics, Speech and Signal Processing
LibriSpeech dataset. (ICASSP). IEEE, 2019, pp. 3617–3621.
Future works may include improving on LibriStutter’s re- [13] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis
with transformer network,” in Proceedings of the AAAI Conference on
alism, which could constitute conducting further research into Artificial Intelligence, vol. 33, 2019, pp. 6706–6713.
the physical sound generation of stutters and how they translate [14] P. Howell and S. Sackin, “Automatic recognition of repetitions and
to audio signals. Whereas this work focuses on the educational prolongations in stuttered speech,” Proceedings of the First World
Congress on Fluency Disorders, 01 1995.
and business applications of speech metric analysis, further [15] P. Howell, S. Sackin, and K. Glenn, “Development of a two-stage
work may focus towards medical and therapeutic use-cases. procedure for the automatic recognition of dysfluencies in the speech of
children who stutter: Ii. ann recognition of repetitions and prolongations
with supplied word segment markers,” Journal of Speech, Language, and
ACKNOWLEDGMENT Hearing Research, 10 1997.
[16] A. Dash, N. Subramani, T. Manjunath, V. Yaragarala, and S. Tripathi,
The authors would like to thank Prof. Jim Hamilton for “Speech recognition and correction of a stuttered speech,” in 2018
his support and valuable discussion throughout this work. We International Conference on Advances in Computing, Communications
also wish to acknowledge Adrienne Nobbe for her consultation and Informatics (ICACCI), 2018, pp. 1757–1760.
[17] S. Alharbi, M. Hasan, A. Simons, S. Brumfitt, and P. Green, “A
towards this project. lightly supervised approach to detect stuttering in childrens speech,”
INTERSPEECH, pp. 3433–3437, 2018.
R EFERENCES [18] P. Howell, S. Davis, and J. Bartrip, “The university college london
archive of stuttered speech (uclass),” Journal of Speech, Language, and
[1] S. H. Ferguson and S. D. Morgan, “Talker differences in clear and Hearing Research, vol. 52, pp. 556–569, 2009.
conversational speech: Perceived sentence clarity for young adults with [19] T. Tan, Helbin-Liboh, A. K. Ariff, C. Ting, and S. Salleh, “Application
normal hearing and older adults with hearing loss,” Journal of Speech, of malay speech technology in malay speech therapy assistance tools,”
Language, and Hearing Research, vol. 61, no. 1, pp. 159–173, 2018. International Conference on Intelligent and Advanced Systems, pp. 330–
[2] J. S. Robinson, B. L. Garton, and P. R. Vaughn, “Becoming employable: 334, 2007.
A look at graduates’ and supervisors’ perceptions of the skills needed [20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:
for employability,” in NACTA Journal, vol. 51, 2007, pp. 19–26. An asr corpus based on public domain audio books,” in 2015 IEEE
13
International Conference on Acoustics, Speech and Signal Processing [42] Y. Ma, H. Peng, and E. Cambria, “Targeted aspect-based sentiment
(ICASSP), 2015, pp. 5206–5210. analysis via embedding commonsense knowledge into an attentive lstm.”
[21] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, and in Aaai, 2018, pp. 5876–5883.
R. Collobert, “Fully convolutional speech recognition,” arXiv preprint [43] H. Y. Kim and C. H. Won, “Forecasting the volatility of stock price in-
arXiv:1812.06864, 2018. dex: A hybrid model integrating lstm with multiple garch-type models,”
[22] E. Yairi and N. G. Ambrose, “Early childhood stuttering i: persistency Expert Systems with Applications, vol. 103, pp. 25–37, 2018.
and recovery rates,” Journal of Speech, Language, and Hearing Re- [44] P. Li, M. Abdel-Aty, and J. Yuan, “Real-time crash risk prediction on
search, vol. 42, 1999. arterials based on lstm-cnn,” Accident Analysis & Prevention, vol. 135,
[23] F. S. Juste and C. R. F. de Andrade, “Speech disfluency types of p. 105371, 2020.
fluent and stuttering individuals: Age effects,” International Journal of [45] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-
Phoniatrics, Speech Therapy and Communication Pathology, vol. 63, works,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp.
2011. 2673–2681, 1997.
[24] Stuttering Foundation. (2020) Differential diagnosis. [Online]. Available: [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
https://www.stutteringhelp.org/differential-diagnosis dinov, “Dropout: a simple way to prevent neural networks from over-
[25] K. Ravikumar, S. Kudva, R. Rajagopal, and H. Nagaraj, “Development fitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.
of a procedure for the automatic recognition of disfluencies in the 1929–1958, 2014.
speech of people who stutter,” in International Conference on Advanced [47] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
Computing Technologies, Hyderbad, India, 2008, pp. 514–519. jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
[26] H. K.M Ravikumar, R.Rajagopal, “An approach for objective assessment 2014.
of stuttered speech using mfcc features,” Digital Signal Processing [48] A. Hajavi and A. Etemad, “Knowing what to listen to: Early attention for
Journal, vol. 9, pp. 19–24, 2019. deep speech representation learning,” arXiv preprint arXiv:2009.01822,
[27] L. S. Chee, O. C. Ai, and S. Yaacob, “Overview of automatic stuttering 2020.
recognition system,” in Proc. International Conference on Man-Machine [49] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion
Systems, no. October, Batu Ferringhi, Penang Malaysia, 2009, pp. 1–6. recognition using recurrent neural networks with local attention,” in
[28] V. Zayats, M. Ostendorf, and H. Hajishirzi, “Disfluency detection using 2017 IEEE International Conference on Acoustics, Speech and Signal
a bidirectional lstm,” INTERSPEECH, pp. 2523–2527, 2016. Processing (ICASSP), 2017, pp. 2227–2231.
[29] B. Villegas, K. M. Flores, K. Jos Acua, K. Pacheco-Barrios, and D. Elias, [50] T. Sun and A. A. Wu, “Sparse autoencoder with attention mechanism for
“A novel stuttering disfluency classification system based on respiratory speech emotion recognition,” in 2019 IEEE International Conference on
biosignals,” in 2019 41st Annual International Conference of the IEEE Artificial Intelligence Circuits and Systems (AICAS), 2019, pp. 146–149.
Engineering in Medicine and Biology Society (EMBC), 2019, pp. 4660– [51] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap-
4663. proaches to attention-based neural machine translation,” arXiv preprint
arXiv:1508.04025, 2015.
[30] J. Santoso, T. Yamada, and S. Makino, “Classification of causes of
[52] F. Chollet et al. (2015) Keras. [Online]. Available: https://keras.io
speech recognition errors using attention-based bidirectional long short-
[53] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
term memory and modulation spectrum,” in 2019 Asia-Pacific Signal
S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-
and Information Processing Association Annual Summit and Conference
scale machine learning,” in 12th {USENIX} symposium on operating
(APSIPA ASC), 2019, pp. 302–306.
systems design and implementation ({OSDI} 16), 2016, pp. 265–283.
[31] Q. Chen, M. Chen, B. Li, and W. Wang, “Controllable time-delay trans-
[54] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg,
former for real-time punctuation prediction and disfluency detection,”
and O. Nieto, “librosa: Audio and music signal analysis in python,” in
in ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
Proceedings of the 14th python in science conference, vol. 8, 2015.
Speech and Signal Processing (ICASSP), 2020, pp. 8069–8073.
[55] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.
[32] K. Georgila, “Using integer linear programming for detecting speech Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom.
disfluencies,” in Proceedings of Human Language Technologies: The nist speech disc 1-1.1,” STIN, vol. 93, p. 27403, 1993.
2009 Annual Conference of the North American Chapter of the Associ- [56] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale
ation for Computational Linguistics, Companion Volume: Short Papers, speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
2009, pp. 109–112. [57] L. S. Chee, O. C. Ai, M. Hariharan, and S. Yaacob, “Mfcc based
[33] T. Kourkounakis, A. Hajavi, and A. Etemad, “Detecting multiple speech recognition of repetitions and prolongations in stuttered speech using
disfluencies using a deep residual network with bidirectional long short- k-nn and lda,” in 2009 IEEE Student Conference on Research and
term memory,” in ICASSP 2020-2020 IEEE International Conference Development (SCOReD), 2009, pp. 146–149.
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, [58] O. C. Ai, M. Hariharan, S. Yaacob, and L. S. Chee, “Classification of
pp. 6089–6093. speech dysfluencies with mfcc and lpcc features,” Expert Systems with
[34] S. Khara, S. Singh, and D. Vir, “A comparative study of the techniques Applications, vol. 39, no. 2, pp. 2157–2165, 2012.
for feature extraction and classification in stuttering,” in 2018 Second In- [59] (2020) Google cloud speech-to-text. [Online]. Available: https:
ternational Conference on Inventive Communication and Computational //cloud.google.com/speech-to-text/
Technologies (ICICCT), 2018, pp. 887–893. [60] J. Van Borsel, E. Geirnaert, and R. Van Coster, “Another case of word-
[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for final disfluencies,” Folia phoniatrica et logopaedica, vol. 57, no. 3, pp.
image recognition,” IEEE Conference on Computer Vision and Pattern 148–162, 2005.
Recognition, pp. 770–778, 2016. [61] J. Han, M. Kamber, and J. Pei, Data Mining, 3rd ed. Elsevier Inc.,
[36] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” CoRR, 2012.
vol. abs/1709.01507, 2017. [Online]. Available: http://arxiv.org/abs/ [62] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
1709.01507 large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[37] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,
inception-resnet and the impact of residual connections on learning,”
arXiv preprint arXiv:1602.07261, 2016.
[38] A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and channel
squeeze & excitationin fully convolutional networks,” in International
conference on medical image computing and computer-assisted inter-
vention. Springer, 2018, pp. 421–429.
[39] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, “Inside-outside
net: Detecting objects in context with skip pooling and recurrent neural
networks,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 2874–2883.
[40] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for
human pose estimation,” in European conference on computer vision.
Springer, 2016, pp. 483–499.
[41] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.