0% found this document useful (0 votes)
190 views13 pages

FluentNet - End-to-End Detection of Speech Disfluency With Deep Learning

FluentNet is a deep neural network capable of detecting different types of speech disfluencies like stutters. It uses a Squeeze-and-Excitation Residual convolutional neural network to learn strong spectral representations of speech, followed by bidirectional LSTM layers to learn temporal relationships. An attention mechanism then focuses on important parts of the speech to improve performance. The model achieves state-of-the-art results on the UCLASS dataset for speech disfluency detection and the authors believe it can help many people improve their communication skills.

Uploaded by

yacov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
190 views13 pages

FluentNet - End-to-End Detection of Speech Disfluency With Deep Learning

FluentNet is a deep neural network capable of detecting different types of speech disfluencies like stutters. It uses a Squeeze-and-Excitation Residual convolutional neural network to learn strong spectral representations of speech, followed by bidirectional LSTM layers to learn temporal relationships. An attention mechanism then focuses on important parts of the speech to improve performance. The model achieves state-of-the-art results on the UCLASS dataset for speech disfluency detection and the authors believe it can help many people improve their communication skills.

Uploaded by

yacov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1

FluentNet: End-to-End Detection of Speech


Disfluency with Deep Learning
Tedd Kourkounakis, Amirhossein Hajavi, and Ali Etemad

Abstract—Strong presentation skills are valuable and sought- referred to as a speech disfluency [5]. There are hundreds
after in workplace and classroom environments alike. Of the of different speech disfluencies often grouped together along-
possible improvements to vocal presentations, disfluencies and side language and swallowing disorders. Of these afflictions,
arXiv:2009.11394v1 [eess.AS] 23 Sep 2020

stutters in particular remain one of the most common and


prominent factors of someone’s demonstration. Millions of people stuttering proves to be one of the most common and most
are affected by stuttering and other speech disfluencies, with the recognized of the lot [5].
majority of the world having experienced mild stutters while Stuttering, also known as stammering, as a disorder can
communicating under stressful conditions. While there has been be generally defined as issues pertaining to the consistency of
much research in the field of automatic speech recognition and the flow and fluency of speech. This often involves involuntary
language models, there lacks the sufficient body of work when
it comes to disfluency detection and recognition. To this end, we additions of sounds and words, and the delay or inability to
propose an end-to-end deep neural network, FluentNet, capable consistently progress through a phrase. Although labelled as
of detecting a number of different disfluency types. FluentNet a disorder, stuttering can occur in any persons speech, often
consists of a Squeeze-and-Excitation Residual convolutional neu- induced by stress or nervousness [6]. These cases however do
ral network which facilitate the learning of strong spectral frame- not correlate with stammering as a disorder, but are caused by
level representations, followed by a set of bidirectional long
short-term memory layers that aid in learning effective temporal performance anxiety [7]. The use of stutter detection does not
relationships. Lastly, FluentNet uses an attention mechanism only apply to those with long term stutter impairments, but
to focus on the important parts of speech to obtain a better can appeal to the majority of the world as it can help with the
performance. We perform a number of different experiments, improvement of communication skills.
comparisons, and ablation studies to evaluate our model. Our As the breadth of applications using machine learning tech-
model achieves state-of-the-art results by outperforming other
solutions in the field on the publicly available UCLASS dataset. niques have flourished in recent decades, they have only re-
Additionally, we present LibriStutter: a disfluency dataset based cently began to be utilized in the field of speech disfluency and
on the public LibriSpeech dataset with synthesized stutters. We disorder detection. While deep learning has dominated many
also evaluate FluentNet on this dataset, showing the strong areas of speech processing, for instance speech recognition [8]
performance of our model versus a number of benchmark [9], speaker recognition [10] [11], and speech synthesis [12]
techniques.
[13], very little work has been done toward the problem of
Index Terms—Speech, stutter, disfluency, deep learning, speech disfluency detection. Disfluencies, including stutters,
squeeze-and-excitation, BLSTM, attention. are not easily definable; they come in many shapes and
variations. This means that factors such as gender, age, accent,
I. I NTRODUCTION and the language themselves will affect the contents of each
stutter, greatly complicating the problem space. As well, there
LEAR and comprehensive speech is the vital backbone to
C strong communication and presentation skills [1]. Where
some occupations consist mainly of presenting, most careers
are many classes of stutter, each with their own sub-classes
and with wildly different structures, making the identification
of all stutter types with a single model a difficult task. Even
require and thrive from the ability to communicate effectively. a specific type of stutter applied to a single word can be
Research has shown that oral communication remains one of conducted in a wide variety of ways. Where people are great
the more employable skills in both the perception of employers at identifying stutters through their experience with them,
and new graduates [2]. Simple changes to ones speaking machine learning models have historically struggled with this
patterns such as volume or appearance of disfluencies can (as we show in Section II).
have a huge impact on the ability to convey information ef- Another common issue is the sheer lack of sufficient training
fectively. By providing simplified, quantifiable data concerning data available. Many previous works often rely on their own
ones speech patterns, as well as feedback on how to change manually recorded, transcribed, and labelled datasets, which
ones speaking habits, drastic improvements could be made to are often quite small due to the work involved in their creation
anyone’s communication skills [3]. [14] [15] [16] [17]. There is only one commonly used public
In regard to presentation skills, disfluent speech remains dataset, UCLASS [18], that is widely used amongst works in
one of the more common factors [4]. Any abnormality or this area, though it still is also quite small.
generally uncommon component of one’s speech patterns is Many disfluency detection solutions provide some form of
filler word identification, flagging and counting any spoken
All the authors are with the Department of Electrical and Computer
Engineering, Queen’s University, Kingston, ON, K7L3N6 Canada, e-mail: interjections (e.g. ‘okay’, ‘right’, etc.). However, upon further
[email protected]. investigation, these applications simply request a list of in-
2

terjections from the user and use Speech-to-Text (STT) tools longations and interjections. The dataset contains the output
in order to match the spoken word with any interjections in labels that can be used in training deep learning methods;
the list. Though this may work fine for interjections such (3) We evaluate our model (FluentNet) on two datasets,
as ‘um’ and ‘uh’ (assuming the used STT tool has the UCLASS and LibriStutter. The experiments show that our
necessary embeddings), this can lead to serious overall errors model achieves state-of-the-art results on both datasets out-
in classification for most other utterances that are actual words, performing a number of other baselines as well as previously
such as ‘like’, which is commonly used as a filler word in the published work; (4) We make our annotations on the existing
English language. UCLASS dataset, along with the entire LibriStutter dataset
Early works in stutter detection, realizing the challenges and its labels, publicly available1 to contribute to the field and
mentioned above, first sought out to test the viability of facilitate further research.
identifying stutters from clean speech. These models primarily This is an extension of our earlier work titled “Detecting
focused on machine learning models with very small datasets, Multiple Speech Disfluencies using a Deep Residual Network
consisting of a single stutter type, or even a single word with Bidirectional Long Short-Term Memory”, published in
[14], [19]. In more recent years, and due to the rise of the 2020 IEEE International Conference on Acoustics, Speech,
automatic speech recognition (ASR), language models have and Signal Processing (ICASSP). This paper focused on
been used to tackle stutter recognition. These works have tackling the problem of detection and classification of different
proven to be strong at identifying certain stutter types, and forms of stutters. The model used a deep residual network
have been showing ever improving results [17], [16]. However, and bidirectional long short-term memory layers to classify
due to the uncertainty surrounding relations between cleanly different types of stutters. In this extended work, we replace
spoken and stuttered word embeddings, it remains difficult the previously used residual blocks of the spectral encoder
for these models to generalize across multiple stutter types. with residual squeeze-and-excitation blocks. Additionally, we
It is hypothesized that by bypassing the use of language add an attention mechanism after the recurrent network to
models, and by focusing solely on phonetics through the use better focus the network on salient parts of input speech.
of convolution networks, a model can be created that both Furthermore, we develop a new dataset, which we present in
maintains a strong average accuracy while also being effective this paper and make publicly available. Lastly, we perform
across all stutter types. thorough experiments, for instance through additional bench-
In this paper, we propose a model capable of detecting mark comparisons and ablation studies. Our experiments show
speech disfluencies. To this end, we design FluentNet, a the improvements made by FluentNet over our preliminary
deep neural network (DNN) for automated speech disfluency work, as validated on both the UCLASS dataset (previously
detection. The proposed network does not apply any language used) as well as the newly developed dataset. This new model
model aspects, but instead focuses on the direct classification provides greater advancement towards end-to-end disfluency
of speech signals. This allows for the avoidance of complex detection and classification.
and time consuming ASR as a pre-processing steps in our The rest of this paper is organized as follows; a discussion
model, and would provide the ability to view the scenario as of previous contributions towards stutter recognition in Section
an end-to-end solution using a single deep neural network. We II followed by our methodology including a breakdown of
validate our model on a commonly used benchmark dataset the model in Section III, the datasets and benchmark models
UCLASS [18]. To tackle the issue of scarce stutter-related applied in Section IV, a discussion of our results in Section
speech datasets, we also develop a synthetic dataset based on V, and our conclusion in the final section.
a non-stuttered speech dataset (LibriSpeech [20]), which we
entitle LibriStutter. This dataset is created to mimic stuttered II. R ELATED W ORK
speech and vastly expand the amount of data available for There has recently been increasing interest in the fields
use. Our end-to-end neural network takes spectrogram feature of deep learning, speech, and audio processing. However,
images as inputs, and uses Squeeze-and-Excitation residual as discussed earlier in section I, there has been minimal
(SE-ResNet) blocks for learning the speech embedding. Next, research targeting automated detection of speech disfluencies
a bidirectional long short-term memory (BLSTM) network including stuttering, most likely as a result of insufficient data
is used to learn the temporal relationships, followed by an and smaller number of potential applications in comparison
attention mechanism to focus on the more salient parts of the to other speech-related problems such as speech recognition
speech. Experiments show the effectiveness of our approach in [21] [9] and speaker recognition [10] [11]. In the following
generalizing across multiple classes of stutters while maintain- sections we first provide a summary of the type of disfluencies
ing a high accuracy and strong consistency between classes on commonly targeted in the area, followed by a review of the
both datasets. existing work that fall under the umbrella of speech disfluency
The key contributions of our work can be summarized detection and classification.
as follows: (1) We propose FluentNet, an end-to-end deep
neural network capable of detection of several types of speech A. Background: Types of Speech Disfluency
disfluencies; (2) We develop a synthesized disfluency dataset There are a number of different stuttering types, often
called LibriStutter based on the publicly available LibriSpeech categorized into four main groups: repetitions, prolongations,
dataset by artificially generating several types of disfluencies,
namely sound, word, and phrase repetitions, as well as pro- 11 http://aiimlab.com/resources.html
3

TABLE I: Types of stutters considered for training and testing labels.


Label Stutter Disfluency Description Example
S Sound Repetition Repetition of a phoneme th-th-this
PW Part-Word Repetition Repetition of a syllable bec-because
W Word Repetition Repetition of any word why why
PH Phrase Repetition Repetition of multiple successive words I know I know that
R Revision Repetition of thought, rephrased mid sentence I think that- I believe that
I Interjection Addition of fabricated words or sounds um, uh, like
PR Prolongation Prolonged sounds whoooooo is there
B Block Involuntary pause within a phrase I want (pause) to

interjections, and blocks. A summary of all these disfluency and was weakest for mild ones. These models were able
types and examples of each have been presented in Table I. to achieve a maximum detection rate of 0.82 on severe
The descriptions for each of these categories is as follows. prolongation stutters.Howell et al. [15] later furthered their
Repetitions are classified as any part of an utterance re- work using a larger set of data, as well as a wider variety of
peated at quick pace. As this definition still remains general, audio parameters. This work also introduced an ANN model
repetitions are often further sub-categorized [5]. These sub- for both repetition and prolongation types, and more judges
categories have been used in previous works classifying stutter were used to identify stutters with strict restrictions towards
disfluencies [22] [23] [17], which include sound, word, and agreement of disfluency labeling. Results showed that the best
phrase repetitions, as well as revisions. Sound repetitions (S) parameters for disfluency classification were fragmentation
are repetitions of a single phoneme, or short sound, often spectral measures for whole words, as well as duration and
consisting of a single letter. Part-word, or syllable repetitions supralexical disfluencies of energy in part-words.
(PW), as its name suggests, are classified as the repetition of Tan et al. [19] worked on testing the viability of stutter
syllables, which can consist of multiple phonemes. Similarly, detection through a simplified approach in order to maximize
word repetitions (W) are defined as any repetition of a single the possible results. By collecting audio samples of clean,
word, and phrase repetitions (PH) are the repetition of phrases, stuttered, and artificially generated copies of single pre-chosen
consisting of multiple consecutive words. The final repetition- words, they were able to reach an average accuracy of 96%
type disfluency is revision (R). Similar to phrase repetitions, on the human samples using a hidden Markov model (HMM).
they consist of repeated phrases, where the repeated segment This served as a temporary benchmark towards the possible
is rephrased, containing new or different information from the best average results for stutter detection.
first iteration. A rise in pitch may accompany this disfluency Ravikumar et al. have attempted a variety of classifiers
type [24]. on syllable repetitions, including an HMM [25] and support
Interjections (I), often referred to as filler words, consist of vector machine (SVM) [26] using Mel-frequency cepstral
the addition of any utterance that does not logically belong coefficients (MFCCs) features. Their best results were ob-
in the spoken phrase. Common interjections in the English tained when classifying this stutter type using the SVM on
language include exclamations, such as ‘um’ and ‘uh’, as well 15 participants, achieved an accuracy of 94.35%. No other
as discourse markers such as ‘like’, ‘okay’, and ‘right’. disfluency types were considered.
Prolongation (PR) stutters are presented as a lengthened or A detailed summary of previously attempted stutter classifi-
sustained phoneme. The duration of these prolonged utterances cation methods, including some of the aforementioned classi-
vary alongside the severity of the stutter. Similar to repetitions, cal models, is available in the form of a review paper in [27].
this disfluency is often accompanied by a rise in pitch. This paper provides background on the use of three different
The final category of stuttering are silent blocks (B), which models (ANNs, HMMs and SVM) towards the application of
are sudden cutoffs of vocal utterances. These are often invol- stutter recognition. Of the works considered in that review
untary and are presented as pauses within a given phrase. paper in 2009, it was concluded that HMMs achieve the best
results in stutter recognition.
B. Stutter Recognition with Classical Machine Learning
Before the focus of stutter recognition targeted maximizing
C. Stutter Recognition with Deep Learning
accuracy in classification of stammers, a number of works
were performed toward testing the viability of stutter detection. With the recent advancements in deep learning, disfluency
In 1995, Howell et al. [14], who later helped to create the detection and classification has seen an increase in popularity
UCLASS dataset [18] used in this paper, employed a set of within the field with a higher tendency towards end-to-end
pre-defined words to identify repetition and prolongation stut- approaches. ASR has become an increasingly popular method
ters. From these, they extracted the autocorrelation features, of tackling the problem of disfluency classification. As some
spectral information, and envelope parameters from the audio. stuttered speech results in repeated words, as well as prolonged
Each was used as an input to a fully connected artificial neural utterances, these can be represented by word embeddings and
network (ANN). Findings showed that the model achieved sound amplitude features, respectively. To exploit this concept,
its strongest classification results against severe disfluencies, Alharbi et al. [17] detected sound and word repetitions, as well
4

TABLE II: Summary of previous stutter disfluency classification methods.


Year Author Dataset Features Classification Method Results
1995 Howell et al. [14] N/A autocorrelation function, spectral infor- ANN Acc.: 82%
mation, envelope parameters
1997 Howell et al. [15] 12 Speech Samples oscillographic and spectrographic pa- ANN Avg. Acc.: 92%
rameters
2007 Tan et al. [19] 20 Samples (single word) MFCC HMM Acc.: 96%
2009 Ravikumar et al. [26] 15 Speech Samples MFCC SVM Acc.: 94.35%
2016 Zayats et al. [28] Switchboard Corpus MFCC BLSTM w/ Attention F1: 85.9
2018 Alharbi et al. [17] UCLASS Word Lattice Finite State Transducer, Amplitude and Avg. MR: 37%
Time Thresholding
2018 Dash et al. [16] 60 Speech Samples Amplitude STT, Amplitude Thresholding Acc.: 86%
2019 Villegas et al. [29] 68 Participants Respiratory Biosignals Perceptron Acc.: 95.4%
2019 Santoso et al. [30] PASD, UUDB MFCC BLSTM w/ Attention F1: 69.1
2020 Chen et al. [31] In-house Chinese Corpus Word & Position Embeddings CT-Transformer MR: 38.5%

as revision disfluencies using task-oriented finite state trans- especially linear ones.
ducer (FST) lattices. They also utilized amplitude thresholding Table II provides a summary of the related works on
techniques to detect prolongations in speech. These methods disfluency detection and classification. It can be observed and
resulted in an average 37% miss rate across the 4 different concluded that disfluency classification has been progressing
types of disfluencies. in one of two fronts i) end-to-end speech-based methods, or
Dash et al. [16] have used an STT model in order to identify ii) language-based models relying on an ASR pre-processing
word and phrase repetitions within stuttered speech. To detect step. Our work in this paper is positioned in the first category
prolongation stutters, they integrated a neural network capable in order to avoid the reliance on an ASR step. Moreover, from
of finding optimal cutoff amplitudes for a given speaker to Table II, we observe that although progresses is being made in
expand upon simple thresholding methods. As these ASR the area of speech disfluency recognition, the lack of available
works required full word embeddings to classify repetitions, data remains a hindrance to potential further achievements in
they either fared poorly against, or did not attempt sound or the field.
part word repetitions.
Deep recurrent neural networks (RNN), namely BLSTM, III. P ROPOSED M ETHOD
have been used to tackle stutter classification. Zayats et al.
[28] trained a BLSTM with Integer Linear Programming (ILP) A. Problem and Solution Overview
[32] on a set of MFCC features to detect repetitions with Our goal in this section is to design and develop a system
an F-score of 85.9. Similarly, a work done by Santoso et that can be used for detecting various types of disfluencies.
al. applied a BLSTM followed by an attention mechanism While one approach to tackle this concept is to design a multi-
to perform stutter detection based on input MFCC features, class problem, another approach is to design an ensemble of
obtaining an maximum F-score of 69.1 [30]. More recently in single-disfluency detectors. In this paper, given the relatively
a study by Chen et al., a Controllable Time-delay Transformer small size of available stutter datasets, we use the latter
(CT-Transformer) has been used to detect speech disfluencies approach which can help reduce the complexity of each binary
and correct punctuation in real time [31]. In our initial work task. Accordingly, the goal is to design a single network
on stutter classification, we utilized spectrogram features of architecture that can be trained separately to detect different
stuttered audio and used a BLSTM [33] to learn tempo- disfluency types with each trained instance, where together
ral relationships following spectral frame-level representation they could detect a number of different disfluencies. Figure
learning by a ResNet. This model achieved a 91.15% average 1 shows the overview of our system. The designed network
accuracy across six different stutter categories. should possess the capability of learning spectral frame-level
In an interesting recent work, Villegas et al. utilized res- representations as well as temporal relationships. Moreover,
piratory biosignals in order to better detect stutters [29]. By the model should be able to focus on salient parts of the
correlating respiratory volume and flow, as well as heart rate inputs in order to effectively learn the disfluencies and perform
measurements correlating to the time when a stutter occurs, accurately.
they were able to classify block stutters with an accuracy of
95.4% using an MLP.
A 2018 summary and comparison of different features B. Proposed Network: FluentNet
and classification methods for stuttering has been conducted We propose an end-to-end network, FluentNet, which uses
by Khara et al. [34]. This work discusses and compares the short-time Fourier transform (STFT) spectrograms of audio
different popular feature extraction methods, classifiers and clips as inputs. These inputs are passed through a Squeeze-
their uses, as well as their advantages and shortcomings. The and-Excitation Residual Network (SE-ResNet) to learn frame-
paper discusses that MFCC feature extraction has historically level spectral representations. As most stutter types have
provided the strongest results.Similarly, ANNs provide the distinct spectral and temporal properties, a bidirectional LSTM
most flexibility and adaptability compared to other models, network is introduced to learn the temporal relationships
5

Sound
Squeeze-and-excitation modules have been recently pro-
SE-ResNet BLSTM Attention Repetition
Classification
posed and have shown to outperform various DNN models
using previous CNN architectures, namely VGG and ResNet,
SE-ResNet BLSTM Attention
Word
Repetition
as their backbone architectures [36]. SE networks were first
Classification proposed for image classification, reducing the relative error
compared to previous works on the ImageNet dataset by
Phrase
SE-ResNet BLSTM Attention Repetition approximately 25% [36].
Classification
Input Every kernel within a convolution layer of a CNN results in
Audio
an added channel (depth) for the output feature map. Whereas
Revision
SE-ResNet BLSTM Attention
Classification recent works have focused on expanding on the spectral
relationships within these models [39] [40], SE-blocks build
Interjection stronger focus on channel-wise relationships within a CNN.
SE-ResNet BLSTM Attention
Classification
These blocks consist of two major operations. The squeeze
operation aggregates a feature map across both its height
SE-ResNet BLSTM Attention
Prolongation
Classification
and width resulting in a one-dimensional channel descriptor.
The excitation operation consists of fully connected layers
providing channel-wise weights, which are then applied back
Fig. 1: Full model overview using FluentNet for disfluency to the original feature map.
classification. To exploit the capabilities of both ResNet and SE architec-
tures and learn effective spectral frame-level representations
from the input, we use an SE-ResNet model in our end-to-
present among different spectrograms. An attention mecha- end network. The network consists of 8 SE-ResNet blocks, as
nism is then added to the final recurrent layer to better focus shown in Figure 2(a). Each SE-ResNet block in FluentNet,
on the necessary features needed for stutter classification. illustrated in Figure 2(b), consists of three sets of two-
FluentNet’s final output reveals a binary classification to detect dimensional convolution layers, each succeeded by a batch
a specific disfluency type that it has been trained for. The normalization and Rectified Linear Unit (ReLU) activation
architecture of the network is presented in Figure 2(a). In the function. A separate residual connection shares the same input
following, we describe each of the components of our model as the block’s non-identity branch, and is added back to
in detail. the non-identity branch before the final ReLU function, but
1) Data Representation: Input audio clips recorded with after the SE unit (described below). Each residual connection
a sampling rate of 16 khz are converted to spectrograms contains a convolution layer followed by batch normalization.
using STFT with 256 filters (frequency bins) to be fed to The Squeeze-and-Excitation unit within each SE-ResNet block
our end-to-end model. A sample spectrogram can be seen begins with a global pooling layer. The output is then passed
in Figure 2 where the colours represent the amplitude of through two fully connected layers: the first followed by a
each frequency bin at a given frame, with blue representing ReLU activation function, and the second succeeded with
lower amplitudes, and green and yellow representing higher a sigmoid gating function. The main convolution branch is
amplitudes. Following the common practice in audio-signal scaled with the output of the SE unit using channel-wise
processing, a 25 ms frame has been used with an overlap of multiplication.
10 ms. 3) Learning Temporal Relationships: In order to learn the
2) Learning Frame-level Spectral Representations: Fluent- temporal relationships between the representations learned
Net first focuses on learning effective representations from from the input spectrogram, we use an RNN. In particular,
each input spectrogram. To do so, CNN architectures are often LSTM [41] networks have shown to be effective for this
used. Though both residual networks [35] and squeeze-and- purpose in the past and are widely used for learning sequences
excitation (SE) networks [36] are relatively new in the field of spectral representations obtained from consecutive segments
of deep learning, both have proven to improve on previous of time-series data [42] [43] [44].
state-of-the-art models in a variety of different application Each LSTM unit contains a cell state, which holds informa-
areas [37], [38]. The ResNet architecture has proven a reliable tion contained in previous units allowing the network to learn
solution to the vanishing or exploding gradient problems, both temporal relationships. This cell state is part of the LSTM’s
common issues when back-propagating through a deep neural memory unit, where there lie several gates that together control
network. In many cases, as the model depth increases, the which information from inputs, as well as from the previous
gradients of weights in the model become increasing smaller, cell and hidden states, will be used to generate the current cell
or inversely, explosively larger with each layer. This may and hidden states. Namely, the forget gate, ft , and input gate,
eventually prevent the gradients from actually changing the it , are utilized to learn what information from each of these
weights, or from the weights becoming too large, thus pre- respective states will be saved within the new current state,
venting improvements in the model. A ResNet, overcomes this Ct . This is shown by the following equations:
by utilizing shortcuts all through its CNN blocks resulting in
ft = σ(Wf · [ht−1 , xt ] + bf ) (1)
norm-preserving blocks capable of carrying gradients through
very deep models. it = σ(Wi · [ht−1 , xt ] + bi ) (2)
6

a)

SE SE SE ... ...
ResNet ResNet ... ResNet Attention
Stutter
Classification
Block Block Block
Raw Audio Input
Spectrogram
BLSTM

b) Squeeze-and-Excitation Block

Global Pool

Sigmoid
ReLU
FC

FC
Batch Norm

Batch Norm

Batch Norm
2d Conv

2d Conv

2d Conv
ReLU

ReLU

ReLU
X +

Batch Norm
2d Conv
Fig. 2: a) A full workflow of FluentNet is presented. This network consists of 8 SE residual blocks, two BLSTM layers, and
a global attention mechanism. b) The breakdown of a single SE-ResNet block in FluentNet is presented.

Ct = ft ∗ Ct−1 + it ∗ tanh(WC · [ht−1 , xt ] + bC ) (3) The final output value of the second layer of the BLSTM,
ht , as well as a context vector, Ct , derived through the use of
where σ represents the sigmoid function, and the ∗ operator the attention mechanism are used to generate FluentNet’s final
denotes point-wise multiplication. This new cell state, along classification, h̃t . This is done by applying a tanh activation
with an output gate, ot , are used to generate the hidden state function, as shown by:
of the unit, ht , as represented by:
h̃t = tanh(Wc [Ct ; ht ]) (6)
ot = σ(Wo · [ht−1 , xt ] + bo ) (4)

ht = ot ∗ tanh(Ct ) (5) The context vector of the global attention is the weighted
sum of all hidden state outputs of the encoder. An alignment
The cell state and hidden state are then passed to successive vector, generated as a relation between ht and each hidden
LSTM units, allowing the network to learn long-term depen- state value is passed through a softmax layer, which is then
dencies. used to represent the weights to the context vector. Dot product
We used a BLSTM network [45] in FluentNet. BLSTMs was used as the alignment score function for this attention
consist of two LSTMs advancing in opposite directions, maxi- mechanism. The calculation for the context vector can be
mizing the available context from relationships of both the past represented by:
and future. The outputs of these two networks are multiplied t >
·h̄i
together into a single output layer. FluentNet consists of two X eh t
Ct = h̄i ( Pt > ·h̄ ) (7)
consecutive BLSTMs, each utilizing LSTM cells with 512 i=1 i‘=1 eht i‘

hidden units. A dropout [46] of 20% was also applied to each


of these recurrent layers. To avoid overfitting given the size of where h̄i represents the ith BLSTM hidden state’s output.
the dataset, the randomly masked neurons caused by dropout
forces the model to be trained using a sparse representation
of the given data. C. Implementation
4) Attention: The recent introduction of attention mecha- FluentNet was implemented using Keras [52] with a Ten-
nisms [47] and its subsequent variations [48] have allowed sorflow [53] backend. The model was trained with a learning
for added focus on more salient sections of the learned rate of 10−4 yielded the strongest results. A root mean square
embedding. These mechanisms have recently been applied to propagation (RMSProp) optimizer, and a binary cross-entropy
speech recognition models to better focus on strong emotional loss function were used. All experiments were trained using an
characteristics within utterances [49] [50], and have similarly Nvidia GeForce GTX 1080 Ti GPU. Python’s Librosa library
been used in FluentNet to improve focus on specific parts [54] was used for audio importing and manipulation towards
of utterances with disfluent attributes. FluentNet uses global creating our synthetic dataset as described later. Each STFT
attention [51], which incorporates all hidden state values of spectrogram was generated using four-second audio clips. This
the encoder (in this case the BLSTM). A diagram showing length of time can encapsulate any stutter apparent in the
the attention model is presented in Figure 3. dataset, with no stutters lasting longer than four seconds.
7

between the ages of 8 and 18, totalling to just over one hour
of audio.
~
ht
In order to pair the utterances with their transcriptions, each
audio file and its corresponding orthographic transcription
were passed through a forced time alignment tool. The result-
ing table related each alphabetical token in the transcription
to its matching timestamp within the audio. This process was
Ct then manually checked for outlaying utterances not matching
Attention

Softmax

their transcriptions.
Alignment The provided orthographic transcriptions only flagged the
existence of disfluencies (through the use of capitalization),
but gave no information towards a disfluency type. To build
ht a more detailed dataset and be able to classify the type of
disfluency, all utterances were manually labelled as one of
... the seven represented classes for our model. These included
RNN

clean (no stutter), sound repetitions, word repetitions, phrase


repetitions, revisions, interjections, and prolongations. The
LSTM Inputs annotation methods applied in [22] and [23] were used as
guidelines when manually categorizing each utterance. Out
of the 8 disfluencies, 6 were used: sound, word, and phrase
repetitions, as well as revisions, interjections, and prolonga-
Fig. 3: Global attention addition to binary classifier of recur- tions.. Of the usable audio in the dataset, only three instances
rent network. of ‘part-word repetitions’ appeared, lacking sufficient positive
training samples to feasibly classify these types of stutters. As
‘block disfluencies’ exist as the absence of sound, they could
IV. E XPERIMENTS
not feasibly be represented in the orthographic transcriptions,
A. Datasets which represent how utterances are performed.
Despite an abundance of datasets for speech-related tasks 2) LibriStutter: The 2015 LibriSpeech ASR corpus by
such as ASR and speaker recognition [20] [55] [56], there is a Panayotov et al. [20] includes 1000 hours of prompted English
clear lack of corpora that are focused on speech disfluencies. speech extracted from audio books derived from the LibriVox
An ideal speech disfluency dataset would require the labelling project. We used this dataset as the basis for our synthetic
and categorization of each existing disfluent utterance. In stutter dataset, which we name LibriStutter. LibriStutter’s
this paper, to tackle this problem, in addition to using the creation compensates for two shortcomings of the UCLASS
UCLASS dataset which is a commonly used stuttered speech corpus: the small amount of labelled stuttered speech data
corpus [57] [58] [17], a second dataset was created through available and the imbalance of the dataset (several disfluency
adding speech disfluencies into clean speech. This synthetic types in UCLASS consisted of a small number of samples).
corpus contributes a drastic expansion to the available training To allow for a manageable size for LibriStutter and feasible
and testing data for disfluency classification. Through the training times, we used a subset of LibriSpeech and set the size
following subsections, we describe the UCLASS dataset used of LibriStutter to 20 hours. LibriStutter includes synthetic stut-
in our study, as well as the approach for creating the synthetic ters for sound repetitions, word repetitions, phrase repetitions,
dataset, LibriStutter, which we created using the original non- prolongations, and interjections. We generated these stutter
stuttered LibriSpeech dataset. types by sampling the audio within the same utterance, the
1) UCLASS: The University College Londons Archive of details of which are described below. Revisions were excluded
Stuttered Speech (UCLASS) [18] is the most commonly used from LibriStutter, as this disfluency type requires the speaker
dataset for disfluency-related studies with machine learning. to change and revise what was initially said. This would
This corpus came in two releases, in 2004 and 2008, from the require added speech through the use of complex language
university’s Division of Psychology and Language Sciences. models and voice augmentation tools to mimic the revised
The dataset consists of 457 audio recordings including mono- phrase, both of which fall out of scope for this project.
logues, readings, and conversations of children with known For each audio file selected from the LibriSpeech dataset,
stutter disfluency issues. Of those recordings, a select few we used the Google Cloud Speech-to-Text API [59] to gen-
contain written transcriptions of their respective audio files; erate a timestamp corresponding to each spoken word. For
these were either standard, phonetic or orthographic tran- every four-second window of speech within a given audio
scriptions.Orthographic format is the best option for manual file, either a random disfluency type was inserted and labelled
labelling of the dataset for disfluency as they try to transcribe accordingly, or alternatively left clean. Each disfluency type
the exact sounds uttered by the speaker in the form of standard underwent a number of processes to best simulate natural
alphabet. This helps to identify the presence of disfluency in an stutters.
utterance more easily. The resulting applicable data consisted All repetition stutters relied upon copying existing audio
of 25 unique conversations between an examiner and a child segments already present within each audio file. Sound rep-
8

etitions were generated by copying the first fraction of a TABLE III: Cosine similarity between a UCLASS dataset
random spoken word within the sample and repeating this stutter and a matching LibriStutter stutter, as well as the
short utterance a several times before said word. Although average of 100 random samples from the LibriStutter dataset.
repetitions of sounds can occur at the end of words, known Stutter UCLASS vs. LibriStutter UCLASS vs. Random
as word-final disfluencies, this is rarely the case [60]. One to Sound Repetition 3.73893e−3 2.58116e−4
three repeated sound utterances were added in each stuttered Word Repetition 3.14077e−3 2.61084e−4
Prolongation 7.70236e−3 2.57234e−4
word. After each instance of the repeated sound, a random
empty pause duration of 100 to 350 ms was appended as
this range sounded most natural. Inserted audio may leave
sharp cutoffs, especially part-way through an utterances. To To ensure that sufficient realism was incorporated into the
avoid this, interpolation was used to smooth the added audio’s dataset, a registered speech language pathologist was consulted
transition into the existing clip. for this project. Nonetheless, it should be mentioned that
Both word and phrase repetitions underwent similar pro- despite our attention to creating a perceptually valid and
cesses to that of sound repetitions. For word repetitions we realistic dataset, the notion of “realism” itself is not a focus of
repeated one to two copies of a randomly selected word this dataset. Instead, much like synthetic datasets in other areas
before the original utterance. For phrase repetitions, a similar such as image processing, the aim is for the dataset to be valid
approach was taken, where instead of repeating a particular enough such that machine learning and deep learning methods
word, a phrase consisting of two to three words were repeated. can be trained and evaluated with, and later on transferred
The same pause duration and interpolation techniques used to real large-scale datasets [in the future] with little to no
for sound repetitions were applied to both word and phrase adjustments to the model architectures.
repetition disfluencies. Figure 4 displays side by side comparisons of spectrograms
of real stuttered data from the UCLASS dataset, and artificial
Prolongations consist of sustained sounds, primarily at the
stutters from LibriStutter. Each pairing represents a single stut-
end of a word. To mimic this behaviour, the last portion
ter type, with the same word or sound being spoken in each.
of a word was stretched to simulate prolonged speech. For
It can be observed that the UCLASS stutter samples and their
a randomly chosen word, the latter 20% of the signal was
corresponding LibriStutter examples show clear similarities.
stretched by a factor of 5. This prolonged speech segment
Moreover, to numerically compare the samples, cosine similar-
replaced the original word ending. As applying time stretching
ity [61] was calculated between the UCLASS and LibriStutter
to audio results in a drop in pitch, pitch shifting was used to
spectrogram samples shown earlier. To add relevance to these
realign the pitch with the original audio. The average pitch of
values, a second comparison was performed for each UCLASS
the given speech segment was used to normalize the disfluent
spectrogram with respect to 100 random samples from the
utterance.
LibriStutter dataset, and the average score was used as the
Unlike the aforementioned classes, interjection disfluencies
represented comparison value. These scores are summarized
cannot be created from existing speech within a sample as it
in Table III. We observe that the UCLASS cosine similarity
requires the addition of filler words absent from the original
scores corresponding to the matching LibriStutter samples are
audio (for example ‘umm’). Multiple samples of common filler
noticeably (approximately between 10× to 30×) greater than
words from the UCLASS were isolated and saved separately
those compared to random audio samples, confirming that the
to create a pool of interjections. To create interjection dis-
disfluent utterances contained in LibriStutter share phonetic
fluencies, a random filler word from this pool was inserted
similarities with real stuttered samples, empirically showing
between two random words, followed by a short empty pause.
the similarity between a few sample real and synthesized
The same pitch scaling and normalization method as used
stutters.
for prolongations was applies to match the pitches between
The LibriStutter dataset consists of approximately 20 hours
the interjection and audio clip. Interpolation was used as in
of speech data from the LibriSpeech train-clean-100 (training
repetition disfluencies to smooth sharp cutoffs caused by the
set of 100 hours “clean” speech). In turn, LibriStutter shares
added utterance.
a similar make up to that of its predecessor. It consists of
Sound Repetition Word Repetition Prolongation
disfluent prompted English speech from audiobooks. It also
contains 23 male and 27 female speakers, with an approximate
53% of the audio coming from males, and 47% from females.
UCLASS There are 15000 disfluencies in this dataset, with equal counts
for each of the five disfluency types: 3000 sound, word, and
phrase repetitions, as well as prolongations and interjections.
Each audio file has a corresponding CSV file containing each
LibriStutter word or utterance spoken, the start and end time of the
utterance, and its disfluency type, if any.

B. Benchmarks
Fig. 4: Spectrograms of the same stutters found in the For a thorough analysis of our results, we compare the
UCLASS dataset and generated in the LibriStutter dataset. results obtained by the proposed FluentNet to a number of
9

other models. In particular, we employ two type of solutions Dash et al. [16]: This method passed the maximum ampli-
for comparison purposes. First, we compare our results to tude of the provided audio sample through a neural network
related works and the state-of-the-art as follows: to generate a custom threshold for each sample, trained on a
Alharbi et al. [17]: This work conducted classification set of 60 speech samples. This amplitude threshold was used
of sound repetitions, word repetitions, revisions, and prolon- to remove any perceived prolongations and interjections. The
gations on the UCLASS dataset through the application of audio was then passed the audio through a SST tool, which
two different methods. First, an original speech prompt was allowed for the removal of any repeated words, phrases, or
aligned, and then passed to a task-oriented FST to generate characters, achieving an overall stutter classification of 86%
word lattices. These lattices were used to detect repeated part- on a test set of 50 speech segments.
words, words, and phrases within the sample. This method Note that the latter three works only provide results on a
scored perfect results on word repetition classification,though group of disfluency types [28], a single disfluency type [29], or
the results on sound repetitions and revisions proved much overall stutter classification [16]. As such, only their average
weaker. To classify prolongation stutters, an autocorrelation disfluency classification results could be compared. Moreover,
algorithm consisting of two thresholds was used: the first to these works ([31], [28], [29], and [16]) have not used the
detect speech with similar amplitudes (sustained speech), and UCLASS dataset, therefore the comparisons should be taken
another dynamic threshold to decide whether the duration of cautiously.
similar speech would be considered a prolongation. Using this Next, we also compare the performance of our solution
algorithm, perfect prolongation classification was achieved. to a number of other models for benchmarking purposes.
Chen et al. [31]: A CT-Transformer was designed to These models were selected due to their popularity for time-
conduct repetition and interjection disfluency detection on series learning and their hyperparameters of these models
an in-house Chinese speech dataset. Both word and position are all tuned to obtain the best possible results given their
embeddings of a provided audio sample were passed through architectures. These benchmarks are as follows: (i) VGG-16
a series of CT self attention layers and fully connected (Benchmark 1): VGG-16 [62] consists of 16 convolutional
layers. This work was able to achieve an overall disfluency or fully connected layers, comprised of groups of two or
classification miss rate of 38.5% (F1 score of 70.5). Notably, three convolution layers with ReLU activation, with each
this is one of the few works to have attempted interjection grouping being followed by a max pooling layer.The model
disfluency classification, yielding a miss rate of 25.1%. Note concludes with three fully connected layers and a final softmax
that the performance on repetition disfluencies encompasses function. (ii) VGG-19 (Benchmark 2): This network is very
all repetition-type stutters, including sound, word, and phrase similar to its VGG-16 counterpart, with the only difference
repetitions, as well as revisions. being an addition of three more convolution layers spread
Kourkounakis et al. [33]: As opposed to other current throughout the model. (iii) ResNet-18 (Benchmark 3): ResNet-
models focusing on ASR and language models, our previous 18 was chosen as a benchmark, which contains 18 layers: eight
work proposed a model relying solely on acoustic and phonetic consecutive residual blocks each containing two convolutional
features, allowing for the classification of several multiple layers with ReLU activation, followed by an average pooling
disfluencies types without the need for speech recognition layer and a final fully connected layer.
methods. This model applied a deep residual network, con-
sisting of 6 residual blocks (18 convolution layers) and two V. R ESULTS AND A NALYSIS
bidirectional long short-term memory layers to classify six
different types of stutters. This work achieved an average miss A. Validation
rate of 10.03% on the UCLASS dataset, and sustained strong In order to rigorously test FluentNet on the UCLASS
accuracy and miss rates across all stutter types, prominently dataset, a leave-one-subject-out (LOSO) cross validation
word repetitions and revisions. method was used. The results of models tested on this dataset
Zayats et al. [28]: A recurrent network was used to are represented as the average between 25 experiments, each
classify repetition disfluencies within the Switchboard corpus. consisting of audio samples from 24 of the participants as
It consists of a BLSTM followed by an ILP post processing training data, and a unique single participant’s audio as a
method. The input embedding to this network consisted of a test set. A 10-fold cross validation method was used on the
vector containing each word’s index, part of speech, as well LibriStutter dataset with a random 90% subset of the samples
as 18 other disfluency-based features. The method achieved a from each stutter being used for training along with 90% of
miss rate of 19.4% across all repetitions disfluencies. the clean samples chosen randomly. The remaining 10% of
Villegas et al. [29]: This model was used a reference to both clean and stuttered samples were used for testing. All
compare the effectiveness of repository signals towards stutter experiments were trained over 30 epochs, with minimal change
classification. These features included the means, standard in loss seen in further epochs.
deviations, and distances of respiratory volume, respiratory The two metrics used to measure the performance of the
flow, and heart rate. Sixty-eight participants were used to aforementioned experiments were miss rate and accuracy.
generate the data for their experiments. The best performing Miss rate (1 - recall) is used to determine the proportion of
model in this work was an MLP with 40 hidden layers, disfluencies which were incorrectly classified by the model.
resulting in a 82.6% average classification accuracy between To balance out any bias this metric may hold, accuracy was
block and non-block type stutters. used as a second performance metric.
10

TABLE IV: Percent miss rates (MR) and accuracy (Acc) of the six stutter types trained on the UCLASS dataset.
S W PH I PR R
Paper Method Dataset MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑
Alharbi et al. [17] Word Lattice UCLASS 60 – 0 – 0 – 25 –
Kourkounakis et al. [33] ResNet+BLSTM UCLASS 18.10 84.10 3.20 96.60 4.46 95.54 25.12 81.40 5.92 94.08 2.86 97.14
Benchmark 1 VGG-16 UCLASS 20.80 81.03 6.54 93.01 12.82 87.91 28.44 72.03 9.04 90.83 5.20 94.90
Benchmark 2 VGG-19 UCLASS 19.41 81.35 5.22 95.42 10.13 91.60 26.06 73.64 5.72 94.21 4.72 96.32
Benchmark 3 ResNet-18 UCLASS 19.51 81.38 5.26 94.50 7.32 94.01 25.55 76.38 7.02 93.22 5.16 94.74
Ours FluentNet UCLASS 16.78 84.46 3.43 96.57 3.86 96.26 24.05 81.95 5.34 94.89 2.62 97.38
Kourkounakis et al. [33] ResNet+BLSTM LibriStutter 19.23 79.80 5.17 92.52 6.12 92.52 31.49 69.22 9.80 89.44
Benchmark 1 VGG-16 LibriStutter 20.97 79.33 6.27 92.74 8.90 91.94 36.47 64.05 10.63 89.10
Benchmark 2 VGG-19 LibriStutter 20.79 79.66 6.45 93.44 7.92 92.44 34.46 66.92 10.78 89.98
Benchmark 3 ResNet-18 LibriStutter 22.47 78.71 6.22 92.70 6.74 93.36 35.56 64.78 10.52 90.32
Ours FluentNet LibriStutter 17.65 82.24 4.11 94.69 5.71 94.32 29.78 70.12 7.88 92.14

B. Performance and Comparison LibriStutter dataset as compared to UCLASS between the two
The results of our model for recognition of each stutter models. Notably, word repetitions and prolongation relay a
type are presented for the UCLASS and LibriStutter datasets decrease in miss rate of approximately 20% between FluentNet
in Table IV. FluentNet achieves strong results against all the and [33]. This implies the SE and attention mechanisms assist
disfluency types within both datasets, outperforming nearly all in better representing the disfluent utterances within stuttered
of the related work as well as the benchmark models. speech found in the synthetic dataset.
As some previous works have been designed to tackle An interesting observation is that LibriStutter proves a more
specific disfluency types as opposed to a general solution for difficult dataset compared to UCLASS as evident by the
detecting different types of disfluencies, a few of FluentNet’s lower performance of all the solutions including FluentNet.
individual class accuracies do not surpass previous works’, This is likely due to the fact that given the large number
namely word repetitions and prolongation. In particular, the of controllable parameters for each stutter type, LibriStutter
work by Alharbi et al. [17] offers perfect word repetition is likely to contain a larger variance of stutter styles and
classification, as word lattices can easily identify two words variations, resulting in a more difficult problem space.
repeated one after the other. Amplitude thresholding also Table V presents the overall performance of our model with
proves to be a successful prolongation classification method. respect to all disfluency types on UCLASS and LibriStutter
It should be noted that FluentNet does achieve strong results datasets. The results are compared with other works on re-
for these disfluency types as well. Notably, our work is one spective datasets, and the benchmarks which we implemented
of the only ones that has attempted classification of interjec- for comparison purposes. We observe that FluentNet achieves
tion disfluencies. These disfluent utterances lack the unique average miss rates and accuracy of 9.35% and 91.75%on the
phonetic and temporal patterns that, for instance, repetition UCLASS dataset, surpassing the other models and setting
or prolongation disfluencies contain. Moreover, they may be a new state-of-the-art. A similar trend can be seen for the
present as a combination of other disfluency types, for example LibriStutter dataset where FluentNet outperforms the previous
an interjection can be both prolonged or repeated. For these model along with all the benchmark models.
reasons, interjections remain the hardest category, with a The BLSTM used in [28] yields successful results to-
24.05% and 29.78% miss rate on the UCLASS an LibriStutter wards repetition stutter classification by learning temporal
datasets, respectively. Nonetheless, FluentNet provides good relationships between words, however it remains impaired
results, especially given that interjections have been histori- by its reliance solely on lexical model inputs. On the other
cally avoided. hand, as shown by the results, FluentNet is better able to
The task-oriented lattices generated in [17] show strong learn these phonetic details through the spectral and temporal
performance on word repetitions and prolongations, but strug- representations of speech.
gle to detect sound repetitions and revision. Likewise, as is The work from [16] uses similar classification techniques to
presented in [31], the CT-Transformer yields a comparable [17], however improves upon the thresholding technique with
interjection classification miss rate to that of FluentNet. How- the addition of a neural neural network. Though achieving
ever, when the same model is applied to repetition stutters, the an average accuracy of 86% across the same disfluency types
performance of the model drops severely, hindering its overall used in this work, FluentNet remains a stronger model given
disfluency detection capabilities. The use of an attention-based its effective spectral frame-level and temporal embeddings.
transformer proves a viable method of classifying interjection Nonetheless, the results of this work contains only a single
disfluencies, however as the results suggest, the convolutional overall accuracy value across all of repetition, interjection, and
and recurrent architecture in FluentNet allows for effective prolongation disfluency detection. Little is discussed on the
representations to be learned for interjection disfluencies origin and makeup of the dataset used.
alongside repetitions and prolongations. Of the benchmark models without an RNN component,
FluentNet’s achievements surpass our previous work’s ResNet performs better than both VGG networks for both
across all disfluency types on the Libristutter dataset, and all datasets, indicating that ResNet-style architectures are able
but word repetition accuracy on the UCLASS dataset. The to learn effective spectral representations of speech. This
results show a greater margin of improvement against the further justifies the use of a ResNet as the backbone of our
11

TABLE V: Average percent miss rates (MR) and accuracy UCLASS LibriStutter
100% 100%
(Acc) of disfluency classification models.
90% 90%
Paper Dataset Ave. MR↓ Ave. Acc.↑

Training Accuracy

Training Accuracy
80% 80%
Zayats et al. [28] Switchboard 19.4 – Interjection
70% 70%
Villegas et al. [29] Custom – 82.6 Sound Repetition Interjection
Word Repetition Sound Repetition
Phrase Repetition 60% Word Repetition
Dash et al. [16] Custom – 86 60%
Revision Phrase Repetition
Prolongation Prolongation
Chen et al. [31] Custom 38.5 – 5 10 15 20 25 30 5 10 15 20 25 30
Epoch Epoch
Alharbi et al. [17] UCLASS 37 –
Kourkounakis et al. [33] UCLASS 10.03 91.15 (a) UCLASS (b) LibriStutter
Benchmark 1 (VGG-16) UCLASS 13.81 86.62
Benchmark 2 (VGG-19) UCLASS 12.21 87.92 Fig. 6: Average training accuracy for FluentNet on the consid-
Benchmark 3 (ResNet-18) UCLASS 12.14 89.14 ered stuttered types for the UCLASS and LibriStutter datasets.
FluentNet UCLASS 9.35 91.75
Kourkounakis et al. [33] LibriStutter 14.36 85.30
Benchmark 1 (VGG-16) LibriStutter 16.65 83.43
Benchmark 2 (VGG-19) LibriStutter 16.08 84.49 UCLASS dataset. Similarly, we experimented with the use of
Benchmark 3 (ResNet-18) LibriStutter 16.30 83.97 different number of BLSTM layers, ranging between 0 to 3
FluentNet LibriStutter 13.03 86.70
layers. The use of 2 layers yielded the best results. Moreover,
the use of bi-directional layers proved slightly more effective
UCLASS 1.0
LibriStutter than uni-directional layers. Lastly, we experimented with a
1.0
number of different values and strategies for the learning rate
0.8 0.8
where 10−4 showed the best results.
True Positive Rate
True Positive Rate

0.6 0.6
Figures 6(a) and 6(b) show FluentNet’s performance for
0.4 Interjection 0.4
Sound Repetition
Word Repetition
Interjection
Sound Repetition
each stutter type against different epochs on the UCLASS
0.2 Phrase Repetition 0.2 Word Repetition
Revision
Prolongation
Phrase Repetition
Prolongation
and LibriStutter datasets, respectively. It can be seen that the
0.0 0.0
0.0 0.2 0.4 0.6
False Positive Rate
0.8 1.0 0.0 0.2 0.4 0.6
False Positive Rate
0.8 1.0 training accuracy stabilizes after around 20 epochs. Whereas
(a) UCLASS (b) LibriStutter all disfluencies types in the UCLASS dataset approach perfect
training accuracy, training accuracy plateaus at much lower
Fig. 5: ROC curves for each stutter type tested on the UCLASS accuracies for interjections and sound repetitions within the
and LibriStutter datasets. LibriStutter dataset.

model. Moreover, the addition of the LSTM component to D. Ablation Experiments


the benchmarks shows that learning the temporal relationships To further analyze FluentNet, an ablation study was done
through an RNN contributes to the performance. in order to systematically evaluate how each component con-
To further demonstrate the performance of FluentNet, the tributes towards the overall performance. Both the SE portion
Receiver Operator Characteristic (ROC) curves were generated and attention mechanisms were removed, individually and
for each disfluency class on the UCLASS and LibriStutter together, in order to analyse the relationship between their
datasets, as shown in Figures 5(a) and 5(b), respectively. It absences, and how these affect both accuracy and miss rates
can be seen that word repetitions, phrase repetitions, revisions, for each disfluency class. The ablation results for both the
and prolongations reveal very strong classification on both UCLASS and LibriStutter datasets can be seen summarized
datasets. Both sound repetitions and interjections classification in Table VI. Overall, FluentNet shows stronger accuracy and
fair weakest, with the LibriStutter dataset, proving to be a lower miss rates across both datasets and all stutter types, com-
more difficult dataset for FluentNet, as previously observed pared to the three variants. Although the drops in performance
and discussed. varies across different stutter types with the removal of each
element, the experiment shows the general advantages of the
different components of FluentNet.
C. Parameters The results show that across both datasets, the SE compo-
Multiple parameters have been tuned in order to maxi- nent and the attention mechanism both individually benefit the
mize the accuracy of FluentNet and the baseline experiments model for most stutter types. Removal of the SE component
on both datasets. These include convolution window sizes, yields the greatest drop in the accuracy and increase in miss
epochs, and learning rates, among others. Each has been rates across nearly all stutter types. The removal of the SE
individually tested in order to find the optimal values for the components from FluentNet has the most negative impact.
given model. Note that all of FluentNet’s hyper-parameters The removal of the global attention mechanism as the final
remain the same across all disfluency types. stage of the model, also reduces the classification accuracy
Thorough experiments were performed to obtain the op- of FluentNet. Similarly, with both the SE component and
timum architecture of FluentNet. For the SE-ResNet com- attention removed, the model showed a decline in accuracy
ponent, we tested a different count of convolution blocks, and miss rates across all classes tested. Note that the results
ranging between 3 to 12, with each block consisting of of these ablation experiments hold similar conclusions for
3 convolutional layers. Eight blocks were found to be the both the UCLASS and our synthesized dataset (with a slightly
approximate optimal depth for training the model on the higher impact observed on UCLASS vs. LibriStutter), thereby
12

TABLE VI: Ablation experiment results, providing accuracy (Acc) and miss rates (MR) for each stutter type and model on
the UCLASS dataset.
S W PH I PR R Average
Method Dataset MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑ MR↓ Acc.↑
FluentNet UCLASS 16.78 84.46 3.43 96.57 3.86 96.26 24.05 81.95 5.34 94.89 2.62 97.38 9.35 91.75
w/o Attention UCLASS 16.97 83.13 3.51 96.29 4.23 95.78 24.22 80.78 6.88 92.50 3.25 96.55 9.84 90.84
w/o Squeeze-and-Excitation UCLASS 17.37 82.01 4.82 95.34 4.81 95.17 24.59 79.84 6.22 93.10 3.14 96.98 10.16 90.41
w/o Squeeze-and-Excitation & Attention UCLASS 18.18 82.83 4.96 95.04 5.32 93.68 28.89 71.01 8.30 91.72 3.30 96.70 11.49 88.50
FluentNet LibriStutter 17.65 82.24 4.11 94.69 5.71 94.32 29.78 70.12 7.88 92.14 13.03 86.70
w/o Attention LibriStutter 18.91 81.14 4.17 94.01 5.92 93.73 31.26 68.91 8.53 91.24 13.76 85.81
w/o Squeeze-and-Excitation LibriStutter 19.11 80.72 4.95 94.60 5.87 94.15 31.14 70.02 8.80 91.28 13.97 86.15
w/o Squeeze-and-Excitation & Attention LibriStutter 19.23 79.80 5.17 92.52 6.12 92.52 31.49 69.22 9.80 89.44 14.36 85.30

reinforcing the validity of LibriStutter’s similarity to real [3] Mayo Foundation for Medical Education and Research.
stutters. (2017) Stuttering. [Online]. Available: https://www.mayoclinic.org/
diseases-conditions/stuttering/diagnosis-treatment/drc-20353577
[4] H. Trinh, R. Asadi, D. Edge, and T. Bickmore, “Robocop: A robotic
VI. C ONCLUSION coach for oral presentations,” ACM Conference on Interactive, Mobile,
Wearable and Ubiquitous Technologies, vol. 1, no. 2, p. 27, 2017.
Of the measurable metrics of speech, stuttering continues to [5] ASHA. (2020) Childhood fluency disorders. [On-
be the most difficult to identify as their diversity and unique- line]. Available: https://www.asha.org/Practice-Portal/Clinical-Topics/
Childhood-Fluency-Disorders
ness make them challenging for simple algorithms to model. [6] United Kingdom National Health Service. (2019) Stammering. [Online].
To this end, we proposed a deep neural network, FluentNet, Available: https://www.nhs.uk/conditions/stammering/
to accurately classify these disfluencies. FluentNet is an end- [7] Anxiety and Depression Association of America.
(2019). [Online]. Available: https://adaa.org/understanding-anxiety/
to-end deep neural network designed to accurately classify social-anxiety-disorder/treatment/conquering-stage-fright
stuttered speech across six different stutter types: sound, word, [8] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar,
and phrase repetitions, as well as revisions, interjections, H. Huang, A. Tjandra, X. Zhang, F. Zhang et al., “Transformer-based
acoustic modeling for hybrid speech recognition,” in ICASSP 2020-
and prolongations. This model uses a Squeeze-and-Excitation 2020 IEEE International Conference on Acoustics, Speech and Signal
residual network to learn effective spectral frame-level speech Processing (ICASSP). IEEE, 2020, pp. 6874–6878.
representations, followed by recurrent bidirectional long short- [9] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao,
D. Rybach, A. Kannan, Y. Wu, R. Pang et al., “Streaming end-to-end
term memory layers to learn temporal relationships from speech recognition for mobile devices,” in ICASSP 2019-2019 IEEE
stuttered speech. A global attention mechanism was then added International Conference on Acoustics, Speech and Signal Processing
to focus on salient parts of speech in order to accurately detect (ICASSP). IEEE, 2019, pp. 6381–6385.
[10] A. Hajavi and A. Etemad, “A deep neural network for short-segment
the required influences. Through comprehensive experiments, speaker recognition,” INTERSPEECH, 2019.
we demonstrate that FluentNet achieves state-of-the-art results [11] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and
on disfluency classification with respect to other works in S. Khudanpur, “Speaker recognition for multi-speaker conversations
using x-vectors,” in ICASSP 2019-2019 IEEE International Conference
the area as well as a number of benchmark models on the on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019,
public UCLASS dataset. Given the lack of sufficient data to pp. 5796–5800.
facilitate more in-depth research on disfluency detection, we [12] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based
generative network for speech synthesis,” in ICASSP 2019-2019 IEEE
developed a synthetic dataset, LibriStutter, based on the public International Conference on Acoustics, Speech and Signal Processing
LibriSpeech dataset. (ICASSP). IEEE, 2019, pp. 3617–3621.
Future works may include improving on LibriStutter’s re- [13] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis
with transformer network,” in Proceedings of the AAAI Conference on
alism, which could constitute conducting further research into Artificial Intelligence, vol. 33, 2019, pp. 6706–6713.
the physical sound generation of stutters and how they translate [14] P. Howell and S. Sackin, “Automatic recognition of repetitions and
to audio signals. Whereas this work focuses on the educational prolongations in stuttered speech,” Proceedings of the First World
Congress on Fluency Disorders, 01 1995.
and business applications of speech metric analysis, further [15] P. Howell, S. Sackin, and K. Glenn, “Development of a two-stage
work may focus towards medical and therapeutic use-cases. procedure for the automatic recognition of dysfluencies in the speech of
children who stutter: Ii. ann recognition of repetitions and prolongations
with supplied word segment markers,” Journal of Speech, Language, and
ACKNOWLEDGMENT Hearing Research, 10 1997.
[16] A. Dash, N. Subramani, T. Manjunath, V. Yaragarala, and S. Tripathi,
The authors would like to thank Prof. Jim Hamilton for “Speech recognition and correction of a stuttered speech,” in 2018
his support and valuable discussion throughout this work. We International Conference on Advances in Computing, Communications
also wish to acknowledge Adrienne Nobbe for her consultation and Informatics (ICACCI), 2018, pp. 1757–1760.
[17] S. Alharbi, M. Hasan, A. Simons, S. Brumfitt, and P. Green, “A
towards this project. lightly supervised approach to detect stuttering in childrens speech,”
INTERSPEECH, pp. 3433–3437, 2018.
R EFERENCES [18] P. Howell, S. Davis, and J. Bartrip, “The university college london
archive of stuttered speech (uclass),” Journal of Speech, Language, and
[1] S. H. Ferguson and S. D. Morgan, “Talker differences in clear and Hearing Research, vol. 52, pp. 556–569, 2009.
conversational speech: Perceived sentence clarity for young adults with [19] T. Tan, Helbin-Liboh, A. K. Ariff, C. Ting, and S. Salleh, “Application
normal hearing and older adults with hearing loss,” Journal of Speech, of malay speech technology in malay speech therapy assistance tools,”
Language, and Hearing Research, vol. 61, no. 1, pp. 159–173, 2018. International Conference on Intelligent and Advanced Systems, pp. 330–
[2] J. S. Robinson, B. L. Garton, and P. R. Vaughn, “Becoming employable: 334, 2007.
A look at graduates’ and supervisors’ perceptions of the skills needed [20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:
for employability,” in NACTA Journal, vol. 51, 2007, pp. 19–26. An asr corpus based on public domain audio books,” in 2015 IEEE
13

International Conference on Acoustics, Speech and Signal Processing [42] Y. Ma, H. Peng, and E. Cambria, “Targeted aspect-based sentiment
(ICASSP), 2015, pp. 5206–5210. analysis via embedding commonsense knowledge into an attentive lstm.”
[21] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, and in Aaai, 2018, pp. 5876–5883.
R. Collobert, “Fully convolutional speech recognition,” arXiv preprint [43] H. Y. Kim and C. H. Won, “Forecasting the volatility of stock price in-
arXiv:1812.06864, 2018. dex: A hybrid model integrating lstm with multiple garch-type models,”
[22] E. Yairi and N. G. Ambrose, “Early childhood stuttering i: persistency Expert Systems with Applications, vol. 103, pp. 25–37, 2018.
and recovery rates,” Journal of Speech, Language, and Hearing Re- [44] P. Li, M. Abdel-Aty, and J. Yuan, “Real-time crash risk prediction on
search, vol. 42, 1999. arterials based on lstm-cnn,” Accident Analysis & Prevention, vol. 135,
[23] F. S. Juste and C. R. F. de Andrade, “Speech disfluency types of p. 105371, 2020.
fluent and stuttering individuals: Age effects,” International Journal of [45] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-
Phoniatrics, Speech Therapy and Communication Pathology, vol. 63, works,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp.
2011. 2673–2681, 1997.
[24] Stuttering Foundation. (2020) Differential diagnosis. [Online]. Available: [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
https://www.stutteringhelp.org/differential-diagnosis dinov, “Dropout: a simple way to prevent neural networks from over-
[25] K. Ravikumar, S. Kudva, R. Rajagopal, and H. Nagaraj, “Development fitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.
of a procedure for the automatic recognition of disfluencies in the 1929–1958, 2014.
speech of people who stutter,” in International Conference on Advanced [47] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
Computing Technologies, Hyderbad, India, 2008, pp. 514–519. jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
[26] H. K.M Ravikumar, R.Rajagopal, “An approach for objective assessment 2014.
of stuttered speech using mfcc features,” Digital Signal Processing [48] A. Hajavi and A. Etemad, “Knowing what to listen to: Early attention for
Journal, vol. 9, pp. 19–24, 2019. deep speech representation learning,” arXiv preprint arXiv:2009.01822,
[27] L. S. Chee, O. C. Ai, and S. Yaacob, “Overview of automatic stuttering 2020.
recognition system,” in Proc. International Conference on Man-Machine [49] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion
Systems, no. October, Batu Ferringhi, Penang Malaysia, 2009, pp. 1–6. recognition using recurrent neural networks with local attention,” in
[28] V. Zayats, M. Ostendorf, and H. Hajishirzi, “Disfluency detection using 2017 IEEE International Conference on Acoustics, Speech and Signal
a bidirectional lstm,” INTERSPEECH, pp. 2523–2527, 2016. Processing (ICASSP), 2017, pp. 2227–2231.
[29] B. Villegas, K. M. Flores, K. Jos Acua, K. Pacheco-Barrios, and D. Elias, [50] T. Sun and A. A. Wu, “Sparse autoencoder with attention mechanism for
“A novel stuttering disfluency classification system based on respiratory speech emotion recognition,” in 2019 IEEE International Conference on
biosignals,” in 2019 41st Annual International Conference of the IEEE Artificial Intelligence Circuits and Systems (AICAS), 2019, pp. 146–149.
Engineering in Medicine and Biology Society (EMBC), 2019, pp. 4660– [51] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap-
4663. proaches to attention-based neural machine translation,” arXiv preprint
arXiv:1508.04025, 2015.
[30] J. Santoso, T. Yamada, and S. Makino, “Classification of causes of
[52] F. Chollet et al. (2015) Keras. [Online]. Available: https://keras.io
speech recognition errors using attention-based bidirectional long short-
[53] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
term memory and modulation spectrum,” in 2019 Asia-Pacific Signal
S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-
and Information Processing Association Annual Summit and Conference
scale machine learning,” in 12th {USENIX} symposium on operating
(APSIPA ASC), 2019, pp. 302–306.
systems design and implementation ({OSDI} 16), 2016, pp. 265–283.
[31] Q. Chen, M. Chen, B. Li, and W. Wang, “Controllable time-delay trans-
[54] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg,
former for real-time punctuation prediction and disfluency detection,”
and O. Nieto, “librosa: Audio and music signal analysis in python,” in
in ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
Proceedings of the 14th python in science conference, vol. 8, 2015.
Speech and Signal Processing (ICASSP), 2020, pp. 8069–8073.
[55] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.
[32] K. Georgila, “Using integer linear programming for detecting speech Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom.
disfluencies,” in Proceedings of Human Language Technologies: The nist speech disc 1-1.1,” STIN, vol. 93, p. 27403, 1993.
2009 Annual Conference of the North American Chapter of the Associ- [56] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale
ation for Computational Linguistics, Companion Volume: Short Papers, speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
2009, pp. 109–112. [57] L. S. Chee, O. C. Ai, M. Hariharan, and S. Yaacob, “Mfcc based
[33] T. Kourkounakis, A. Hajavi, and A. Etemad, “Detecting multiple speech recognition of repetitions and prolongations in stuttered speech using
disfluencies using a deep residual network with bidirectional long short- k-nn and lda,” in 2009 IEEE Student Conference on Research and
term memory,” in ICASSP 2020-2020 IEEE International Conference Development (SCOReD), 2009, pp. 146–149.
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, [58] O. C. Ai, M. Hariharan, S. Yaacob, and L. S. Chee, “Classification of
pp. 6089–6093. speech dysfluencies with mfcc and lpcc features,” Expert Systems with
[34] S. Khara, S. Singh, and D. Vir, “A comparative study of the techniques Applications, vol. 39, no. 2, pp. 2157–2165, 2012.
for feature extraction and classification in stuttering,” in 2018 Second In- [59] (2020) Google cloud speech-to-text. [Online]. Available: https:
ternational Conference on Inventive Communication and Computational //cloud.google.com/speech-to-text/
Technologies (ICICCT), 2018, pp. 887–893. [60] J. Van Borsel, E. Geirnaert, and R. Van Coster, “Another case of word-
[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for final disfluencies,” Folia phoniatrica et logopaedica, vol. 57, no. 3, pp.
image recognition,” IEEE Conference on Computer Vision and Pattern 148–162, 2005.
Recognition, pp. 770–778, 2016. [61] J. Han, M. Kamber, and J. Pei, Data Mining, 3rd ed. Elsevier Inc.,
[36] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” CoRR, 2012.
vol. abs/1709.01507, 2017. [Online]. Available: http://arxiv.org/abs/ [62] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
1709.01507 large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[37] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,
inception-resnet and the impact of residual connections on learning,”
arXiv preprint arXiv:1602.07261, 2016.
[38] A. G. Roy, N. Navab, and C. Wachinger, “Concurrent spatial and channel
squeeze & excitationin fully convolutional networks,” in International
conference on medical image computing and computer-assisted inter-
vention. Springer, 2018, pp. 421–429.
[39] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, “Inside-outside
net: Detecting objects in context with skip pooling and recurrent neural
networks,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 2874–2883.
[40] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for
human pose estimation,” in European conference on computer vision.
Springer, 2016, pp. 483–499.
[41] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

You might also like