Classification and Recognition of Stuttered Speech
Classification and Recognition of Stuttered Speech
Manu Chopra
Stanford University
([email protected])
Kevin Khieu
Stanford University
([email protected])
Thomas Liu
Stanford University
([email protected])
ABSTRACT handling speech impediments. The weakness of these
speech recognizers is that these machines assume that
According to the National Institute on Deafness and every sound a user makes is an intended sound, which
Other Communication Disorders, over 3 million Amer- is not the case for stuttering users and thus leads to the
icans stutter when they speak [3]. This speech imped- ineffectiveness of these platforms for those with speech
iment can have significant effects on the performance impediments.
of speech recognizers, thereby hurting the ability for
individuals with stuttering to be able to use speech Our approach tackles the problem on two levels: the
related tools. Many of the voice interfaces that ex- classifier-level, and the ASR-level. For our classifier,
ist ubiquitously within today’s consumer technology, we worked on creating a model that could best identify
from smart TVs to car systems, often neglect popu- whether a 1 second clip of audio contained stuttering
lations with speech ailments. As an example, Apple’s or not. Classification techniques commonly used in ex-
Siri, when tested against various speech disorders in- isting literature are Artificial Neural Networks (ANNs),
cluding stuttering and slurred speech, reported accura- Hidden Markov Model (HMM) and Support Vector Ma-
cies ranging from as low as 18.2%, to only as high as chines (SVM), with the most common best features be-
73%. The significance of this issue is further substan- ing Mel-Frequency Cepstral Coefficients and Spectral
tiated when these accuracies are juxtaposed with the measure. We chose to more thoroughly investigate the
industry standards of 92% and above for normal users effectiveness of neural networks on the problem, given
without impediments [10]. Inaccuracies of this degree the tools we were familiar with as well as the surprising
render voice control tools difficult, if not impossible, for lack of features tested on neural network classifiers in
affected individuals to use reliably. Thus, our project existing literature. Ultimately, we decided to employ
seeks to improve the performance of automatic speech two tools, TensorFlow and MATLAB, to conduct our
recognizers on speech containing stuttering, specifically studies, testing five different audio features as well as
by trying to develop classifiers that can better detect five different feature extraction mechanisms.
stuttering in speech signals, as well as to study tech- For our ASR studies, we experimented with apply-
niques on applying these classifiers to ASR models so ing our classifier at varying degrees to speech that con-
that we can more effectively parse out stuttered speech tains stuttering. We then take the classified speech and,
before processing these speech signals. after removing the parts that have been identified as be-
ing stutter, pass it through IBM Watson’s UK English
Broadband Model to test for WER/Accuracy. The fig-
1. INTRODUCTION
ure below shows both these parts:
The lack of effective speech recognition for stuttered
speech, especially among devices at the top of the mar-
ket today, is an unfair consequence for the more than 70
million people in the world who suffer from stuttering
[14]. Assistants such as Apple’s Siri, Microsoft Cortana
or Google Now specifically cater to the majority of the
population that doesn’t stutter, meaning that these sys-
tems are less focused on detecting and accounting for
1
2. BACKGROUND/RELATED WORK testing. The system yielded 94.35% accuracy which is
higher than their previous work.
We will begin by conducting a thorough review of exist-
ing classification techniques that are currently utilized
The table below, created by Lim Sin Chee and team,
in automatic stuttering recognition, and a survey of
summarizes several research works on automatic stut-
current literature of the topic. Classification techniques
tering recognition systems and details the databases,
commonly used in existing literature are Artificial Neu-
features and classifiers chosen by the researchers and
ral Networks (ANNs), Hidden Markov Model (HMM)
the accuracy obtained. [8]
and Support Vector Machines (SVM).
2
3. OVERVIEW OF APPROACHES 4.2 Formatting Data
We seek to have a two-fold approach in this study on For our purposes, and due to the lack of any labeled
stuttering speech. The first part of this is the study- data on the internet that relates to stuttering speech,
ing of the effectiveness of neural networks on serving as we needed to specifically transcribe by hand what time
classifiers for stuttered and non-stuttered speech. Our windows in each audio file contained stuttering. 28
decision to use neural networks was because of the large files, each roughly three minutes long, were labeled for
amounts of past work done on the topic, as well as due the purposes of our models. For each .WAV file, a
to the trends we noticed in the features used for stud- corresponding .TXT file was created that sequentially
ies involving ANNs. Furthermore, the small amount of identified which segments in the audio clip were stut-
data studied on was an incentive for us to try something tered and non-stuttered speech.
that would add to this field.
The University College London’s releases of stut- The goal of our classifier-building approach is to test
tered speech data are the most commonly cited recent the effectiveness of neural networks, one of the most
data sets for speech studies regarding stuttered signals. commonly used approaches in past literature, on classi-
Both releases contain data in MP3 and WAV formats, fying stuttered and non-stuttered speech. More specif-
and also include a few transcriptions that would be ically, we wanted to test the effectiveness of two sub-
helpful in our analysis. The only drawback regarding approaches, one where we test the effects of various fea-
this data set is the imbalance in female versus male tures other than MFCC’s, and one where we test the
data samples. Because many more samples are from different processes of extracting features. This portion
male speakers than female speakers, there arises an is- of our work has three parts: our Baseline (Basic Tensor-
sue regarding the ability for our classifiers trained on Flow Implementation), our Advanced TensorFlow Im-
this data to perform equally well on both genders. plementation, and our MATLAB implementation.
3
5.1 Baseline ceived sound quality of a sound or tone that dis-
tinguishes different types of sound production,
such as choir voices versus musical instruments.
In other words, this is the feature that we use to
perceive different ”categories” of sound.[1]
• Tonal Centroid Features:
Tonal Centroid Features, also known as ton-
The goal of our Baseline classifier was to build as netz, are a conceptual lattice diagram represent-
simple of a neural net as possible using common archi- ing tonal space in other words a ”tone network”.
tectures cited in previous literature in this field. Given Tonnetz features allow us to study harmonic fea-
our shared experience in this class, our baseline classi- tures and relationships in our audio files. [16]
fier was created using a simple two-layered neural net-
work in TensorFlow. Because the most common feature Regarding the neural network we used, we kept the
in previous studies was Mel-Frequency Cepstral Coeffi- neural network with the same two-layer configuration
cients (MFCC’s), we committed to using this sole fea- as before, partially due to time constraints but also be-
ture in our baseline studies. To extract this feature from cause we wanted to focus more on the direct impact
our WAV files, we relied on using the Librosa library, an of these five features on our baseline model. To extract
audio processing library in Python with various feature each of these five features, we defaulted to using a mean
extraction tools for WAV files. [7] feature extractor.
• Chroma:
Chroma relates to the twelve different pitch
classes, capturing the harmonic and melodic char-
acteristics of speech. For each sample, we rate
the audio signal using each of these twelve pitch
classes in terms of intensity, passing along these
values as the features we use [4].
• Mel Spectrogram:
Mel Spectrogram serves as another acoustic time-
frequency representation of a sound: the power
spectral density. This feature samples around
equally spaced times and frequencies for the given The number of inputs is the size of the feature vector
WAV signal.[9] used. The hidden layers are linear weighted subprob-
lems over a sigmoid activation function. If wi is a
• Spectral Contrast: signal, for each hidden unit hj ,
Spectral Contrast features mainly serve to iden-
tify the timbre of audio signals. This is the per- hj = σ(vj · φ(wi ))
4
with vj a learned weight and logistic activation func- 7.1 Classifier: Baseline
tion
σ(z) = (1 + e−z )−1 The goal of our baseline classifier was to test the sim-
plest possible solution we could create: a two-layer neu-
We trained a feedforward network using a scaled conju- ral network in TensorFlow that captured only MFCC
gate gradient backpropagation to update the weights features of the audio files we passed. Upon setting up
and measured performance using cross entropy. We our model, we then optimized the number of epochs and
evaluated the performance on different size hidden lay- the learning rate, eventually settling on 5,000 epochs
ers and found that a hidden layer of size 6 maximizes and a learning rate of 0.01. The TensorFlow model used
the training and test performance. 85% of the trials a held out subset of 53 randomly selected (half stutter,
were used for training and 15% of the trials were held half non-stutter) audio files as the validation set. This
out for testing.[2] configuration yielded a best accuracy of 66.0%.
6. APPROACH: ASR MODEL Detailed results of our experiment are listed below:
The initial goal with our ASR model was to test the
question of whether this problem was even worth solv-
ing we wanted to truly verify for ourselves the accuracy
of state-of-the-art Automatic Speech Recognizers on
speech that contained stuttered segments. Our base-
line model that we compared overall end-to-end per-
formance is a state of the art neural network trained
on IBM’s Watson’s UK English broadband (16 kHz)
model. As a pure baseline, our goal was to see how
this model fared when fed audio that did not remove From this experiment, we found that just having
non-stuttered components from our audio signals. To a baseline on MFCC features was good enough to get
conduct this experiment, we found five data files from fairly high accuracy on our test set.
the UCLASS data set and ran them through our ASR
model. The results of this can be found below:
7.2 Classifier: MATLAB Implementation
7. EXPERIMENTS
The following section details the processes and results
of the experiments we conducted.
5
to us. A great deal of our time and energy over the
last quarter was spent locating, labeling, and process-
ing data to be used for our purposes, and even then, the
data distribution was so variable that after processing
over 28 audio files from UCLASS, a total of over 1,788
speech samples, to train/test off of, our results still did
not reach reliable levels.
6
[6] R.Rajagopal K. M. Ravikumar and H.C.Nagaraj. [12] Kishore Prahallad. Speech Technology: A Prac-
“An Approach for Objective Assessment of Stut- tical Introduction, Topic: Spectrogram, Cepstrum
tered Speech Using MFCC Features”. In: ICGST and Mel-Frequency Analysis. url: http://www.
International Journal on Digital Signal Process- speech.cs.cmu.edu/15-492/slides/03_mfcc.
ing 9 (2009), pp. 19–24. pdf.
[13] UCL Division of Psychology and Language Sci-
[7] Librosa Feature Extraction. url: https : / / ences. UCLASS Release Two. url: http://www.
librosa.github.io/librosa/feature.html. uclass.psychol.ucl.ac.uk/uclass2.htm.
[8] Sazali Yaacob Lim Sin Chee Ooi Chia Ai. [14] Stuttering Facts and Information. url: http://
“Overview of Automatic Stuttering Recognition www.stutteringhelp.org/faq.
System”. In: (2009), pp. 1–6. [15] L. Helbin T. Tian-Swee and S. H. Salleh. “Ap-
plication of Malay speech technology in Malay
[9] Mel Frequency Cepstrum. url: https : / / en . Speech Therapy Assistance Tools”. In: Intelligent
wikipedia . org / wiki / Mel - frequency _ and Advanced Systems (2007), pp. 330–334.
cepstrum. [16] Dmitri Tymoczko. “The Generalized Tonnetz”.
In: Journal of Music Theory 56 (2012), pp. 1–
[10] Emily Mullin. Why Siri Won’t Listen to Millions 3.
of People with Disabilities. url: https://www.
scientificamerican.com/article/why-siri-
won - t - listen - to - millions - of - people -
Acknowledgments
with-disabilities/. As a team we would like to thank the University Col-
lege London for the work it has put into producing the
[11] S. Sackin P. Howell and K. Glenn. “Development two large releases of data on stuttering speech over the
of a two-stage procedure for the automatic recog- past decade and a half. Without their work, none of
nition of dysfluencies in the speech of children our efforts could have been possible (as well as the ef-
who stutter: II. ANN recognition of repetitions forts of researchers before us in this respective field).
and prolongations with supplied word segment We would also like to thank the CS224S teaching staff
markers”. In: Journal of Speech, Language, and for both teaching us the skills necessary to pursue this
Hearing Research 40 (1997), p. 1085. project as well as for providing us guidance.