0% found this document useful (0 votes)
44 views7 pages

Classification and Recognition of Stuttered Speech

This document discusses developing classifiers and techniques to improve automatic speech recognition for stuttered speech. It summarizes existing research showing that artificial neural networks, hidden Markov models, and support vector machines can classify stuttered and non-stuttered speech with over 90% accuracy when trained on small datasets. However, most studies were tested on artificially generated stuttered speech rather than real data. The authors plan to focus on neural networks to classify stuttered speech segments before passing the output to a commercial speech recognition system, aiming to better recognize stuttered speech. They will experiment with different audio features and extraction methods to train their neural network classifier.

Uploaded by

Luka Savic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views7 pages

Classification and Recognition of Stuttered Speech

This document discusses developing classifiers and techniques to improve automatic speech recognition for stuttered speech. It summarizes existing research showing that artificial neural networks, hidden Markov models, and support vector machines can classify stuttered and non-stuttered speech with over 90% accuracy when trained on small datasets. However, most studies were tested on artificially generated stuttered speech rather than real data. The authors plan to focus on neural networks to classify stuttered speech segments before passing the output to a commercial speech recognition system, aiming to better recognize stuttered speech. They will experiment with different audio features and extraction methods to train their neural network classifier.

Uploaded by

Luka Savic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Classification and Recognition of Stuttered Speech

Manu Chopra
Stanford University
([email protected])
Kevin Khieu
Stanford University
([email protected])
Thomas Liu
Stanford University
([email protected])
ABSTRACT handling speech impediments. The weakness of these
speech recognizers is that these machines assume that
According to the National Institute on Deafness and every sound a user makes is an intended sound, which
Other Communication Disorders, over 3 million Amer- is not the case for stuttering users and thus leads to the
icans stutter when they speak [3]. This speech imped- ineffectiveness of these platforms for those with speech
iment can have significant effects on the performance impediments.
of speech recognizers, thereby hurting the ability for
individuals with stuttering to be able to use speech Our approach tackles the problem on two levels: the
related tools. Many of the voice interfaces that ex- classifier-level, and the ASR-level. For our classifier,
ist ubiquitously within today’s consumer technology, we worked on creating a model that could best identify
from smart TVs to car systems, often neglect popu- whether a 1 second clip of audio contained stuttering
lations with speech ailments. As an example, Apple’s or not. Classification techniques commonly used in ex-
Siri, when tested against various speech disorders in- isting literature are Artificial Neural Networks (ANNs),
cluding stuttering and slurred speech, reported accura- Hidden Markov Model (HMM) and Support Vector Ma-
cies ranging from as low as 18.2%, to only as high as chines (SVM), with the most common best features be-
73%. The significance of this issue is further substan- ing Mel-Frequency Cepstral Coefficients and Spectral
tiated when these accuracies are juxtaposed with the measure. We chose to more thoroughly investigate the
industry standards of 92% and above for normal users effectiveness of neural networks on the problem, given
without impediments [10]. Inaccuracies of this degree the tools we were familiar with as well as the surprising
render voice control tools difficult, if not impossible, for lack of features tested on neural network classifiers in
affected individuals to use reliably. Thus, our project existing literature. Ultimately, we decided to employ
seeks to improve the performance of automatic speech two tools, TensorFlow and MATLAB, to conduct our
recognizers on speech containing stuttering, specifically studies, testing five different audio features as well as
by trying to develop classifiers that can better detect five different feature extraction mechanisms.
stuttering in speech signals, as well as to study tech- For our ASR studies, we experimented with apply-
niques on applying these classifiers to ASR models so ing our classifier at varying degrees to speech that con-
that we can more effectively parse out stuttered speech tains stuttering. We then take the classified speech and,
before processing these speech signals. after removing the parts that have been identified as be-
ing stutter, pass it through IBM Watson’s UK English
Broadband Model to test for WER/Accuracy. The fig-
1. INTRODUCTION
ure below shows both these parts:
The lack of effective speech recognition for stuttered
speech, especially among devices at the top of the mar-
ket today, is an unfair consequence for the more than 70
million people in the world who suffer from stuttering
[14]. Assistants such as Apple’s Siri, Microsoft Cortana
or Google Now specifically cater to the majority of the
population that doesn’t stutter, meaning that these sys-
tems are less focused on detecting and accounting for

1
2. BACKGROUND/RELATED WORK testing. The system yielded 94.35% accuracy which is
higher than their previous work.
We will begin by conducting a thorough review of exist-
ing classification techniques that are currently utilized
The table below, created by Lim Sin Chee and team,
in automatic stuttering recognition, and a survey of
summarizes several research works on automatic stut-
current literature of the topic. Classification techniques
tering recognition systems and details the databases,
commonly used in existing literature are Artificial Neu-
features and classifiers chosen by the researchers and
ral Networks (ANNs), Hidden Markov Model (HMM)
the accuracy obtained. [8]
and Support Vector Machines (SVM).

2.1 Artificial Neural Networks (ANNs)


2.4 Summary
Previous literature has shown that ANNs can be used as
a tool in speech analysis, for both fluent and non-fluent To summarize, existing research literature tells us that
speakers. Peter et al. have performed several research we can use ANNs, HMMs and SVMs to classify stut-
studies using ANNs as classification technique. The tered speech and non-stuttered speech with consider-
ANN model is used to detect the stuttered events. The able accuracy (greater than 90%). A summary of the
particular stuttered events to be located are repetitions existing research is shown below:
and prolongations. This is because the repetitions and
prolongations are ubiquitous in stuttered speech.
In a 1997 study, Peter Howell and his team used
ANNs to correctly identified 78.01% of the disfluent
(combination of prolongations and repetitions) words
[11].
In 2009, Swietlicka et al presented a research con-
cerning on automatic detection of disfluency in stut-
tered speech. They applied Multilayer Perceptron
(MLP) and Radial Basis Function (RBF) networks to
recognize and classify fluent and nonfluent speech sam-
ples. They yield classification correctness for all net-
works ranging between 88.1% and 94.9% [5].
It’s, however, important to note that most of the
2.2 Hidden Markov Models (HMMs) studies in this field are done with very small amount
HMMs are widely used in speech recognition, especially of training data. Notice how most of the studies in
in stuttering recognition to detect previously mentioned the table above use around 8-15 speakers. Ravikumar’s
speech disfluencies like repetitions and prolongations. study that employed SVMs to classify speech, for ex-
ample, used 8 training speech samples and was tested
Tian-Swee et al presented an automatic stutter-
on just 2 speech samples.
ing recognition system that utilizes HMM technique to
evaluate speech problem for children such as stutter- Moreover, some of the studies were trained and
ing. The voice patterns of non-stuttering and stuttering tested with artificial stuttered speech samples (i.e com-
children are used to train the HMM model. An aver- putationally created speech samples that are made to
age percentage of speech recognition rates was 96% for ”sound” like actual stuttered speech). Tian-Swee et al’s
non-stuttering speakers and 90% for stuttering speak- [15] work with HMMs, for example, reports 90% accu-
ers. [15] racy for stuttered speech but was only tested on artifi-
cial stuttered speech and hasn’t been tested on speech
2.3 Support Vector Machines (SVM)
samples taken from actual people.
SVMs are a powerful machine learning tool and are
widely used in the field of pattern recognition. Raviku- We plan to focus on neural networks for our stud-
mar et al [6] used SVMs to classify between fluent and ies, in particular looking more into the kinds of features
disfluent speech. The speech samples were collected that may add value to classifiers seeking to extract/re-
from 15 adults who stutter, 12 samples were used for move stutter from speech. We will also experiment with
training. The remaining three samples were used for the ways we can extract these features.

2
3. OVERVIEW OF APPROACHES 4.2 Formatting Data

We seek to have a two-fold approach in this study on For our purposes, and due to the lack of any labeled
stuttering speech. The first part of this is the study- data on the internet that relates to stuttering speech,
ing of the effectiveness of neural networks on serving as we needed to specifically transcribe by hand what time
classifiers for stuttered and non-stuttered speech. Our windows in each audio file contained stuttering. 28
decision to use neural networks was because of the large files, each roughly three minutes long, were labeled for
amounts of past work done on the topic, as well as due the purposes of our models. For each .WAV file, a
to the trends we noticed in the features used for stud- corresponding .TXT file was created that sequentially
ies involving ANNs. Furthermore, the small amount of identified which segments in the audio clip were stut-
data studied on was an incentive for us to try something tered and non-stuttered speech.
that would add to this field.

The second part of our approach is to apply our clas-


sifier in various ways to a state-of-the-art Automatic
Speech Recognizer to see how best we could apply our
algorithms such that the speech we produce is clearer
than it was before.

Our Approaches section will also discuss the pro-


cess of data-collection for our project. Due to the lack
of ubiquity in databases for stuttered and non-stuttered Afterwards, we then used these labeled data files
speech, we invested a lot of time in searching for, label- and parsed each data file into labeled audio snippets of
ing, and organizing data to meet our needs. max 1 second in length, using a Python script to fa-
cilitate this process. The decision to clip these audio
We go more in depth on these topics below. files at 1 second snippets was meant to make easier the
task of our classifier. Thus, the goal of the classifiers we
built would be to simply look at a 1 second time win-
4. APPROACH: DATA COLLECTION dow of speech and identify whether or not that 1 second
window is stutter or non-stuttered speech (versus the
4.1 Data Set alternative where our classifier has no standard time-
window for analyzing speech). These labeled WAV files
The University College London’s Archive of Stuttered were passed as training data into our classifiers to fa-
Speech contains two releases of recordings containing cilitate this learning.
stuttering speech of various levels of severity, the first
from 2004 and the second from 2008. [13] The two re- Once this parsing was completed, we had 1788 man-
leases encompass a range of ages from 5 to 47 years ually labeled data samples to train and validate our
old and includes a total of 139 human speakers (split classifiers with.
almost equally among males and females). Specifically,
we focused on using data from Release 2 due to its more
recent nature as well as due to the ease in labeling the
5. APPROACH: CLASSIFIER
sounds gathered then.

The University College London’s releases of stut- The goal of our classifier-building approach is to test
tered speech data are the most commonly cited recent the effectiveness of neural networks, one of the most
data sets for speech studies regarding stuttered signals. commonly used approaches in past literature, on classi-
Both releases contain data in MP3 and WAV formats, fying stuttered and non-stuttered speech. More specif-
and also include a few transcriptions that would be ically, we wanted to test the effectiveness of two sub-
helpful in our analysis. The only drawback regarding approaches, one where we test the effects of various fea-
this data set is the imbalance in female versus male tures other than MFCC’s, and one where we test the
data samples. Because many more samples are from different processes of extracting features. This portion
male speakers than female speakers, there arises an is- of our work has three parts: our Baseline (Basic Tensor-
sue regarding the ability for our classifiers trained on Flow Implementation), our Advanced TensorFlow Im-
this data to perform equally well on both genders. plementation, and our MATLAB implementation.

3
5.1 Baseline ceived sound quality of a sound or tone that dis-
tinguishes different types of sound production,
such as choir voices versus musical instruments.
In other words, this is the feature that we use to
perceive different ”categories” of sound.[1]
• Tonal Centroid Features:
Tonal Centroid Features, also known as ton-
The goal of our Baseline classifier was to build as netz, are a conceptual lattice diagram represent-
simple of a neural net as possible using common archi- ing tonal space in other words a ”tone network”.
tectures cited in previous literature in this field. Given Tonnetz features allow us to study harmonic fea-
our shared experience in this class, our baseline classi- tures and relationships in our audio files. [16]
fier was created using a simple two-layered neural net-
work in TensorFlow. Because the most common feature Regarding the neural network we used, we kept the
in previous studies was Mel-Frequency Cepstral Coeffi- neural network with the same two-layer configuration
cients (MFCC’s), we committed to using this sole fea- as before, partially due to time constraints but also be-
ture in our baseline studies. To extract this feature from cause we wanted to focus more on the direct impact
our WAV files, we relied on using the Librosa library, an of these five features on our baseline model. To extract
audio processing library in Python with various feature each of these five features, we defaulted to using a mean
extraction tools for WAV files. [7] feature extractor.

5.2 Advanced TensorFlow Implementation 5.3 MATLAB Implementation


For our MATLAB implementation, We used 5 differ-
After completing our Baseline, our first goal was to ana-
ent approaches to extract features RMS (root mean
lyze the effects of different audio features on the success
square), standard mean, median frequency, peak to
of our classifier (as well as to verify the effectiveness of
peak analysis (the difference between the highest and
MFCC features on stuttering/non-stuttering classifica-
the lowest frequency peaks in a .wav file) and a special-
tion). Specifically, the features we studied were:
ized mean feature extractor where we divided the .wav
signal to 8 subparts, taking the mean value of every
• Mel-Frequency Cepstral Coefficients: single sub part. To conduct these experiments, we used
A Mel-Frequency Cepstrum is a representation MATLAB’s Neural Network Toolbox.
of the short-term power spectrum of a sound,
based on a linear cosine transform of a log power In artificial neural networks, the burden of mak-
spectrum on a nonlinear mel-scale of frequency. ing assumptions about the structure of the data is
Mel-Frequency Cepstral Coefficients are the coef- transferred to the training of hidden layers that solve
ficients that comprise this representation [12]. ”subproblems” of the given input.

• Chroma:
Chroma relates to the twelve different pitch
classes, capturing the harmonic and melodic char-
acteristics of speech. For each sample, we rate
the audio signal using each of these twelve pitch
classes in terms of intensity, passing along these
values as the features we use [4].

• Mel Spectrogram:
Mel Spectrogram serves as another acoustic time-
frequency representation of a sound: the power
spectral density. This feature samples around
equally spaced times and frequencies for the given The number of inputs is the size of the feature vector
WAV signal.[9] used. The hidden layers are linear weighted subprob-
lems over a sigmoid activation function. If wi is a
• Spectral Contrast: signal, for each hidden unit hj ,
Spectral Contrast features mainly serve to iden-
tify the timbre of audio signals. This is the per- hj = σ(vj · φ(wi ))

4
with vj a learned weight and logistic activation func- 7.1 Classifier: Baseline
tion
σ(z) = (1 + e−z )−1 The goal of our baseline classifier was to test the sim-
plest possible solution we could create: a two-layer neu-
We trained a feedforward network using a scaled conju- ral network in TensorFlow that captured only MFCC
gate gradient backpropagation to update the weights features of the audio files we passed. Upon setting up
and measured performance using cross entropy. We our model, we then optimized the number of epochs and
evaluated the performance on different size hidden lay- the learning rate, eventually settling on 5,000 epochs
ers and found that a hidden layer of size 6 maximizes and a learning rate of 0.01. The TensorFlow model used
the training and test performance. 85% of the trials a held out subset of 53 randomly selected (half stutter,
were used for training and 15% of the trials were held half non-stutter) audio files as the validation set. This
out for testing.[2] configuration yielded a best accuracy of 66.0%.

6. APPROACH: ASR MODEL Detailed results of our experiment are listed below:
The initial goal with our ASR model was to test the
question of whether this problem was even worth solv-
ing we wanted to truly verify for ourselves the accuracy
of state-of-the-art Automatic Speech Recognizers on
speech that contained stuttered segments. Our base-
line model that we compared overall end-to-end per-
formance is a state of the art neural network trained
on IBM’s Watson’s UK English broadband (16 kHz)
model. As a pure baseline, our goal was to see how
this model fared when fed audio that did not remove From this experiment, we found that just having
non-stuttered components from our audio signals. To a baseline on MFCC features was good enough to get
conduct this experiment, we found five data files from fairly high accuracy on our test set.
the UCLASS data set and ran them through our ASR
model. The results of this can be found below:
7.2 Classifier: MATLAB Implementation

Our Matlab models over performed the Tensorflow


models. We tried three different implementations -
Neural Networks, Naive Bayes and SVM.

After some extensive experiments, we realized that


Neural Networks coupled with a standard mean fea-
ture extractor gave us the best results. We were able
to accomplish an average accuracy of 85.4% for men
and 78% for women (We believe that the disparity is
because of pre-existing biases in our labeled database).
This small experiment showed us that existing Some of our results, for both men and women, are show
ASR models (that perform considerably well on non- below.
stuttered speech) don’t perform very well on stuttered
speech. Later in the paper, we will feed modified audio The best results are highlighted in yellow:
signals (with the stuttered components of the signal re-
moved) to the same ASR model, to compare the two
approaches.

7. EXPERIMENTS
The following section details the processes and results
of the experiments we conducted.

5
to us. A great deal of our time and energy over the
last quarter was spent locating, labeling, and process-
ing data to be used for our purposes, and even then, the
data distribution was so variable that after processing
over 28 audio files from UCLASS, a total of over 1,788
speech samples, to train/test off of, our results still did
not reach reliable levels.

Thus, the biggest impediment to our project (and


These results, are indeed, very promising. We then what ultimately will be the bulk of our future work) was
used our best model to classify the stuttered and non- the process of having to both obtain and label stut-
stuttered components of 5 audio signals (the same files tering audio data to augment our classifiers. We be-
we tested our ASR model in Section 6 on). We use the lieve that if more data were to be collected and made
results from our model to remove the stuttered compo- available, our efforts to test and optimize classifier ap-
nents of these 5 files, and finally, we feed the modified proaches would be significantly improved.
audio signals (with the stuttered components removed)
to our previously described ASR model (a 6 layer neural 8.2 Future Work
network trained on IBM Watson’s UK English broad-
band (16 kHz) model). From here, more work will be put into overcoming
The table below compares the accuracies of our our data limitations so that we can more effectively
vanilla ASR model from Section 6 (which is fed a stut- train and perfect classifier models for stuttering speech.
tered audio signal) and our modified ASR model (which Preferably we’d like to incorporate the entire UCLASS
is fed an audio signal with all the stuttered components data set into our work, alongside more stuttering speech
removed). data libraries. Because neural network tools are the
most effective approach towards solving this problem,
we will continue exploring neural network optimizations
alongside the addition of more features (e.g. talking
speed). Upon developing a better classification tool,
we would then focus on optimizing how we apply our
classification model to actual speech, the goal being
that we can refactor any stuttered speech fed to us into
non-stuttered speech.

As the table above clearly shows, our modified ASR


outperforms the vanilla ASR for every single audio sig- References
nal, with accuracies ranging from a low of 66% to a
high of 81.7%. It’s interesting to compare this accuracy [1] V. Akkermans and J. Serra. “Shape-Based Spec-
range to Siri’s accuracy range with stuttered speech tral Contrast Descriptor”. In: SMC Network
(ranging from as low as 18.2%, to only as high as 73%). (2009), pp. 143–145.
[2] Manu Chopra. “Classifying Syllables in Imagined
8. CONCLUSION Speech using EEG Data”. In: (2015).
8.1 Regarding Findings [3] National Institute of Deafness and Other Com-
Perhaps the most important discovery we made was munication Disorders. Stuttering. url: https://
that it is in fact possible to build good ASRs for www.nidcd.nih.gov/health/stuttering.
stuttered speech. With just a six-layer neural net-
[4] Dan Ellis. Chroma Feature Analysis and Synthe-
work using MATLAB’s Neural Network Toolbox, we
sis. url: https://labrosa.ee.columbia.edu/
were able to obtain a best accuracy of over 85% on
matlab/chroma-ansyn/.
males and 78% on females, a model which also yielded
strong results when used to process speech via IBM [5] W. Kuniszyk-Jozkowiak I. Swietlicka and E.
Watson’s UK English Broadband Model. However, as Smolka. “Artificial Neural Networks in the Dis-
our report showed, the largest struggle we faced was abled Speech Analysis”. In: Computer Recogni-
the lack of stuttering-speech-containing data available tion System 3 57/2009 (2009), pp. 347–354.

6
[6] R.Rajagopal K. M. Ravikumar and H.C.Nagaraj. [12] Kishore Prahallad. Speech Technology: A Prac-
“An Approach for Objective Assessment of Stut- tical Introduction, Topic: Spectrogram, Cepstrum
tered Speech Using MFCC Features”. In: ICGST and Mel-Frequency Analysis. url: http://www.
International Journal on Digital Signal Process- speech.cs.cmu.edu/15-492/slides/03_mfcc.
ing 9 (2009), pp. 19–24. pdf.
[13] UCL Division of Psychology and Language Sci-
[7] Librosa Feature Extraction. url: https : / / ences. UCLASS Release Two. url: http://www.
librosa.github.io/librosa/feature.html. uclass.psychol.ucl.ac.uk/uclass2.htm.

[8] Sazali Yaacob Lim Sin Chee Ooi Chia Ai. [14] Stuttering Facts and Information. url: http://
“Overview of Automatic Stuttering Recognition www.stutteringhelp.org/faq.
System”. In: (2009), pp. 1–6. [15] L. Helbin T. Tian-Swee and S. H. Salleh. “Ap-
plication of Malay speech technology in Malay
[9] Mel Frequency Cepstrum. url: https : / / en . Speech Therapy Assistance Tools”. In: Intelligent
wikipedia . org / wiki / Mel - frequency _ and Advanced Systems (2007), pp. 330–334.
cepstrum. [16] Dmitri Tymoczko. “The Generalized Tonnetz”.
In: Journal of Music Theory 56 (2012), pp. 1–
[10] Emily Mullin. Why Siri Won’t Listen to Millions 3.
of People with Disabilities. url: https://www.
scientificamerican.com/article/why-siri-
won - t - listen - to - millions - of - people -
Acknowledgments
with-disabilities/. As a team we would like to thank the University Col-
lege London for the work it has put into producing the
[11] S. Sackin P. Howell and K. Glenn. “Development two large releases of data on stuttering speech over the
of a two-stage procedure for the automatic recog- past decade and a half. Without their work, none of
nition of dysfluencies in the speech of children our efforts could have been possible (as well as the ef-
who stutter: II. ANN recognition of repetitions forts of researchers before us in this respective field).
and prolongations with supplied word segment We would also like to thank the CS224S teaching staff
markers”. In: Journal of Speech, Language, and for both teaching us the skills necessary to pursue this
Hearing Research 40 (1997), p. 1085. project as well as for providing us guidance.

You might also like