Voice Disorder Detection Using Long Short Term Memory (LSTM) Model
Voice Disorder Detection Using Long Short Term Memory (LSTM) Model
Vibhuti Gupta
Department of Computer Science
Texas Tech University, Lubbock, TX 79415
Email: [email protected]
Abstract— Automated detection of voice disorders with is also important to cure the correct disorder with proper
computational methods is a recent research area in the treatment.
medical domain since it requires a rigorous endoscopy for
the accurate diagnosis. Efficient screening methods are Automated detection of voice disorders is crucial to
required for the diagnosis of voice disorders so as to mitigate these problems since it makes the diagnosis process
provide timely medical facilities in minimal resources. simpler, cheaper and less time consuming. Recent research on
Detecting Voice disorder using computational methods is a computerized detection of voice disorders has studied various
challenging problem since audio data is continuous due to machine learning techniques and few deep learning techniques
which extracting relevant features and applying machine [3-5, 6-10, 11-13]. Majority of the previous work deals with
learning is hard and unreliable. This paper proposes a machine learning techniques for voice disorder detection [3,4].
Long short term memory model (LSTM) to detect [3] used rule based analysis by analyzing various acoustic
pathological voice disorders and evaluates its performance measures such as Fundamental Frequency, jitter, shimmer etc.
in a real 400 testing samples without any labels. Different and then applied logistic model tree algorithm, instance based
feature extraction methods are used to provide the best set learning and SVM algorithms while [4] used SVM and
of features before applying LSTM model for classification. decision trees for detecting voice disorders. Muhammad et. al
The paper describes the approach and experiments that [5] used gaussian mixture model (GMM) to classify 6 different
show promising results with 22% sensitivity, 97% types of voice disorders.
specificity and 56% unweighted average recall.
Deep learning is widely used nowadays for image
recognition, music genre classification and various other
Keywords— Neoplasm; Phonotrauma; Vocal Paralysis; Long applications. It is recently used for voice disorder detection
Short Term Memory; Mel frequency cepstral coefficient tasks [6-10]. Most recently [6] applied deep neural networks
I. INTRODUCTION (DNN) for voice disorder detection using dataset of Far Eastern
Memorial Hospital (FEMH) with 60 normal voice samples and
A voice disorder occurs due to disturbance in respiratory, 402 various voice disorder samples and achieved the highest
laryngeal, subglottal vocal tract or physiological imbalance accuracy as compared to other machine learning approaches.
among the system which causes abnormal voice quality, pitch Authors at [7] discussed the use of deep neural networks
and loudness as compared to normal voice of a healthy (DNN) in acoustic modeling. They have applied DNN in
person[21]. Major voice disorders include vocal nodules various speech recognition tasks and found that DNN’s are
polyps, and cysts (collectively referred as Phonotrauma), performing well. Wu et. al [8] used convolutional neural
glottis neoplasm; and unilateral vocal paralysis. Voice network (CNN) for vocal cord paralysis which is a challenging
disorders may affect a person’s social, professional and medical classification problem. Alhussein et. al [9] applied
personal aspects of communication which hinders its growth in deep learning into a mobile healthcare framework to detect
all these aspects [2]. voice disorder.
Current approaches for voice disorder detection requires Despite the success of above mentioned models, recurrent
rigorous endoscopy (i.e. laryngeal endoscopy) which is a neural networks (RNN) are not used for voice disorder tasks.
multistep examination including mirror examination, rigid and Recurrent neural networks are widely used for speech
flexible laryngoscopy and videostroboscopy [1][22]. This recognition, music genre classification, natural language
rigorous examination requires a lot of expensive medical processing and sequence prediction problems [11-12] . Long
resources and delays the diagnosis of voice disorders due to short term memory (LSTM) is a special type of recurrent
which treatment get delayed which worsen the severity of the neural network which is widely used for long term
disease. Sometimes voice disorders remain unidentified since dependencies. [11] used LSTM for voice activity detection
they are considered as normal by most of the people due to which separates the incoming speech with noise. Convolutional
inefficient and slow screening methods. Accuracy in diagnosis neural networks are used along with LSTM in [12] to
determine dysarthric speech disorder. To the best of our B. Long Short Term Memory (LSTM) Model
knowledge, none of the studies used LSTM for voice disorder
Long Short term Memory Network (LSTM) are the special
detection task.
type of Recurrent Neural networks capable of learning long
Our major contributions in this paper are: (1) to propose an term dependencies [17]. A typical LSTM network has 4 layers
approach to detect pathological voice disorders using Long i.e. input layer, 2 hidden layers and one output layer. It
short term memory (LSTM) model. (2) to evaluate LSTM contains three gates forget gate, Input gate and Output gate.
performance in differentiating normal and pathological voice
samples. The rest of the paper is organized as follows. Section
II discusses the material and methods. Section III describes
our experimental setup along with results and Section IV
concludes the paper.
II. METHOD
This section provides a brief overview of our proposed
approach with the general description of Long short Term
memory Model (LSTM) used in our experiments and
description of dataset with preprocessing part.
A. Overview of proposed approach Fig. 2 LSTM Network
Our proposed approach starts by loading the input voice
samples provided by Far Eastern Memorial Hospital (FEMH) Forget gate layer decides what information has to be kept
voice disorder detection challenge [16] as shown in Figure 1 or thrown away from the cell state. It takes input as ht-1 and xt
which includes 50 normal voice samples and 150 samples of and outputs a number between 0 and 1 using ft as in the
Eqn(1). Value of 0 indicates completely remove and 1 to
common voice disorders, including vocal nodules, polyps, and
completely keep this.
cysts (collectively referred as Phonotrauma); glottis neoplasm;
and unilateral vocal paralysis, that comprises of our training ft = σ(Wf [ht-1 , xt ] + bf (1)
dataset.
Now we need to decide what information has to be stored
in the cell state. It has two parts , firstly input gate layer using
Loading FEMH voice
disorder Detection Dataset
to decide what values has to be updated and then tanh layer
generates a vector of new candidate values that has to be
added. it is the function used by input gate layer and C is the
vector of new candidate values by tanh layer as shown in the
Feature Extraction Eqn (2) and (3).
Ct = ft *Ct-1 + it * C (4)
Trained LSTM model
Finally, we need to decide what will be the output using
output gate. First we run the sigmoid layer using ot as shown
in the Eqn (5) and then its output is multiplied by tanh to get
Classification
the output which is shown in Eqn. (6)
Fig. 1 Overview of proposed approach
ot = σ(Wo [ht-1 , xt ] + bo (5)
Feature extraction process is done after loading the data
that includes Mel-frequency cepstral coefficients (MFCC), ht = ot *tanh(Ct) (6)
spectral centroid, chroma and spectral contrast features
comprising 33 features for each audio sample. Details are outputclass = σ(ht * Woutparameter) (7)
provided in the further sections. Then LSTM model is used to
The output class of the LSTM network is determined by the
train the model which is used for classification.
Eqn. (7) Wf, Wi, WC, Wo, Woutparameter are the weights , bf , bi,
bC, bo are the biases, ht is the output at time t , xt are the input
features and outputclass is the classification output.
C. Dataset and Preprocessing
The dataset comprises of 200 samples in the training set
and 400 samples in the testing set. Out of 150 common voice
disorder samples, 40 are for glottis neoplasm, 60 for
Phonotrauma and 50 are for vocal palsy in the training set. The
labels of training dataset includes gender, age, whether the
speaker is healthy or not and the corresponding voice disease.
Voice samples of a 3-second sustained vowel sound were
recorded at a comfortable level of loudness, with a
Fig. 5 Waveform of Phonotrauma Voice disorder
microphone-to-mouth distance of approximately 15–20 cm,
using a high-quality microphone (Model: SM58, SHURE, IL),
with a digital amplifier (Model: X2u, SHURE) under a
background noise level between 40 and 45 dBA. The sampling
rate was 44,100 Hz with a 16-bit resolution, and data were
saved in an uncompressed .wav format as used in [6]. Further
dataset information is given in [6][16].
Visualization of voice samples is done using the waveforms
as shown in Figures 3,4,5,6. It shows waveform whose y-axis
represents the amplitude of voice sample and x-axis as time
duration. We plotted 4 secs duration of each type of voice Sample
sample with a sampling rate of 22050 Hz.
Fig. 6 Waveform of Vocal Palsy Voice disorder