0% found this document useful (0 votes)
82 views4 pages

Voice Disorder Detection Using Long Short Term Memory (LSTM) Model

This document proposes using a Long Short Term Memory (LSTM) model to detect voice disorders from audio data. It discusses how LSTM models can learn from sequential data like audio better than other models. The proposed approach uses the FEMH voice disorder dataset to train and evaluate an LSTM model. Feature extraction is applied to the audio clips before feeding them to the LSTM. The model achieves 22% sensitivity and 97% specificity on a test set of 400 voice samples, showing promising results for voice disorder detection using LSTMs.

Uploaded by

WISSAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views4 pages

Voice Disorder Detection Using Long Short Term Memory (LSTM) Model

This document proposes using a Long Short Term Memory (LSTM) model to detect voice disorders from audio data. It discusses how LSTM models can learn from sequential data like audio better than other models. The proposed approach uses the FEMH voice disorder dataset to train and evaluate an LSTM model. Feature extraction is applied to the audio clips before feeding them to the LSTM. The model achieves 22% sensitivity and 97% specificity on a test set of 400 voice samples, showing promising results for voice disorder detection using LSTMs.

Uploaded by

WISSAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Voice Disorder Detection Using Long Short Term

Memory (LSTM) Model

Vibhuti Gupta
Department of Computer Science
Texas Tech University, Lubbock, TX 79415
Email: [email protected]

Abstract— Automated detection of voice disorders with is also important to cure the correct disorder with proper
computational methods is a recent research area in the treatment.
medical domain since it requires a rigorous endoscopy for
the accurate diagnosis. Efficient screening methods are Automated detection of voice disorders is crucial to
required for the diagnosis of voice disorders so as to mitigate these problems since it makes the diagnosis process
provide timely medical facilities in minimal resources. simpler, cheaper and less time consuming. Recent research on
Detecting Voice disorder using computational methods is a computerized detection of voice disorders has studied various
challenging problem since audio data is continuous due to machine learning techniques and few deep learning techniques
which extracting relevant features and applying machine [3-5, 6-10, 11-13]. Majority of the previous work deals with
learning is hard and unreliable. This paper proposes a machine learning techniques for voice disorder detection [3,4].
Long short term memory model (LSTM) to detect [3] used rule based analysis by analyzing various acoustic
pathological voice disorders and evaluates its performance measures such as Fundamental Frequency, jitter, shimmer etc.
in a real 400 testing samples without any labels. Different and then applied logistic model tree algorithm, instance based
feature extraction methods are used to provide the best set learning and SVM algorithms while [4] used SVM and
of features before applying LSTM model for classification. decision trees for detecting voice disorders. Muhammad et. al
The paper describes the approach and experiments that [5] used gaussian mixture model (GMM) to classify 6 different
show promising results with 22% sensitivity, 97% types of voice disorders.
specificity and 56% unweighted average recall.
Deep learning is widely used nowadays for image
recognition, music genre classification and various other
Keywords— Neoplasm; Phonotrauma; Vocal Paralysis; Long applications. It is recently used for voice disorder detection
Short Term Memory; Mel frequency cepstral coefficient tasks [6-10]. Most recently [6] applied deep neural networks
I. INTRODUCTION (DNN) for voice disorder detection using dataset of Far Eastern
Memorial Hospital (FEMH) with 60 normal voice samples and
A voice disorder occurs due to disturbance in respiratory, 402 various voice disorder samples and achieved the highest
laryngeal, subglottal vocal tract or physiological imbalance accuracy as compared to other machine learning approaches.
among the system which causes abnormal voice quality, pitch Authors at [7] discussed the use of deep neural networks
and loudness as compared to normal voice of a healthy (DNN) in acoustic modeling. They have applied DNN in
person[21]. Major voice disorders include vocal nodules various speech recognition tasks and found that DNN’s are
polyps, and cysts (collectively referred as Phonotrauma), performing well. Wu et. al [8] used convolutional neural
glottis neoplasm; and unilateral vocal paralysis. Voice network (CNN) for vocal cord paralysis which is a challenging
disorders may affect a person’s social, professional and medical classification problem. Alhussein et. al [9] applied
personal aspects of communication which hinders its growth in deep learning into a mobile healthcare framework to detect
all these aspects [2]. voice disorder.
Current approaches for voice disorder detection requires Despite the success of above mentioned models, recurrent
rigorous endoscopy (i.e. laryngeal endoscopy) which is a neural networks (RNN) are not used for voice disorder tasks.
multistep examination including mirror examination, rigid and Recurrent neural networks are widely used for speech
flexible laryngoscopy and videostroboscopy [1][22]. This recognition, music genre classification, natural language
rigorous examination requires a lot of expensive medical processing and sequence prediction problems [11-12] . Long
resources and delays the diagnosis of voice disorders due to short term memory (LSTM) is a special type of recurrent
which treatment get delayed which worsen the severity of the neural network which is widely used for long term
disease. Sometimes voice disorders remain unidentified since dependencies. [11] used LSTM for voice activity detection
they are considered as normal by most of the people due to which separates the incoming speech with noise. Convolutional
inefficient and slow screening methods. Accuracy in diagnosis neural networks are used along with LSTM in [12] to
determine dysarthric speech disorder. To the best of our B. Long Short Term Memory (LSTM) Model
knowledge, none of the studies used LSTM for voice disorder
Long Short term Memory Network (LSTM) are the special
detection task.
type of Recurrent Neural networks capable of learning long
Our major contributions in this paper are: (1) to propose an term dependencies [17]. A typical LSTM network has 4 layers
approach to detect pathological voice disorders using Long i.e. input layer, 2 hidden layers and one output layer. It
short term memory (LSTM) model. (2) to evaluate LSTM contains three gates forget gate, Input gate and Output gate.
performance in differentiating normal and pathological voice
samples. The rest of the paper is organized as follows. Section
II discusses the material and methods. Section III describes
our experimental setup along with results and Section IV
concludes the paper.
II. METHOD
This section provides a brief overview of our proposed
approach with the general description of Long short Term
memory Model (LSTM) used in our experiments and
description of dataset with preprocessing part.
A. Overview of proposed approach Fig. 2 LSTM Network
Our proposed approach starts by loading the input voice
samples provided by Far Eastern Memorial Hospital (FEMH) Forget gate layer decides what information has to be kept
voice disorder detection challenge [16] as shown in Figure 1 or thrown away from the cell state. It takes input as ht-1 and xt
which includes 50 normal voice samples and 150 samples of and outputs a number between 0 and 1 using ft as in the
Eqn(1). Value of 0 indicates completely remove and 1 to
common voice disorders, including vocal nodules, polyps, and
completely keep this.
cysts (collectively referred as Phonotrauma); glottis neoplasm;
and unilateral vocal paralysis, that comprises of our training ft = σ(Wf [ht-1 , xt ] + bf (1)
dataset.
Now we need to decide what information has to be stored
in the cell state. It has two parts , firstly input gate layer using
Loading FEMH voice
disorder Detection Dataset
to decide what values has to be updated and then tanh layer
generates a vector of new candidate values that has to be
added. it is the function used by input gate layer and C is the
vector of new candidate values by tanh layer as shown in the
Feature Extraction Eqn (2) and (3).

it = σ(Wi [ht-1 , xt ] + bi (2)

Long Short Term Memory C = tanh(WC [ht-1 , xt ] + bC (3)


(LSTM) Training
Updated state of cell is shown in Eqn (4)

Ct = ft *Ct-1 + it * C (4)
Trained LSTM model
Finally, we need to decide what will be the output using
output gate. First we run the sigmoid layer using ot as shown
in the Eqn (5) and then its output is multiplied by tanh to get
Classification
the output which is shown in Eqn. (6)
Fig. 1 Overview of proposed approach
ot = σ(Wo [ht-1 , xt ] + bo (5)
Feature extraction process is done after loading the data
that includes Mel-frequency cepstral coefficients (MFCC), ht = ot *tanh(Ct) (6)
spectral centroid, chroma and spectral contrast features
comprising 33 features for each audio sample. Details are outputclass = σ(ht * Woutparameter) (7)
provided in the further sections. Then LSTM model is used to
The output class of the LSTM network is determined by the
train the model which is used for classification.
Eqn. (7) Wf, Wi, WC, Wo, Woutparameter are the weights , bf , bi,
bC, bo are the biases, ht is the output at time t , xt are the input
features and outputclass is the classification output.
C. Dataset and Preprocessing
The dataset comprises of 200 samples in the training set
and 400 samples in the testing set. Out of 150 common voice
disorder samples, 40 are for glottis neoplasm, 60 for
Phonotrauma and 50 are for vocal palsy in the training set. The
labels of training dataset includes gender, age, whether the
speaker is healthy or not and the corresponding voice disease.
Voice samples of a 3-second sustained vowel sound were
recorded at a comfortable level of loudness, with a
Fig. 5 Waveform of Phonotrauma Voice disorder
microphone-to-mouth distance of approximately 15–20 cm,
using a high-quality microphone (Model: SM58, SHURE, IL),
with a digital amplifier (Model: X2u, SHURE) under a
background noise level between 40 and 45 dBA. The sampling
rate was 44,100 Hz with a 16-bit resolution, and data were
saved in an uncompressed .wav format as used in [6]. Further
dataset information is given in [6][16].
Visualization of voice samples is done using the waveforms
as shown in Figures 3,4,5,6. It shows waveform whose y-axis
represents the amplitude of voice sample and x-axis as time
duration. We plotted 4 secs duration of each type of voice Sample
sample with a sampling rate of 22050 Hz.
Fig. 6 Waveform of Vocal Palsy Voice disorder

Data preprocessing includes feature extraction using


different methods such as Mel-frequency cepstral coefficients
(MFCC), spectral centroid, chroma and spectral contrast etc.
For each voice sample we extracted 33 features combining all
the feature extraction techniques. A brief overview of various
feature extraction techniques are provided in below sections.
a) Mel Frequency cepstral Coefficients (MFCC): MFCC
Fig. 3 Waveform of Normal Voice Sample features are widely used in music genre classification, audio
classification and speech recognition tasks, so we used it in this
work. We extracted 13 MFCC features from each voice
sample. More details on extracting MFCC features can be
found at [18].

b) Spectral Centroid: Spectral centroid provides the


center of mass of the spectrum. It provides the average
loudness in terms of audio processing. One feature is extracted
from each audio sample using spectral centroid. More details
can be found at [19].

c) Chroma: Chroma provides a chromagram from a


Fig. 4 Waveform of Neoplasm Voice disorder Sample
waveform. 12 features are extracted using chroma from each
audio sample. More details can be found at [20].
As shown in Figures 3 and 4 normal voice sample
amplitude is fluctuating while neoplasm disorder waveform d) Spectral Contrast: Spectral contrast represents the
spectral characteristics of audio sample. We extracted 13
not. Waveforms for phonotrauma and vocal palsy as shown in
features using spectral contrast from each audio sample. More
Figures 5 and 6 shows variations as compared to normal and
details can be found at [13].
neoplasm voice samples. Amplitudes for Phonotrauma
disorder are fluctuating and increasing while for vocal palsy,
its decreasing. Normal voice sample can be easily
distinguishable with Phonotrauma and Vocal palsy disorder
but not much with Neoplasm disorder.
III. EXPERIMENTS AND RESULTS REFERENCES
This section provides the experiments and results to [1] Schwartz SR, Cohen SM, Dailey SH, et al. Clinical practice guideline:
evaluate the effectiveness of our approach. The design of our hoarseness (dysphonia). Otolaryngol Head Neck Surg. 2009;141:S1–
LSTM network is shown in Table I showing one input layer S31.
with all 33 features extracted from each voice sample,2 hidden [2] Hegde, S., Shetty, S., Rai, S., & Dodderi, T. (2018). A Survey on
Machine Learning Approaches for Automatic Detection of Voice
layers out of which first hidden layer has 128 neurons while Disorders. Journal of Voice.
second one has 32 neurons and one output layer for predicting [3] Cesari, U., De Pietro, G., Marciano, E., Niri, C., Sannino, G., & Verde,
whether the voice sample is normal or having a voice disorder. L. (2018). Voice Disorder Detection via an m-Health System: Design
We used the same experimental setup as [14] which was used and Results of a Clinical Study to Evaluate Vox4Health. BioMed
research international, 2018.
for music genre classification as it provided promising results.
[4] Verde, L., De Pietro, G., & Sannino, G. (2018). Voice Disorder
Table I: Design of LSTM Network Identification by using Machine Learning Techniques. IEEE Access.
[5] Muhammad G, Mesallam TA, Malki KH, et al. Multidirectional
Input Layer 33 Input features extracted from audio regression (MDR)-based features for automatic voice disorder detection.
samples J Voice. 2012;26. 817.e19−817.e27.
Hidden Layer I 128 neurons [6] Fang, S. H., Tsao, Y., Hsiao, M. J., Chen, J. Y., Lai, Y. H., Lin, F. C., &
Hidden Layer II 32 neurons Wang, C. T. (2018). Detection of Pathological Voice Using Cepstrum
Output Layer 4 outputs corresponding to 3 different voice Vectors: A Deep Learning Approach. Journal of Voice.
disorders and 1 normal voice [7] Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic
modeling in speech recognition. IEEE Signal Proc Mag. 2012;29:82–97.
[8] Wu, H., Soraghan, J., Lowit, A., & Di Caterina, G. (2018). A deep
For training the LSTM model , optimizer used is Adams learning method for pathological voice detection using convolutional
[15] while different batch sizes and epochs are used to get the deep belief networks. Interspeech 2018.
best results. Categorical cross entropy is used as a loss function [9] Alhussein, M., & Muhammad, G. (2018). Voice Pathology Detection
to measure the performance of classification model at each Using Deep Learning on Mobile Healthcare Framework. IEEE
epoch. Increasing the number of epochs helps in improving the Access, 6, 41034-41041.
performance of the model. [10] Harar, P., Alonso-Hernandezy, J. B., Mekyska, J., Galaz, Z., Burget, R.,
& Smekal, Z. (2017, July). Voice pathology detection using deep
Table II: Results of two phases learning: a preliminary study. In Bioinspired Intelligence (IWOBI),
2017 International Conference and Workshop on (pp. 1-4). IEEE.
Result Phase Sensitivity Specificity UAR [11] Kim, J., Kim, J., Lee, S., Park, J., & Hahn, M. (2016, November).
I 30% 95.7% 54% Vowel based voice activity detection with LSTM recurrent neural
network. In Proceedings of the 8th International Conference on Signal
II 22% 97.1% 56% Processing Systems (pp. 134-137). ACM.
[12] Kim, M., Cao, B., An, K., & Wang, J. (2018). Dysarthric Speech
Recognition Using Convolutional LSTM Neural Network. Proc.
Interspeech 2018, 2948-2952.
Table II shows the results obtained in two phases of results [13] Jiang, D. N., Lu, L., Zhang, H. J., Tao, J. H., & Cai, L. H. (2002). Music
in FEMH Big data cup challenge. As we can see specificity is type classification by spectral contrast feature. In Multimedia and Expo,
high in both the phases but sensitivity is low. Sensitivity 2002. ICME'02. Proceedings. 2002 IEEE International Conference
determines the true positive rate while specificity true negative on (Vol. 1, pp. 113-116). IEEE.
rate. Results show that normal voice people are correctly [14] Tang, C. P., Chui, K. L., Yu, Y. K., Zeng, Z., & Wong, K. H. (2018).
Music Genre classification using a hierarchical Long Short Term
identified as normal as compared to the people with the voice Memory (LSTM) model.
disorder but unweighted average recall (UAR) represents the [15] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic
mean of recalls for both classes which increases with the optimization. arXiv preprint arXiv:1412.6980.
number of epochs. In phase I we run the experiment for 500 [16] FEMH challenge, Accessed August 2018, URL: https://femh-
epochs but in Phase II we run it for 5000 epochs. challenge2018.weebly.com/
[17] Christopher Olah. Understanding lstm networks. GITHUB blog, posted
Our results show that our approach works fine but requires on August, 27, 2015
more optimization in the future for better results. [18] https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
[19] https://en.wikipedia.org/wiki/Spectral_centroid
IV. CONCLUSION [20] https://labrosa.ee.columbia.edu/matlab/chroma-ansyn/
This study presents a long short term memory (LSTM) [21] https://www.asha.org/practice-portal/clinical-topics/voice-disorders/
approach to detect pathological voice disorders. The results [22] https://voicefoundation.org/health-science/voice-disorders/overview-of-
show that it works fine in detecting the disorders. Also, diagnosis-treatment-prevention/
different feature extraction techniques shows that these
features can be beneficial for voice disorder detection. Future
work includes more experiments with different
hyperparameters to improve the results and use other feature
extraction techniques too for further improvement.

You might also like