0% found this document useful (0 votes)
96 views12 pages

Version Final

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views12 pages

Version Final

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/323843241

Acoustic characterization and perceptual analysis of the relative importance


of prosody in speech of people with Down syndrome

Article  in  Speech Communication · March 2018


DOI: 10.1016/j.specom.2018.03.006

CITATIONS READS

16 313

3 authors, including:

Mario Corrales-Astorgano David Escudero


Universidad de Valladolid Universidad de Valladolid
12 PUBLICATIONS   55 CITATIONS    101 PUBLICATIONS   1,000 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

XMMVR: Interacción multimodal con mundos virtuales View project

Glissando: A corpus for multidisciplinary prosodic studies in Spanish and Catalan View project

All content following this page was uploaded by Mario Corrales-Astorgano on 26 November 2018.

The user has requested enhancement of the downloaded file.


Speech Communication 99 (2018) 90–100

Contents lists available at ScienceDirect

Speech Communication
journal homepage: www.elsevier.com/locate/specom

Acoustic characterization and perceptual analysis of the relative importance T


of prosody in speech of people with Down syndrome

Mario Corrales-Astorgano , David Escudero-Mancebo, César González-Ferreras
Departamento de Informática, Universidad de Valladolid, Valladolid, Spain

A R T I C LE I N FO A B S T R A C T

Keywords: There are many studies that identify important deficits in the voice production of people with Down syndrome.
Speech characterization These deficits affect not only the spectral domain, but also the intonation, accent, rhythm and speech rate. The
Prosody main aim of this work is the identification of the acoustic features that characterize the speech of people with
Down syndrome Down syndrome, taking into account the different frequency, energy, temporal and spectral domains. The
Intellectual disabilities
comparison of the relative weight of these features for the characterization of Down syndrome people’s speech is
Automatic classification
Perceptual test
another aim of this study. The openSmile toolkit with the GeMAPS feature set was used to extract acoustic
features from a speech corpus of utterances from typically developing individuals and individuals with Down
syndrome. Then, the most discriminant features were identified using statistical tests. Moreover, three binary
classifiers were trained using these features. The best classification rate, using only spectral features, is 87.33%,
and using frequency, energy and temporal features, it is 91.83%. Finally, a perception test has been performed
using recordings created with a prosody transfer algorithm: the prosody of utterances from one group of speakers
was transferred to utterances of another group. The results of this test show the importance of intonation and
rhythm in the identification of a voice as non typical. As conclusion, the results obtained point to the training of
prosody in order to improve the quality of the speech production of those with Down syndrome.

1. Introduction be performed with the assistance of therapists who help patients to


properly manage their breathing and intonation patterns. Although
Individuals with Down syndrome (DS) have problems in their lan- there is general consensus about the importance of improving prosody
guage development that make their social relationships and their de- by training (see Kent and Vorperian, 2013 for a complete state of art
velopmental ability more problematic (Cleland et al., 2010; Martin revision), there are very few works that provide empirical evidence of
et al., 2009; Chapman, 1997). Many DS individuals have some phy- the importance of the prosody related features (those belonging to
siological peculiarities that affect their voice production, such as a fundamental frequency, energy and duration domains) with respect to
smaller vocal tract with respect to the tongue size or soft palatal shape, other acoustic features belonging to the spectral domain.
among others Guimaraes et al. (2008). Muscular hypotonia also affects The use of the video game described by González-
their capabilities for performing a correct articulation, degrading the Ferreras et al. (2017) has allowed the formation of a speech corpus,
quality of the spectral characteristics of sounds (Markaki and which has been used in this work to analyze and characterize the
Stylianou, 2011). In addition, hearing loss during childhood speech of people with Down syndrome. This corpus, described in
(Shott et al., 2001) and fluency deficits (Devenny and Silverman, 1990) Section 3.1, contains recordings of people with Down syndrome and
influence the frequency, energy and temporal domains of the voice typically developing people. Both groups recorded the same sentences,
signal. so statistical and perceptual tests have been used to compare the
Although problems derived from physiological peculiarities are acoustic features of the two groups of speakers, so that the most re-
permanent (even if surgery Leshin, 2000 or prostheses levant differences could be identified.
Bhagyalakshmi et al., 2007 could ameliorate them), intonation and This work aims to find the best acoustic features to characterize the
fluency deficits can be improved by speech therapy and training. There speech of people with Down syndrome. To do this, features of fre-
are tools available for this goal (Saz et al., 2009b; González- quency, energy, temporal and spectral domains have been extracted
Ferreras et al., 2017) based on perception and production activities to from the recordings of the gathered corpus. In addition, the relative


Corresponding author.
E-mail addresses: [email protected] (M. Corrales-Astorgano), [email protected] (D. Escudero-Mancebo), [email protected] (C. González-Ferreras).

https://doi.org/10.1016/j.specom.2018.03.006
Received 18 December 2017; Received in revised form 21 February 2018; Accepted 13 March 2018
Available online 14 March 2018
0167-6393/ © 2018 Elsevier B.V. All rights reserved.
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100

weight of each domain in the characterization of people with Down Albertini et al. (2010) discovered a lower duration of words in male
syndrome has been included in this paper, especially the comparison adults with Down syndrome. Moreover, people with Down syndrome
between the spectral and the other domains. present some disfluency problems. Although disfluency (stuttering or
The methodology described above was developed to answer two cluttering) has not been demonstrated as a universal characteristic of
main research questions (RQ): Down syndrome, it is a common problem of this population (Van Borsel
and Vandermeulen, 2008; Devenny and Silverman, 1990; Eggers and
• RQ1: Which are the most discriminative acoustic features between Van Eerdenbrugh, 2017). These disfluencies can affect the speech
the recordings of speakers with Down syndrome and typically de- rhythm of people with Down syndrome.
veloping speakers? On the other hand, Zampini et al. (2016) indicated that children
• Issue 1.1: Are there statistical differences between these features? with Down syndrome had lower F0 than children without intellectual
Issue 1.2: Are these differences in accordance with what is ex- disabilities. Moura et al. (2008) found higher jitter in children with
pected or described in the state of the art? Down syndrome than children without intellectual disabilities. In terms
• RQ2: What is the relative weight of the spectral features in com- of energy, Moura et al. (2008) indicated higher shimmer in children
parison with the rest of the domains? with Down syndrome than in children without intellectual disabilities.
Issue 2.1: What is the relative weight of the different features The unit of analysis and the phonation tasks used by the researchers
when identifying atypical speech using automatic classifiers? are different. Rochet-Capellan and Dohen (2015) used Vowel–Conso-
Issue 2.2: What is the relative weight of the different domains nant–Vowel by syllabes, Saz et al. (2009a) and Albertini et al. (2010)
when identifying atypical speech in a perceptual test? recorded words, Rodger (2009) and Zampini et al. (2016) built
these corpora using semi-spontaneous speech and Corrales-
The structure of the article is as follows. Section 2 reviews related Astorgano et al. (2016) analyzed sentences. Lee et al. (2009) combined
works from the state of the art and presents the innovation of our words, reading and natural speech. The majority of the studies are fo-
proposal. Section 3 describes the experimental procedure, including the cused on the English language (Kent and Vorperian, 2013), but there
corpus description, the features extraction process, the automatic are others focused on Italian (Zampini et al., 2016; Albertini et al.,
classification experiment and the perceptual test. Section 4 shows the 2010), Spanish (Corrales-Astorgano et al., 2016; Saz et al., 2009a),
statistical test results of the different domain features, the automatic French (Rochet-Capellan and Dohen, 2015) or Farsi (Seifpanahi et al.,
classification results and the perceptual test results. Finally, Section 5 2011).
describes the discussion and Section 6 the conclusions. The use of spectral features to assess pathological voice has fre-
quently been applied in the literature. Dibazar et al. (2006) used MFCCs
2. Background and related work and pitch frequency with a hidden Markov model (HMM) classifier for
the assessment of normal versus pathological voice using one vowel as
The age of the population selected for the study seems to be im- the unit of analysis. Markaki and Stylianou (2011) suggested the use of
portant for the results obtained, due to the physiological differences modulation spectra for the detection and classification of voice
between children and adults. Concerning adults, Lee et al. (2009), pathologies. Markaki and Stylianou (2010) created a method for the
Rochet-Capellan and Dohen (2015), Albertini et al. (2010) and objective assessment of hoarse voice quality, based on modulation
Corrales-Astorgano et al. (2016) found significantly higher F0 values in spectra, using a corpus of sustained vowels. The voice quality was
adults with Down syndrome as compared to adults without intellectual evaluated using the long term average spectrum (LTAS) and alpha ratio
disabilities. In addition, Lee et al. (2009) and Seifpanahi et al. (2011) by Leino (2009). Although these works do not refer to people with
found lower jitter (frequency perturbations) in adult speakers with Down syndrome, they do refer to some aspects that appear in this kind
Down syndrome. As for energy, Albertini et al. (2010) found sig- of speakers and we refer to them in the discussion section.
nificantly lower energy values in adults with Down syndrome. More- Formant frequency and amplitude have also been studied in people
over, Saz et al. (2009a) concluded that adults with Down syndrome had with Down syndrome. A larger vowel space in people with Down syn-
poor control over energy in stressed versus unstressed vowels. drome was found by Rochet-Capellan and Dohen (2015), while other
Albertini et al. (2010) found lower shimmer (amplitude perturbations) studies denoted a reduction of the vowel space in children (Moura et al.,
in male adults with Down syndrome than in adults without intellectual 2008) and adults (Bunton and Leddy, 2011). Moreover, the voice of
disabilities. Finally, temporal domain results depend on the unit of people with Down syndrome showed significantly reduced formant
analysis employed. Saz et al. (2009a) found that people with cognitive amplitude intensity levels (Pentz Jr, 1987).
disorders presented an excessive variability in vowel duration, while In order to compare our study with the state of the art, a summary of
Rochet-Capellan and Dohen (2015) and Bunton and Leddy (2011) re- other similar studies is shown in Table 1. A description of the corpus
ported longer durations of vowels in adults with Down syndrome. employed by these studies is shown in Table 2. To the best of our

Table 1
Results of different studies in the state of the art.

Author Group Frequency Duration Loudness

Rodger (2009) Adults and No differences


Children
Zampini et al. (2016) Children Good control for linguistics low for
pragmatics. Lower F0.
Saz et al. (2009a) Adults and Good control in pronounced vowels Longer pronounced vowels. Dispersed Low control of intensity in unstressed
Children mispronounced vowels vowels
Albertini et al. (2010) Adults Higher F0 Lower duration (only for men) Lower energy.Shimmer lower (only men)
Rochet-Capellan and Dohen (2015) Adults Higher F0 Longer vowels
Lee et al. (2009) Adults Smaller pitch range. Higher F0.
Lower jitter.
Corrales-Astorgano et al. (2016) Adults Higher F0 excursions More pauses to complete turns Different range

91
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100

Table 2
Description of the corpus used in the state of the art.

Author Group Down syndrome Control Type Size Language

Rodger (2009) Adults and 22 52 Semi spontaneous 5 picture descriptions per speaker English
Children
Zampini et al. (2016) Children 9 12 Semi spontaneous 20 minutes per speaker Italian
Saz et al. (2009a) Adults and 3 168 Words 9576 words (6 hours) Control. 684 words (38 Spanish
Children minutes) Down syndrome
Albertini et al. (2010) Adults 30 60 Words NA Italian
Rochet-Capellan and Dohen (2015) Adults 8 8 Vowel-consonant-vowel 144 per speaker French
Lee et al. (2009) Adults 9 9 Vowel. Reading. Natural 3 vowels per speaker. 1 reading per speaker. 1 English
speech minute per speaker
Corrales-Astorgano et al. (2016) Adults 18 20 Sentences 479 utterances Spanish

knowledge, our study is one of the first to analyze some features from 3.1. Corpus collection
the frequency, energy, temporal and spectral domains together. These
features were extracted from the same recordings, which can help in the We developed a computer video game to improve the prosodic and
study of the relative importance of each domain in the characterization communication skills of people with Down syndrome (González-
of the speech of people with Down syndrome. The use of a standard Ferreras et al., 2017). This video game is a graphic adventure game
feature set (extended Geneva Minimalistic Acoustic Parameter Set, where users have to use the computer mouse to interact with the ele-
eGeMAPS; detailed in Section 3.2 and Appendix A) can reduce the ex- ments on the screen, listen to audio instructions and sentences from the
traction methodology dependence, which can make it easier to compare characters of the game, and record utterances using a microphone in
the results of different studies. different contexts. The video game was designed using an iterative
Perceptual studies show mixed results. Moura et al. (2008) de- methodology in collaboration with a school of special education located
scribed the voice of children with Down syndrome as being statistically in Valladolid (Spain). The feedback provided by teachers of special
different from the voice of children without intellectual disabilities in education was complemented by research into the difficulties of this
five speech problems: grade, roughness, breathiness, asthenic speech population to use information and communication technologies. They
and strained speech. Moran and Gilbert (1982) judged the voice quality have some difficulties, such as attention deficit (Martínez et al., 2011),
of adults with Down syndrome as hoarse. In addition, Rodger (2009) lack of motivation (Wuang et al., 2011), or problems with the short
noted discrepancies between perceptual judgments of pitch level and term memory (Chapman and Hesketh, 2001) that had to be taken into
acoustic measures of F0. In our study, we did not want to compare each account when developing the video game. The game was developed for
acoustic measure with a perceptual judgment of the same feature. Our the Spanish language.
aim is the assessment of the domain relevance in the identification of a Inside the narrative of the game, some learning activities were in-
recording as being from a person with Down syndrome, using automatic cluded to practice communication skills. There are three different types
classifiers and perceptual tests. of activities: comprehension, production and visual. Firstly, the com-
prehension activities are focused on lexical-semantic comprehension
and on improving prosodic perception in specific contexts. Secondly,
3. Experimental procedure production activities are focused on oral production, so the players are
encouraged by the game to train their speech, keeping in mind such
Fig. 1 shows the experimental methodology that we have followed. prosodic aspects as intonation, expression of emotions or syllabic em-
Firstly, the speech corpus recorded by people with Down syndrome and phasis. At the beginning of these activities, the video game introduces
by typically developing people was gathered. Secondly, acoustic fea- the context where the sentence has to be said. Then, the game plays the
tures were extracted from all the recordings of each corpus and a sta- sentence and the player must utter the sentence while it is shown on the
tistical test to analyze the differences between groups was carried out. screen. The production activities include affirmative, exclamatory and
Finally, the automatic classification experiment was carried out, in interrogative sentences. Finally, visual activities include other activities
which the features with significant differences were used. designed to add variety to the game and to reduce the feeling of

Fig. 1. Scheme of the experimental procedure which includes corpus collection, feature extraction and automatic classification.

92
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100

monotony while playing. Table 4


The video game collected examples of sentences with different Number of users and recordings of each group of the corpus.
modalities (i.e. declarative, interrogative and exclamatory). Usually,
User type #Users #Recordings Length (seconds)
the intonation patterns vary depending on the modality. Neutral de-
clarative sentences usually end with a decline to a low tone, while total Control (TD) 22 250 650
interrogatives end with an upgrade to a high pitch. On the other hand, Down syndrome (DS) 18 349 1442
partial interrogative sentences, which are characterized by an inter-
rogative element at the beginning of the sentence, start with a high tone
did. It should be noted that for the production activities, not all
associated with that interrogative element and usually end with a fall.
speakers with Down syndrome reproduced the target sentence exactly.
Finally, exclamatory sentences are usually a marked variation of the
Some of them had hearing problems, while others had reading diffi-
corresponding declarative, so the variation lies basically in such aspects
culties or cluttering derived from their intellectual disability.
as the intensity, volume and tonal range used by the speaker.
To obtain a control sample of the recordings, twenty two adult
Moreover, the combination of different sentences allows the inclu-
speakers without any intellectual disability, 13 males and 9 females,
sion of inflections that indicate a particular segmentation in oral pro-
were recorded. Therefore, two groups representing different popula-
duction. Depending on the context and speed of elocution, these in-
tions were thus obtained: typically developing adults (TD) and people
flections may correspond to a pause, which implies a silence and,
with Down syndrome (DS). Table 4 shows the number of users of each
normally, the end of the sentence, or a semi-pause, which implies an
group of speakers, the number of recordings made by them and the total
intonation change in the same sentence. For instance, one of the ex-
length in seconds of the recordings.
amples collected in the corpus includes the three modalities and forces
the speaker to make a pause between sentences: ¡Hola! ¿Tienen lupas?
Quería comprar una. (Hello! Do you have magnifiers? I wanted to buy one). 3.2. Feature extraction
In other cases, the tonal inflection corresponds to a semi-pause invol-
ving no change of modality or silence: ¡Hasta luego, tío Pau! (See you Acoustic low-level descriptors (LLD) and temporal features were
later, uncle Pau!). Thus, the combination of these types of inflection automatically extracted from each recording using the openSmile
allows the collection of examples with different segmentation. The toolkit (Eyben et al., 2013). Two minimalistic feature sets were used.
sentences recorded can be seen in Table 3. On the one hand, these sets provided enough features to characterize
The recording sessions were carried out in the same facilities of the the audio recordings. On the other hand, we avoid the problem of
centers where the players attended their regular classes to assure the having too many parameters relative to the number of observations.
comfort of the players. In addition, a staff member of the centers was This problem can produce overfitting in the training phase, because the
always with the players. The players were selected by the staff members classifier adapts to the concrete set of inputs. This adaptation can
because the distinct cognitive abilities of each student limited their produce good classification results for this particular set, but negatively
possibilities as potential players, as some of them were not able to affects the generalization capacity of the classifier. The Geneva Mini-
follow the structured process of the game in a reliable way. Eighteen malistic Standard Parameter Set (GeMAPS) and the extended Geneva
speakers with Down syndrome participated, 11 males (chronological Minimalistic Acoustic Parameter Set (eGeMAPS), described by
ages: 16, 16, 18, 20, 21, 21, 23, 24, 25, 26 and 30) and 7 females Eyben et al. (2016), were selected. The features extracted from each
(chronological ages: 16, 17, 18, 19, 21, 22, 25). All of them were native recording are sorted into four groups:
speakers of Spanish, aged 16 to 30. They were students of two special
education schools located in Valladolid and Barcelona (Spain) and have • Frequency related features: fundamental frequency and jitter.
a moderate or mild intellectual disability. Besides, to reduce the am- • Energy related features: loudness, shimmer and
bient noise in the recording process, the players used a headset with a Harmonics-to-Noise Ratio.
microphone incorporated (Plantronics USB headset). In addition, • Spectral features: alpha ratio, Hammarberg index, spectral slope,
players recorded a different number of sentences, depending on their formant 1, 2, 3 relative energy, harmonic difference H1-H2, har-
performance in the video game and the number of game sessions they monic difference H1-A3, formant 1, 2, 3 frequency and formant 1, 2,
3 bandwidth.

Table 3
• Temporal features: the rate of loudness peaks per second, mean
length and standard deviation of continuous voiced and unvoiced
Sentences included in the corpus.
segments and the rate of voiced segments per second, approximating
Sentence in Spanish Sentence in English the pseudo syllable rate.

¡Hasta luego, tío Pau! See you later, uncle Pau! In total, there are 25 LLD. The arithmetic mean and the coefficient
¡Muchas gracias, Juan! Thank you very much, Juan!
¡Hola! ¿Tienen lupas? Quería comprar Hello, do you have magnifiers? I
of variation are calculated on these 25 LLD. Some functionals are ap-
una. wanted to buy one. plied to fundamental frequency and loudness: 20-th, 50-th, and 80-th
Sí, la necesito. ¿Cuánto vale? Yes, I need it. How much is it? percentile, the range of 20-th to 80-th percentile, and the mean and
¡Hola tío Pau! Ya vuelvo a casa. Hello uncle Pau! I’ll be back home. standard deviation of the slope of rising/falling signal parts. All these
Sí, esa es. ¡Hasta luego! Yes, it is. Bye!
functionals are computed by the openSmile toolkit. In addition, the
¡Hola, tío Pau! ¿Sabes dónde vive la Hello uncle Pau! Do you know where
señora Luna? Mrs Luna lives? process used by the openSmile toolkit to extract the eGeMAPS features
¡Nos vemos luego, tío Pau! See you later, uncle Pau! did not differentiate between silences and unvoiced regions, which can
Has sido muy amable, Juan. Muchas You have been very kind, Juan. Thank produce errors in the functions applied to each feature. Therefore, the
gracias! you very much! Praat software (Boersma, 2006) was used to extract all silences from
¡Hola! ¿Tienen lupas? Me gustaría Hello, do you have magnifiers? I would
comprar una. like to buy one.
each recording and these silences were excluded from the analysis
Sí, necesito una sea como sea. ¿Cuánto Yes, I really need one. How much is it? process.
vale? Furthermore, 4 additional temporal features were added: the silence
Sí, lo es. Vivo allí desde pequeño. Yes, it is. I have lived there since I was and sounding percentages, silences per second and the mean silences.
¡Hasta luego! a child. Bye!
These new features were added to improve the information about the
¡Hola, tío Pau! Tengo que encontrar a la Hello uncle Pau! I have to find Mrs
señora Luna ¿Sabes dónde vive? Luna. Do you know where she lives? temporal characterization of the recordings. In this case, the initial and
final silence of each recording were excluded from the analysis process

93
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100

because their lengths were different due to the recording process. To Fig. 2 shows the experimental procedure used to perform the per-
sum up, the acoustic feature set contains 88 features from the eGeMAPS ception test. The sentence ¡Hola tío Pau! ¿Sabes donde vive la señora
feature set and 4 new features introduced from the research team (92 Luna? (Hello uncle Pau! Do you know where Mrs Luna lives?) recorded by
features). all the speakers was selected. This sentence was selected because of its
A statistical test was used to detect the significant differences be- prosodic richness (combining an affirmative and an interrogative sen-
tween the features extracted from the recording of each group. The tence), because it was used in another of our studies (González-
Mann-Whitney non-parametric test was used. Only the features with a Ferreras et al., 2017) and because it was the most recorded sentence. To
p-value lower than 0.01 were selected for analysis and classification. obtain a phonetic segmentation of the recordings, the BAS web services
(Schiel, 1999; Kisler et al., 2017) were used. This tool returns the time
3.3. Automatic classification intervals of each phoneme using the audio file and the transcription as
inputs. Manual revision of the segmentation was necessary to correct
In order to make an automatic classification of the recordings, the transcription errors. The sentence was recorded by 22 TD speakers and
Weka machine learning toolkit (Hall et al., 2009) was used. This toolkit by 16 speakers with DS. However, each speaker did not have the same
permits to a collection of machine learning algorithms to be accessed number of recordings. In total, there were 62 recordings.
for data mining tasks. Three different classifiers were used to compare Once the segmentation was corrected, a prosody transfer algorithm
their performance: the C4.5 decision tree (DT), the multilayer percep- implemented in Praat (Boersma, 2006) was executed. This algorithm
tron (MLP) and the support vector machine (SVM). transfers, phoneme by phoneme, the pitch, energy and duration from
In addition, the 10-fold cross validation technique was used to one audio to another. Therefore, the new audio file contains the original
create the training and testing datasets. To avoid classifier adaptation, utterance but with the prosody transferred from another utterance. The
all folds were created by recordings of different speakers. Therefore, the algorithm was executed combining the audios of each speaker with the
recordings of each speaker were joined in the same fold and each fold audios of the rest of the speakers, so, in total, 3525 audio files were
was balanced in terms of the number of recordings. generated (not all the speakers had the same number of recordings). As
To analyze the performance of the classification, we used the clas- a result, there are four types of audio files, as shown in Fig. 2. Five
sification rate. The unweighted average recall (UAR) (Schuller et al., audio files of each type were selected randomly for the perception test,
2016) was also used. This metric is the mean of sensitivity (recall of so the test included twenty audio files, balanced in terms of gender.
positive instances) and specificity (recall of negative instances). UAR The perception test was performed using a web application. First,
was chosen as the classification metric because it equally weights each personal information of the evaluator was collected. Then, the twenty
class regardless of its number of samples, so it represents more precisely audio files selected in the previous phase were shown randomly. The
the accuracy of a classification test using unbalanced data. evaluators have to answer the following question for each utterance:
keeping in mind the way of speaking, do you think that the person who is
speaking has intellectual disabilities? Ignore the audio distortion produced by
3.4. Perception test
the non natural voice synthesis. The possible answers to the question were
in a 5-point Likert scale: 1 means “no way” and 5 means “very sure”.
In order to evaluate the impact of prosody in the perception of the
Thirty evaluators judged each utterance using this scale. People without
listeners, we used prosody transfer techniques. These techniques have
any specific background on speech therapies were selected for this test,
previously been used in other studies of the state of the art. For in-
as we were interested in the perception of normal people concerning
stance, Luo et al. (2017) investigated the role of different prosodic
the importance of prosody in the identification of speech from people
features in the naturalness of English L2 speech. The prosodic mod-
with intellectual disability.
ification method was applied to native and L2 learners’ speech. Later,
they used a perceptual test to evaluate the impact of prosody mod-
ification. A similar methodology was used by Escudero et al. (2017), 4. Results
where the characteristic prosodic patterns of the style of different
groups of speakers was investigated. After the prosodic modification of 4.1. Characterization results
the utterances, the characteristic prosodic patterns were validated using
a perceptual test. The procedure described in Escudero et al. (2017) for Table 5 shows the features with statistically significant differences
transferring prosody is used in the experiments reported in this paper. (Mann–Whitney test with p-value < 0.01) related to frequency, energy

Fig. 2. Experimental procedure followed to perform the perceptual test. The utterances used in the test were: TDutt+TDpro (utterance of a TD person with prosody transferred from an
utterance of another TD person), DSutt+TDpro (utterance of a person with DS with prosody transferred from an utterance of a TD person), TDutt+DSpro (utterance of a TD person with
prosody transferred from an utterance of a person with DS) and DSutt+DSpro (utterance of a person with DS with prosody transferred from an utterance of another person with DS).

94
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100

Table 5
List of frequency, energy and temporal features with higher statistically significant differences (Mann-Whitney test with p-value < 0.01), sorted by mean differences. The meaning of the
features in column Variable can be seen in Appendix A. The units are reported in (Eyben et al., 2016).

Variable Control Control (CI 95%) Down syndrome Down syndrome (CI 95%)

F0 domain
F0_stddevRisingSlope 166.17 ± 231.44 (137.35,195.01) 220.85 ± 273.67 (192.08,249.62)
jitter_stddevNorm 1.15 ± 0.39 (1.11,1.21) 1.46 ± 0.47 (1.42,1.52)
jitter_mean 0.04 ± 0.02 (0.045,0.050) 0.03 ± 0.01 (0.035,0.039)
F0_pctlrange 4.63 ± 1.9 (4.4,4.88) 3.91 ± 2.88 (3.61,4.22)
F0_percentile20 26.89 ± 4.49 (26.33,27.45) 30.32 ± 4.63 (29.84,30.81)
F0_percentile50 29.18 ± 4.22 (28.66,29.71) 32.33 ± 4.28 (31.89,32.79)
F0_mean 29.3 ± 4.11 (28.79,29.82) 32.38 ± 4.14 (31.95,32.82)
F0_stddevNorm 0.13 ± 0.07 (0.129,0.147) 0.12 ± 0.07 (0.116,0.132)
F0_percentile80 31.52 ± 4.34 (30.99,32.07) 34.24 ± 4.67 (33.75,34.73)
Energy domain
loudness_percentile20 0.95 ± 0.38 (0.91,1.01) 1.77 ± 1.03 (1.66,1.88)
loudness_percentile50 1.93 ± 0.73 (1.84,2.02) 3.29 ± 2.22 (3.06,3.53)
loudness_mean 2.09 ± 0.78 (1.99,2.19) 3.37 ± 1.99 (3.17,3.58)
loudness_percentile80 3.15 ± 1.24 (3,3.31) 4.9 ± 2.94 (4.6,5.22)
loudness_pctlrange 2.19 ± 0.96 (2.08,2.32) 3.13 ± 2.06 (2.92,3.35)
loudness_stddevRisingSlope 15.3 ± 7.18 (14.41,16.2) 19.63 ± 14.24 (18.14,21.13)
loudness_stddevNorm 0.57 ± 0.07 (0.57,0.58) 0.49 ± 0.07 (0.48,0.5)
shimmer_mean 1.55 ± 0.38 (1.51,1.61) 1.36 ± 0.37 (1.32,1.4)
shimmer_stddevNorm 0.86 ± 0.14 (0.84,0.88) 0.78 ± 0.16 (0.77,0.8)
Temporal domain
silencePercentage 0.1 ± 0.11 (0.09,0.12) 0.22 ± 0.19 (0.2,0.24)
silencesMean 0.16 ± 0.2 (0.14,0.19) 0.31 ± 0.3 (0.28,0.35)
StddevVoicedSegmentLengthSec 0.15 ± 0.08 (0.14,0.16) 0.25 ± 0.2 (0.23,0.27)
MeanVoicedSegmentLengthSec 0.26 ± 0.15 (0.25,0.29) 0.44 ± 0.39 (0.41,0.49)
silencesPerSecond 0.39 ± 0.38 (0.35,0.44) 0.57 ± 0.4 (0.53,0.62)
VoicedSegmentsPerSec 3.42 ± 1.06 (3.29,3.55) 2.47 ± 1.04 (2.37,2.59)
loudnessPeaksPerSec 5.76 ± 1 (5.64,5.89) 4.39 ± 0.94 (4.29,4.49)
MeanUnvoicedSegmentLength 0.05 ± 0.02 (0.05,0.06) 0.06 ± 0.03 (0.06,0.07)
soundingPercentage 0.89 ± 0.11 (0.88,0.91) 0.77 ± 0.19 (0.76,0.8)

and temporal domains, sorted by mean differences. In the case of fre- and Formant 3 (to a lower degree) also allow differences to be identi-
quency, 9 of 12 features present significant differences. The first rows fied. As expected, MFCC values (the four analyzed) permit both groups
(from F0_stddevRisingSlope to jitter_mean) refer to the temporal evo- to be separated. With respect to the variables related with the harmonic
lution of the F0 contour. In all cases, figures present a higher value for differences, only two variables appear in the list: log-
speakers with Down syndrome, both when the stddev value is analyzed RelF0H1A3_stddevNorm and logRelF0H1A3_mean.
or the Risingslope and jitter (jitter value is lower because it focuses on the
periods, which are the inverse of the F0 values). The last rows refer to 4.2. Classification results
mean values, coefficient of variation, ranges and percentiles of the F0
contour (from F0_pctlrange to F0_percetile80). Speakers with Down Table 7 shows the classification results in the task of identifying the
syndrome exhibit higher values than the speakers of the control group group of the speaker (TD or SD) of each utterance. The classifiers ex-
in all the cases, with a lower coefficient of variation in the Down syn- plained in Section 3.3 and the selected features presented in the pre-
drome group. These results seem to indicate that the participants with vious section were used. Only the features with significant differences
Down syndrome use higher F0 values with more temporal changes in between TD and DS groups are used. DT shows the lower classification
the F0 contours. results in all feature groups. MLP shows a better performance using
There are 9 of 14 energy features that present statistically sig- frequency (UAR 0.64), temporal (UAR 0.78), frequency+energy
nificant differences (Mann-Whitney test with p-value < 0.01), as shown +temporal (UAR 0.91) and all (UAR 0.95) feature groups. SVM works
in Table 5. The first four rows (from loudness_percentil20 to loud- better with energy features (UAR 0.78). The results using spectral fea-
ness_pctlrange) refer to mean, range and percentile values. Values are tures are the same in MLP and SVM classifiers (UAR 0.87).
higher for speakers with Down syndrome in all the cases. The last In addition, the best classification results are obtained using all
columns refer to the temporal variation of the energy values. In this features, independently of which classifier is used. Frequency features
case, Down syndrome speakers exhibit lower values. These results seem show the worst performance when they are used alone. Energy and
to indicate that participants with Down syndrome speak louder with temporal features have similar results, with only 9 features per group.
less variation in the energy. When frequency, energy and temporal features are used together,
With respect to the temporal features displayed in Table 5, 9 of 10 the performance is noticeably better than using each group separately.
features presented statistically significant differences (Mann-Whitney Finally, spectral features show a slightly worse performance than all
test with p-value < 0.01). Speakers with Down syndrome use more and frequency+energy+temporal features.
pauses and they are longer (higher silencePercentage, silencePerSecond
and silenceMean). The length of the voiced segment is longer, in-
dicating that participants with Down syndrome speak more slowly. 4.3. Perception test results
As for spectral features (Table 6), 34 of 56 features showed statis-
tically significant differences (Mann–Whitney test with p-value < Table 8 shows the results of the perception test and Fig. 3 visually
0.01). Results show that the LTAS could be a useful instrument to detect presents the differences between the groups. When the prosody of TD
differences, as clear differences appear when the features related with speakers was transferred to utterances of TD speakers, 84% of the an-
slope, Hammarberg and alpha index are taken into account. Formant 1 swers identified the audios as TD speakers (answer 1 of row TDutt
+TDpro). In this case, the doubts in the identification of the audio files

95
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100

Table 6
List of spectral features with higher statistically significant differences (Mann–Whitney test with p-value < 0.01), sorted by mean differences. The meaning of the features in column
Variable can be seen in Appendix A. The units are reported in (Eyben et al., 2016).

Variable Control Control (CI 95%) Down syndrome Down syndrome (CI 95%)

LTAS related features


slopeV0500_mean 0 ± 0.03 (0,0.01) 0.05 ± 0.03 (0.056,0.063)
slopeUV0500_mean −0.06 ± 0.04 (−0.07,−0.06) 0.05 ± 0.03 (0.02,0.03)
slopeV0500_stddevNorm −1.12 ± 13.82 (−2.85,0.6) 0.69 ± 2.64 (0.41,0.97)
alphaRatioUV_mean −12.06 ± 11.37 (−13.48,−10.65) 1.07 ± 6.37 (0.41,1.75)
hammarbergIndexUV_mean 20.79 ± 13.51 (19.11,22.48) 5.4 ± 7.24 (4.64,6.16)
alphaRatioV_mean −11.79 ± 5.52 (−12.49,−11.11) −8.46 ± 5.55 (−9.05,−7.88)
hammarbergIndexV_mean 20.8 ± 7.06 (19.93,21.69) 16.35 ± 7.14 (15.61,17.11)
hammarbergIndexV_stddevNorm 0.48 ± 0.67 (0.4,0.57) 0.57 ± 1.01 (0.47,0.68)
slopeV5001500_mean −0.02 ± 0 (−0.03,−0.02) −0.02 ± 0 (−0.021,−0.020)
spectralFlux_mean 1.96 ± 1.09 (1.83,2.1) 2.94 ± 2.32 (2.7,3.19)
spectralFluxUV_mean 1.4 ± 1.35 (1.23,1.57) 2.1 ± 2.11 (1.88,2.32)
spectralFluxV_mean 2.11 ± 1.12 (1.98,2.26) 3.13 ± 2.53 (2.87,3.4)
spectralFlux_stddevNorm 0.72 ± 0.19 (0.7,0.75) 0.67 ± 0.12 (0.66,0.69)
MFCC related features
mfcc3_stddevNorm 0.25 ± 24.92 (−2.85,3.36) −54.35 ± 1039.94 (−163.68,54.98)
mfcc2V_mean 1.49 ± 7.41 (0.58,2.42) −2.45 ± 6.88 (−3.17,−1.73)
mfcc4_stddevNorm 1.54 ± 44.52 (−4.01,7.09) −2 ± 19.36 (−4.04,0.03)
mfcc2_stddevNorm 1.97 ± 26.17 (−1.29,5.23) −1.16 ± 27.11 (−4.01,1.69)
mfcc2_mean 4.05 ± 7.08 (3.18,4.94) −2.32 ± 6.45 (−3,−1.64)
mfcc4V_stddevNorm −1.23 ± 9.51 (−2.42,−0.05) −0.45 ± 4.73 (−0.96,0.04)
mfcc4_mean −11.17 ± 7.74 (−12.14,−10.21) −17.34 ± 9.91 (−18.39,−16.3)
mfcc3V_stddevNorm −0.78 ± 71.43 (−9.68,8.11) −0.28 ± 21.18 (−2.51,1.94)
mfcc4V_mean −14.75 ± 8.58 (−15.82,−13.68) −18.3 ± 10.83 (−19.44,−17.17)
mfcc1V_mean 26.42 ± 7.31 (25.51,27.34) 20.93 ± 9.61 (19.93,21.95)
mfcc1_mean 22.52 ± 7.73 (21.56,23.49) 18.16 ± 9.95 (17.11,19.21)
Formants related features
F3amplitudeLogRelF0_stddevNorm −1.18 ± 0.25 (−1.22,−1.16) −1.36 ± 0.41 (−1.41,−1.32)
F2amplitudeLogRelF0_mean −49.47 ± 17.65 (−51.68,−47.28) −42.63 ± 20.55 (−44.79,−40.47)
F2amplitudeLogRelF0_stddevNorm −1.35 ± 0.26 (−1.39,−1.32) −1.54 ± 0.61 (−1.61,−1.48)
F1bandwidth_stddevNorm 0.2 ± 0.08 (0.19,0.21) 0.23 ± 0.09 (0.22,0.24)
F1frequency_stddevNorm 0.35 ± 0.09 (0.34,0.37) 0.4 ± 0.09 (0.39,0.41)
F3frequency_stddevNorm 0.09 ± 0.02 (0.095,0.102) 0.1 ± 0.02 (0.1,0.11)
F3frequency_mean 2665.98 ± 145.97 (2647.81,2684.17) 2643.51 ± 203.27 (2622.15,2664.89)
F3amplitudeLogRelF0_mean −53.64 ± 17.44 (−55.82,−51.47) −45.02 ± 19.5 (−47.08,−42.98)
Harmonic differences features
logRelF0H1A3_stddevNorm 1.6 ± 16.02 (−0.39,3.6) 0.18 ± 7.44 (−0.6,0.97)
logRelF0H1A3_mean 18.91 ± 6.26 (18.13,19.69) 15.86 ± 7.09 (15.12,16.61)

Table 7 Table 8
Classification results for identifying the group of the speaker. Classification rate (c. rate) Number of responses of the perception tests for each type of audio file. A response of 1
and UAR using different feature sets and different classifiers are reported. The features means “no way” and 5 means “very sure” in the identification of the audio file as a
used are those with significant differences between TD and DS groups. The classifiers are speaker with Down syndrome. NR means no response. TDutt+TDpro means utterance of
decision tree (DT), support vector machine (SVM) and multilayer perceptron (MLP). # is a TD person with prosody transferred from an utterance of another TD person; DSutt
the number of input features in each set. +TDpro means utterance of a person with DS with prosody transferred from an utterance
of a TD person; TDutt+DSpro means utterance of a TD person with prosody transferred
SVM MLP DT from an utterance of a person with DS; and DSutt+DSpro means utterance of a person
with DS with prosody transferred from an utterance of another person with DS.
Set # C. Rate UAR C. Rate UAR C. Rate UAR
Type 1 2 3 4 5 NR Total
Frequency 9 62.67 0.61 64.33 0.64 60.17 0.60
Energy 9 79.33 0.78 76 0.76 72.5 0.71 TDutt+TDpro 124 15 3 1 4 3 150
Temporal 9 76.83 0.76 77.83 0.78 74.33 0.75 DSutt+TDpro 42 42 31 18 11 6 150
Frequency+Energy 27 90 0.9 91.83 0.91 82 0.82 TDutt+DSpro 17 21 34 43 31 4 150
+Temporal DSutt+DSpro 1 11 26 49 56 7 150
Spectral 34 87.33 0.87 87.33 0.87 84.33 0.84
All 61 94.17 0.94 95.17 0.95 86.5 0.87
process than the original utterance. When the prosody of TD speakers
was transferred to utterances of speakers with DS, 58% of the answers
as TD or DS represent only 2% of the answers (answer 3 of row TDutt identified the audios as TD speakers (answers 1 and 2) versus only 20%
+TDpro). On the other hand, when the prosody of DS speakers was of DS identifications (answers 4 and 5). On the other hand, 51% of the
transferred to utterances of DS speakers, 73% of the answers identified answers identified the audios as speakers with DS (answers 4 and 5)
the audios as DS speakers (answers 4 and 5 of row DSutt+DSpro). In when the prosody of speakers with DS was transferred to an utterance
this case, the doubts in the identification of the audio files as TD or DS of TD speakers, versus only 26% of TD identifications (4 and 5 an-
represent 18% of the answers (answer 3 of row DSutt+DSpro), and the swers). In both cases, the number of answers 3 is relevant (22% and
identifications as TD are only 8% (answers 1 and 2 of row DSutt 23% of answers 3, respectively).
+DSpro). Moreover, two statistical tests were used to compare the answers
The answers given about the audio files that combined utterances of obtained. The results of the Kruskal–Wallis non-parametric test showed
one group with prosody of the other group present much more varia- significant differences (with a p-value < 0.001) between the answers
bility. However, prosody had more influence in the identification given to the four groups (TDutt+TDpro, DSutt+TDpro, TDutt+DSpro

96
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100

commonly used in speaker recognition applications (Martinez et al.,


2012), as they are representative of the vocal tract shape (Dusan and
Deng, 1998). The relative importance of the MFCC features on the
characterization of the speech of people with DS (as shown in Table 6)
could thus be justified by the special anatomy of the tongue, palate,
jaw, etc. of this type of speaker (Rodger, 2009). MFCC has also been
used to identify nasality by Yuan and Liberman (2011) which is another
aspect that has been related with the speech of people with DS in many
works (Kent and Vorperian, 2013). The relative position of the formants
has been associated with the degree of nasality in many works (House
and Stevens, 1956; Huffman, 1989) which was also highlighted in our
results table.
Finally, people with DS present hypotonia of muscles and difficul-
ties in motor control, which affect the movement of the lips, tongue and
jaw, with the consequent impact on spectral features already men-
Fig. 3. Results of the perception tests for each type of audio file. TDutt+TDpro means tioned. The lack of muscular strength could also be another reason
utterance of a TD person with prosody transferred from an utterance of another TD
justifying the slower speech. As hypotonia could also affect the dia-
person; DSutt+TDpro means utterance of a person with DS with prosody transferred from
phragm, the energy values should have been lower. We hypothesize
an utterance of a TD person; TDutt+DSpro means utterance of a TD person with prosody
transferred from an utterance of a person with DS; and DSutt+DSpro means utterance of that the reason why higher values of energy were obtained could be due
a person with DS with prosody transferred from an utterance of another person with DS. to the extra effort made by students to correctly complete the activities.

5.2. Relative impact of prosody


and DSutt+DSpro). Furthermore, the Mann–Whitney non-parametric
test was used to compare each group with the others, in groups of two.
The experimental results obtained show that the features con-
All the comparisons showed significant differences (p-value < 0.001).
cerning the frequency, energy and temporal domains have the same or a
greater impact than the spectral domain features to identify the speech
5. Discussion of people with Down syndrome:

5.1. Characterization of the speech of people with Down syndrome • There are a high number of features out of the spectral domain that
present significant differences between speakers with Down syn-
Fundamental frequency is significantly higher in speakers with drome and speakers without intellectual disabilities.
Down syndrome. The same results were found by Albertini et al. (2010), • Spectral features achieve high classification rates (up to 87%), but
Rochet-Capellan and Dohen (2015) and Lee et al. (2009). In addition, classification rates of frequency, energy and temporal features to-
the F0 range is lower in speakers with Down syndrome, which can be gether are higher than spectral features (up to 91.83%).
explained by a less melodious intonation. Continuing with frequency, • Utterances of control speakers with transferred frequency, energy
jitter is significantly lower in the DS group, as found by Lee et al. (2009) and phoneme duration from speakers with Down syndrome are
and by Seifpanahi et al. (2011). mostly perceived as anomalous voice. In the same way, utterances of
Concerning temporal features, on the one hand, the number of speakers with Down syndrome with transferred frequency, energy
continuous voiced regions per second is lower in the speakers with and phoneme duration from control speakers are mostly perceived
Down syndrome, which means that the oral production of speakers with as typical speech.
Down syndrome was slower than that of control speakers. Reading
difficulties that some people with Down syndrome present can have To the best of our knowledge, there are few studies that assess, in an
influenced these results. On the other hand, Van Borsel and experimental way, the relative weight of prosody in the perception of
Vandermeulen (2008) found disfluencies in Down syndrome speaking, speech of people with Down syndrome as a non typical voice. The
such as cluttering and stuttering. These disfluencies can produce the differences between speakers with Down syndrome and control
insertion of more silences and the presence of more temporal variety in speakers in the spectral domain can be derived from physiological pe-
the speech of people with Down syndrome, as found in this study. culiarities in their phonological system. Some could be corrected by
In terms of energy, loudness features were found to be significantly surgery, but others are impossible to be corrected. However, frequency,
higher in the speakers with Down syndrome and its range was higher. energy and temporal characteristics can be trained using speech
This result contradicts that reported by Albertini et al. (2010), which therapy techniques focusing on breathing and repetition of activities.
showed lower energy values in speakers with Down syndrome. Another The results obtained in this paper show the potential benefits of pro-
study focused on vowels (Saz et al., 2009a) found an increase in the sody training.
energy of unstressed vowels in Down syndrome speakers. Energy is The distance between the prosodic features of speakers with Down
always a difficult variable in the analysis of prosody, as its values are syndrome and those of control speakers can be used to devise a quality
very dependent on the recording conditions: the dynamic range of the metric to be included in computer assisted pronunciation training ap-
microphone and the distance between the speaker and the microphone. plications. Our future work on the implementation of an automatic
On the other hand, some of the participants have slight hearing pro- evaluation module of voice quality is expected to benefit from the re-
blems, which may be another possible explanation for the higher en- sults of this paper. This module is to be included in our speech training
ergy values. tools (González-Ferreras et al., 2017), so spectral features will be useful
Our corpus also permitted the detection of differences related with to identify a recording as a non typical speech, while prosody analysis
the spectral features. Table 6 highlights the fact that LTAS has been will be necessary for the evaluation of the players’ improvement over
proposed in Gauffin and Sundberg (1977) for the identification of the different game sessions.
breathy and hypokinetic voice. The relative amplitude of the first
harmonic was also related with breathy voices by Hillenbrand and 5.3. Limitations
Houde (1996). The speech of people with DS is described as breathy by
Wold DC (1979) and dysphonic by Moran (1986). MFCC features are The corpus size in speech analysis studies is very important to

97
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100

achieve representative results. The recording of a corpus of speech of features.


people with Down syndrome is always challenging because of the A perception experiment, based on prosody transfer, allowed us to
special characteristics of these speakers (attention deficit and problems verify the high relative importance of the prosodic variables of fre-
with short term memory, among others). Our video game has allowed quency, energy and temporal domains regarding the perception of
the recording of a speech corpus whose size is bigger than other speech atypical speech. An adequate control of these variables in utterances of
corpora used in other studies (see Table 2). Although the corpus size speakers with Down syndrome allows us to change the perception of
could be larger, the statistical tests carried out guarantee that the them, even though the voice quality is not modified. Besides, trans-
corpus has the necessary size to obtain significant results. In addition, ferring the prosody from speakers with Down syndrome to speakers of
new recordings are currently being obtained due to the use of the video the control group means the utterances will be perceived, to a large
game in a school of special education. degree, as if they were from speakers with Down syndrome. This result
The heterogeneity of the population with Down syndrome can have encourages the use of methodologies for training prosody as a means for
an influence on the correct generalization of the results. However, the improving the overall quality of the oral production of Down syndrome
methodology presented in this paper can be applied to individuals with speakers.
the aim of identifying the concrete features that they are using wrongly.
Moreover, the relative impact of these features in the identification of
their speech as pathological can be analyzed. Acknowledgments

6. Conclusions The work described in this paper was supported (1/2016-12/2017)


by the Fundacion BBVA (project “Pradia: la aventura gráfica de la
The speech characterization experiment presented in this article has pragmática y la prosodia” - CF613399). The activities of Down syn-
allowed us to find significant differences between the speech of in- drome speech analysis continue (1/2018-12/2020) in the project
dividuals with Down syndrome and those of the control group that funded by the Ministerio de Economía, Industria y Competitividad
affect the use of a set of acoustic variables related to frequency, energy, (MINECO) and the European Regional Development Fund FEDER
temporal and spectral domains. The use of these variables in an ex- (project “Incorporación de un Módulo de Predicción Automática de la
periment of automatic identification allows very high classification Calidad de la Comunicación Oral de Personas con Síndrome de Down en
rates (above 95%) to be obtained. If these variables are used in- un Videojuego Educativo” - TIN2017-88858-C2-1-R). The authors
dependently, the classification rates decrease, the highest being those would like to thank all the participants who took part in the recording
obtained using the spectral features. However, the importance of the of the corpus. We would also like to thank Lourdes Aguilar, Valle Flores,
rest of the variables becomes clear, because when only the variables Yolanda Martín and Ferran Adell. Special thanks to the students of the
related to frequency, energy and temporal domains are used, the clas- Fundación Personas (http://www.fundacionpersonas.org) for their
sification rate can be higher than that obtained using the spectral motivation during the training sessions.

Appendix A. Description of the features

The tables included in this appendix describe the features used in each of the domains. Frequency features are presented in Table A.9. Energy
features are described in Table A.10. Temporal features are explained in Table A.11. Spectral features are presented in Tables A.12 and A.13.

Table A.9
Frequency features explained. All functionals are applied to voiced regions only. Text in brackets shows the original name of the eGeMAPS features .

Feature Description

F0_stddevRisingSlope (F0semitoneFrom27.5Hz_sma3nz_stddevRisingSlope) Standard deviation of the slope of rising signal parts of F0


jitter_stddevNorm (jitterLocal_sma3nz_stddevNorm) Coefficient of variation of the deviations in individual consecutive F0 period lengths
jitter_mean (jitterLocal_sma3nz_amean) Mean of the deviations in individual consecutive F0 period lengths
F0_pctlrange (F0semitoneFrom27.5Hz_sma3nz_pctlrange0-2) Range of 20-th to 80-th of logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz
F0_percentile20 (F0semitoneFrom27.5Hz_sma3nz_percentile20.0) Percentile 20-th of logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz
F0_percentile50 (F0semitoneFrom27.5Hz_sma3nz_percentile50.0) Percentile 50-th of logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz
F0_mean (F0semitoneFrom27.5Hz_sma3nz_amean) Mean of logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz
F0_stddevNorm (F0semitoneFrom27.5Hz_sma3nz_stddevNorm) Coefficient of variation of logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz
F0_percentile80 (F0semitoneFrom27.5Hz_sma3nz_percentile80.0) Percentile 80-th of logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz

Table A.10
Energy features explained. All functionals are applied to voiced and unvoiced regions together. Text in brackets shows the original name of the eGeMAPS featurese.

Feature Description

loudness_percentile20 (loudness_sma3_percentile20.0) Percentile 20-th of estimate of perceived signal intensity from an auditory spectrum
loudness_percentile50 (loudness_sma3_percentile50.0) Percentile 50-th of estimate of perceived signal intensity from an auditory spectrum
loudness_mean (loudness_sma3_amean) Mean of estimate of perceived signal intensity from an auditory spectrum
loudness_percentile80 (loudness_sma3_percentile80.0) Percentile 80-th of estimate of perceived signal intensity from an auditory spectrum
loudness_pctlrange02 (loudness_sma3_pctlrange0-2) Range of 20-th to 80-th of estimate of perceived signal intensity from an auditory spectrum
loudness_stddevRisingSlope (loudness_sma3_stddevRisingSlope) Standard deviation of the slope of rising signal parts of loudness
loudness_stddevNorm (loudness_sma3_stddevNorm) Coefficient of variation of estimate of perceived signal intensity from an auditory spectrum
shimmer_mean (shimmerLocaldB_sma3nz_amean) Mean of difference of the peak amplitudes of consecutive F0 periods
shimmer_stddevNorm (shimmerLocaldB_sma3nz_stddevNorm) Coefficient of variation of difference of the peak amplitudes of consecutive F0 periods

98
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100

Table A.11
Temporal features explained.

Feature Description

silencePercentage Duration percentage of unvoiced regions


silencesMean Mean of unvoiced regions
StddevVoicedSegmentLengthSec Standard deviation of continuously voiced
regions
MeanUnvoicedSegmentLength Mean of unvoiced regions
silencesPerSecond The number of silences per second
VoicedSegmentsPerSec The number of continuous voiced regions per
second
loudnessPeaksPerSec The number of the loudness peaks per second
MeanVoicedSegmentLengthSec Mean of continuously voiced regions
soundingPercentage Duration percentage of voiced regions

Table A.12
Spectral features explained (part1). If nothing is said, the features are applied to voiced and unvoiced regions together. Text in brackets shows the original name of the eGeMAPS features .

Feature Description

mfcc3_stddevNorm (mfcc3_sma3_stddevNorm) Coefficient of variation of Mel-Frequency Cepstral Coefficient 3


slopeV0500_mean (slopeV0-500_sma3nz_amean) Mean of linear regression slope of the logarithmic power spectrum within 0–500 Hz band in voiced regions
mfcc2V_mean (mfcc2V_sma3nz_amean) Mean of Mel-Frequency Cepstral Coefficient 2 in voiced regions
mfcc4_stddevNorm (mfcc4_sma3_stddevNorm) Coefficient of variation of Mel-Frequency Cepstral Coefficient 4
slopeUV0500_mean (slopeUV0-500_sma3nz_amean) Mean of linear regression slope of the logarithmic power spectrum within 0–500 Hz band in unvoiced
regions
slopeV0500_stddevNorm (slopeV0-500_sma3nz_stddevNorm) Coefficient of variation of linear regression slope of the logarithmic power spectrum within 0–500 Hz band
in voiced regions
mfcc2_stddevNorm (mfcc2_sma3_stddevNorm) Coefficient of variation of Mel-Frequency Cepstral Coefficient 2
mfcc2_mean (mfcc2_sma3_amean) Mean of Mel-Frequency Cepstral Coefficient 2
alphaRatioUV_mean (alphaRatioUV_sma3nz_amean) Mean of the ratio of the summed energy from 50 to 1000 Hz and 1–5 kHz in unvoiced regions
logRelF0H1A3_stddevNorm (logRelF0-H1-A3_sma3nz_stddevNorm) Coefficient of variation of the ratio of energy of the first F0 harmonic (H1) to the energy of the highest
harmonic in the third formant range (A3) in voiced regions
hammarbergIndexUV_mean (hammarbergIndexUV_sma3nz_amean) Mean of the ratio of the strongest energy peak in the 0–2 kHz region to the strongest peak in the 2–5 kHz
region in unvoiced regions
mfcc3V_stddevNorm (mfcc3V_sma3nz_stddevNorm) Coefficient of variation of Mel-Frequency Cepstral Coefficient 3 in voiced regions
mfcc4V_stddevNorm (mfcc4V_sma3nz_stddevNorm) Coefficient of variation of Mel-Frequency Cepstral Coefficient 4 in voiced regions
mfcc4_mean (mfcc4_sma3_amean) Mean of Mel-Frequency Cepstral Coefficient 4
spectralFlux_mean (spectralFlux_sma3nz_amean) Mean of the difference of the spectra of two consecutive frames
spectralFluxUV_mean (spectralFluxUV_sma3nz_amean) Mean of the difference of the spectra of two consecutive frames in unvoiced regions
spectralFluxV_mean (spectralFluxV_sma3nz_amean) Mean of the difference of the spectra of two consecutive frames in voiced regions

Table A.13
Spectral features explained (part2). If nothing is said, the features are applied to voiced and unvoiced regions together. Text in brackets shows the original name of the eGeMAPS features .

Feature Description

alphaRatioV_mean (alphaRatioV_sma3nz_amean) Mean of the ratio of the summed energy from 50 to 1000 Hz and 1–5 kHz in voiced regions
mfcc4V_mean (mfcc4V_sma3nz_amean) Mean of Mel-Frequency Cepstral Coefficient 4 in voiced regions
hammarbergIndexV_mean (hammarbergIndexV_sma3nz_amean) Mean of the ratio of the strongest energy peak in the 0–2 kHz region to the strongest peak in
the 2–5 kHz region in voiced regions
mfcc1V_mean (mfcc1V_sma3nz_amean) Mean of Mel-Frequency Cepstral Coefficient 1 in voiced regions
hammarbergIndexV_stddevNorm (hammarbergIndexV_sma3nz_stddevNorm) Coefficient of variation of the ratio of the strongest energy peak in the 0–2 kHz region to the
strongest peak in the 2–5 kHz region in voiced regions
mfcc1_mean (mfcc1_sma3_amean) Mean of Mel-Frequency Cepstral Coefficient 1
logRelF0H1A3_mean (logRelF0-H1-A3_sma3nz_amean) Mean of the ratio of energy of the first F0 harmonic (H1) to the energy of the highest
harmonic in the third formant range (A3)in voiced regions
F3amplitudeLogRelF0_mean (F3amplitudeLogRelF0_sma3nz_amean) Mean of the ratio of the energy of the spectral harmonic peak at the third formant’s centre
frequency to the energy of the spectral peak at F0 in voiced regions
F3amplitudeLogRelF0_stddevNorm (F3amplitudeLogRelF0_sma3nz_stddevNorm) Coefficient of variation of the ratio of the energy of the spectral harmonic peak at the third
formant’s centre frequency to the energy of the spectral peak at F0 in voiced regions
slopeV5001500_mean (slopeV500-1500_sma3nz_amean) Mean of linear regression slope of the logarithmic power spectrum within 500–1500 Hz band
in voiced regions
F2amplitudeLogRelF0_mean (F2amplitudeLogRelF0_sma3nz_amean) Mean of the ratio of the energy of the spectral harmonic peak at the second formant’s centre
frequency to the energy of the spectral peak at F0 in voiced regions
F2amplitudeLogRelF0_stddevNorm (F2amplitudeLogRelF0_sma3nz_stddevNorm) Coefficient of variation of the ratio of the energy of the spectral harmonic peak at the second
formant’s centre frequency to the energy of the spectral peak at F0 in voiced regions
F1bandwidth_stddevNorm (F1bandwidth_sma3nz_stddevNorm) Coefficient of variation of the bandwidth of first formant in voiced regions
F1frequency_stddevNorm (F1frequency_sma3nz_stddevNorm) Coefficient of variation of the centre frequency of first formant in voiced regions
F3frequency_stddevNorm (F3frequency_sma3nz_stddevNorm) Coefficient of variation of the centre frequency of third formant in voiced regions
spectralFlux_stddevNorm (spectralFlux_sma3_stddevNorm) Coefficient of variation of the difference of the spectra of two consecutive frames
F3frequency_mean (F3frequency_sma3nz_amean) Mean of the centre frequency of third formant in voiced regions

99
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100

References untrained male university students. J. Voice 23 (6), 671–676.


Leshin, L., 2000. Plastic Surgery in Children with Down Syndrome. Down syndrome:
Health issues: News and information for parents and professionals.
Albertini, G., Bonassi, S., Dall’Armi, V., Giachetti, I., Giaquinto, S., Mignano, M., 2010. Luo, D., Luo, R., Wang, L., 2017. Prosody analysis of L2 English for naturalness evaluation
Spectral analysis of the voice in down syndrome. Res. Dev. Disabil. 31 (5), 995–1001. through speech modification. Proc. Interspeech. pp. 1775–1778.
Bhagyalakshmi, G., Renukarya, A., Rajangam, S., 2007. Metric analysis of the hard palate Markaki, M., Stylianou, Y., 2010. Modulation spectral features for objective voice quality
in children with Down syndrome-a comparative study. Down Syndrome Res. Pract. assessment. Communications, Control and Signal Processing (ISCCSP), 2010 4th
12 (1), 55–59. International Symposium on. IEEE, pp. 1–4.
Boersma, P., 2006. Praat: doing phonetics by computer. http://www.praat.org/. Markaki, M., Stylianou, Y., 2011. Voice pathology detection and discrimination based on
Bunton, K., Leddy, M., 2011. An evaluation of articulatory working space area in vowel modulation spectral features. IEEE Trans. Audio Speech Lang. Process. 19 (7),
production of adults with down syndrome. Clinical Ling. Phonetics 25 (4), 321–334. 1938–1948.
Chapman, R., Hesketh, L., 2001. Language, cognition, and short-term memory in in- Martin, G.E., Klusek, J., Estigarribia, B., Roberts, J.E., 2009. Language characteristics of
dividuals with Down syndrome. Down Syndrome Res. Pract. 7 (1), 1–7. individuals with Down syndrome. Top. Lang. Disord. 29 (2), 112.
Chapman, R.S., 1997. Language development in children and adolescents with Down Martinez, J., Perez, H., Escamilla, E., Suzuki, M.M., 2012. Speaker recognition using Mel
syndrome. Ment. Retard. Dev. Disabil. Res. Rev. 3 (4), 307–312. frequency Cepstral Coefficients (MFCC) and Vector quantization (VQ) techniques.
Cleland, J., Wood, S., Hardcastle, W., Wishart, J., Timmins, C., 2010. Relationship be- Electrical Communications and Computers (CONIELECOMP). IEEE, pp. 248–251.
tween speech, oromotor, language and cognitive abilities in children with Down’s Martínez, M.H., Duran, X.P., Navarro, J.N., 2011. Attention deficit disorder with or
syndrome. Int. J. Lang. Commun. Disord. 45 (1), 83–95. without hyperactivity or impulsivity in children with Down’s syndrome. Int. Med.
Corrales-Astorgano, M., Escudero-Mancebo, D., González-Ferreras, C., 2016. Acoustic Rev. Down Syndrome 15 (2), 18–22.
analysis of anomalous use of prosodic features in a corpus of people with intellectual Moran, M.J., 1986. Identification of Down’s syndrome adults from prolonged vowel
disability. Advances in Speech and Language Technologies for Iberian Languages: samples. J. Commun. Disord. 19 (5), 387–394.
Third International Conference IberSPEECH. Springer, pp. 151–161. Moran, M.J., Gilbert, H.R., 1982. Selected acoustic characteristics and listener judgments
Devenny, D., Silverman, W., 1990. Speech dysfluency and manual specialization in of the voice of Down syndrome adults. Am. J. Ment. Defic.
Down’s syndrome. J. Intellect. Disabil. Res. 34 (3), 253–260. Moura, C.P., Cunha, L.M., Vilarinho, H., Cunha, M.J., Freitas, D., Palha, M., Pueschel,
Dibazar, A.A., Berger, T.W., Narayanan, S.S., 2006. Pathological voice assessment. S.M., Pais-Clemente, M., 2008. Voice parameters in children with Down syndrome. J.
Engineering in Medicine and Biology Society (EMBS). IEEE, pp. 1669–1673. Voice 22 (1), 34–42.
Dusan, S., Deng, L., 1998. Recovering vocal tract shapes from mfcc parameters. ICSLP. Pentz Jr, A.L., 1987. Formant amplitude of children with Down syndrome. Am. J. Ment.
Eggers, K., Van Eerdenbrugh, S., 2017. Speech disfluencies in children with Down syn- Defic. 92 (2), 230–233.
drome. J. Commun. Disord. Rochet-Capellan, A., Dohen, M., 2015. Acoustic characterisation of vowel production by
Escudero, D., González, C., Gutiérrez, Y., Rodero, E., 2017. Identifying characteristic young adults with Down syndrome. 18th International Congress of Phonetic Sciences
prosodic patterns through the analysis of the information of sp_tobi label sequences. (ICPhS 2015).
Comput. Speech Lang. 45, 39–57. Rodger, R., 2009. Voice quality of children and young people with Down’s Syndrome and
Eyben, F., Scherer, K.R., Schuller, B.W., Sundberg, J., André, E., Busso, C., Devillers, L.Y., its impact on listener judgement. Queen Margaret University.
Epps, J., Laukka, P., Narayanan, S.S., et al., 2016. The geneva minimalistic acoustic Saz, O., Simón, J., Rodríguez, W., Lleida, E., Vaquero, C., et al., 2009. Analysis of acoustic
parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. features in speakers with cognitive disorders and speech impairments. EURASIP J.
Affect. Comput. 7 (2), 190–202. Adv. Signal Process. 2009, 1.
Eyben, F., Weninger, F., Gross, F., Schuller, B., 2013. Recent developments in opensmile, Saz, O., Yin, S.C., Lleida, E., Rose, R., Vaquero, C., Rodríguez, W.R., 2009. Tools and
the Munich open-source multimedia feature extractor. Proceedings of the 21st ACM technologies for computer-aided speech and language therapy. Speech
international conference on Multimedia. ACM, pp. 835–838. Communication 51 (10), 948–967.
Gauffin, J., Sundberg, J., 1977. Clinical applications of acoustic voice analysis. part II: Schiel, F., 1999. Automatic phonetic transcription of non-prompted speech. International
acoustical analysis, results, and discussion. Speech Transmission Laboratory, Congress of Phonetic Sciences (ICPhS). pp. 607–610.
Quarterly Progress and Status Report. Schuller, B.W., Steidl, S., Batliner, A., Hirschberg, J., Burgoon, J.K., Baird, A., Elkins, A.C.,
González-Ferreras, C., Escudero-Mancebo, D., Corrales-Astorgano, M., Aguilar-Cuevas, L., Zhang, Y., Coutinho, E., Evanini, K., 2016. The interspeech 2016 computational
Flores-Lucas, V., 2017. Engaging adolescents with Down syndrome in an educational paralinguistics challenge: Deception, sincerity & native language. INTERSPEECH. pp.
video game. Int. J. Human–Comput. Interact. 1–20. 2001–2005.
Guimaraes, C.V., Donnelly, L.F., Shott, S.R., Amin, R.S., Kalra, M., 2008. Relative rather Seifpanahi, S., Bakhtiar, M., Salmalian, T., 2011. Objective vocal parameters in farsi-
than absolute macroglossia in patients with Down syndrome: implications for treat- speaking adults with down syndrome. Folia Phoniatrica et Logopaedica 63 (2),
ment of obstructive sleep apnea. Pediatr. Radiol. 38 (10), 1062. 72–76.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., 2009. The Shott, S.R., Joseph, A., Heithaus, D., 2001. Hearing loss in children with Down syndrome.
weka data mining software: an update. ACM SIGKDD Explorations Newslett. 11 (1), Int. J. Pediatr. Otorhinolaryngol. 61 (3), 199–205.
10–18. Van Borsel, J., Vandermeulen, A., 2008. Cluttering in Down syndrome. Folia Phoniatrica
Hillenbrand, J., Houde, R.A., 1996. Acoustic correlates of breathy vocal quality: dys- et Logopaedica 60 (6), 312–317.
phonic voices and continuous speech. J. Speech Lang. Hearing Res. 39 (2), 311–321. Wold DC, M.J., 1979. Preliminary perceived voice deviations and hearing disorders of
House, A.S., Stevens, K.N., 1956. Analog studies of the nasalization of vowels. J. Speech adults with Down’s syndrome. Percept. Mot. Skills 49, 564-564.
Hearing Disord. 21 (2), 218–232. Wuang, Y.-P., Chiang, C.-S., Su, C.-Y., Wang, C.-C., 2011. Effectiveness of virtual reality
Huffman, M.K., 1989. Implementation of nasal: timing and articulatory landmarks. using Wii gaming technology in children with Down syndrome. Res. Dev. Disabil. 32
University of California, Los Angeles. (1), 312–321.
Kent, R.D., Vorperian, H.K., 2013. Speech impairment in Down syndrome: a review. J. Yuan, J., Liberman, M., 2011. Automatic measurement and comparison of vowel nasa-
Speech Lang. Hearing Res. 56 (1), 178–210. lization across languages. Proceedings of ICPhS.
Kisler, T., Reichel, U., Schiel, F., 2017. Multilingual processing of speech via web services. Zampini, L., Fasolo, M., Spinelli, M., Zanchi, P., Suttora, C., Salerni, N., 2016. Prosodic
Comput. Speech Lang. 45, 326–347. skills in children with down syndrome and in typically developing children. Int. J.
Lee, M.T., Thorpe, J., Verhoeven, J., 2009. Intonation and phonation in young adults with Lang. Commun. Disord. 51 (1), 74–83.
Down syndrome. J. Voice 23 (1), 82–87.
Leino, T., 2009. Long-term average spectrum in screening of voice quality in speech:

100

View publication stats

You might also like