Version Final
Version Final
net/publication/323843241
CITATIONS READS
16 313
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Glissando: A corpus for multidisciplinary prosodic studies in Spanish and Catalan View project
All content following this page was uploaded by Mario Corrales-Astorgano on 26 November 2018.
Speech Communication
journal homepage: www.elsevier.com/locate/specom
A R T I C LE I N FO A B S T R A C T
Keywords: There are many studies that identify important deficits in the voice production of people with Down syndrome.
Speech characterization These deficits affect not only the spectral domain, but also the intonation, accent, rhythm and speech rate. The
Prosody main aim of this work is the identification of the acoustic features that characterize the speech of people with
Down syndrome Down syndrome, taking into account the different frequency, energy, temporal and spectral domains. The
Intellectual disabilities
comparison of the relative weight of these features for the characterization of Down syndrome people’s speech is
Automatic classification
Perceptual test
another aim of this study. The openSmile toolkit with the GeMAPS feature set was used to extract acoustic
features from a speech corpus of utterances from typically developing individuals and individuals with Down
syndrome. Then, the most discriminant features were identified using statistical tests. Moreover, three binary
classifiers were trained using these features. The best classification rate, using only spectral features, is 87.33%,
and using frequency, energy and temporal features, it is 91.83%. Finally, a perception test has been performed
using recordings created with a prosody transfer algorithm: the prosody of utterances from one group of speakers
was transferred to utterances of another group. The results of this test show the importance of intonation and
rhythm in the identification of a voice as non typical. As conclusion, the results obtained point to the training of
prosody in order to improve the quality of the speech production of those with Down syndrome.
⁎
Corresponding author.
E-mail addresses: [email protected] (M. Corrales-Astorgano), [email protected] (D. Escudero-Mancebo), [email protected] (C. González-Ferreras).
https://doi.org/10.1016/j.specom.2018.03.006
Received 18 December 2017; Received in revised form 21 February 2018; Accepted 13 March 2018
Available online 14 March 2018
0167-6393/ © 2018 Elsevier B.V. All rights reserved.
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100
weight of each domain in the characterization of people with Down Albertini et al. (2010) discovered a lower duration of words in male
syndrome has been included in this paper, especially the comparison adults with Down syndrome. Moreover, people with Down syndrome
between the spectral and the other domains. present some disfluency problems. Although disfluency (stuttering or
The methodology described above was developed to answer two cluttering) has not been demonstrated as a universal characteristic of
main research questions (RQ): Down syndrome, it is a common problem of this population (Van Borsel
and Vandermeulen, 2008; Devenny and Silverman, 1990; Eggers and
• RQ1: Which are the most discriminative acoustic features between Van Eerdenbrugh, 2017). These disfluencies can affect the speech
the recordings of speakers with Down syndrome and typically de- rhythm of people with Down syndrome.
veloping speakers? On the other hand, Zampini et al. (2016) indicated that children
• Issue 1.1: Are there statistical differences between these features? with Down syndrome had lower F0 than children without intellectual
Issue 1.2: Are these differences in accordance with what is ex- disabilities. Moura et al. (2008) found higher jitter in children with
pected or described in the state of the art? Down syndrome than children without intellectual disabilities. In terms
• RQ2: What is the relative weight of the spectral features in com- of energy, Moura et al. (2008) indicated higher shimmer in children
parison with the rest of the domains? with Down syndrome than in children without intellectual disabilities.
Issue 2.1: What is the relative weight of the different features The unit of analysis and the phonation tasks used by the researchers
when identifying atypical speech using automatic classifiers? are different. Rochet-Capellan and Dohen (2015) used Vowel–Conso-
Issue 2.2: What is the relative weight of the different domains nant–Vowel by syllabes, Saz et al. (2009a) and Albertini et al. (2010)
when identifying atypical speech in a perceptual test? recorded words, Rodger (2009) and Zampini et al. (2016) built
these corpora using semi-spontaneous speech and Corrales-
The structure of the article is as follows. Section 2 reviews related Astorgano et al. (2016) analyzed sentences. Lee et al. (2009) combined
works from the state of the art and presents the innovation of our words, reading and natural speech. The majority of the studies are fo-
proposal. Section 3 describes the experimental procedure, including the cused on the English language (Kent and Vorperian, 2013), but there
corpus description, the features extraction process, the automatic are others focused on Italian (Zampini et al., 2016; Albertini et al.,
classification experiment and the perceptual test. Section 4 shows the 2010), Spanish (Corrales-Astorgano et al., 2016; Saz et al., 2009a),
statistical test results of the different domain features, the automatic French (Rochet-Capellan and Dohen, 2015) or Farsi (Seifpanahi et al.,
classification results and the perceptual test results. Finally, Section 5 2011).
describes the discussion and Section 6 the conclusions. The use of spectral features to assess pathological voice has fre-
quently been applied in the literature. Dibazar et al. (2006) used MFCCs
2. Background and related work and pitch frequency with a hidden Markov model (HMM) classifier for
the assessment of normal versus pathological voice using one vowel as
The age of the population selected for the study seems to be im- the unit of analysis. Markaki and Stylianou (2011) suggested the use of
portant for the results obtained, due to the physiological differences modulation spectra for the detection and classification of voice
between children and adults. Concerning adults, Lee et al. (2009), pathologies. Markaki and Stylianou (2010) created a method for the
Rochet-Capellan and Dohen (2015), Albertini et al. (2010) and objective assessment of hoarse voice quality, based on modulation
Corrales-Astorgano et al. (2016) found significantly higher F0 values in spectra, using a corpus of sustained vowels. The voice quality was
adults with Down syndrome as compared to adults without intellectual evaluated using the long term average spectrum (LTAS) and alpha ratio
disabilities. In addition, Lee et al. (2009) and Seifpanahi et al. (2011) by Leino (2009). Although these works do not refer to people with
found lower jitter (frequency perturbations) in adult speakers with Down syndrome, they do refer to some aspects that appear in this kind
Down syndrome. As for energy, Albertini et al. (2010) found sig- of speakers and we refer to them in the discussion section.
nificantly lower energy values in adults with Down syndrome. More- Formant frequency and amplitude have also been studied in people
over, Saz et al. (2009a) concluded that adults with Down syndrome had with Down syndrome. A larger vowel space in people with Down syn-
poor control over energy in stressed versus unstressed vowels. drome was found by Rochet-Capellan and Dohen (2015), while other
Albertini et al. (2010) found lower shimmer (amplitude perturbations) studies denoted a reduction of the vowel space in children (Moura et al.,
in male adults with Down syndrome than in adults without intellectual 2008) and adults (Bunton and Leddy, 2011). Moreover, the voice of
disabilities. Finally, temporal domain results depend on the unit of people with Down syndrome showed significantly reduced formant
analysis employed. Saz et al. (2009a) found that people with cognitive amplitude intensity levels (Pentz Jr, 1987).
disorders presented an excessive variability in vowel duration, while In order to compare our study with the state of the art, a summary of
Rochet-Capellan and Dohen (2015) and Bunton and Leddy (2011) re- other similar studies is shown in Table 1. A description of the corpus
ported longer durations of vowels in adults with Down syndrome. employed by these studies is shown in Table 2. To the best of our
Table 1
Results of different studies in the state of the art.
91
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100
Table 2
Description of the corpus used in the state of the art.
Rodger (2009) Adults and 22 52 Semi spontaneous 5 picture descriptions per speaker English
Children
Zampini et al. (2016) Children 9 12 Semi spontaneous 20 minutes per speaker Italian
Saz et al. (2009a) Adults and 3 168 Words 9576 words (6 hours) Control. 684 words (38 Spanish
Children minutes) Down syndrome
Albertini et al. (2010) Adults 30 60 Words NA Italian
Rochet-Capellan and Dohen (2015) Adults 8 8 Vowel-consonant-vowel 144 per speaker French
Lee et al. (2009) Adults 9 9 Vowel. Reading. Natural 3 vowels per speaker. 1 reading per speaker. 1 English
speech minute per speaker
Corrales-Astorgano et al. (2016) Adults 18 20 Sentences 479 utterances Spanish
knowledge, our study is one of the first to analyze some features from 3.1. Corpus collection
the frequency, energy, temporal and spectral domains together. These
features were extracted from the same recordings, which can help in the We developed a computer video game to improve the prosodic and
study of the relative importance of each domain in the characterization communication skills of people with Down syndrome (González-
of the speech of people with Down syndrome. The use of a standard Ferreras et al., 2017). This video game is a graphic adventure game
feature set (extended Geneva Minimalistic Acoustic Parameter Set, where users have to use the computer mouse to interact with the ele-
eGeMAPS; detailed in Section 3.2 and Appendix A) can reduce the ex- ments on the screen, listen to audio instructions and sentences from the
traction methodology dependence, which can make it easier to compare characters of the game, and record utterances using a microphone in
the results of different studies. different contexts. The video game was designed using an iterative
Perceptual studies show mixed results. Moura et al. (2008) de- methodology in collaboration with a school of special education located
scribed the voice of children with Down syndrome as being statistically in Valladolid (Spain). The feedback provided by teachers of special
different from the voice of children without intellectual disabilities in education was complemented by research into the difficulties of this
five speech problems: grade, roughness, breathiness, asthenic speech population to use information and communication technologies. They
and strained speech. Moran and Gilbert (1982) judged the voice quality have some difficulties, such as attention deficit (Martínez et al., 2011),
of adults with Down syndrome as hoarse. In addition, Rodger (2009) lack of motivation (Wuang et al., 2011), or problems with the short
noted discrepancies between perceptual judgments of pitch level and term memory (Chapman and Hesketh, 2001) that had to be taken into
acoustic measures of F0. In our study, we did not want to compare each account when developing the video game. The game was developed for
acoustic measure with a perceptual judgment of the same feature. Our the Spanish language.
aim is the assessment of the domain relevance in the identification of a Inside the narrative of the game, some learning activities were in-
recording as being from a person with Down syndrome, using automatic cluded to practice communication skills. There are three different types
classifiers and perceptual tests. of activities: comprehension, production and visual. Firstly, the com-
prehension activities are focused on lexical-semantic comprehension
and on improving prosodic perception in specific contexts. Secondly,
3. Experimental procedure production activities are focused on oral production, so the players are
encouraged by the game to train their speech, keeping in mind such
Fig. 1 shows the experimental methodology that we have followed. prosodic aspects as intonation, expression of emotions or syllabic em-
Firstly, the speech corpus recorded by people with Down syndrome and phasis. At the beginning of these activities, the video game introduces
by typically developing people was gathered. Secondly, acoustic fea- the context where the sentence has to be said. Then, the game plays the
tures were extracted from all the recordings of each corpus and a sta- sentence and the player must utter the sentence while it is shown on the
tistical test to analyze the differences between groups was carried out. screen. The production activities include affirmative, exclamatory and
Finally, the automatic classification experiment was carried out, in interrogative sentences. Finally, visual activities include other activities
which the features with significant differences were used. designed to add variety to the game and to reduce the feeling of
Fig. 1. Scheme of the experimental procedure which includes corpus collection, feature extraction and automatic classification.
92
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100
Table 3
• Temporal features: the rate of loudness peaks per second, mean
length and standard deviation of continuous voiced and unvoiced
Sentences included in the corpus.
segments and the rate of voiced segments per second, approximating
Sentence in Spanish Sentence in English the pseudo syllable rate.
¡Hasta luego, tío Pau! See you later, uncle Pau! In total, there are 25 LLD. The arithmetic mean and the coefficient
¡Muchas gracias, Juan! Thank you very much, Juan!
¡Hola! ¿Tienen lupas? Quería comprar Hello, do you have magnifiers? I
of variation are calculated on these 25 LLD. Some functionals are ap-
una. wanted to buy one. plied to fundamental frequency and loudness: 20-th, 50-th, and 80-th
Sí, la necesito. ¿Cuánto vale? Yes, I need it. How much is it? percentile, the range of 20-th to 80-th percentile, and the mean and
¡Hola tío Pau! Ya vuelvo a casa. Hello uncle Pau! I’ll be back home. standard deviation of the slope of rising/falling signal parts. All these
Sí, esa es. ¡Hasta luego! Yes, it is. Bye!
functionals are computed by the openSmile toolkit. In addition, the
¡Hola, tío Pau! ¿Sabes dónde vive la Hello uncle Pau! Do you know where
señora Luna? Mrs Luna lives? process used by the openSmile toolkit to extract the eGeMAPS features
¡Nos vemos luego, tío Pau! See you later, uncle Pau! did not differentiate between silences and unvoiced regions, which can
Has sido muy amable, Juan. Muchas You have been very kind, Juan. Thank produce errors in the functions applied to each feature. Therefore, the
gracias! you very much! Praat software (Boersma, 2006) was used to extract all silences from
¡Hola! ¿Tienen lupas? Me gustaría Hello, do you have magnifiers? I would
comprar una. like to buy one.
each recording and these silences were excluded from the analysis
Sí, necesito una sea como sea. ¿Cuánto Yes, I really need one. How much is it? process.
vale? Furthermore, 4 additional temporal features were added: the silence
Sí, lo es. Vivo allí desde pequeño. Yes, it is. I have lived there since I was and sounding percentages, silences per second and the mean silences.
¡Hasta luego! a child. Bye!
These new features were added to improve the information about the
¡Hola, tío Pau! Tengo que encontrar a la Hello uncle Pau! I have to find Mrs
señora Luna ¿Sabes dónde vive? Luna. Do you know where she lives? temporal characterization of the recordings. In this case, the initial and
final silence of each recording were excluded from the analysis process
93
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100
because their lengths were different due to the recording process. To Fig. 2 shows the experimental procedure used to perform the per-
sum up, the acoustic feature set contains 88 features from the eGeMAPS ception test. The sentence ¡Hola tío Pau! ¿Sabes donde vive la señora
feature set and 4 new features introduced from the research team (92 Luna? (Hello uncle Pau! Do you know where Mrs Luna lives?) recorded by
features). all the speakers was selected. This sentence was selected because of its
A statistical test was used to detect the significant differences be- prosodic richness (combining an affirmative and an interrogative sen-
tween the features extracted from the recording of each group. The tence), because it was used in another of our studies (González-
Mann-Whitney non-parametric test was used. Only the features with a Ferreras et al., 2017) and because it was the most recorded sentence. To
p-value lower than 0.01 were selected for analysis and classification. obtain a phonetic segmentation of the recordings, the BAS web services
(Schiel, 1999; Kisler et al., 2017) were used. This tool returns the time
3.3. Automatic classification intervals of each phoneme using the audio file and the transcription as
inputs. Manual revision of the segmentation was necessary to correct
In order to make an automatic classification of the recordings, the transcription errors. The sentence was recorded by 22 TD speakers and
Weka machine learning toolkit (Hall et al., 2009) was used. This toolkit by 16 speakers with DS. However, each speaker did not have the same
permits to a collection of machine learning algorithms to be accessed number of recordings. In total, there were 62 recordings.
for data mining tasks. Three different classifiers were used to compare Once the segmentation was corrected, a prosody transfer algorithm
their performance: the C4.5 decision tree (DT), the multilayer percep- implemented in Praat (Boersma, 2006) was executed. This algorithm
tron (MLP) and the support vector machine (SVM). transfers, phoneme by phoneme, the pitch, energy and duration from
In addition, the 10-fold cross validation technique was used to one audio to another. Therefore, the new audio file contains the original
create the training and testing datasets. To avoid classifier adaptation, utterance but with the prosody transferred from another utterance. The
all folds were created by recordings of different speakers. Therefore, the algorithm was executed combining the audios of each speaker with the
recordings of each speaker were joined in the same fold and each fold audios of the rest of the speakers, so, in total, 3525 audio files were
was balanced in terms of the number of recordings. generated (not all the speakers had the same number of recordings). As
To analyze the performance of the classification, we used the clas- a result, there are four types of audio files, as shown in Fig. 2. Five
sification rate. The unweighted average recall (UAR) (Schuller et al., audio files of each type were selected randomly for the perception test,
2016) was also used. This metric is the mean of sensitivity (recall of so the test included twenty audio files, balanced in terms of gender.
positive instances) and specificity (recall of negative instances). UAR The perception test was performed using a web application. First,
was chosen as the classification metric because it equally weights each personal information of the evaluator was collected. Then, the twenty
class regardless of its number of samples, so it represents more precisely audio files selected in the previous phase were shown randomly. The
the accuracy of a classification test using unbalanced data. evaluators have to answer the following question for each utterance:
keeping in mind the way of speaking, do you think that the person who is
speaking has intellectual disabilities? Ignore the audio distortion produced by
3.4. Perception test
the non natural voice synthesis. The possible answers to the question were
in a 5-point Likert scale: 1 means “no way” and 5 means “very sure”.
In order to evaluate the impact of prosody in the perception of the
Thirty evaluators judged each utterance using this scale. People without
listeners, we used prosody transfer techniques. These techniques have
any specific background on speech therapies were selected for this test,
previously been used in other studies of the state of the art. For in-
as we were interested in the perception of normal people concerning
stance, Luo et al. (2017) investigated the role of different prosodic
the importance of prosody in the identification of speech from people
features in the naturalness of English L2 speech. The prosodic mod-
with intellectual disability.
ification method was applied to native and L2 learners’ speech. Later,
they used a perceptual test to evaluate the impact of prosody mod-
ification. A similar methodology was used by Escudero et al. (2017), 4. Results
where the characteristic prosodic patterns of the style of different
groups of speakers was investigated. After the prosodic modification of 4.1. Characterization results
the utterances, the characteristic prosodic patterns were validated using
a perceptual test. The procedure described in Escudero et al. (2017) for Table 5 shows the features with statistically significant differences
transferring prosody is used in the experiments reported in this paper. (Mann–Whitney test with p-value < 0.01) related to frequency, energy
Fig. 2. Experimental procedure followed to perform the perceptual test. The utterances used in the test were: TDutt+TDpro (utterance of a TD person with prosody transferred from an
utterance of another TD person), DSutt+TDpro (utterance of a person with DS with prosody transferred from an utterance of a TD person), TDutt+DSpro (utterance of a TD person with
prosody transferred from an utterance of a person with DS) and DSutt+DSpro (utterance of a person with DS with prosody transferred from an utterance of another person with DS).
94
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100
Table 5
List of frequency, energy and temporal features with higher statistically significant differences (Mann-Whitney test with p-value < 0.01), sorted by mean differences. The meaning of the
features in column Variable can be seen in Appendix A. The units are reported in (Eyben et al., 2016).
Variable Control Control (CI 95%) Down syndrome Down syndrome (CI 95%)
F0 domain
F0_stddevRisingSlope 166.17 ± 231.44 (137.35,195.01) 220.85 ± 273.67 (192.08,249.62)
jitter_stddevNorm 1.15 ± 0.39 (1.11,1.21) 1.46 ± 0.47 (1.42,1.52)
jitter_mean 0.04 ± 0.02 (0.045,0.050) 0.03 ± 0.01 (0.035,0.039)
F0_pctlrange 4.63 ± 1.9 (4.4,4.88) 3.91 ± 2.88 (3.61,4.22)
F0_percentile20 26.89 ± 4.49 (26.33,27.45) 30.32 ± 4.63 (29.84,30.81)
F0_percentile50 29.18 ± 4.22 (28.66,29.71) 32.33 ± 4.28 (31.89,32.79)
F0_mean 29.3 ± 4.11 (28.79,29.82) 32.38 ± 4.14 (31.95,32.82)
F0_stddevNorm 0.13 ± 0.07 (0.129,0.147) 0.12 ± 0.07 (0.116,0.132)
F0_percentile80 31.52 ± 4.34 (30.99,32.07) 34.24 ± 4.67 (33.75,34.73)
Energy domain
loudness_percentile20 0.95 ± 0.38 (0.91,1.01) 1.77 ± 1.03 (1.66,1.88)
loudness_percentile50 1.93 ± 0.73 (1.84,2.02) 3.29 ± 2.22 (3.06,3.53)
loudness_mean 2.09 ± 0.78 (1.99,2.19) 3.37 ± 1.99 (3.17,3.58)
loudness_percentile80 3.15 ± 1.24 (3,3.31) 4.9 ± 2.94 (4.6,5.22)
loudness_pctlrange 2.19 ± 0.96 (2.08,2.32) 3.13 ± 2.06 (2.92,3.35)
loudness_stddevRisingSlope 15.3 ± 7.18 (14.41,16.2) 19.63 ± 14.24 (18.14,21.13)
loudness_stddevNorm 0.57 ± 0.07 (0.57,0.58) 0.49 ± 0.07 (0.48,0.5)
shimmer_mean 1.55 ± 0.38 (1.51,1.61) 1.36 ± 0.37 (1.32,1.4)
shimmer_stddevNorm 0.86 ± 0.14 (0.84,0.88) 0.78 ± 0.16 (0.77,0.8)
Temporal domain
silencePercentage 0.1 ± 0.11 (0.09,0.12) 0.22 ± 0.19 (0.2,0.24)
silencesMean 0.16 ± 0.2 (0.14,0.19) 0.31 ± 0.3 (0.28,0.35)
StddevVoicedSegmentLengthSec 0.15 ± 0.08 (0.14,0.16) 0.25 ± 0.2 (0.23,0.27)
MeanVoicedSegmentLengthSec 0.26 ± 0.15 (0.25,0.29) 0.44 ± 0.39 (0.41,0.49)
silencesPerSecond 0.39 ± 0.38 (0.35,0.44) 0.57 ± 0.4 (0.53,0.62)
VoicedSegmentsPerSec 3.42 ± 1.06 (3.29,3.55) 2.47 ± 1.04 (2.37,2.59)
loudnessPeaksPerSec 5.76 ± 1 (5.64,5.89) 4.39 ± 0.94 (4.29,4.49)
MeanUnvoicedSegmentLength 0.05 ± 0.02 (0.05,0.06) 0.06 ± 0.03 (0.06,0.07)
soundingPercentage 0.89 ± 0.11 (0.88,0.91) 0.77 ± 0.19 (0.76,0.8)
and temporal domains, sorted by mean differences. In the case of fre- and Formant 3 (to a lower degree) also allow differences to be identi-
quency, 9 of 12 features present significant differences. The first rows fied. As expected, MFCC values (the four analyzed) permit both groups
(from F0_stddevRisingSlope to jitter_mean) refer to the temporal evo- to be separated. With respect to the variables related with the harmonic
lution of the F0 contour. In all cases, figures present a higher value for differences, only two variables appear in the list: log-
speakers with Down syndrome, both when the stddev value is analyzed RelF0H1A3_stddevNorm and logRelF0H1A3_mean.
or the Risingslope and jitter (jitter value is lower because it focuses on the
periods, which are the inverse of the F0 values). The last rows refer to 4.2. Classification results
mean values, coefficient of variation, ranges and percentiles of the F0
contour (from F0_pctlrange to F0_percetile80). Speakers with Down Table 7 shows the classification results in the task of identifying the
syndrome exhibit higher values than the speakers of the control group group of the speaker (TD or SD) of each utterance. The classifiers ex-
in all the cases, with a lower coefficient of variation in the Down syn- plained in Section 3.3 and the selected features presented in the pre-
drome group. These results seem to indicate that the participants with vious section were used. Only the features with significant differences
Down syndrome use higher F0 values with more temporal changes in between TD and DS groups are used. DT shows the lower classification
the F0 contours. results in all feature groups. MLP shows a better performance using
There are 9 of 14 energy features that present statistically sig- frequency (UAR 0.64), temporal (UAR 0.78), frequency+energy
nificant differences (Mann-Whitney test with p-value < 0.01), as shown +temporal (UAR 0.91) and all (UAR 0.95) feature groups. SVM works
in Table 5. The first four rows (from loudness_percentil20 to loud- better with energy features (UAR 0.78). The results using spectral fea-
ness_pctlrange) refer to mean, range and percentile values. Values are tures are the same in MLP and SVM classifiers (UAR 0.87).
higher for speakers with Down syndrome in all the cases. The last In addition, the best classification results are obtained using all
columns refer to the temporal variation of the energy values. In this features, independently of which classifier is used. Frequency features
case, Down syndrome speakers exhibit lower values. These results seem show the worst performance when they are used alone. Energy and
to indicate that participants with Down syndrome speak louder with temporal features have similar results, with only 9 features per group.
less variation in the energy. When frequency, energy and temporal features are used together,
With respect to the temporal features displayed in Table 5, 9 of 10 the performance is noticeably better than using each group separately.
features presented statistically significant differences (Mann-Whitney Finally, spectral features show a slightly worse performance than all
test with p-value < 0.01). Speakers with Down syndrome use more and frequency+energy+temporal features.
pauses and they are longer (higher silencePercentage, silencePerSecond
and silenceMean). The length of the voiced segment is longer, in-
dicating that participants with Down syndrome speak more slowly. 4.3. Perception test results
As for spectral features (Table 6), 34 of 56 features showed statis-
tically significant differences (Mann–Whitney test with p-value < Table 8 shows the results of the perception test and Fig. 3 visually
0.01). Results show that the LTAS could be a useful instrument to detect presents the differences between the groups. When the prosody of TD
differences, as clear differences appear when the features related with speakers was transferred to utterances of TD speakers, 84% of the an-
slope, Hammarberg and alpha index are taken into account. Formant 1 swers identified the audios as TD speakers (answer 1 of row TDutt
+TDpro). In this case, the doubts in the identification of the audio files
95
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100
Table 6
List of spectral features with higher statistically significant differences (Mann–Whitney test with p-value < 0.01), sorted by mean differences. The meaning of the features in column
Variable can be seen in Appendix A. The units are reported in (Eyben et al., 2016).
Variable Control Control (CI 95%) Down syndrome Down syndrome (CI 95%)
Table 7 Table 8
Classification results for identifying the group of the speaker. Classification rate (c. rate) Number of responses of the perception tests for each type of audio file. A response of 1
and UAR using different feature sets and different classifiers are reported. The features means “no way” and 5 means “very sure” in the identification of the audio file as a
used are those with significant differences between TD and DS groups. The classifiers are speaker with Down syndrome. NR means no response. TDutt+TDpro means utterance of
decision tree (DT), support vector machine (SVM) and multilayer perceptron (MLP). # is a TD person with prosody transferred from an utterance of another TD person; DSutt
the number of input features in each set. +TDpro means utterance of a person with DS with prosody transferred from an utterance
of a TD person; TDutt+DSpro means utterance of a TD person with prosody transferred
SVM MLP DT from an utterance of a person with DS; and DSutt+DSpro means utterance of a person
with DS with prosody transferred from an utterance of another person with DS.
Set # C. Rate UAR C. Rate UAR C. Rate UAR
Type 1 2 3 4 5 NR Total
Frequency 9 62.67 0.61 64.33 0.64 60.17 0.60
Energy 9 79.33 0.78 76 0.76 72.5 0.71 TDutt+TDpro 124 15 3 1 4 3 150
Temporal 9 76.83 0.76 77.83 0.78 74.33 0.75 DSutt+TDpro 42 42 31 18 11 6 150
Frequency+Energy 27 90 0.9 91.83 0.91 82 0.82 TDutt+DSpro 17 21 34 43 31 4 150
+Temporal DSutt+DSpro 1 11 26 49 56 7 150
Spectral 34 87.33 0.87 87.33 0.87 84.33 0.84
All 61 94.17 0.94 95.17 0.95 86.5 0.87
process than the original utterance. When the prosody of TD speakers
was transferred to utterances of speakers with DS, 58% of the answers
as TD or DS represent only 2% of the answers (answer 3 of row TDutt identified the audios as TD speakers (answers 1 and 2) versus only 20%
+TDpro). On the other hand, when the prosody of DS speakers was of DS identifications (answers 4 and 5). On the other hand, 51% of the
transferred to utterances of DS speakers, 73% of the answers identified answers identified the audios as speakers with DS (answers 4 and 5)
the audios as DS speakers (answers 4 and 5 of row DSutt+DSpro). In when the prosody of speakers with DS was transferred to an utterance
this case, the doubts in the identification of the audio files as TD or DS of TD speakers, versus only 26% of TD identifications (4 and 5 an-
represent 18% of the answers (answer 3 of row DSutt+DSpro), and the swers). In both cases, the number of answers 3 is relevant (22% and
identifications as TD are only 8% (answers 1 and 2 of row DSutt 23% of answers 3, respectively).
+DSpro). Moreover, two statistical tests were used to compare the answers
The answers given about the audio files that combined utterances of obtained. The results of the Kruskal–Wallis non-parametric test showed
one group with prosody of the other group present much more varia- significant differences (with a p-value < 0.001) between the answers
bility. However, prosody had more influence in the identification given to the four groups (TDutt+TDpro, DSutt+TDpro, TDutt+DSpro
96
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100
5.1. Characterization of the speech of people with Down syndrome • There are a high number of features out of the spectral domain that
present significant differences between speakers with Down syn-
Fundamental frequency is significantly higher in speakers with drome and speakers without intellectual disabilities.
Down syndrome. The same results were found by Albertini et al. (2010), • Spectral features achieve high classification rates (up to 87%), but
Rochet-Capellan and Dohen (2015) and Lee et al. (2009). In addition, classification rates of frequency, energy and temporal features to-
the F0 range is lower in speakers with Down syndrome, which can be gether are higher than spectral features (up to 91.83%).
explained by a less melodious intonation. Continuing with frequency, • Utterances of control speakers with transferred frequency, energy
jitter is significantly lower in the DS group, as found by Lee et al. (2009) and phoneme duration from speakers with Down syndrome are
and by Seifpanahi et al. (2011). mostly perceived as anomalous voice. In the same way, utterances of
Concerning temporal features, on the one hand, the number of speakers with Down syndrome with transferred frequency, energy
continuous voiced regions per second is lower in the speakers with and phoneme duration from control speakers are mostly perceived
Down syndrome, which means that the oral production of speakers with as typical speech.
Down syndrome was slower than that of control speakers. Reading
difficulties that some people with Down syndrome present can have To the best of our knowledge, there are few studies that assess, in an
influenced these results. On the other hand, Van Borsel and experimental way, the relative weight of prosody in the perception of
Vandermeulen (2008) found disfluencies in Down syndrome speaking, speech of people with Down syndrome as a non typical voice. The
such as cluttering and stuttering. These disfluencies can produce the differences between speakers with Down syndrome and control
insertion of more silences and the presence of more temporal variety in speakers in the spectral domain can be derived from physiological pe-
the speech of people with Down syndrome, as found in this study. culiarities in their phonological system. Some could be corrected by
In terms of energy, loudness features were found to be significantly surgery, but others are impossible to be corrected. However, frequency,
higher in the speakers with Down syndrome and its range was higher. energy and temporal characteristics can be trained using speech
This result contradicts that reported by Albertini et al. (2010), which therapy techniques focusing on breathing and repetition of activities.
showed lower energy values in speakers with Down syndrome. Another The results obtained in this paper show the potential benefits of pro-
study focused on vowels (Saz et al., 2009a) found an increase in the sody training.
energy of unstressed vowels in Down syndrome speakers. Energy is The distance between the prosodic features of speakers with Down
always a difficult variable in the analysis of prosody, as its values are syndrome and those of control speakers can be used to devise a quality
very dependent on the recording conditions: the dynamic range of the metric to be included in computer assisted pronunciation training ap-
microphone and the distance between the speaker and the microphone. plications. Our future work on the implementation of an automatic
On the other hand, some of the participants have slight hearing pro- evaluation module of voice quality is expected to benefit from the re-
blems, which may be another possible explanation for the higher en- sults of this paper. This module is to be included in our speech training
ergy values. tools (González-Ferreras et al., 2017), so spectral features will be useful
Our corpus also permitted the detection of differences related with to identify a recording as a non typical speech, while prosody analysis
the spectral features. Table 6 highlights the fact that LTAS has been will be necessary for the evaluation of the players’ improvement over
proposed in Gauffin and Sundberg (1977) for the identification of the different game sessions.
breathy and hypokinetic voice. The relative amplitude of the first
harmonic was also related with breathy voices by Hillenbrand and 5.3. Limitations
Houde (1996). The speech of people with DS is described as breathy by
Wold DC (1979) and dysphonic by Moran (1986). MFCC features are The corpus size in speech analysis studies is very important to
97
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100
The tables included in this appendix describe the features used in each of the domains. Frequency features are presented in Table A.9. Energy
features are described in Table A.10. Temporal features are explained in Table A.11. Spectral features are presented in Tables A.12 and A.13.
Table A.9
Frequency features explained. All functionals are applied to voiced regions only. Text in brackets shows the original name of the eGeMAPS features .
Feature Description
Table A.10
Energy features explained. All functionals are applied to voiced and unvoiced regions together. Text in brackets shows the original name of the eGeMAPS featurese.
Feature Description
loudness_percentile20 (loudness_sma3_percentile20.0) Percentile 20-th of estimate of perceived signal intensity from an auditory spectrum
loudness_percentile50 (loudness_sma3_percentile50.0) Percentile 50-th of estimate of perceived signal intensity from an auditory spectrum
loudness_mean (loudness_sma3_amean) Mean of estimate of perceived signal intensity from an auditory spectrum
loudness_percentile80 (loudness_sma3_percentile80.0) Percentile 80-th of estimate of perceived signal intensity from an auditory spectrum
loudness_pctlrange02 (loudness_sma3_pctlrange0-2) Range of 20-th to 80-th of estimate of perceived signal intensity from an auditory spectrum
loudness_stddevRisingSlope (loudness_sma3_stddevRisingSlope) Standard deviation of the slope of rising signal parts of loudness
loudness_stddevNorm (loudness_sma3_stddevNorm) Coefficient of variation of estimate of perceived signal intensity from an auditory spectrum
shimmer_mean (shimmerLocaldB_sma3nz_amean) Mean of difference of the peak amplitudes of consecutive F0 periods
shimmer_stddevNorm (shimmerLocaldB_sma3nz_stddevNorm) Coefficient of variation of difference of the peak amplitudes of consecutive F0 periods
98
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100
Table A.11
Temporal features explained.
Feature Description
Table A.12
Spectral features explained (part1). If nothing is said, the features are applied to voiced and unvoiced regions together. Text in brackets shows the original name of the eGeMAPS features .
Feature Description
Table A.13
Spectral features explained (part2). If nothing is said, the features are applied to voiced and unvoiced regions together. Text in brackets shows the original name of the eGeMAPS features .
Feature Description
alphaRatioV_mean (alphaRatioV_sma3nz_amean) Mean of the ratio of the summed energy from 50 to 1000 Hz and 1–5 kHz in voiced regions
mfcc4V_mean (mfcc4V_sma3nz_amean) Mean of Mel-Frequency Cepstral Coefficient 4 in voiced regions
hammarbergIndexV_mean (hammarbergIndexV_sma3nz_amean) Mean of the ratio of the strongest energy peak in the 0–2 kHz region to the strongest peak in
the 2–5 kHz region in voiced regions
mfcc1V_mean (mfcc1V_sma3nz_amean) Mean of Mel-Frequency Cepstral Coefficient 1 in voiced regions
hammarbergIndexV_stddevNorm (hammarbergIndexV_sma3nz_stddevNorm) Coefficient of variation of the ratio of the strongest energy peak in the 0–2 kHz region to the
strongest peak in the 2–5 kHz region in voiced regions
mfcc1_mean (mfcc1_sma3_amean) Mean of Mel-Frequency Cepstral Coefficient 1
logRelF0H1A3_mean (logRelF0-H1-A3_sma3nz_amean) Mean of the ratio of energy of the first F0 harmonic (H1) to the energy of the highest
harmonic in the third formant range (A3)in voiced regions
F3amplitudeLogRelF0_mean (F3amplitudeLogRelF0_sma3nz_amean) Mean of the ratio of the energy of the spectral harmonic peak at the third formant’s centre
frequency to the energy of the spectral peak at F0 in voiced regions
F3amplitudeLogRelF0_stddevNorm (F3amplitudeLogRelF0_sma3nz_stddevNorm) Coefficient of variation of the ratio of the energy of the spectral harmonic peak at the third
formant’s centre frequency to the energy of the spectral peak at F0 in voiced regions
slopeV5001500_mean (slopeV500-1500_sma3nz_amean) Mean of linear regression slope of the logarithmic power spectrum within 500–1500 Hz band
in voiced regions
F2amplitudeLogRelF0_mean (F2amplitudeLogRelF0_sma3nz_amean) Mean of the ratio of the energy of the spectral harmonic peak at the second formant’s centre
frequency to the energy of the spectral peak at F0 in voiced regions
F2amplitudeLogRelF0_stddevNorm (F2amplitudeLogRelF0_sma3nz_stddevNorm) Coefficient of variation of the ratio of the energy of the spectral harmonic peak at the second
formant’s centre frequency to the energy of the spectral peak at F0 in voiced regions
F1bandwidth_stddevNorm (F1bandwidth_sma3nz_stddevNorm) Coefficient of variation of the bandwidth of first formant in voiced regions
F1frequency_stddevNorm (F1frequency_sma3nz_stddevNorm) Coefficient of variation of the centre frequency of first formant in voiced regions
F3frequency_stddevNorm (F3frequency_sma3nz_stddevNorm) Coefficient of variation of the centre frequency of third formant in voiced regions
spectralFlux_stddevNorm (spectralFlux_sma3_stddevNorm) Coefficient of variation of the difference of the spectra of two consecutive frames
F3frequency_mean (F3frequency_sma3nz_amean) Mean of the centre frequency of third formant in voiced regions
99
M. Corrales-Astorgano et al. Speech Communication 99 (2018) 90–100
100