2020.alta-1.6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

An Automatic Vowel Space Generator for Language Learners’

Pronunciation Acquisition and Correction


Xinyuan Chao1 , Charbel El-Khaissi2 , Nicholas Kuo1 , Priscilla Kan John1 , and Hanna Suominen1, 3, 4
1
Research School of Computer Science, The Australian National University, Australia
2
College of Arts and Social Sciences, The Australian National University, Australia
3
Data61, Commonwealth Scientific and Industrial Research Organisation, Australia
4
Department of Future Technologies, University of Turku, Finland

{u6456596, charbel.el-khaissi, nicholas.kuo, priscilla.kanjohn, hanna.suominen}@anu.edu.au

Abstract and Suominen, 2018). PLA achieves this via eval-


uating students’ produced speech to reflect their
Speech visualisations are known to help lan- pronunciation status. Another instance of auxiliary
guage learners to acquire correct pronuncia- systems is visual cues, which serves as a friendly
tion and promote a better study experience. and accessible form of feedback to language stu-
We present a two-step approach based on two
dents (Yoshida, 2018).
established techniques to display tongue tip
movements of an acoustic speech signal on a Through combining language lecturers’ teaching
vowel space plot. First, we use Energy En- with auxiliary systems, our aim is to assist students
tropy Ratio to extract vowels; and then, we ap- in both a classroom setting and in their individual
ply the Linear Predictive Coding root method practices. We present a prototype system that dis-
to estimate Formant 1 and Formant 2. We plays visual feedback on tongue movements to as-
invited and collected acoustic data from one sist language learners to acquire correct pronuncia-
Modern Standard Arabic (MSA) lecturer and
tion in the process of L2 studying. We have adopted
four MSA students. Our proof of concept was
able to reflect differences between the tongue a human-centred approach for the development
tip movements in a native MSA speaker to of the system using a design-oriented perspective
those of a MSA language learner at a vocab- through applying a methodology that draws from
ulary level. This paper addresses principle Design Science Research (DSR) (Hevner et al.,
methods for generating features that reflect 2004) and Design Thinking (DT) (Plattner et al.,
bio-physiological features of speech and thus, 2009). Unlike machine learning methods, which
facilitates an approach that can be generally train deep neural networks to predict articulatory
adapted to languages other than MSA.
movements (Yu et al., 2018), our proposed system
uses vowel space plots based on bio-physiological
1 Introduction features to help visualise tongue movements.
Second language (L2) learners have difficulties In this present work, we introduce a versatile pro-
in pronouncing words as well as native speakers totype of our vowel space plot generator to address
(Burgess and Spencer, 2000) which can create in- these challenges for students primarily learning
conveniences in social interactions (Derwing and MSA. Our design aims to allow L2 beginner learn-
Munro, 2005). Difficulty in providing pronunci- ers to quickly visualise their status of pronunciation
ation instructions by language teachers add extra compared to those by their language teachers. We
challenges on L2 pronunciation training and cor- provide a reference vowel space plot adjacent to
rections (Breitkreutz et al., 2001). the students’ own plots to reflect clear differences
One solution to assist pronunciation acquisition to support self-corrections. The envisioned appli-
is through the adoption of educational software ap- cability ranges from in-class activities to provide
plications (Levis, 2007). A well-designed language immediate and personalised suggestion to remote
educational software can provide straightforward learning where in both cases glossary files are pre-
guidance to correct L2 pronunciation through mul- uploaded by teachers or textbook publishers.
tiple information sources. One instance of auxil-
2 Related Work
iary systems is Pronunciation Learning Aid (PLA),
which supports language students towards native- Traditional acoustic plots, such as waveforms, spec-
like pronunciation in a target language (Fudholi trograms, and other feature plots are applied to vi-
sualise speech signals and can provide sufficient using vowel space plots to achieve pronunciation
information to phoneticians, expert scientists, and visualisation, such as the studies by Paganus et al.
engineers (Fouz-González, 2015). However, these (2006) and Iribe et al. (2012). These studies indi-
methods fall short in providing straightforward cate that for language learners, vowel space plots
suggestions for improving language students’ pro- are easy-to-understand, straightforward, and pro-
nunciation or otherwise lack an intuitive and user- vide the necessary information for understanding
friendly graphic user interface (Neri et al., 2002). their own tongue placement and movement. There-
A study proposed by Dibra et al. (2014) adopted fore, vowel space plots are considered a useful tool
the combination of waveform and highlighting syl- for language learners to practice and correct their
lables to visualise pronunciation in ESL studying pronunciation relative to other pronunciation cor-
shows using acoustic plots to support pronunciation rection tools, such as ultrasound visual feedback or
acquisition is an implementable method. more traditional pedadogical methods like explicit
Different from acoustic plots, another think- correction and repetition.
ing of pronunciation visualisation was considered
based on people’s bio-physiological features. A 3 The Proposed Approach
pioneer study with this idea was introduced by To visualise the tongue movement based on stu-
Tye-Murray et al. (1993), in which they discussed dents’ pronunciation practice, our proposed sys-
the effect of increasing the amount of visible ar- tem needs to receive students’ pronunciation audio
ticulatory information, such as non-visible articu- signal as its input. After the process of vowel de-
latory gestures, on speech comprehension. With tection, vowel extraction, and formant estimation,
the improvement of equipment, Ultrasound imag- the system can automatically generate the corre-
ing, Magnetic Resonance Imaging (MRI), and Elec- sponding vowel space plot as its output. In this
troMagnetic Articulography (EMA) can be alter- section, we will introduce how engineering and lin-
native approaches to visualise the movement of guistics insights inspired our proposed method, and
articulators, and several study cases on pronun- the details of audio signal processing procedures.
ciation visualisation were implemented by Stone
(2005), Narayanan et al. (2004), and Katz and 3.1 Design Methodology
Mehta (2015). However, these approaches are still
To find a reliable solution for language students on
difficult to be implemented in daily language study-
the challenges about pronunciation acquisition, we
ing since relevant equipment are often not available
adopted a design-based approach and implemented
for in-class activities and self-learning, and gener-
a human-centred approach by using the Design
ated images and videos are hard to be understood
Thinking framework (Plattner et al., 2009) to find
by ordinary learners.
the students’ needs in terms of pronunciation prac-
Enlightened by imaging the movement of articu- tice and transform these into requirements. In the
lators, the idea of talking head, which is using 3D Empathy and Define phases of DT, we defined our
mesh model to display of both the appearance ar- research question as “Finding an implementable
ticulators and internal articulators, was introduced. and friendly approach for language learners to help
Some of the fundamental works of talking head them practice their pronunciation”. After this, we
were completed by Odisio et al. (2004), and Ser- participated in an MSA tutorial and observed stu-
rurier and Badin (2008). With the techniques of ar- dents’ behaviours during the process of pronun-
ticulatory movement prediction, such as Gaussian ciation acquisition. Finally, we generated an on-
Mixture Model (GMM) (Toda et al., 2008), Hid- line questionnaire for students which asks their in-
den Markov model (HMM) (Ling et al., 2010), and class pronunciation training experience and their
popular deep learning approach (Yu et al., 2019). study preferences. The details of this survey were
Although talking head is developing swiftly, the introduced in the thesis by Chao (2019).
research about performance of talking head for pro- Based on the observation of MSA tutorial, we
nunciation training is still insufficient. found that students feel comfortable to interact with
The place and manner of articulation are well other people (lecturer or classmates) during pro-
established variables in the study of speech pro- nunciation process. One advantage for interaction
duction and perception (e.g. Badin et al., 2010). is other people can provide feedback on students’
Early research has already realised the potential of pronunciation. Another finding from observation
is the process of pronunciation acquisition can be
seen as a process of imitation. Students need a
gold-standard, such as teachers’ pronunciation, as
a reference to acquire new pronunciation and cor-
rect mispronunciation. The survey gives us some
insights into students preferences about pronunci-
ation study pattern. One of the most important in-
sight is that students are interested in multi-source
feedback of pronunciation training. For ordinary (a) An example of vowel space (b) Vowel space plot and
pronunciation, training students can only receive plot which shows the location oral cavity – the Formant-
auditory information of pronunciation. Therefore, of different vowels in the vowel Articulation Correlation
space
if a straightforward and easy-understanding visual
feedback can be adopted in our proposed method, Figure 1: Vowel space plot and oral cavity
students will have a better experience and higher
efficiency on pronunciation training.
The DT Empathy and Define phases gave us the F2 counterpart is associated with tongue placement
insight that an ideal auxiliary pronunciation system in the oral cavity (tongue advancement) and plotted
should interact with learners, provide gold-standard along the horizontal axis.
pronunciation reference, and display reliable vi- The correlation between formant values and the
sual feedback to learners. The insight gained led tongue’s height and placement is referred to as the
to ideation discussions leading to the selection of formant-articulation relationship (Lee et al., 2015).
vowel space plots as visualisation tool. We aug- These F1-F2 formant values can be rendered as x-y
mented the use of DT with the DSR approach, in coordinates on a 2D plot to visualise the relative
the manner of John et al. (2020)’s study, to guide height and placement of the tongue in the oral cav-
the development of our the artefact generated from ity during articulation. When visualised alongside
our insights. Using the DSR method introduced the tongue position of a native speaker’s pronun-
by Peffers et al. (2007), we (1) identified our re- ciation, users can then see the position of their
search question based on a research project which tongue relative to a standard reference or bench-
is about assisting new language learner on pronun- mark of their choice, such as an L2 teacher or native
ciation acquisition with potential educational soft- speaker. This visualisation supports pronunciation
wares, (2) defined our solution according to our feedback and correction as users could then rectify
observation and survey, (3) designed and devel- the placement and/or height of their tongue during
oped our prototype of vowel space plot generator, articulation to more closely align with its position
(4) demonstrated our prototype to MSA lecturers in an equivalent native-like pronunciation.
and students, (5) and evaluated the prototype’s per-
3.3 Vowel Detection and Perception
formance. The DT and DSR process underpin all
our methods. To extract vowels from input speech signal, first, we
calculate relevant energy criteria and find speech
3.2 Vowel Space Plot segments. Once speech segments were confirmed,
we then use defined thresholds and detect vowels
Our proposed prototype uses vowel space plots as from these speech segments. This section will in-
a tool to visualise the acoustic input. This visualisa- troduce the energy criteria and the thresholds we
tion then forms the basis for subsequent feedback adopted in our practice.
on pronunciation features. Before detecting vowels in a speech signal, de-
A vowel space plot is generated by plotting trending and speech-background discrimination are
vowel formant values on a graph that approximates two necessary steps of pre-processing. These steps
the human vocal tract (Figures 1(a) and 1(b)). F1 ensure that only the correct speech information
and F2 vowel formant values correlate with the po- from the original signal is extracted, while other
sition of the tongue during articulation (Lieberman possible noise is ignored. In this way, the prototype
and Blumstein, 1988). Specifically, F1 is associ- minimises the possibility of including irrelevant
ated with the height of the tongue body (tongue signals during the feature extraction process.
height) and plotted along the vertical axis, while its Our prototype adopted the spectral subtraction
algorithm to achieve speech-background discrim-
ination, as first introduced by Boll (1979). And
the detrending can be achieved by the classic least
squares method.
Our approach used Energy Entropy Ratio (EER),
which is a calculated feature from input signal, as
the criteria to find vowels from input speech signal.
The EER can be calculated as following steps.
The spectral entropy (SE) of a signal describes
its spectral power distribution (Shen et al., 1998).
SE treats the signal’s normalised power distribu-
tion within the frequency domain as a probability
distribution and calculates its Shannon entropy. To Figure 2: Vowel detection and segmentation
demonstrate the probability distribution of a signal,
let a sampled time-domain speech signal be x(n), where Ei is the energy of the ith frame of a speech
where the ith frame of x(n) is xi (k) and the mth of signal, and Hi is the corresponding SE. Speech
the power spectrum Yi (m) is the Discrete Fourier segments will have larger energy and smaller SE
Transformation (DFT) of xi (k). If N is the length than silent segments. A division of these two short-
of Fast Fourier Transformation (FFT), the proba- term factors makes the difference between speech
bility distribution Pi (m) of the signal can be then segments and silent segments more obvious.
expressed as The first threshold T1 was implemented as the
Yi (m) criterion to judge if the segment contains speech
pi (m) = PN/2 . (1) or not. The value of T1 can be adjusted, and in
l=0 Yi (l) our case we chose T1 = 0.1 which performs well.
The definition of short-time spectral entropy for Thus, segments with an energy entropy ratio larger
each frame of the signal can be further shown as than T1 were classified as speech segments.
In each speech segment that is extracted, the
N/2
X maximum energy entropy ratio, Emax , and scale
Hi = − pi (k) log pi (k). (2)
factor r2 , were used to set another threshold T2 for
k=0
detecting vowel segments:
The spectral entropy reflects the disorder or ran-
domness of a signal. The distribution of normalised
T2 = r2 Emax . (4)
spectral probability for noise is even, which makes
the spectral entropy value of noise great. Due to the Since different speech segments may have a differ-
presence of formants in the spectrum of signals in ent threshold T2 , segments with an energy entropy
human speech, the distribution of normalised spec- ratio larger than T2 were used to detect vowels.
tral probability is uneven, which makes the spectral In an example visualisation of vowel detec-
entropy value small. This phenomenon can be used tion and segmentation (Figure 2), three vowel
with speech-background discrimination to find out phonemes — /a/, /i/, and /u/ — are contained in the
endpoints of speech segments. speech signal. The black dashed horizontal lines
In its practical application, SE is robust under show the threshold value T1 = 0.1 for speech seg-
the influence of noise. But spectral entropy can- ment detection, while the solid orange lines show
not be applied for signals with a low signal-to- the detected speech segments within the speech
noise ratio (SNR) because when SNR decreases, signal. Similarly, the black vertical lines in bold
the time-domain plot of spectral entropy will keep indicate a dynamic threshold value T2 for vowel
the original shape, but with a smaller amplitude. detection across different speech segments, while
This makes SE insensitive to distinguishing speech the blue dashed lines display the vowel segments.
segments from background noise. To provide a
more reliable method of detecting the beginning 3.4 Formant Estimation
and end of speech intervals, we introduce Formant value estimation is the next task after the
q detection of vowel segments from input speech sig-
EERi = 1 + |Ei /Hi |, (3) nals. Our prototype adopted the Linear Predictive
Coding (LPC) root method to estimate the F1 and roots will be up to p/2. This makes it straightfor-
F2 formant values for vowels. ward to find which pole belongs to which formant,
A common pre-processing step for linear predic- since extra poles with a bandwidth larger than a for-
tive coding is pre-emphasis (highpass) filtering. We mant’s bandwidth may be conveniently excluded.
apply a straightforward first-order highpass filter to
complete this task. 4 Preliminary Evaluation Experiment
A simplified speech production model, which We conducted two experiments to evaluate the per-
we adopted in our work is represented in Figure 3 formance of our prototype. First, we invited a
following Rabiner and Schafer (2010). As shown in native Arabic speaker who is a Modern Standard
Figure 3, s[n] is the output of the speech production Arabic (MSA) lecturer at The Australian National
system, u[n] is the excitation from the throat, G is University (ANU) to provide a glossary of MSA
a gain parameter and H(z) is a vocal tract system lexicon and their corresponding utterances. These
function. Let us consider the transfer function of utterances constituted the gold-standard or target
H(z) as an Auto-Regression (AR) model pronunciation for users. Then, we invited four
MSA language students to use our prototype by
G G pronouncing four MSA words. For each lexical
H(z) = = Pp (5) item pronounced, the articulation was visualised
A(z) 1 − k=1 ak z −k
on a vowel space plot so users can compare their
where A(z) is the prediction error filter, which is pronunciation alongside the native-like, target pro-
used in the LPC root method below. nunciation of their lecturer. Following this visual
The polynomial coefficient decomposition of comparison, users were prompted to pronounce the
prediction error filter A(z) can be used to estimate same word again.
the centre of formants and their bandwidth. This In the experiments, we want to verify the feasi-
method is known as the LPC root method, which bility and accessibility of our prototype. The feasi-
was first introduced by Snell and Milinazzo (1993). bility of our prototype was determined by whether
Notably, the roots of A(z) are mostly complex con- the interpretation of the comparison plots in the
jugate paired roots. first instance supported improved pronunciation of
the same word in subsequent iterations. And the
Let zi = ri ejθi be any value of a complex root
accessibility refers to whether our prototype can
of A(z), where its conjugate zi∗ = ri e−iθi is one
provide implementable and correct feedback for
of the roots of A(z). Further, if Fi is the for-
learners to visualise their pronunciation.
mant frequency corresponding to zi , and Bi is the
Ethical Approval (2018/520) was obtained from
bandwidth at 3dB, then we have the relationships
the Human Research Ethics Committee of The Aus-
2πT Fi = θi and e−Bi πT = ri , where T is sam-
tralian National University. Each study participant
pling period. Their solutions are Fi = θi /(2πT )
provided written informed consent.
and Bi = − ln ri /πT .
Since the order p of prediction error filter is set 4.1 Feasibility Test
in advance, the pair number of complex conjugate
The functionality of the prototype, including
speech detection, vowel segmentation and plot gen-
eration, was first verified by using a series of acous-
tic signals as input to observe the accuracy of the
output vowel space plot. The MSA lecturer’s pro-
nunciation of MSA lexicon was used here to test
the veracity of the prototype output. The MSA
dataset comprised of ten lexical items1 and their
corresponding pronunciation, henceforth referred
to as the “standard reference” (see Table 1).
For each vocabulary item and corresponding au-
dio input, we observed the vowel space plot gen-
1
Refer to MSA Vocabulary Selection (Section 8) on our
Figure 3: A simplified model of speech production selection criteria of this list.
Vocabulary MSA Transliteration Vowels Volunteers were aged between 19 and 22
clock

é«Aƒ /sā‘a/ 2
and had completed an introductory MSA course
eggs ‘JK. /bayd./ 1
(ARAB1002), which meant they had basic knowl-
mosque ©Ó Ag. /jāmi‘/ 2 edge of MSA and were familiar with its alphabet
phone ­K Aë /hātif/ 2 and phonetic inventory. Four lexical items from the
shark €Q ¯ /qirš/ 1 glossary in the standard reference were selected as
soap àñK. A“ /s.ābūn/ 2 test items which shown in Table 2 for the volun-
spring ©JK. P /rabı̄‘/ 2 teers to pronounce. Volunteers pronounced each
street ¨PAƒ /šāri‘/ 2 of the four vocabulary items independently, which
student(male) I.Ë A£ /t.ālib/ 2 were recorded respectively as audio files. These
student(female) éJ . ËA£ /t.āliba/ 3 files were processed by our prototype and the corre-
watermelon qJ¢ . /bāt.t.ı̄k/ 2
sponding vowel space plots were generated to visu-
¯ alise their pronunciation for each word. Then, their
Table 1: Ten reference vocabularies vowel space plots were compared to the correspond-
ing vowel space plot of the standard reference. Par-
Vocabulary MSA Transliteration Vowels ticipants were advised to use this comparison plot
shark €Q ¯ /qirš/ 1 as the basis for their pronunciation feedback prior
soap àñK. A“ /s.ābūn/ 2 to repeating the pronunciation of the word. Then,
student(male) I.Ë A£ /t.ālib/ 2 participants pronounced the word a second time
student(female) éJ . ËA£ /t.āliba/ 3 and the generated plot was once again compared to
the standard reference. This time, the comparison
Table 2: The student test data of four MSA words assessed whether the participant’s articulation of
the vowel was more closely aligned to the standard
reference compared to the first pronunciation. In
erated by our prototype. The accuracy and acces- other words, the second iteration of pronunciation
sibility of our prototype’s speech and vowel de- allowed for an assessment of whether our prototype
tection functionality was determined by its ability provided valuable visualisation information to par-
to correctly visualise tongue positioning for each ticipants, and whether it helped them immediately
vowel in a word. This was determined based on correct and improve their pronunciation relative to
a comparison with statistical averages of formant the standard reference.
values for the same vowel. We use a Sony Xperia
We participated in one of the MSA course tuto-
Z5 mobile phone to collect the utterance of glos-
rials and were keen to see the quality of acoustic
sary from the MSA lecturer. The utterances were
data, which were collected from a noisy circum-
recorded as individual mp3 files which can be used
stance, like a classroom. The collecting device was
as input of our prototype. Each mp3 file contains
a MacBook Pro 2017. We wrote a Matlab recorder
one MSA vocabulary in the glossary. These mp3
function with GUI to collect the utterance provided
files were recorded in the lecturer’s office to reduce
by volunteers who were from this tutorial. The
background noise.
utterance were collected as individual wav files and
4.2 Accessibility Test each file contained one word from volunteers.

The verification of our prototype’s functionality 5 Results and Discussion


alone is insufficient to prove that the prototype can
assist in providing valuable corrective feedback to We used collected speech signals to test the feasi-
users. Therefore, we invited two male students bility and accessibility of our prototype. To test
and two female students who were enrolled in a the feasibility, we fed the standard references to
beginner MSA course (ARAB1003) at ANU to our prototype and verify whether the output vowel
voluntarily participate in our accessibility test The space plot can reflect the correct tongue motion of
success of our prototype’s feedback function was the corresponding word. As for accessibility, we
determined by whether the language learners can used the student test data and generated the vowel
interpret their pronunciation on a vowel space plot space plot, and then found corresponding words
against the standard reference in order to produce a from a standard reference and compare these two
more native-like pronunciation for the same word. vowel space plots. An ideal result is the student test
(a) Vowel segmentation of (b) Vowel space plot of stan-
standard reference “soap” dard reference “soap” with /ā/
and /ū/ two vowels
Figure 5: The tongue motion for the MSA word “soap”
Figure 4: The waveform, energy-entropy ratio, and
vowel space plot for standard reference word “soap”
(provided by a MSA teacher) prototype: one shows the standard reference, and
another reflects their own pronunciation.

data can reflect the student’s tongue motion, and


the student can find how to improve the pronuncia-
tion by compare these two vowel space plots. With
the vowel space plots of the same words from stu-
dent test data and standard reference, we compared
the corresponding plots to see if the corresponding
plots and if the vowel space plots can provide use-
ful feedback on pronunciation correction. In this
paper, we display the MSA word “soap” ( àñK. A“ ,
/s.ābūn/) as an example of our results.
Figure 6: The tongue movement (reference and stu-
5.1 Feasibility dent1’s practice) for the MSA word “soap”

To test the feasibility of our prototype, we picked


one vocabulary item (the word “soap”) from stan-
dard reference and verify whether the output vowel
space plot can reflect the tongue motion. The wave-
form, energy-entropy ratio, and vowel space plot
for standard reference word “soap” (Figure 4).
From Figure 4(a), we found two voice segments
between solid orange lines that were recognised
from the input speech signal, and the two voice seg- (a) Standard reference of (b) Vowel space plot of user
“soap” input-1 “soap” with /ā/, /ū/
ments, which contained one vowel between dash
blue lines for each. In Figure 4(b), the two vowels Figure 7: The vowel space plot from standard reference
of /ā/ and /ū/ were mapped in the vowel space. This and student1
vowel space plot was made available to the users so
they can get familiar with their tongue position in
the oral cavity and use this visual feedback towards
pronouncing the word “soap” correctly (Figure 5).

5.2 Accessibility
To test the accessibility of our prototype, we com-
pared the vowel space plot of standard reference
and the vowel space plot of student test data. We (a) Standard reference of (b) Vowel space plot of user
continue to use the word “soap” here as an example. “soap” with arrow input-1 “soap” with arrow
Figures below show the results of MSA vocabulary
“soap” pronounced by the four anonymous students. Figure 8: The vowel space plot from standard reference
Students will see two vowel space plot from the and student1 with arrow
Figure 6 shows the overlay vowel space plot of and the pronunciation was good as well. However,
standard reference (blue crosses) and student1’s the starting point of the first vowel /ā/ was some-
pronunciation practice (red crosses). Since the key what higher than its standard reference. Hence, our
information from vowel space plot is the trend of suggestion for Student3 was to lower the starting
tongue movement, it is not necessary to compare position of the word “soap”.
the standard reference and students’ pronunciation
on the same vowel space plot. From Figure 7, stu-
dent1’s tongue should be drawn back instead of
moving it to the front of the oral cavity. The verti-
cal down-up movement of the tongue was correct.
Figure 8 shows the tongue movement with an arrow.
This is more readable and friendly for students to
help them perceive their tongue movement. (a) Standard reference “soap” (b) Vowel space plot of user
input-4 “soap”

Figure 11: The waveform, energy-entropy ratio, and


vowel space plot of Student4

Finally, student4 and student1 made similar mis-


pronunciation: student4 should draw the tongue
back instead of moving it forward while pronounc-
(a) Standard reference “soap” (b) Vowel space plot of user
input-2 “soap” with /ā/, /ū/ fol- ing the second vowel /ū/. Besides this mistake,
lowing wrong trajectory another interesting point worthy of notice was that
another unexpected vowel occurred by the end of
Figure 9: The vowel space plot of standard reference this speech signal. According to waveform anal-
and student2 ysis, this vowel was not pronounced by student4
but originated from the background noise due to
Student2, on the other hand, should focus on the data collection during an in-class activity. This
the pronunciation of the second vowel /ū/. Accord- meant that the sudden noise from background can
ing to Figure 9, we can see that the pronunciation still influence the analysis result although our pro-
of “soap” pronounced by student2 had the correct totype already applied its denoising algorithm to
tongue motion trajectory when compared with the this speech signal. Hence, we made a suggestion
standard reference of Figure 1. This student’s verti- to try to adopt a more effective denoising function
cal down-up movement of the tongue was correct. as the future development of the system to satisfy
A small defect for this practice was that there ex- the requirements from students to practice their
isted an unexpected vowel for the end of this pro- pronunciation anywhere, including noisy settings.
nunciation practice. For further practice, the advice
for student1 targeted pronouncing a clean and neat 6 Conclusion
end of the word “soap”.
This paper presented the initial proof of concept
that used vowel space plots to enhance language
learning in second languages. The idea of our pro-
totype was based on our early stage DSR process
and MSA language student survey (Chao, 2019).
Our prototype was designed to generate clear vi-
sual feedback from speech input, and it was tested
to assist the pronunciation of L2 MSA beginners.
(a) Standard reference “soap” (b) Vowel space plot of user
input-3 “soap” with /ā/, /ū/
Our main contribution is the vowel space plot
generator prototype which produces easily un-
Figure 10: The vowel space plots of standard reference derstandable visual cues from analysing the bio-
and Student3 physiological features of user speech. Our proto-
type is hence user-friendly for improving language
Student3, in turn, had the correct tongue motion, learner pronunciation.
To gain evidence of our prototype being effec- selected vocabulary items were basic MSA words
tive on assisting language learners’ pronunciation chosen in consultation with an MSA teacher to
training, we designed an experiment to test at the ensure students had been explicitly taught or oth-
vocabulary level the feasibility and accessibility erwise been exposed to them during the course of
of the prototype and invited language students to their language learning.
provide their audio data for experimental use. Also, Second, the selected words were restricted to
according to students’ feedback, we proposed a one-to-three syllabic words only. This restriction
series of future developments that are described in ensured that sentence-level factors affecting the
the next section. One limitation of our presented articulation of vowels were excluded (e.g. /t/-
work is that there was no re-testing of pronuncia-

insertion rule in Id.āfah structures; é«Aƒ /sā‘a/
tion after the students received feedback from the 
“clock” vs. ­ƒñK é«Aƒ /sā‘at jusif/ “Joseph’s
system to check that their pronunciation improved.
clock”), thus allowing for a straightforward as-
We plan to deploy re-tests as mentioned in our next
sessment of how the prototype detected speech
stage experiments
boundaries and extracted the relevant features from
7 Future Work vowel segments.
Finally, the ten words selected captured the three,
In the future, we aim to build on this current work cardinal MSA vowels: /a/ i/ and /u/. Although
to verify and quantify the pronunciation improve- these vowels exist in the English phonemic inven-
ments gained from each user. This will help us to tory and do not theoretically pose a challenge for
understand the effectiveness of this current design English-speaking L2 learners of MSA, when they
of the prototype and enable us to select appropriate are considered alongside surrounding MSA con-
extensions to enhance L2 learning experiences. sonants then their articulation becomes more diffi-
We are currently considering to build a correc- cult, such as in the well-known case of emphatic
tion subsystem for pronunciation practice. In addi- spreading caused by the presence of pharyngeal
tion to the existing vowel space plots, we theorise or pharyngealised consonants (’emphatics’) (e.g.
that it would be helpful to construct a system that Shosted et al., 2018).
could directly compare our users’ speech to a set of
externally stored standard references. This should Acknowledgement
enable the users to correct their pronunciation with The authors express their gratitude to participants and other
higher precision and efficiency. Such a design contributors of this study. Furthermore, we would like to
could also potentially provide personalised pro- thank our three anonymous ALTA reviewers for their careful
comments, which helped us to improve this present work.
nunciation assistance via analysing user-specific We would also like to thank Ms Leila Kouatly, a MSA lec-
pronunciation patterns. turer who works at the Australian National University (ANU)
for helping us on the selection of the MSA glossary. She also
Future iterations also intend to test a much more provided us a series of opportunities to join her classes and
varied selection of MSA words that capture both tutorials. We acquired many valuable observations on her
short and long vowels in word initial, medial and pedagogical methods and skills. Her activity in promoting our
study ensured that students actively participated in our student
final positions, as well as the two MSA dipthongs experience survey and preliminary evaluation experiments.
/aw/ (e.g. Zñ“ /d.aw/ ’light’) and /aj/ (e.g. I  K. Moreover, we thank Dr Emmaline Louise Lear and Mr
Frederick Chow. Dr Lear helped us to acquire ethic approval
/bajt/ ’house’) and MSA consonant. for our study and provided us inspirations from an educator’s
Another potential future direction is to animate perspective. Mr Chow helped us on communication with
the tongue motion. Iribe et al. (2012) showed that ANU Centre for Arab and Islamic Studies which is crucial
for our study and commented on engineering details of our
such animations could achieve better results than project. They also provided insightful suggestions for an early
their static counterparts. We expect the animated presentation for this study as examiners. We would like to
version of the vowel space plot to display tongue express our sincere appreciation for their help and remarkable
work.
motions while people speak to help users to better Finally, we acknowledge the funding and support by
conceptualise pronunciation in real-time. Australian Government Research Training Program Schol-
arships and ANU for the first three authors’ higher degree
research studies.
8 Clarification: MSA
Vocabulary Selection
The justification for the selection of the above ten
words was based on a variety of factors. First, the
References William F. Katz and Sonya Mehta. 2015. Visual feed-
back of tongue movement for novel speech sound
Pierre Badin, Yuliya Tarabalka, Frédéric Elisei, and learning. Frontiers in Human Neuroscience, 9:612.
Gérard Bailly. 2010. Can you ‘read’ tongue move-
ments? evaluation of the contribution of tongue dis- Shao-Hsuan Lee, Jen-Fang Yu, Yu-Hsiang Hsieh, and
play to speech understanding. Speech Communica- Guo-She Lee. 2015. Relationships between formant
tion, 52:493–503. frequencies of sustained vowels and tongue contours
measured by ultrasonography. American journal
Steven Boll. 1979. Suppression of acoustic noise in of speech-language pathology / American Speech-
speech using spectral subtraction. IEEE Transac- Language-Hearing Association, 24:739–749.
tions on Acoustics, Speech, and Signal Processing,
27(2):113–120. John Levis. 2007. Computer technology in teaching
and researching pronunciation. Annual Review of
Judy Breitkreutz, Tracey M Derwing, and Marian J Applied Linguistics, 27:184.
Rossiter. 2001. Pronunciation teaching practices in
canada. TESL Canada journal, pages 51–61. Philip Lieberman and Sheila E. Blumstein. 1988.
Speech Physiology, Speech Perception, and Acous-
John Burgess and Sheila Spencer. 2000. Phonology tic Phonetics. Cambridge Studies in Speech Science
and pronunciation in integrated language teaching and Communication. Cambridge University Press.
and teacher education. System, 28(2):191–215.
Zhen-Hua Ling, Korin Richmond, and Junichi Yamag-
Xinyuan Chao. 2019. Supporting students’ ability to ishi. 2010. An analysis of hmm-based prediction
speak a foreign language intelligibly using educa- of articulatory movements. Speech Communication,
tional technologies:The case of learning Arabic in 52(10):834–846.
the Australian National University. College of Engi-
neering and Computer Science, The Australian Na- Shrikanth Narayanan, Krishna Nayak, Sungbok Lee,
tional University, Canberra, ACT, Australia. Abhinav Sethy, and Dani Byrd. 2004. An approach
to real-time magnetic resonance imaging for speech
Tracey M. Derwing and Murray J. Munro. 2005. Sec- production. The Journal of the Acoustical Society of
ond language accent and pronunciation teaching: America, 115(4):1771–1776.
A research-based approach. TESOL Quarterly,
39(3):379–397. Ambra Neri, Catia Cucchiarini, Helmer Strik, and Lou
Boves. 2002. The pedagogy-technology interface in
Dorina Dibra, Nuno Otero, and Oskar Pettersson. 2014. computer assisted pronunciation training. Computer
Real-time interactive visualization aiding pronuncia- Assisted Language Learning, 15(5):441–467.
tion of english as a second language. In 2014 IEEE
14th International Conference on Advanced Learn- Matthias Odisio, Gérard Bailly, and Frédéric Elisei.
ing Technologies, pages 436–440. 2004. Tracking talking faces with shape and appear-
ance models. Speech Communication, 44:63–82.
Jonás Fouz-González. 2015. Trends and directions in
computer-assisted pronunciation training. Investi- Annu Paganus, Vesa-Petteri Mikkonen, Tomi Mäntylä,
gating English Pronunciation Trends and Directions, Sami Nuuttila, Jouni Isoaho, Olli Aaltonen, and
pages 314–342. Tapio Salakoski. 2006. The vowel game: Continu-
ous real-time visualization for pronunciation learn-
Dzikri Fudholi and Hanna Suominen. 2018. The im- ing with vowel charts. In Advances in Natural Lan-
portance of recommender and feedback features in guage Processing, pages 696–703, Berlin, Heidel-
a pronunciation learning aid. In Proceedings of berg. Springer Berlin Heidelberg.
the 5th Workshop on Natural Language Processing
Techniques for Educational Applications, pages 83– Ken Peffers, Tuure Tuunanen, Marcus A Rothenberger,
87, Melbourne, Australia. Association for Computa- and Samir Chatterjee. 2007. A design science
tional Linguistics. research methodology for information systems re-
search. Journal of management information systems,
Alan R. Hevner, Salvatore T. March, Jinsoo Park, and 24(3):45–77.
Sudha Ram. 2004. Design science in information
systems research. MIS Quarterly, 28(1):75–105. Hasso Plattner, Christoph Meinel, and Ulrich Weinberg.
2009. Design-thinking. Springer.
Yurie Iribe, Takurou Mori, Kouichi Katsurada, Goh
Kawai, and Tsuneo Nitta. 2012. Real-time visualiza- Lawrence Rabiner and Ronald Schafer. 2010. Theory
tion of english pronunciation on an ipa chart based and Applications of Digital Speech Processing, 1st
on articulatory feature extraction. Interspeech 2012, edition. Prentice Hall Press, Upper Saddle River, NJ,
2:1270–1273. USA.
Priscilla Kan John, Emmaline Lear, Patrick L’Espoir Antoine Serrurier and Pierre Badin. 2008. A three-
Decosta, Shirley Gregor, Stephen Dann, and Ruonan dimensional articulatory model of the velum and
Sun. 2020. Designing a visual tool for teaching and nasopharyngeal wall based on mri and ct data.
learning front-end innovation. Technology Innova- The Journal of the Acoustical Society of America,
tion Management Review, 10. 123:2335–55.
Jia-lin Shen, Jeih-weih Hung, and Lin-shan Lee. 1998.
Robust entropy-based endpoint detection for speech
recognition in noisy environments. In Fifth interna-
tional conference on spoken language processing.
Ryan K Shosted, Maojing Fu, and Zainab Hermes.
2018. Arabic pharyngeal and emphatic consonants,
chapter chapter3. Routledge.
R. C. Snell and F. Milinazzo. 1993. Formant loca-
tion from lpc analysis data. IEEE Transactions on
Speech and Audio Processing, 1(2):129–134.
Maureen Stone. 2005. A guide to analysing tongue mo-
tion from ultrasound images. Clinical Linguistics &
Phonetics, 19(6-7):455–501. PMID: 16206478.
Tomoki Toda, Alan W Black, and Keiichi Tokuda.
2008. Statistical mapping between articulatory
movements and acoustic spectrum using a gaussian
mixture model. Speech Communication, 50(3):215–
227.
Nancy Tye-Murray, Karen Iler Kirk, and Lorianne
Schum. 1993. Making typically obscured articula-
tory activity available to speech readers by means of
videofluoroscopy. In NCVS Status and Progress Re-
port, volume 4, pages 41–63.
Marla Tritch Yoshida. 2018. Choosing technology
tools to meet pronunciation teaching and learning
goals. The CATESOL Journal, 30(1):195–212.
Lingyun Yu, Jun Yu, and Qiang Ling. 2018. Syn-
thesizing 3d acoustic-articulatory mapping trajecto-
ries: Predicting articulatory movements by long-
term recurrent convolutional neural network. In
2018 IEEE Visual Communications and Image Pro-
cessing (VCIP), pages 1–4.
Lingyun Yu, Jun Yu, and Qiang Ling. 2019. Bltrcnn-
based 3-d articulatory movement prediction: Learn-
ing articulatory synchronicity from both text and
audio inputs. IEEE Transactions on Multimedia,
21(7):1621–1632.

You might also like