0% found this document useful (0 votes)
28 views3 pages

Context Dependent Feature Based Bottom-Up Rescoring SVM Classifier in Children's English Stress Mis-Pronunciation Detection

Uploaded by

Kevin Duy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views3 pages

Context Dependent Feature Based Bottom-Up Rescoring SVM Classifier in Children's English Stress Mis-Pronunciation Detection

Uploaded by

Kevin Duy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

2009 Ninth IEEE International Conference on Advanced Learning Technologies

Context Dependent Feature based Bottom-up rescoring SVM Classifier in


Children’s English Stress Mis-pronunciation detection
Shen Huang, Hongyan Li, Shijin Wang, Jiaen liang, Bo Xu
Digital Content Tech Research Center, Institute Automation, Chinese Academy of Science, China
{Shenhuang,hyli,sjwang,jeliang,xubo}@hitic.ia.ac.cn

Abstract This tends to span broad activities into the vowel’s context
Automatic assessment of word stress error is an information and requires consideration of different score
integral part for oral language grading system. However, strategies towards a more reasonable structure.
problems that the property of vowels depends on its TRAP based features with neural network posterior
context information and the data sparseness of different outputs are extensively explored in speech recognition [5].
vowel class are yet to be solved. This paper shall briefly Provided that stress is highly infected by acoustic quality,
introduce a hybrid method consisting of both traditional it is utilized in our work. Next, Prosodic posterior score,
prosodic features and proposed context dependent which models vowel with GMM, is highly correlated with
strategies. In classification word stress is determined by goodness of stress. Here we take a tri-gram based score
weighting a bottom-up fashioned group tree with regarding the distinction of neighboring syllables. Further,
modified distributed probability score. In experiment, the apart from a data-driven classifier, we propose a modified
overall equal error rate of our proposed system achieves method of score probabilities and a bottom up tree scheme
9.41%, which exhibits relative reduction and its for rescore. In Section 2, the hybrid approach for detecting
competence of use in stress error detection system. stress mispronunciation is proposed, which is followed by
database description and two experiments in Section 3.
1. Introduction 2. Proposed method for feature and classifier
With the development of computer aided language 2.1 TRAP-NN confidence and GMM posterior score.
learning (CALL), automatic speech rating system Phoneme posterior plays an important role in speech
becomes a second guide for children and a fine-grained recognition. Especially TRAP based posterior probability,
analysis tool for teacher. Among many oral speech which is reported to outperform the traditional GMM
qualities for judgment, stress is considered to be one of based recognition system. We take interested in its
the basic units, given its effect of conveying different representation of stress property. Stress detection task is a
meaning by its rhythmic production and perception. In context dependent model, any parts of stress emphasis or
practice, Children and L2 learners tend to mispronounce slack will impact other parts of stress properties. So this
vowels stress in a word. However, even for trained technique based on long temporal context is utilized, the
listener, detecting those mispronounced word stress in temporal spectral bank is divided into left and right parts,
oral reading is a challenging task. Especially in computer which is then converted to Mel-bank representation of
aided language tutor, constant false alarm can lead to features. Feature vector which is feed to NN combined as
severe consequence of subjects’ feeling, including vectors over all critical bands. A cascade structure of two
impatience and complaint. Extensive works for stress parts and one merger produces the final posterior
pronunciation [1,2] explore prosodic features such as probability [5].
duration, amplitude, and f0 which are perceived to be Another confidence is a GMM based prosodic model.
primarily related with acoustic reflection of vowel stress. For linguistic reasons, Different tendencies of meaning or
Other majority of studies investigates the vowel quality error pronunciation may lead to variations in energy,
by either spectral features [4] or posterior probability duration and pitch, in which posterior probability is
output of speech recognizer [1]. All the above systems computed respectively. The corresponding score model of
adopt different pattern classifiers or structures [3] to N vowels in a window is as follows:
identify stress property of each vowels. 1 N
Unfortunately, such prosodic features appear to be ScorePro = ∑ log( p( f (di ) | phone previ ,curi ,nexti ))
N i =1
negligent to the wide gap of amounts between different
vowel classes, and relative prosodic position in the whole Where f (di ) is the prosodic feature of i th vowels.
sentence. From a linguistic view, lack of phrase context Here, Normalization proceeds at first level to reduce
doesn’t fit well with the constellation of stress generation. variation in speaker and recording condition [1].

978-0-7695-3711-5/09 $25.00 © 2009 IEEE 236


DOI 10.1109/ICALT.2009.157
Difference is that phone prev ,cur , next is a tri-gram vowel
i i i
Where Par(n) is the parent node of node n, s j is the
of i th segment considering both previous and next sparseness parameter between levels and is determined by
phoneme group. Total score is obtained by averaging log the number of each class:
GMM probability of the segmental prosodic scores. N+ − N−
s j = 1−
2.2 Ranking order strategy and teager operator. max( N + , N − )
Another feature technique is the importance order of If two classes share the same amount of phonemes,
vowel in a window range. Imagine when a word is then s j =1, otherwise s j =0. PPar ( n ) is given by repeating
pronounced, in a broader range the vowel stressed will
with n=Par(n) until n=Root.
definitely weakened neighboring vowel’s stress quality.
3. Increase n to another leaf node l.
Next, One word has only one primary stress. It is true
In experiment we use SVM to construct feature
that there can be a secondary stress in some words, but
classifier. Kernel method with C=100, gamma=1.0 is
this minor stress is much smaller than the primary one, so
adopted after feature normalization between 0 and 1.
the feature is determined by a dynamic range of values.
Firstly, basic prosodic features of GMM models are Root
extracted in each vowel, as illustrated in the studies of [1]. Ppar(Ppar(aj))
Pitch relevant feature of the syllabic nuclei appears to Diphthongs monophthong
play a less important role than amplitude and duration, so
ranking order strategy is not implemented in pitch. Ppar(aj)
A straight-forward feature is the rank order of vowels Group Group
features in a window range of a word. Imagine there are
N vowels in a window range, the ranking feature of i th
Pn(aj) ae aa ao
vowel is computed in sigmoid function form as follows:
α ⋅( Ranki −1)

RankFea = 2 (1 + e N
) Psmooth(aj)

α is an Figure 1. Bottom-up fashioned rescore tree


Ranki is the feature rank of the i th vowel,
empirically set variable, the bigger α is, the steeper this 2.4 Word stress score computation.
function will be. Here we set it as 0.6. The property of scoring stress pronunciation is
Another context feature is Teager operator, capable of determined by tagged stress vocabulary. Some words
extracting the signal energy based on mechanical and contain minor stressed vowels. Due to the study results of
physical reason [7]. For a discrete band limited signal Silipo [6], it is difficult to distinguish minor stressed
x(n), the operator can be computed as follows: vowels from others, so treat minor stressed vowel as
ψ [ x ( n)] = [ x ( n)]2 − x ( n + 1) x ( n − 1) unstressed one. In word based task, score tactic is based
It’s known that we can only stress vowels, not on both the stressed and unstressed parts. As one word
consonants, so computation only exists in vowels. contains only one stress, the final goodness of stress (GOS)
Consonant will be used as a reference in normalization of word level stress can be computed as follows:
and to assign stress of vowel that is unlabeled by human. GOSw = ScoreS − max ScoreUi
i∈TU

2.3 Bottom up fashioned rescore classifier. TU represents tagged unstressed vowels, Score is the
stress score of each vowel. The bigger the value is, the
There is still absence of data problem in training
more stressed the vowel is pronounced. All the scores are
because some vowels have great disparity amount either
output by SVM probabilities. If the syllable is classified as
in positive or negative classes. Here we apply a bottom-
stressed one, this score is close to 1, otherwise close to 0.
up fashioned group phoneme method in classification. In
For word stress score, it’s better to soften the maxima
Fig 1, Individual phoneme may occur sparsely in some
to make a modified uniform distributed phoneme score
leaves. But the hierarchical BT framework offers a
(MProb). The transformation contains two logarithms.
convenient way to provide a backing-off technique to
share information from parent leafs. Each parent is a pool ⎧ p
⎪⎪ log L ( thre ), P < thre ;
of children. Following recursive smooth to calculate the T( I , L, R ) ( p) = ⎨
probability of given phoneme is as follows: ⎪log ( p ), P ≥ thre .
⎩⎪
R
1. Determine the leaf l using training set, Set a node 1 − thre
variable: n=l
2. Calculate symbol probability by linear interpolation:
3. Performance Evaluation
Psmooth (α j ) = s j Pn (α j ) + (1 − s j ) PPar ( n ) ( a j ) Database for evaluation of GOS performance is taken
from collection of a group of elite middle school students

237
in China. Training part of corpus consists of 155 males where Y-axis represents False Accept Rate (FAR) of the
and 270 females from random grades and schools. mispronounced words. Baseline system with blue line is
Testing corpus is derived from the same database with 38 based on the “Total” features as in Exp1. Systems with
males and 60 females. Each person reads 20 scripts, the label “Context” represent two context based strategies:
difficulty level of which is different. All scripts of speech Ranking strategy and Teager feature operator. It’s shown
are common words in teaching materials. In training that context dependent stress scores bring about significant
stage, given the reading script is known, speech is first impact on decrease of EER, the same with modified
force-aligned to the target sentences to generate uniform distributed probability (MProb). Notice that the
boundaries corresponding to each vowel of word, with first three systems adopt vowel classifier without rescore
which the prosodic and context dependent features are in their group of vowels. Final system outputs vowel score
extracted. For testing, each utterance is recognized by based on bottom-up rescore tree structure (Rescore),
ASR system, resulting boundary of each phoneme. which achieves the best performance of 9.41%.
Our study in stress detection aims to investigate two
aspects of proposed structural feature schemes: 1)
whether the applied context based feature schemes and
different combinations take effect in stress detection of
single vowel task. 2) Detection of word’s stress flaw in a
sentence. Here system performance verifies the impact of
the proposed feature strategies, modified probability,
bottom-up fashioned rescore in classification and
reliability in tackling the problem of data sparseness.
Exp.1 The first task is to classify vowels as stressed or
unstressed. Since some vowels contain different number
of positive or negative samples, we use grouped vowel
classification accuracy as the measure. This is based on
vowel level to test performance of proposed method.
Tab.1 shows the accuracy of each vowel category
Figure 2. EER curves for various approach
with both prosodic and context features, the first row
“PRO” is the baseline prosodic features [1]. Final row
“Total” is the combination result of prosodic features, 4. Conclusion
Trap-NN, GMM. It’s shown that “Total” system achieves The major thrust of this paper discussed whether the
the best performance in most of vowel groups. Further, proposed context dependent features and rescore method
Results reveal that monophthong (ah, er) outperforms in classification can better reflect the inherent nature of
diphthongs (oy, ow). An interesting find is a decayed rhythmic stress. Final results reveal that hybrid method
result of vowels ih and iy, which illustrates that it’s still can significantly enhance performance of EER to 9.41%.
hard to distinguish some kind of vowels, especially Future work will focus on exploiting diverse features that
vowels with same property but different durations. can better reflect rhythmic stress in emotion level.
Table.1 Accuracy of classifier in phoneme group (%)
Vowel
oy aa
eh
ow 5. References
ah er ih iy ay ae uh
group ey [1] Huayang Xie, etal. “Detecting stress in Spoken English using
aw ao uw
PRO 88.0 85.2 79.5 75.8 78.3 85.3 78.9 79.5 Decision Tress and Support Vector Machines”. PDMISI. 2004
+TrapNN 89.5 88.7 74.1 76.7 81.1 86.4 82.2 79.7 [2] Joseph Tepperman, Shrikanth N. “Automatic syllable stress
+GMM 87.6 84.2 79.4 77.3 80.5 83.8 80.8 75.4 detection using prosodic features for pronunciation evaluation of
Total 89.7 90.9 81.4 77.9 83.3 86.3 85.6 81.7 language learners”. ICASSP.2005
Exp.2 Second word level based task is towards the [3] Min Lai, etal, “A hierarchical approach to automatic stress
application of stress mispronunciation detection system. detection in English sentences”. ICASSP.2006
Performance of different structures in the above proposed [4] Nan Chen, Qianhua He. “Using nonlinear features in
methods is evaluated in term of EER (Equal Error Rate) automatic english lexical stress detection”. ICCISW.2007
curves, which describes the performance of a system for [5] Szoke, etal, “Comparison of keyword spotting approaches for
informal continuous speech”. Eurospeech.2005
two-class discrimination when a fixable threshold varies.
Fig.2 depicts the EER curve of word level stress [6] Rosaria Silipo, Steven Greeberg, “Automatic Detection of
Prosodic Stress in American English Discourse”, ICSI. 2000
detection in described dataset. X-axis represents False
Rejection Rate (FRR) of the fine pronounced vowels, [7] J. F. Kaiser, etal. “On a Simple Algorithm to Calculate the
‘Energy’ of a Signal”, ICASSP 90, pp. 381-384.

238

You might also like