Context Dependent Feature Based Bottom-Up Rescoring SVM Classifier in Children's English Stress Mis-Pronunciation Detection
Context Dependent Feature Based Bottom-Up Rescoring SVM Classifier in Children's English Stress Mis-Pronunciation Detection
Abstract This tends to span broad activities into the vowel’s context
Automatic assessment of word stress error is an information and requires consideration of different score
integral part for oral language grading system. However, strategies towards a more reasonable structure.
problems that the property of vowels depends on its TRAP based features with neural network posterior
context information and the data sparseness of different outputs are extensively explored in speech recognition [5].
vowel class are yet to be solved. This paper shall briefly Provided that stress is highly infected by acoustic quality,
introduce a hybrid method consisting of both traditional it is utilized in our work. Next, Prosodic posterior score,
prosodic features and proposed context dependent which models vowel with GMM, is highly correlated with
strategies. In classification word stress is determined by goodness of stress. Here we take a tri-gram based score
weighting a bottom-up fashioned group tree with regarding the distinction of neighboring syllables. Further,
modified distributed probability score. In experiment, the apart from a data-driven classifier, we propose a modified
overall equal error rate of our proposed system achieves method of score probabilities and a bottom up tree scheme
9.41%, which exhibits relative reduction and its for rescore. In Section 2, the hybrid approach for detecting
competence of use in stress error detection system. stress mispronunciation is proposed, which is followed by
database description and two experiments in Section 3.
1. Introduction 2. Proposed method for feature and classifier
With the development of computer aided language 2.1 TRAP-NN confidence and GMM posterior score.
learning (CALL), automatic speech rating system Phoneme posterior plays an important role in speech
becomes a second guide for children and a fine-grained recognition. Especially TRAP based posterior probability,
analysis tool for teacher. Among many oral speech which is reported to outperform the traditional GMM
qualities for judgment, stress is considered to be one of based recognition system. We take interested in its
the basic units, given its effect of conveying different representation of stress property. Stress detection task is a
meaning by its rhythmic production and perception. In context dependent model, any parts of stress emphasis or
practice, Children and L2 learners tend to mispronounce slack will impact other parts of stress properties. So this
vowels stress in a word. However, even for trained technique based on long temporal context is utilized, the
listener, detecting those mispronounced word stress in temporal spectral bank is divided into left and right parts,
oral reading is a challenging task. Especially in computer which is then converted to Mel-bank representation of
aided language tutor, constant false alarm can lead to features. Feature vector which is feed to NN combined as
severe consequence of subjects’ feeling, including vectors over all critical bands. A cascade structure of two
impatience and complaint. Extensive works for stress parts and one merger produces the final posterior
pronunciation [1,2] explore prosodic features such as probability [5].
duration, amplitude, and f0 which are perceived to be Another confidence is a GMM based prosodic model.
primarily related with acoustic reflection of vowel stress. For linguistic reasons, Different tendencies of meaning or
Other majority of studies investigates the vowel quality error pronunciation may lead to variations in energy,
by either spectral features [4] or posterior probability duration and pitch, in which posterior probability is
output of speech recognizer [1]. All the above systems computed respectively. The corresponding score model of
adopt different pattern classifiers or structures [3] to N vowels in a window is as follows:
identify stress property of each vowels. 1 N
Unfortunately, such prosodic features appear to be ScorePro = ∑ log( p( f (di ) | phone previ ,curi ,nexti ))
N i =1
negligent to the wide gap of amounts between different
vowel classes, and relative prosodic position in the whole Where f (di ) is the prosodic feature of i th vowels.
sentence. From a linguistic view, lack of phrase context Here, Normalization proceeds at first level to reduce
doesn’t fit well with the constellation of stress generation. variation in speaker and recording condition [1].
RankFea = 2 (1 + e N
) Psmooth(aj)
2.3 Bottom up fashioned rescore classifier. TU represents tagged unstressed vowels, Score is the
stress score of each vowel. The bigger the value is, the
There is still absence of data problem in training
more stressed the vowel is pronounced. All the scores are
because some vowels have great disparity amount either
output by SVM probabilities. If the syllable is classified as
in positive or negative classes. Here we apply a bottom-
stressed one, this score is close to 1, otherwise close to 0.
up fashioned group phoneme method in classification. In
For word stress score, it’s better to soften the maxima
Fig 1, Individual phoneme may occur sparsely in some
to make a modified uniform distributed phoneme score
leaves. But the hierarchical BT framework offers a
(MProb). The transformation contains two logarithms.
convenient way to provide a backing-off technique to
share information from parent leafs. Each parent is a pool ⎧ p
⎪⎪ log L ( thre ), P < thre ;
of children. Following recursive smooth to calculate the T( I , L, R ) ( p) = ⎨
probability of given phoneme is as follows: ⎪log ( p ), P ≥ thre .
⎩⎪
R
1. Determine the leaf l using training set, Set a node 1 − thre
variable: n=l
2. Calculate symbol probability by linear interpolation:
3. Performance Evaluation
Psmooth (α j ) = s j Pn (α j ) + (1 − s j ) PPar ( n ) ( a j ) Database for evaluation of GOS performance is taken
from collection of a group of elite middle school students
237
in China. Training part of corpus consists of 155 males where Y-axis represents False Accept Rate (FAR) of the
and 270 females from random grades and schools. mispronounced words. Baseline system with blue line is
Testing corpus is derived from the same database with 38 based on the “Total” features as in Exp1. Systems with
males and 60 females. Each person reads 20 scripts, the label “Context” represent two context based strategies:
difficulty level of which is different. All scripts of speech Ranking strategy and Teager feature operator. It’s shown
are common words in teaching materials. In training that context dependent stress scores bring about significant
stage, given the reading script is known, speech is first impact on decrease of EER, the same with modified
force-aligned to the target sentences to generate uniform distributed probability (MProb). Notice that the
boundaries corresponding to each vowel of word, with first three systems adopt vowel classifier without rescore
which the prosodic and context dependent features are in their group of vowels. Final system outputs vowel score
extracted. For testing, each utterance is recognized by based on bottom-up rescore tree structure (Rescore),
ASR system, resulting boundary of each phoneme. which achieves the best performance of 9.41%.
Our study in stress detection aims to investigate two
aspects of proposed structural feature schemes: 1)
whether the applied context based feature schemes and
different combinations take effect in stress detection of
single vowel task. 2) Detection of word’s stress flaw in a
sentence. Here system performance verifies the impact of
the proposed feature strategies, modified probability,
bottom-up fashioned rescore in classification and
reliability in tackling the problem of data sparseness.
Exp.1 The first task is to classify vowels as stressed or
unstressed. Since some vowels contain different number
of positive or negative samples, we use grouped vowel
classification accuracy as the measure. This is based on
vowel level to test performance of proposed method.
Tab.1 shows the accuracy of each vowel category
Figure 2. EER curves for various approach
with both prosodic and context features, the first row
“PRO” is the baseline prosodic features [1]. Final row
“Total” is the combination result of prosodic features, 4. Conclusion
Trap-NN, GMM. It’s shown that “Total” system achieves The major thrust of this paper discussed whether the
the best performance in most of vowel groups. Further, proposed context dependent features and rescore method
Results reveal that monophthong (ah, er) outperforms in classification can better reflect the inherent nature of
diphthongs (oy, ow). An interesting find is a decayed rhythmic stress. Final results reveal that hybrid method
result of vowels ih and iy, which illustrates that it’s still can significantly enhance performance of EER to 9.41%.
hard to distinguish some kind of vowels, especially Future work will focus on exploiting diverse features that
vowels with same property but different durations. can better reflect rhythmic stress in emotion level.
Table.1 Accuracy of classifier in phoneme group (%)
Vowel
oy aa
eh
ow 5. References
ah er ih iy ay ae uh
group ey [1] Huayang Xie, etal. “Detecting stress in Spoken English using
aw ao uw
PRO 88.0 85.2 79.5 75.8 78.3 85.3 78.9 79.5 Decision Tress and Support Vector Machines”. PDMISI. 2004
+TrapNN 89.5 88.7 74.1 76.7 81.1 86.4 82.2 79.7 [2] Joseph Tepperman, Shrikanth N. “Automatic syllable stress
+GMM 87.6 84.2 79.4 77.3 80.5 83.8 80.8 75.4 detection using prosodic features for pronunciation evaluation of
Total 89.7 90.9 81.4 77.9 83.3 86.3 85.6 81.7 language learners”. ICASSP.2005
Exp.2 Second word level based task is towards the [3] Min Lai, etal, “A hierarchical approach to automatic stress
application of stress mispronunciation detection system. detection in English sentences”. ICASSP.2006
Performance of different structures in the above proposed [4] Nan Chen, Qianhua He. “Using nonlinear features in
methods is evaluated in term of EER (Equal Error Rate) automatic english lexical stress detection”. ICCISW.2007
curves, which describes the performance of a system for [5] Szoke, etal, “Comparison of keyword spotting approaches for
informal continuous speech”. Eurospeech.2005
two-class discrimination when a fixable threshold varies.
Fig.2 depicts the EER curve of word level stress [6] Rosaria Silipo, Steven Greeberg, “Automatic Detection of
Prosodic Stress in American English Discourse”, ICSI. 2000
detection in described dataset. X-axis represents False
Rejection Rate (FRR) of the fine pronounced vowels, [7] J. F. Kaiser, etal. “On a Simple Algorithm to Calculate the
‘Energy’ of a Signal”, ICASSP 90, pp. 381-384.
238