Some Recent Research Work at LIUM Based On The Use of CMU Sphinx
Some Recent Research Work at LIUM Based On The Use of CMU Sphinx
Yannick Estève, Paul Deléglise, Sylvain Meignier, Simon Petit-Renaud, Holger Schwenk,
Loic Barrault, Fethi Bougares, Richard Dufour, Vincent Jousse, Antoine Laurent, Anthony Rousseau
{
CN
tion, and transcripted by the Sphinx decoder, before being anno- best hypo
output
tated. Later, segments were labeled into three classes (prepared, TERp
CN
Lattice nbest list
System 1 alignment Merge DECODE
low spontaneity and high spontaneity) depending of its level of
spontaneity. In this work, we particularly focused on the detec- 1-best
output CN
tion of the high spontaneity class of speech.
In parallel to the subjective annotation of the corpus, we in- TERp
System M alignment
troduced the features used to describe speech segments. We 1-best
output
chose speech segments that are relevant to characterize the
spontaneity of those, and on which an automatic classification
process can be trained on our annotated corpus. Three sets of Figure 1: MT system combination.
features were used [14, 15]: acoustic features related to prosody,
linguistic features related to the lexical and syntactic content of
the segments, and confidence measures output by ASR system.
The features were evaluated on our labeled corpus with a clas- 10.1. Hypotheses alignment and confusion network gener-
sification task: labeling speech segments according to the three ation
classes of spontaneity. For each segment, the best hypotheses of M − 1 systems are
Intuitively, we can feel that it should be rare to observe a aligned against the last one used as backbone. A modified ver-
sion of the TERp tool [18] is used to generate a confusion net- 10.2.2. Language Model
work. This is done by incrementally adding the hypotheses to
There are two ways of loading a LM with this software.
the CN. These hypotheses are added to the backbone beginning
The first solution is to use the LargeTrigramModel class,
with the nearest (in terms of TERp) and ending with the more
but as its name tells us, only a maximum 3-gram model can be
distant ones. This differs from the result of [17] where the near-
loaded with this class.
est hypothesis is computed at each step. M confusion networks
The second and easiest way is to use a language model
are generated in this way. Then all the confusion networks are
hosted on a lm-server. This kind of LM can be accessed via the
connected into a single lattice by adding a first and last node.
LanguageModelOnServer class which is based on the generic
The probability of the first arcs (named priors) must reflect the
LanguageModel class from the Sphinx4 library. This allow us
capacity of such system to provide a well structured hypothesis.
to load a n-gram LM with n higher than 3, which is not possible
with a standard LM class in Sphinx4 ... at this time. Actually,
10.2. Decoding
a new generic class for handling k-gram LM (whatever is k) is
The decoder is based on the token pass decoding algorithm. The being developed at LIUM and will be integrated soon into this
principle of this decoder is to propagate tokens over the lattice software.
and accumulate various scores into a global score for each hy- In addition, the Dictionary interface has been extended in
potheses. order to be able to load a simple dictionary containing all the
The scores used to evaluate the hypotheses are the follow- words known by the LM (no need to know the different pronun-
ing : ciations of each words in this case).
• System score : this replaces the score of the translation
model. Until now, the words given by all systems have 11. Experimental Evaluation of SMT
the same probability which is equal to their prior, but any System Combination
confidence measure can be used at this step.
We used our combination system, called MTSyscomb [19], for
• Language model (LM) probability. the IWSLT’09 evaluation campaign [20]. Table 3 presents the
• A fudge factor to balance the probabilities provided in results obtained with this approach. The SM T system is based
the lattice with regard to those given by the language on MOSES [21], the SP E system correspond to a rule-based
model. system from SYSTRAN whose outputs have been corrected by
a SMT system and the Hierarchical is based on Joshua [22].
• a null-arc penalty : this penalty avoids always going
through null-arcs encountered in the lattice.
Systems Arabic/English Chinese/English
• A length penalty : this score helps to generate well sized Dev7 Test09 Dev7 Test09
hypotheses. SMT 54.75 50.35 41.71 36.04
The probabilities computed in the decoder can be expressed SPE 48.13 - 41.23 38.53
as follow : Hierarchical 54.00 49.06 39.78 31.89
Len(W )
SMT + SPE 42.55 40.14
X + tuning 43.06 39.46
log(PW ) = [log(Pws (n)) + αPlm (n)] (1)
SMT + Hier. 55.89 50.86
n=0
+ tuning 57.01 51.74
+Lenpen (W ) + N ullpen (W )
where Len(W ) is the length of the hypothesis, Pws (n) is Table 3: Results of system combination on Dev7 (development)
the score of the nth word in the lattice, Plm (n) is its LM proba- corpus and Test09, the official test corpus of IWSLT’09 evalua-
bility, α is the fudge factor, Lenpen (W ) is the length penalty of tion campaign.
the word sequence and N ullpen (W ) is the penalty associated
with the number of null-arcs crossed to obtain the hypothesis.
At the beginning, one token is created at the first node of In these tasks, the system combination approach yielded
the lattice. Then this token is spread over consecutive nodes, +1.39 BLEU on Ar/En and +1.7 BLEU on Zh/En. One obser-
accumulating the score on the arc it crosses, the language model vation is that tuning parameters did not provided better results
probability of the word sequence generated so far and the null for Zh/En.
or length penalty when applicable. The number of tokens can
increase really quickly to cover the whole lattice, and, in order 12. Conclusion
to keep it tractable, only the Nmax best tokens are kept (the
This paper presents some recent research works made at LIUM.
others are discarded), where Nmax can be configured in the
We have started using CMU Sphinx tools in 2004 in order to de-
configuration file.
velop an entire ASR system in French language. We have added
some improvements and have made our system the best open
10.2.1. Technical details about the token pass decoder
source system participating to French evaluation campaigns.
This software is based on the Sphinx4 library and is highly con- This system is now at the center of the research works of the
figurable. The maximum number of tokens being considered LIUM Speech Team. In the framework of speech processing,
during decoding, the fudge factor, the null-arc penalty and the these works include grapheme-to-phoneme conversion, special
length penalty can all be set within an XML configuration file. strategy of correction to process frequent specific errors, de-
This is really useful for tuning. tection and characterization of spontaneous speech in large au-
This decoder uses a language model (LM) which is de- dio database. We used also CMU Sphinx tools in our research
scribed in section 10.2.2. work on statistical machine translation to combine SMT system.
Other research works, not presented in this paper, are made by [16] W. Shen, B. Delaney, T. Anderson, and R. Slyh, “The MIT-
using CMU Sphinx, for example speaker named identification LL/AFRL IWSLT-2008 MT system,” in IWSLT, Hawaii, U.S.A,
which exploits the outputs of our speaker diarization system and 2008, pp. 69–76.
the output of our ASR system. [17] A.-V. Rosti, S. Matsoukas, and R. Schwartz, “Improved word-
We already share some of our resources (French acoustic level system combination for machine translation,” in ACL, 2007,
and language models for example) and we try to integrate our pp. 312–319.
add-ons into the canonical source code of the CMU Sphinx [18] M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul,
project. “A study of translation edit rate with targeted human annotation,”
in ACL, 2006.
This porting is very hard because it needs a lot of develop-
ment time, and because CMU Sphinx progresses and its source [19] L. Barrault, “Many : Open source machine translation system
code frequently changes. In the future, we will try to integrate combination,” Prague Bulletin of Mathematical Linguistics, Spe-
cial Issue on Open Source Tools for Machine Translation, vol. 93,
our work into the canonical source code as soon as possible in
pp. 147–155, 2010.
order to make easier this integration. More, it should be really
interesting to have access to a global view concerning the cur- [20] H. Schwenk, L. Barrault, Y. Estève, and P. Lambert, “LIUM’s Sta-
tistical Machine Translation Systems for IWSLT 2009,” in Proc.
rent and planned developments made by the other CMU Sphinx of the International Workshop on Spoken Language Translation,
developers. Tokyo, Japan, 2009, pp. 65–70.
[21] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico,
13. References N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer,
[1] M. J. F. Gales, “Maximum likelihood linear transfor- mations O. Bojar, A. Constantin, and E. Herbst, “Moses: Open source
for hmm-based speech recognition,” Computer Speech and Lan- toolkit for statistical machine translation.” in ACL. The Associ-
guage, vol. 12, p. 7598, 1998. ation for Computer Linguistics, 2007.
[2] H. Mangu, E. Brill, and S. A., “Finding consensus in speech [22] Z. Li, C. Callison-Burch, C. Dyer, J. Ganitkevitch, S. Khudan-
recognition: Word error minimization and other applications of pur, L. Schwartz, W. N. G. Thornton, J. Weese, and O. F. Zaidan,
confusion networks,” Computer Speech and Language, vol. 14, “Joshua: an open source toolkit for parsing-based machine trans-
no. 4, pp. 373–400, 2000. lation,” in StatMT ’09: Proceedings of the Fourth Workshop on
Statistical Machine Translation. Morristown, NJ, USA: Associ-
[3] D. Povey and P. Woodland, “Minimum phone error and i- ation for Computational Linguistics, 2009, pp. 135–139.
smoothing for improved discriminative training,” in ICASSP,
Florida, USA, 2002, pp. 105–108.
[4] D. Povey, “Discriminative training for large vocabulary speech
recognition,” Ph.D. Dissertation, Department of Engineering,
University of Cambridge, United Kingdom, 2004.
[5] J. Durand, B. Laks, and C. Lyche, “La phonologie du français
contemporain : usages, variétés et structure,” Romance Corpus
Linguistics - Corpora and Spoken Language, pp. 93–106, 2002.
[6] F. Béchet, “LIA PHON un système complet de phonétisation de
texte,” in Traitement Automatique Des Langues. Hermès, 2001,
vol. 42, pp. 47–68.
[7] H. Jiang, “Confidence measures for speech recognition: a survey,”
Speech Communication Journal, vol. 45, pp. 455–470, 2005.
[8] G. Evermann and P. Woodland, “Large vocabulary decoding
and confidence estimation using word posterior probabilities,” in
ICASSP, Istanbul, Turkey, June 2000.
[9] M. De Calmes and G. Perennou, “BDLEX: a lexicon for spoken
and written French,” in Proc. of LREC, International Conference
on Language Resources and Evaluation, 1998, pp. 1129–1136.
[10] M. Bisani and H. Ney, “Joint-sequence models for grapheme-to-
phoneme conversion,” Speech Comm., vol. 50, no. 5, pp. 434–451,
2008.
[11] A. Laurent, P. Deléglise, and S. Meignier, “Grapheme to
phoneme conversion using an smt system,” in In Proceedings of
Interspeech-2009, 2009.
[12] T. Bazillon, Y. Estève, and D. Luzzati, “Manual vs assisted tran-
scription of prepared and spontaneous speech,” in LREC 2008,
Marrakech, Morocco, May 2008.
[13] R. Dufour and Y. Estève, “Correcting ASR outputs: specific solu-
tions to specific errors in French,” in SLT 2008, Goa, India, De-
cember 2008.
[14] R. Dufour, V. Jousse, Y. Estève, F. Béchet, and G. Linarès,
“Spontaneous speech characterization and detection in large audio
database,” in SPECOM 2009, St Petersburg, Russia, June 2009.
[15] R. Dufour, Y. Estève, P. Deléglise, and F. Béchet, “Local and
global models for spontaneous speech segment detection and
characterization,” in ASRU 2009, Merano, Italy, December 2009.