0% found this document useful (0 votes)
33 views6 pages

Some Recent Research Work at LIUM Based On The Use of CMU Sphinx

The document provides an overview of recent research work done at LIUM using the CMU Sphinx tools. It describes LIUM's ASR system which achieved competitive results on French evaluation campaigns. The system uses Sphinx for features and diarization, with in-house developments for other components. It also discusses research on spontaneous speech detection, language modeling, and system combination. Finally, it notes the benefits and challenges of CMU Sphinx's open source license.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views6 pages

Some Recent Research Work at LIUM Based On The Use of CMU Sphinx

The document provides an overview of recent research work done at LIUM using the CMU Sphinx tools. It describes LIUM's ASR system which achieved competitive results on French evaluation campaigns. The system uses Sphinx for features and diarization, with in-house developments for other components. It also discusses research on spontaneous speech detection, language modeling, and system combination. Finally, it notes the benefits and challenges of CMU Sphinx's open source license.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Some recent research work at LIUM based on the use of CMU Sphinx

Yannick Estève, Paul Deléglise, Sylvain Meignier, Simon Petit-Renaud, Holger Schwenk,
Loic Barrault, Fethi Bougares, Richard Dufour, Vincent Jousse, Antoine Laurent, Anthony Rousseau

LIUM, University of Le Mans, France


[email protected]

Abstract In particular, it includes more programs with foreign accents, as


well as more spontaneous speech: in addition to French national
This paper presents an overview of the recent research work
broadcast news, ESTER 2 includes talk-shows and African pro-
developed at LIUM using the CMU Sphinx tools. First, it de-
grams (from the station Radio Africa No 1).
scribes the LIUM ASR system which reached very competitive
results on French evaluation campaigns. ESTER 2 supplements the ESTER 1 corpus with about
Then, different research works using the LIUM ASR sys- 100 hours of transcribed broadcast news recorded from 1998 to
tem are described: detection and characterization of sponta- 2004, with additional 6 hours for development and 6 hours for
neous speech in large audio database, language modeling to de- test from 2007-2008. Fast transcriptions of 40 hours of African
tect and correct errors in automatic transcripts or system com- broadcast news are also available. Textual resources are ex-
bination in the framework of statistical machine translation. tended by articles from the newspaper “Le Monde” from 2004
Last, we discuss about the benefit of the availability of to 2006.
CMU Sphinx under a permissive open source license and, as we
would like share with the CMU Sphinx community some parts 3. The LIUM ASR system
of our work, we discuss about the difficulties we encountered to
participate in the development of CMU Sphinx. The system described below was the best open source system
during the ESTER 2 evaluation campaign (an older version of
1. Introduction this system was also the best open source system during ESTER
1).
The LIUM automatic speech recognition system is based on the
CMU Sphinx system. The tools distributed in the CMU Sphinx
open-source package, although already reaching a high level 3.1. Diarization
of quality, can be supplemented or improved to integrate some
state-of-art technologies. It is the solution LIUM has adopted The diarization system uses of the Sphinx toolkit to compute
to develop its own ASR system, by building on this base and the feature vectors. This diarization system is composed of an
gradually extending it to bring it to new performance levels, of- acoustic BIC-based segmentation followed by a BIC-based hi-
ficially evaluated during the two ESTER evaluation campaigns erarchical clustering. Each cluster represents a speaker and is
on French broadcast news. modeled with a full covariance Gaussian. Viterbi decoding is
used to adjust the segment boundaries using GMMs for each
cluster.
2. The ESTER evaluation campaigns
Music and jingle regions are removed using Viterbi
2.1. ESTER 1 and ESTER 2 decoding with 8 GMMs, for music, jingle, silence, and
The ESTER 1 evaluation campaign was organized within the speech (with wide/narrow band variants for the latter two,
framework of the TECHNOLANGUE project funded by the and clean/noisy/musical background variants for wide-band
French government under the scientific supervision of AFCP1 speech).
with DGA2 and ELDA. Gender and bandwidth are then detected using 4 gender-
About 100 hours of transcribed data make up the corpus, and bandwidth-dependent GMMs.
recorded between 1998 and 2004 from six French speaking ra- Speech segments are then limited to 20 s by splitting over-
dio stations: France Inter, France Info, RFI, RTM, France Cul- long segments using a GMM-based silence detector.
ture and Radio Classique. Shows last from 10 minutes up to 60
minutes. They consist mostly of prepared speech such as news This system, completed by a CLR-based clustering phase,
reports, and a little conversational speech (such as interviews). obtained the best diarization error rate during the ESTER 2 cam-
The corpus of articles from the French newspaper “Le paign.
Monde” from 1987 to 2003 can be used in addition to the tran-
scription of the broadcast news to train language models. 3.2. Speech recognition system
The ESTER 2 campaign falls under the continuity of ES-
TER 1. It was organized by DGA and AFCP during 2007 and 3.2.1. Features
2008. The new campaign builds on the previous edition by
reusing its corpus and extending it to cover new types of data. The transcription decoding process is based on multi-pass de-
coding using 39 dimensional features (PLP with energy, delta,
1 AFCP: Association Francophone de la Communication Parlée and double-delta). Two sets of features are computed for each
2 DGA: Délégation Générale de l’Armement show, corresponding to broadband and narrowband analysis.
3.2.2. Decoding cut-off is applied on unigrams, bigrams, trigrams and quadri-
grams. The models are composed of 121k unigrams, 29M bi-
After speaker diarization, a first decoding pass allows to com-
grams, 162M trigrams, and 376M quadrigrams.
pute a CMLLR transformation for each speaker [1]. The decod-
ing strategy involves 5 passes. The other passes are as follows:
# 2 The best hypotheses generated by pass # 1 are used to
4. LIUM system and CMU Sphinx tools
compute a CMLLR transformation for each speaker, us- We have added large extensions to the SphinxTrain toolkit:
ing SAT and Minimum Phone Error (MPE) acoustic MAP adaptation of means, but also weights and covariances of
models and CMLLR transformations. This pass gener- the models, as well as SAT based on CMLLR and MPE, are the
ates word-graphs. most remarkable.
# 3 In the third pass, the word-graphs are used to drive a Passes # 1 and # 2 use version 3.7 of the Sphinx decoder,
graph-decoding with full 3-phone context with a better slightly modified to employ the CMLLR transformation applied
acoustic precision, particularly in inter-word areas. This to the features. Pass # 4 is based on sphinx3 astar which we
pass generates new word-graphs. extended to handle quadrigram LMs. Passes # 3 and 5 are based
on Sphinx version 4, which we heavily modified to develop the
# 4 The fourth pass consists in recomputing the linguistic acoustic graph decoder and the confusion network generation.
scores of the updated word-graphs of the third pass with Other parts, such as computation of PLP features and the
a quadrigram language model. diarization system, do not rely on Sphinx and are entirely in-
# 5 The last pass generates a confusion network from the house developments.
word-graphs and applies the consensus method to extract
the final one-best hypothesis [2]. 5. Experiments on the LIUM system
3.2.3. Acoustic models The experiments are carried out using the official test corpus
of the ESTER 2 campaign. This corpus consists in 6 hours (26
Acoustic models for 35 phonemes and 5 kinds of fillers are shows) recorded between December 2007 and February 2008.
trained using a set of 280 hours from ESTER 1 & 2. Models for
pass # 1 are composed of 6500 tied states. Models for passes 5.1. Global results
# 2 to # 5 are composed of 7500 states and are trained in a MPE
[3, 4] framework applied over the SAT-CMLLR models. The WER over the test data for the LIUM ASR baseline system
Both decoding passes employ tied-state word-position 3- is 19.2 %.
phone acoustic models which are made gender- and bandwidth-
dependent through MAP adaptation of means, covariances and Our system is based on a multi-pass architecture: table 1
weights. shows the WER after each pass of the decoding process.
The CMLLR technique for SAT in the second decoding
pass generates a full 39x39 matrix for each speaker. Table 1: Word error rates for each pass of LIUM’08

3.2.4. Vocaburary and Language models Pass Word error rate


# 1 (general acoustic models, trigram) 27.1 %
Data used to build the linguistic models are of three kinds: # 2 (acoustic adaptation) 22.5 %
1. Manual transcriptions of broadcast news. They corre- # 3 (word-graph acoustic rescoring) 20.4 %
spond to the transcription of the data used to train the # 4 (word-graph quadrigram rescoring) 19.4 %
acoustic models. We have also used manual transcrip- # 5 (consensus) 19.2 %
tions of conversations from the PFC corpus[5]; + pronunciation variant probability 18.8 %
2. Newspaper articles: in addition to 19 years of “Le + specialization of linguistic models 18.1 %
Monde” newspaper corpus, we also use articles from an-
other French newspaper, “L’Humanité”, from 1990 to
2007, and the French Giga Word Corpus; We can observe that adaptation of the acoustic models al-
lows a large gain in pass # 2, as does the better acoustic preci-
3. Web resources drawn from “L’Internaute”, “Libération”, sion given by the full 3-phone search algorithm used to rescore
“Rue89”, and “Afrik.com”. a word-graph in pass # 3 (the acoustic models used theses two
To build the vocabulary, we generate a unigram model as passes were trained using the MPE method). Rescoring this
a linear interpolation of unigram models trained on the various word-graph with a quadrigram model in pass # 4 allows to lower
training data sources listed above. The linear interpolation was the WER by one extra point. The last pass does not have a sig-
optimized on the ESTER 2 development corpus in order to min- nificant impact on WER, but it allows the ASR system to pro-
imize the perplexity of the interpolated unigram model. Then, vide confidence measures. Two improvements were integrated
we extract the 122k most probable words from this language into our ASR system. The first one consists in assigning a score
model. to each pronunciation variant in the dictionary. The score is
Phonetic transcriptions for the vocabulary are taken computed by observing the frequency of the variant in the train-
from the BDLEX database, or generated by the rule-based, ing corpus. Table 1 shows that this allows a gain of 0.4 point in
grapheme-to-phoneme tool LIA PHON[6] for words not in the terms of WER.
database. The last improvement is based on the presence of two kinds
Using this vocabulary, all the textual data of the training of francophone radio stations in the ESTER 2 campaign: French
corpus is used to train trigram and quadrigram language mod- and African ones. We have decided to build two sets of linguis-
els. To estimate and interpolate these models, the SRILM is em- tic knowledge bases (lexicon and n-gram models), specific to
ployed using the modified Kneser-Ney discounting method. No each of these two kinds of stations. The MPE method to train
acoustic models was also adapted for the African radio stations. • A rule-based G2P method, LIA PHON ([6]) that relies
Table 1 shows that this allows to obtain the best word error rate on the spelling of words to generate the possible corre-
of all our experiments: 18.1%. sponding chains of phones.
When start and end times of segments that contain proper
5.2. Confidence measures
nouns are determined, they are then fed to the APD system to
In order to provide additional information for applications obtain their phonetic transcription. The filtering is used to re-
which could use it, the LIUM system uses the a posteriori prob- move phonetic variants of proper nouns that are the most likely
abilities computed during the generation of the confusion net- to generate confusion with other words.
works to provide confidence measures [7]. We propose to decode our training corpus using the proper
However, as seen in table 2 which presents an evaluation of noun phonetic dictionary that we want to filter, completed by a
these confidence measures in terms of normalized cross entropy separate phonetic dictionary for non proper noun words. Each
(NCE), with no specific treatment these a posteriori probabili- phonetic transcription which allows to decode the correspond-
ties are not very good predictors of the word error rate. ing proper noun in the right place is added to the filtered dictio-
So, a mapping method is applied which consists in splitting nary.
the a posteriori probabilities into 15 classes of values: each con- The whole decoding and filtering process is repeated until
fidence measure is linearly transformed using the coefficient as- no more phonetic transcriptions get removed from the dictio-
sociated with its class. These coefficients have been optimized nary.
on the ESTER 2 development corpus to maximize NCE. Such The metrics used are based on the Word Error Rate (WER)
mapping approach was presented in [8]. Table 2 shows that this and on the Proper Noun Error Rate (PNER). The PNER is com-
method makes the confidence measures provided by the LIUM puted the same way as the WER but it is computed only for
ASR system very competitive, with an NCE of 0.329 on the proper nouns: P N ER = (I + S + E)/(N ) where I is the
ESTER 2 test corpus. number of wrong insertions of proper nouns, S the number of
substitutions of proper nouns with other words, E the number
Table 2: Contribution of the mapping method applied to the of elisions of proper nouns, and N the total number of proper
confidence measures of our ASR system nouns.
On the ESTER 1 test corpus, the best results were obtained
Confidence measures NCE by using SMT to initialize the process. By using the ASR sys-
without mapping 0.064 tem developed for the ESTER 1 campaign, the WER decreased
with mapping 0.329 from 24.7% to 23.9% on segments that contain proper nouns.
The PNER decreased from 26.2% to 20.5%.

6. Acoustic-based phonetic transcription 7. Improving French ASR by targeting


method for proper nouns specific errors
One of our recent research work focuses on an approach to en- Another of our recent research work focuses on the correction
hancing automatic phonetic transcription of proper nouns. of specific errors. Some errors, which do not prevent under-
Proper nouns constitute a special case when it comes to standing, are often neglected because they are not critical for the
phonetic transcription. Indeed, there is much less predictability correct operation of such applications. For example, in French,
in how proper nouns may be pronounced than for regular words. errors of agreement in number or gender. For some applications,
This is partly due to the fact that, in French, pronunciation rules such as subtitling for hearing-impaired people or assisted tran-
are much less normalized for proper nouns than for other cate- scription [12], these errors are more important: in the former
gories of words: a given sequence of letters is not guaranted to case, repetition of errors, even if they do not modify the mean-
be pronounced the same way in two different proper nouns. ing of a sentence, is very exhausting for the final user; in the
Common approaches to the problem of automatic grapheme latter case, where the goal is to produce an entirely correct tran-
to phoneme (G2P) conversion were proposed in the literature, scription, these errors reduce the gain of productivity provided
the most popular are: the dictionary look-up strategy [9], the by the use of ASR system. Thus, the final user might limit his
rule-based approach [6], and the knowledge-based approach use of such transcription systems, feeling that ASR systems are
[10]. not reliable enough because some errors would be easily cor-
In order to enrich the set of phonetic transcriptions of proper rected by a human user. The correspondence of gender, number
nouns with some less predictable variants, we used an Acous- (and/or person) is one of the most difficult aspects of the French
tic Phonetic Decoding (APD) system on speech segments that language. French is an inflected language, containing a lot of
correspond to utterances of proper nouns. homophonous inflected forms.
In the manual transcription utterances, start and end times In this work, we wanted to repair some errors by post-
of individual words were not available. Therefore, the bound- processing the ASR output obtained with the Sphinx decoder.
aries of each word of the transcription had to be determined by We proposed an approach [13] consisting in building a specific
aligning the words with the signal, using a speech recognition correction solution for each specific error. Indeed, some com-
system. In order to do the first forced alignement, we used three plex grammatical rules can not be modeled with a n-gram lan-
different grapheme to phoneme conversion method : guage model. Method must correct homophonous errors and
should not be domain or system specific, should handle large
• A method we proposed in [11], based on the use of a
vocabulary. We particularly focused on the errors caused by
statistical machine translation (SMT) system,
the homophonous inflected forms of past participles, as well
• A data-driven conversion system proposed in [10], based as the errors concerning the words ‘vingt/vingts’ (twenty) and
on the use of joint-sequence models (JSM), ‘cent/cents’ (hundred). These errors are some of the most fre-
quent errors produced by our ASR system, according to the high spontaneity speech segment surrounded by two prepared
analysis of confusion pairs. speech segments. Our previous approach, presented in [14],
To repair errors, we sought to use formal rules whenever takes only into consideration the descriptors which are extracted
possible. But this approach could not be the only one. In par- from within the targeted segment, without taking into consider-
ticular, formal rules are not very robust to errors existing in the ation information about surrounding segments. In order to im-
lexical context of a targeted word. Thus, when it was possible prove our approach, we proposed in [15] to take into account
to establish a formal rule, we did. If not, we tried to use a statis- the nature of the contiguous neighboring speech segments. It
tical method based on the use of a statistical classifier in order implies that the categorization of each speech segment from
to correct a hypothesis word being a past participle. The statis- an audio file has an impact on the categorization of the other
tical method, presented in [13], used various knowledge bases: segments: the decision process becomes a global process. We
lexical information, acoustic information, part-of-speech (POS) chose to use a statistical classical approach by using a max-
tags, syntactic chunk categories, or other information levels. imum likelihood method. With all these improvements, our
Moreover, acoustic information, given by the ASR system, is method allowed to achieve a 69.3% precision for high spon-
used to filter some potential corrections: a correction is valid taneous speech detection with a recall measure of 74.6% .
only if one of its pronunciation variant matches the pronuncia-
tion variant of the targeted word. 9. Using CMU Sphinx tools for Statistical
The method using formal rules, presented in [13], allowed Machine Translation
to reduce the error rate for homophonous errors on words “cent”
and “vingt” by 86.4% on our test corpus. The stochastic method The LIUM laboratory is working on speech processing and ma-
which must repair errors due to the homophonous inflected chine translation. The speech team has used the Sphinx library
forms of past participles, allowed to reduce the error rate for for several years. Since last year, the machine translation team
this kind of errors by 11%. has developed system combination tools based on decoding lat-
tices made of several confusion networks provided by different
statistical machine translation (SMT) systems. In order to de-
8. Spontaneous speech characterization and code these lattices, a token passing decoder has been developed.
detection in large audio database This decoder uses the Sphinx 4 library which is in Java. The fol-
lowing sections describe the approach for system combination,
We were also interested in detecting spontaneous speech in a
the alignment of hypotheses and the token pass decoder.
large audio database. Spontaneous speech, in opposition to
prepared speech, occurs in Broadcast News (BN) data under
several forms: interviews, debates, dialogues, etc. The main 10. SMT System combination
evidences characterizing spontaneous speech are disfluencies The system combination approach is based on confusion net-
(filled pauses, repetitions, repairs and false starts), ungrammati- work decoding as described in [16, 17] and shown in Figure 1.
cality and a language register different from the one that can be The protocol can be decomposed into three steps :
found in written texts. Depending on the speaker, the emotional
state and the context, the language used can be very different. 1. 1-best hypotheses from all M systems are aligned and
Processing spontaneous speech is one of the many challenges confusion networks are built.
that ASR systems have to deal with. Indeed, automatically tran- 2. All confusion networks are connected into a single lattice
scribing spontaneous speech is a more difficult task than auto- with empirically estimated prior probabilities on the first
matically transcribing prepared speech (WER is higher). arcs.
In [14], we proposed a set of features for characterizing
3. The resulting lattice is decoded and the 1-best hypothesis
spontaneous speech. The relevance of these features was es-
and/or n-best list of hypotheses are generated.
timated on an 11 hour corpus (French Broadcast News) manu-
ally labelled by two human judges according to a level of spon-
taneity in a scale from 1 (clean, prepared speech) to 10 (highly 1-best
output
disfluent speech, almost not understandable). The corpus was System 0 TERp
alignment LM
cut into segments thanks to automatic segmentation and diariza-

{
CN
tion, and transcripted by the Sphinx decoder, before being anno- best hypo
output
tated. Later, segments were labeled into three classes (prepared, TERp
CN
Lattice nbest list
System 1 alignment Merge DECODE
low spontaneity and high spontaneity) depending of its level of
spontaneity. In this work, we particularly focused on the detec- 1-best
output CN
tion of the high spontaneity class of speech.
In parallel to the subjective annotation of the corpus, we in- TERp
System M alignment
troduced the features used to describe speech segments. We 1-best
output
chose speech segments that are relevant to characterize the
spontaneity of those, and on which an automatic classification
process can be trained on our annotated corpus. Three sets of Figure 1: MT system combination.
features were used [14, 15]: acoustic features related to prosody,
linguistic features related to the lexical and syntactic content of
the segments, and confidence measures output by ASR system.
The features were evaluated on our labeled corpus with a clas- 10.1. Hypotheses alignment and confusion network gener-
sification task: labeling speech segments according to the three ation
classes of spontaneity. For each segment, the best hypotheses of M − 1 systems are
Intuitively, we can feel that it should be rare to observe a aligned against the last one used as backbone. A modified ver-
sion of the TERp tool [18] is used to generate a confusion net- 10.2.2. Language Model
work. This is done by incrementally adding the hypotheses to
There are two ways of loading a LM with this software.
the CN. These hypotheses are added to the backbone beginning
The first solution is to use the LargeTrigramModel class,
with the nearest (in terms of TERp) and ending with the more
but as its name tells us, only a maximum 3-gram model can be
distant ones. This differs from the result of [17] where the near-
loaded with this class.
est hypothesis is computed at each step. M confusion networks
The second and easiest way is to use a language model
are generated in this way. Then all the confusion networks are
hosted on a lm-server. This kind of LM can be accessed via the
connected into a single lattice by adding a first and last node.
LanguageModelOnServer class which is based on the generic
The probability of the first arcs (named priors) must reflect the
LanguageModel class from the Sphinx4 library. This allow us
capacity of such system to provide a well structured hypothesis.
to load a n-gram LM with n higher than 3, which is not possible
with a standard LM class in Sphinx4 ... at this time. Actually,
10.2. Decoding
a new generic class for handling k-gram LM (whatever is k) is
The decoder is based on the token pass decoding algorithm. The being developed at LIUM and will be integrated soon into this
principle of this decoder is to propagate tokens over the lattice software.
and accumulate various scores into a global score for each hy- In addition, the Dictionary interface has been extended in
potheses. order to be able to load a simple dictionary containing all the
The scores used to evaluate the hypotheses are the follow- words known by the LM (no need to know the different pronun-
ing : ciations of each words in this case).
• System score : this replaces the score of the translation
model. Until now, the words given by all systems have 11. Experimental Evaluation of SMT
the same probability which is equal to their prior, but any System Combination
confidence measure can be used at this step.
We used our combination system, called MTSyscomb [19], for
• Language model (LM) probability. the IWSLT’09 evaluation campaign [20]. Table 3 presents the
• A fudge factor to balance the probabilities provided in results obtained with this approach. The SM T system is based
the lattice with regard to those given by the language on MOSES [21], the SP E system correspond to a rule-based
model. system from SYSTRAN whose outputs have been corrected by
a SMT system and the Hierarchical is based on Joshua [22].
• a null-arc penalty : this penalty avoids always going
through null-arcs encountered in the lattice.
Systems Arabic/English Chinese/English
• A length penalty : this score helps to generate well sized Dev7 Test09 Dev7 Test09
hypotheses. SMT 54.75 50.35 41.71 36.04
The probabilities computed in the decoder can be expressed SPE 48.13 - 41.23 38.53
as follow : Hierarchical 54.00 49.06 39.78 31.89
Len(W )
SMT + SPE 42.55 40.14
X + tuning 43.06 39.46
log(PW ) = [log(Pws (n)) + αPlm (n)] (1)
SMT + Hier. 55.89 50.86
n=0
+ tuning 57.01 51.74
+Lenpen (W ) + N ullpen (W )

where Len(W ) is the length of the hypothesis, Pws (n) is Table 3: Results of system combination on Dev7 (development)
the score of the nth word in the lattice, Plm (n) is its LM proba- corpus and Test09, the official test corpus of IWSLT’09 evalua-
bility, α is the fudge factor, Lenpen (W ) is the length penalty of tion campaign.
the word sequence and N ullpen (W ) is the penalty associated
with the number of null-arcs crossed to obtain the hypothesis.
At the beginning, one token is created at the first node of In these tasks, the system combination approach yielded
the lattice. Then this token is spread over consecutive nodes, +1.39 BLEU on Ar/En and +1.7 BLEU on Zh/En. One obser-
accumulating the score on the arc it crosses, the language model vation is that tuning parameters did not provided better results
probability of the word sequence generated so far and the null for Zh/En.
or length penalty when applicable. The number of tokens can
increase really quickly to cover the whole lattice, and, in order 12. Conclusion
to keep it tractable, only the Nmax best tokens are kept (the
This paper presents some recent research works made at LIUM.
others are discarded), where Nmax can be configured in the
We have started using CMU Sphinx tools in 2004 in order to de-
configuration file.
velop an entire ASR system in French language. We have added
some improvements and have made our system the best open
10.2.1. Technical details about the token pass decoder
source system participating to French evaluation campaigns.
This software is based on the Sphinx4 library and is highly con- This system is now at the center of the research works of the
figurable. The maximum number of tokens being considered LIUM Speech Team. In the framework of speech processing,
during decoding, the fudge factor, the null-arc penalty and the these works include grapheme-to-phoneme conversion, special
length penalty can all be set within an XML configuration file. strategy of correction to process frequent specific errors, de-
This is really useful for tuning. tection and characterization of spontaneous speech in large au-
This decoder uses a language model (LM) which is de- dio database. We used also CMU Sphinx tools in our research
scribed in section 10.2.2. work on statistical machine translation to combine SMT system.
Other research works, not presented in this paper, are made by [16] W. Shen, B. Delaney, T. Anderson, and R. Slyh, “The MIT-
using CMU Sphinx, for example speaker named identification LL/AFRL IWSLT-2008 MT system,” in IWSLT, Hawaii, U.S.A,
which exploits the outputs of our speaker diarization system and 2008, pp. 69–76.
the output of our ASR system. [17] A.-V. Rosti, S. Matsoukas, and R. Schwartz, “Improved word-
We already share some of our resources (French acoustic level system combination for machine translation,” in ACL, 2007,
and language models for example) and we try to integrate our pp. 312–319.
add-ons into the canonical source code of the CMU Sphinx [18] M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul,
project. “A study of translation edit rate with targeted human annotation,”
in ACL, 2006.
This porting is very hard because it needs a lot of develop-
ment time, and because CMU Sphinx progresses and its source [19] L. Barrault, “Many : Open source machine translation system
code frequently changes. In the future, we will try to integrate combination,” Prague Bulletin of Mathematical Linguistics, Spe-
cial Issue on Open Source Tools for Machine Translation, vol. 93,
our work into the canonical source code as soon as possible in
pp. 147–155, 2010.
order to make easier this integration. More, it should be really
interesting to have access to a global view concerning the cur- [20] H. Schwenk, L. Barrault, Y. Estève, and P. Lambert, “LIUM’s Sta-
tistical Machine Translation Systems for IWSLT 2009,” in Proc.
rent and planned developments made by the other CMU Sphinx of the International Workshop on Spoken Language Translation,
developers. Tokyo, Japan, 2009, pp. 65–70.
[21] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico,
13. References N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer,
[1] M. J. F. Gales, “Maximum likelihood linear transfor- mations O. Bojar, A. Constantin, and E. Herbst, “Moses: Open source
for hmm-based speech recognition,” Computer Speech and Lan- toolkit for statistical machine translation.” in ACL. The Associ-
guage, vol. 12, p. 7598, 1998. ation for Computer Linguistics, 2007.

[2] H. Mangu, E. Brill, and S. A., “Finding consensus in speech [22] Z. Li, C. Callison-Burch, C. Dyer, J. Ganitkevitch, S. Khudan-
recognition: Word error minimization and other applications of pur, L. Schwartz, W. N. G. Thornton, J. Weese, and O. F. Zaidan,
confusion networks,” Computer Speech and Language, vol. 14, “Joshua: an open source toolkit for parsing-based machine trans-
no. 4, pp. 373–400, 2000. lation,” in StatMT ’09: Proceedings of the Fourth Workshop on
Statistical Machine Translation. Morristown, NJ, USA: Associ-
[3] D. Povey and P. Woodland, “Minimum phone error and i- ation for Computational Linguistics, 2009, pp. 135–139.
smoothing for improved discriminative training,” in ICASSP,
Florida, USA, 2002, pp. 105–108.
[4] D. Povey, “Discriminative training for large vocabulary speech
recognition,” Ph.D. Dissertation, Department of Engineering,
University of Cambridge, United Kingdom, 2004.
[5] J. Durand, B. Laks, and C. Lyche, “La phonologie du français
contemporain : usages, variétés et structure,” Romance Corpus
Linguistics - Corpora and Spoken Language, pp. 93–106, 2002.
[6] F. Béchet, “LIA PHON un système complet de phonétisation de
texte,” in Traitement Automatique Des Langues. Hermès, 2001,
vol. 42, pp. 47–68.
[7] H. Jiang, “Confidence measures for speech recognition: a survey,”
Speech Communication Journal, vol. 45, pp. 455–470, 2005.
[8] G. Evermann and P. Woodland, “Large vocabulary decoding
and confidence estimation using word posterior probabilities,” in
ICASSP, Istanbul, Turkey, June 2000.
[9] M. De Calmes and G. Perennou, “BDLEX: a lexicon for spoken
and written French,” in Proc. of LREC, International Conference
on Language Resources and Evaluation, 1998, pp. 1129–1136.
[10] M. Bisani and H. Ney, “Joint-sequence models for grapheme-to-
phoneme conversion,” Speech Comm., vol. 50, no. 5, pp. 434–451,
2008.
[11] A. Laurent, P. Deléglise, and S. Meignier, “Grapheme to
phoneme conversion using an smt system,” in In Proceedings of
Interspeech-2009, 2009.
[12] T. Bazillon, Y. Estève, and D. Luzzati, “Manual vs assisted tran-
scription of prepared and spontaneous speech,” in LREC 2008,
Marrakech, Morocco, May 2008.
[13] R. Dufour and Y. Estève, “Correcting ASR outputs: specific solu-
tions to specific errors in French,” in SLT 2008, Goa, India, De-
cember 2008.
[14] R. Dufour, V. Jousse, Y. Estève, F. Béchet, and G. Linarès,
“Spontaneous speech characterization and detection in large audio
database,” in SPECOM 2009, St Petersburg, Russia, June 2009.
[15] R. Dufour, Y. Estève, P. Deléglise, and F. Béchet, “Local and
global models for spontaneous speech segment detection and
characterization,” in ASRU 2009, Merano, Italy, December 2009.

You might also like