Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition
Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition
2. MACHINE TRANSLATION FOR DATA Deriving alignments is known to more challenging for transformer
AUGMENTATION networks with self-attention and multiple attention heads. There has
been some recent work for alleviating this issue by explicitly adding
2.1. Building translation model an alignment head to the base architecture [15].
Owing to the relative ease of alignment extraction, we make
Neural machine translation (NMT) is the dominant paradigm for
the modeling design decision of using the recurrent encoder-decoder
current MT research. We assess the popular neural architectures for
networks with attention for our NMT experiments.
the task of building an EN→HI translation model. Sequence-to-
sequence framework with attention, proposed in [4], comprises two 2.3. Incorporating raw translations: An initial study
recurrent neural networks: Encoder, which reads the source sen-
As an initial experiment, we translated user interaction sentences
tence tokens (x1 , x2 , ..xt ) to generate continuous representations
from US English collections to Hindi and directly ingested the raw
(h1 , h2 , ..ht ), and Decoder, which outputs symbols (y1 , y2 ...yn ),
translations for LM training. However, this strategy resulted in a
conditioned on the previous outputs as well as a context vector c,
very high perplexity LM. Upon further analysis, we found three key
derived as the weighted sum of the encoder hidden states h.
explanatory factors. First, typical user interactions with voice con-
Transformer networks proposed in [7] eliminate recurrence in
trolled agents, e.g. song requests, contain several named entities. The
favor of parallelism and rely solely on attention. Here the encoder
general purpose EN→HI MT system generates translations for those
and decoder comprise stacked self-attention and fully connected lay-
entities as well, which is not desirable. The second factor is the ab-
ers. These networks represent the current state-of-the-art for NMT.
sence of code-switching in translations, which are purely in Hindi,
For training the translation models, we use a corpus of 8.4M
owing to the nature of the training data. Given the extent of intra-
parallel (EN, HI) sentence pairs prepared by crawling different web
sentential code mixing in conversational Hindi, it seems imperative
sources. We employ BLEU (Bilingual Evaluation Understudy) [14]
for the translations to capture it as well, in order to add value to
score to assess the translation quality. Fig. 1 captures the perfor-
the downstream language modeling task. Finally, the most challeng-
mance of these models for different configurations. The best per-
ing nuance is the out-of-domain nature of MT training data (news
forming recurrent encoder-decoder with attention model achieves a
items, wiki articles, etc.), which results in a lack of informal interac-
BLEU score of 43.8 as compared to 46.4 achieved by the transformer
tion style in the generated translations, an inherent attribute of user-
architecture.
device interactions. This domain mismatch issue has been observed
in other machine translation settings as well.
2.4. Post-editing translations
We use the attention weights derived while decoding to approximate
alignments. Each target token generated by the decoder is considered
to be aligned with the source token corresponding to the encoder
position with maximum attention weight. Metadata in the source
text annotations, e.g. song name, artist name etc., is used to iden-
tify named entities (NE). Using these alignments and annotations,
the following post-editing steps are performed. 1) NE copy-over:
source tokens corresponding to named entities are simply retained
as-is in the output, 2) NE resampling: named entities are resampled
with local Hindi catalogs, since the trending entities vary by geogra-
Fig. 1. BLEU score over epochs. Transformer and recurrent archi- phy and, 3) Code mixing: code-switching is simulated in the transla-
tectures are represented by labels T and R respectively. tions, by probabilistically copying over source English tokens. The
Fig. 2. Examples of raw translations and postprocessed outputs. NE is a shorthand for named entities. Aligned source and target tokens are
indicated by similar highlighting. Underlined tokens indicate English tokens in the simulated code mixing.
probability of retaining a token is set to be directly proportional to side sentences. Along similar lines as [13], data selection is based on
the smoothed relative frequency in the transcribed Hindi collections. the relative distance δs of a sentence vector vs w.r.t. the in-domain
Fig. 2 provides examples of raw translations and the outputs of the and out-of-domain centroids CFin and CFout respectively, indicated
post-editing step. by Eq. (2). A lower value of δs implies higher resemblance with in-
domain data.
δs = d(vs , CFin ) − d(vs , CFout ) (2)
2.5. Domain adaptation P P
f ∈Fin vf f ∈Fout vf
Domain mismatch between the training and inference data presents a where, CFin = , CFout =
more subtle challenge with the translations not reflecting colloquial |Fin | |Fout |
usage. Most of the prominent approaches for MT model adaptation, and, d(x, y) = ||x − y||2
like backtranslation [16], shallow fusion [17] etc., assume presence 2.5.2. Model finetuning
of a large target side monolingual corpus for boosting the fluency
Backtranslation [16] is a popular approach for adaptation where a
of translations. Such an assumption cannot be made during model
target-to-source translation model is learnt on the parallel corpus and
bootstrapping, where only limited target in-domain data (transcribed
is used to translate the unpaired target-side monolingual data. The
collection) is available for adaptation. With this constraint in place,
resulting synthetic (source, target) pairs are leveraged for the origi-
we experiment with four broad classes of techniques for adaptation:
nal source-to-target model training. In the absence of a large target
1) data selection for MT training, 2) model finetuning, 3) rescoring,
monolingual corpus, we resort to an alternate approach for synthetic
and 4) filtering translations.
corpus generation. We generate pseudo pairs by translating a por-
2.5.1. Data selection for MT training tion of US English utterances using an initially trained NMT model,
and perform post-editing to retain named entities. With this addi-
Selecting NMT training samples similar to the in-domain data from
tional parallel data, the model is further trained for a certain number
the out-of-domain parallel corpus has been explored in [13]. The
of epochs. As we discuss in Section 3.2, this type of finetuning is
central idea of this work is to use both in-domain and out-of-domain
susceptible to overfitting.
parallel corpora to train an NMT system. Similarity between the
learnt encoder representations of these sentences is used to define
their semantic closeness. Sentence selection is based on the relative 2.5.3. Rescoring with in-domain LM
similarity to the in-domain versus out-of-domain vector centers. In order to further boost the fluency of translations, the hypotheses
We adopt a similar data centric approach for adaptation. In our obtained after beam search decoding are rescored using an n-gram
setting however, in-domain parallel corpus is not available. Hence, LM built from the in-domain transcribed data. The score of a transla-
representing sentences via NMT encoder embeddings is not feasi- tion hypothesis is computed as a weighted sum of the MT decoding
ble. In order to facilitate data selection, we resort to the approach score and the LM score. The choice of n-gram LM for rescoring is
of learning unsupervised sentence embeddings for quantifying the motivated by its robustness under low-resource conditions as com-
closeness of target side MT training sentences with the in-domain pared to RNNLM.
transcriptions available.
We compare the following techniques for generating unsuper-
2.5.4. Filtering translations
vised sentence representations: (1) unweighted averaging of word
vectors, (2) smooth inverse frequency [18], where the sentence em- As a final step, we attempt to remove the spurious translations by
beddings are computed as a weighted average of word vectors fol- performing filtering based on a quality measure. The main challenge
lowed by removal of the projection on the first singular vector, and here is to define the “goodness” of a translated output. A potential
(3) language agnostic sentence representations (LASER) [19], which candidate is the approach of using the score assigned by the MT
is an open source pre-trained biLSTM encoder for generating multi- model. That is, the product of conditional probabilities of the output
lingual sentence embeddings that generalize across languages and tokens generated by the MT decoder can be used as a proxy to define
NLP tasks. The first approach is appealing owing to its simplic- well formed translations. We also consider the approach of using
ity. In the second technique, taking word frequency into account is the statistical LM built using transcribed data to assess the quality
the distinguishing factor. Potential cross-lingual generalization is the of translation data. Using each of these scores, we retain the top-x
advantage offered by the third approach. We use FastText [20] for percentile of the translation output.
learning word vector representations. An overview of the adaptation and postprocessing pipeline is
Using each of these approaches, we generate sentence embed- provided in Fig. 3, with section numbers indicated against each com-
ding vectors for in-domain (Fin ) and out-of-domain (Fout ) target ponent.
Fig. 3. Overview of the model training, adptation and postprocessing pipeline. (.) indicates the corresponding section describing the approach.
3. RESULTS AND DISCUSSION domain dataset, the gains don’t carry over while measuring over-
all ASR performance. SIF embedding based selection achieves the
3.1. Experimental setup highest WERR of 7.23%, followed closely by LASER encoder rep-
resentation.
We conduct experimental evaluation using upto 180 hours of Hindi-
Rescoring the decoded beams using transcription based LM
English code-switched speech for training. This dataset comprises
yields a WERR of 6.28% for a beam size of 5. In these experiments,
200K Hindi utterances collected using Cleo, an interactive skill
a relative weight of 0.3 is assigned to LM for overall score computa-
that enables users to teach local languages to voice assistants via
tion. Increasing the beam size from 5 to 20 leads to a drop in WERR,
prompts. These prompts cover use cases like song requests or
suggesting that during decoding, the head portion of the translation
knowledge related questions. These natural utterances represent
output contains hypotheses helpful for improving naturalness, and
the transcribed in-domain component in our experiments. We follow
increasing beam width can result in higher confusability.
the factored ASR architecture, and the AM is a hybrid DNN-HMM
model, trained on log filter bank energy (LFBE) features extracted at For the model finetuning approach, the number of additional
10 ms intervals for a 25 ms analysis window. LM is built by learning training epochs is an important parameter. We observe a WERR of
a 4-gram model with Katz smoothing on the training data. 6.84% when this parameter is set to 3, as compared to 5.23% for
10 epochs. Increasing the number of passes on the synthetic data
The baseline LM is built using the in-domain transcriptions only.
generated using an initially trained model perpetuates the effect of
Translation component is procured by translating 9.8M US English
model reinforcing its own errors. This potential overfitting makes
utterance transcripts using the trained NMT models followed by
early stopping imperative.
adaptation and post-editing. For the evaluation candidates, the LM
is built by a linear interpolation of the transcribed and translated In the experiments focusing on translation output filtering, MT
components. The interpolation weights are tuned to minimize the score did not turn out to be an effective metric for quality evaluation,
perplexity of a held-out in-domain dataset. We assign a floor inter- indicated both by perplexity and WERR. We obtain interesting in-
polation weight of 0.25 to the translation component to ensure that it sights by ranking translations using in-domain LM scores. A WERR
receives sufficient representation. For all the MT experiments except of 6.82% is observed by retaining only top-75% translations. Mak-
rescoring, we employ greedy decoding, i.e. beam size of one. The ing filtering conservative beyond this point degrades performance.
test set comprises 37K utterances. We report relative word error rate One caveat of the LM guided filtering approach is that the patterns
reduction (WERR) w.r.t. models built using baseline LM. which are underrepresented in the initial collections receive low LM
scores. This could explain the drop in WERR when moving to top-
65% translations; since the transcribed volume used for LM training
3.2. Results is itself small, some of the discarded patterns could have been com-
Table 1 captures the performance of post-editing techniques. The plementary for the overall ASR performance.
approach of ingesting raw translations without any postprocessing Combining the SIF selection, finetuning, rescoring and LM
yields a high perplexity translation component, resulting in nega- based filtering approach results in a relative WERR of 7.86%.
tive WERR. We observe consistent improvements by introducing at-
tention weight based post-editing. NE copy-over alone reduces the 3.3. Impact on different interaction scenarios
perplexity significantly. This, coupled with NE resampling and code
mixing, results in a 5.83% WERR. Cleo prompts cover multiple interaction use cases. In order to derive
In Table 2, we assess the impact of each adaptation approach fine grained insights into the effect of translations, we study WERR
followed by post-editing. For MT training data selection, we retain on test utterances manually categorized into scenarios. Nearly 70%
only top-25% (out of 8.4M) sentences w.r.t. their relative similar- of the test utterances fall into one of the nine interaction scenarios
ity with in-domain data. This reduction in training data impacts the mentioned in Table 3. In order to isolate the gains obtained from
BLEU score adversely. Amongst the sentence representation tech- post-editing and adaptation, we study both the post-editing and com-
niques, LASER and SIF embeddings outperform the unweighted bined WERR. We also analyse the proportion of named entities for
averaging approach in terms of BLEU score. Interestingly, while utterances belonging to each of these scenarios. Following observa-
the unweighted averaging achieves lowest perplexity on held-out in- tions can be made from this analysis.
Postprocessing Approach PPL Relative WERR %
None Raw translations 11941.08 -1.81
Post-editing NE copy-over 2889.45 2.36
NE resampling 1241.52 4.62
Code mixing + NE resampling 936.64 5.83
Table 1. Relative WERRs (%) with different post-editing techniques. Perplexity (PPL) is evaluated on a held-out in-domain dataset. Relative
WERR captures the WER reduction w.r.t baseline trained on transcribed data only.
Table 3. Relative WERR % by interaction scenarios captured along with the extent of coverage in transcribed collections. Named entity
proportion in the utterances is given by NE %. Adaptation contribution % captures the relative contribution of adaptation towards WERR.
For e.g., with a 5.75% post-editing WERR, adaptation yields an additional 1.51% WERR towards a combined WERR of 7.26%, i.e. 20.80%.
• Higher WERR is observed for scenarios such as shopping and 3.4. Impact of floor weight for interpolation
knowledge related queries, which are not well represented in
the transcribed collections as compared to their popular coun- The final augmented LM is an interpolated n-gram model, with the
terparts like song requests and notifications, suggesting that probability of an n-gram computed as a weighted sum of probabili-
translations can effectively complement the transcribed data. ties assigned by transcribed and translation components. Since the
• WERR shows positive correlation with the percentage of tuning data for determining interpolation weights comprises tran-
named entities (Pearson correlation: 0.647, p-value: 0.059). scribed utterances only, the translation component may receive a low
This observation is consistent with the results in the pre- weight owing to domain mismatch.
vious section where postprocessing alone demonstrates a The purpose of this investigation is to observe the effect of
significant WERR. changing the floor weight parameter for the translation component,
• The relative contribution of adaptation towards WERR is which provides a lever to override its relative importance in the
higher for scenarios with smaller named entity footprint, e.g. interpolated LM. As seen from Table 4, the overall PPL increases as
weather. Hence, the gains from postprocessing and adaptation we increase the floor weight. However, WERR demonstrates fluctu-
seem to be complementary across conversation scenarios. ation with varying floor weights: a low weight renders the translation
component ineffective whereas a high value undermines the tran- 5. REFERENCES
scription component. Floor weight sweep can provide empirical
guidance for adjusting this parameter. [1] Arseniy Gorin, Rasa Lileikytė, Guangpu Huang, Lori Lamel,
Jean-Luc Gauvain, and Antoine Laurent, “Language model
Floor Interpolated WERR % data augmentation for keyword spotting in low-resourced train-
weight PPL ing conditions,” in Interspeech 2016, 2016, pp. 775–779.
0.1 50.28 5.78 [2] Saurabh Garg, Tanmay Parekh, and Preethi Jyothi, “Code-
0.15 51.24 7.04 switched language models using dual RNNs and same-source
0.25 52.36 7.86 pretraining,” in Proceedings of the 2018 Conference on Empir-
0.3 53.37 7.49 ical Methods in Natural Language Processing, Brussels, Bel-
0.4 56.34 6.58 gium, Oct.-Nov. 2018, pp. 3078–3083, Association for Com-
putational Linguistics.
Table 4. PPL and relative WERRs (%) with varying floor interpola-
tion weights for the translation component in the 180 hour setup. [3] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau,
and Yoshua Bengio, “On the properties of neural machine
3.5. Impact of in-domain data volume translation: Encoder–decoder approaches,” in Proceedings of
SSST-8, Eighth Workshop on Syntax, Semantics and Structure
We now attempt to address the following question: what are the rel- in Statistical Translation, Doha, Qatar, Oct. 2014, pp. 103–111,
ative gains provided by the translation data during different phases Association for Computational Linguistics.
of bootstrapping? In particular, we measure the WERR between the
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,
baseline and translation-augmented LMs, by varying the in-domain
“Neural machine translation by jointly learning to align and
transcribed utterances from 10K to 200K. We observe that the com-
translate,” CoRR, vol. abs/1409.0473, 2015.
bined WERR after post-editing and adaptation increases from 7.86%
to 15.65% as the amount of in-domain data reduces. Note that in this [5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le,
experiment, we use the same AM trained on 180 hours data, in or- Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,
der to precisely study the effect of data augmentation for LM. The Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva
WERR we report is hence an underestimate, and will probably be Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
much higher, if the AM was trained using similar levels of tran- Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith
scribed data. These findings, summarized in Table 5, suggest that Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young,
the neural MT supplements can especially aid initial stages of model Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg
development. Corrado, Macduff Hughes, and Jeffrey Dean, “Google’s neural
machine translation system: Bridging the gap between human
Transcribed WERR % and machine translation,” CoRR, vol. abs/1609.08144, 2016.
Volume [6] Philipp Koehn, Franz Josef Och, and Daniel Marcu, “Statisti-
10K 15.65 cal phrase-based translation,” in Proceedings of the 2003 Con-
20K 13.18 ference of the North American Chapter of the Association for
50K 9.42 Computational Linguistics on Human Language Technology -
100K 8.98 Volume 1, Stroudsburg, PA, USA, 2003, NAACL ’03, pp. 48–
200K 7.86 54, Association for Computational Linguistics.
Table 5. Relative WERRs (%) with varying levels of in-domain tran- [7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
scribed data. reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Il-
lia Polosukhin, “Attention is all you need,” CoRR, vol.
4. CONCLUSION abs/1706.03762, 2017.
In this work, we explored the key challenges associated with using [8] ArnarThor Jensson, Koji Iwano, and Sadaoki Furui, “Language
NMT for LM data augmentation in a conversational, code-switched model adaptation using machine-translated text for resource-
setting. Using a combination of post-editing and domain adaptation deficient languages,” EURASIP Journal on Audio, Speech, and
techniques, we demonstrated a relative WERR of 7.8% for 180 hours Music Processing, vol. 2008, no. 1, pp. 573832, Jan 2009.
of transcribed data. We examined the performance trajectory along [9] Thang Vu, Dau-Cheng Lyu, Jochen Weiner, Dominic Telaar,
different bootstrapping phases, and observed relative WERR of upto Tim Schlippe, Fabian Blaicher, Eng Chng, Tanja Schultz, and
15.6% with reduced transcription volumes. A further drilldown of Haizhou Li, “A first speech recognition system for mandarin-
WERR by interaction scenarios provided interesting insights into english code-switch conversational speech,” in ICASSP, 03
the gains derived from translation as a function of proportion of 2012.
named entities and relative representation in the transcribed data. [10] Chenhui Chu and Rui Wang, “A survey of domain adaptation
This experimental evidence establishes the efficacy of using trans- for neural machine translation,” CoRR, vol. abs/1806.00258,
lations for supplementing transcribed collections in the early stages 2018.
of model development, a strategy which could be instrumental for
rapid language expansion. Though Hindi is used as an experimental [11] Horia Cucu, Andi Buzo, Laurent Besacier, and Corneliu
testbed in this work, the techniques presented are generic and can Burileanu, “Smt-based asr domain adaptation methods for
be leveraged for bootstrapping other languages as well. Exploring under-resourced languages: Application to romanian,” Speech
semi-supervised and unsupervised translation is a promising future Commun., vol. 56, pp. 195–212, Jan. 2014.
direction, especially for the low resource languages.
[12] Judith Gaspers, Penny Karanasou, and Rajen Chatterjee, in Proceedings of the 54th Annual Meeting of the Associa-
“Selecting machine-translated data for quick bootstrapping tion for Computational Linguistics (Volume 1: Long Papers),
of a natural language understanding system,” CoRR, vol. Berlin, Germany, Aug. 2016, pp. 86–96, Association for Com-
abs/1805.09119, 2018. putational Linguistics.
[13] Rui Wang, Andrew Finch, Masao Utiyama, and Eiichiro [17] Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho,
Sumita, “Sentence embedding for neural machine translation Loı̈c Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk,
domain adaptation,” in Proceedings of the 55th Annual Meet- and Yoshua Bengio, “On using monolingual corpora in neural
ing of the Association for Computational Linguistics (Volume machine translation,” CoRR, vol. abs/1503.03535, 2015.
2: Short Papers), Vancouver, Canada, July 2017, pp. 560–566,
Association for Computational Linguistics. [18] Jonas Mueller and Aditya Thyagarajan, “Siamese recurrent ar-
chitectures for learning sentence similarity,” in Proceedings of
[14] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing the Thirtieth AAAI Conference on Artificial Intelligence. 2016,
Zhu, “Bleu: A method for automatic evaluation of machine AAAI’16, pp. 2786–2792, AAAI Press.
translation,” in Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics, Stroudsburg, PA, [19] Mikel Artetxe and Holger Schwenk, “Massively multilingual
USA, 2002, ACL ’02, pp. 311–318, Association for Computa- sentence embeddings for zero-shot cross-lingual transfer and
tional Linguistics. beyond,” CoRR, vol. abs/1812.10464, 2018.
[15] Tamer Alkhouli, Gabriel Bretschner, and Hermann Ney, “On [20] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas
the alignment problem in multi-head attention-based neural Mikolov, “Enriching word vectors with subword information,”
machine translation,” in WMT, 2018. Transactions of the Association for Computational Linguistics,
[16] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Improv- vol. 5, pp. 135–146, 2017.
ing neural machine translation models with monolingual data,”