0% found this document useful (0 votes)
29 views7 pages

Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition

Language model for bootstrapping using neural machine translation for conversational speech recognition

Uploaded by

Hanumanth s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition

Language model for bootstrapping using neural machine translation for conversational speech recognition

Uploaded by

Hanumanth s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

LANGUAGE MODEL BOOTSTRAPPING USING NEURAL MACHINE TRANSLATION FOR

CONVERSATIONAL SPEECH RECOGNITION

Surabhi Punjabi, Harish Arsikere, Sri Garimella

Alexa Machine Learning, Amazon, Bangalore, India


{spunjabi, arsikere, srigar}@amazon.com

ABSTRACT a generative adversarial model for sequences, has been employed


for pretraining a code-switched LM [2]. However, a precondition for
Building conversational speech recognition systems for new lan-
the successful generalization of these neural generative models is
guages is constrained by the availability of utterances capturing
the availability of a substantial amount of in-domain utterance text
user-device interactions. Data collection is expensive and limited by
for training, which itself is the bottleneck during the bootstrapping
speed of manual transcription. In order to address this, we advocate
phase.
the use of neural machine translation as a data augmentation tech-
nique for bootstrapping language models. Machine translation (MT) Utterances from mature conversational systems, for example
offers a systematic way of incorporating collections from mature, in English provide a rich source of information. They are both in-
resource-rich conversational systems that may be available for a domain, since they capture actual user interaction patterns of varying
different language. However, ingesting raw translations from a gen- complexity, and large-scale, owing to prolonged usage. Translation
eral purpose MT system may not be effective owing to the presence offers an elegant and cost-effective solution for leveraging this ex-
of named entities, intra sentential code-switching and the domain isting data. Devising techniques for systematically incorporating
mismatch between the conversational data being translated and the translated data can be instrumental for achieving the rapid lan-
parallel text used for MT training. To circumvent this, we explore guage expansion goal for ASR, by alleviating the prohibitively high
following domain adaptation techniques: (a) sentence embedding requirements for data collection during bootstrapping.
based data selection for MT training, (b) model finetuning, and (c) The area of machine translation has witnessed sustained research
rescoring and filtering translated hypotheses. Using Hindi language efforts [3, 4, 5]. It is also amongst the first success stories of the
as the experimental testbed, we supplement transcribed collections end-to-end neural paradigm for sequence modeling. Conventional
with translated US English utterances. We observe a relative word phrase based statistical machine translation (SMT) [6] has shown to
error rate reduction of 7.8-15.6%, depending on the bootstrapping be outperformed by attention based recurrent encoder-decoder mod-
phase. Fine grained analysis reveals that translation particularly aids els [5] and transformer networks comprising self-attention and feed
the interaction scenarios underrepresented in the transcribed data. forward network blocks [7].
Data augmentation via SMT has been explored in the past for
Index Terms— speech recognition, neural machine translation, keyword spotting [1] and ASR [8, 9]. These studies primarily focus
domain adaptation, code-switching on incorporating raw translation output as a component in the LM.
However, in our initial experiments we observed that directly ingest-
1. INTRODUCTION ing translations generated from off-the-shelf MT models results in
a suboptimal performance for conversational data. This could be at-
Bootstrapping an automatic speech recognition (ASR) system for a tributed, in part, to the domain mismatch between the MT training
new language involves significant data collection and transcription data comprising parallel text from web sources and the informal style
overhead. For factored ASR systems, where the acoustic model interaction data used for translation. This observation of MT output
(AM) and language model (LM) are trained independently, the LM being sensitive to the mismatch in training and inference data distri-
can be trained with additional text-only corpora to boost perfor- butions is consistent with previous studies on MT adaptation [10].
mance. This is especially helpful during the initial stages of model Statistical post-editing for improving the quality of SMT outputs
development. For a new language the typical supplemental LM has been investigated [11]. In a recent work on bootstrapping nat-
sources include Wikipedia, news portals, blogs etc., which can be ural language understanding systems using translations [12], SMT
incorporated along with the limited transcribed data to circumvent is employed for generating initial translations, followed by the use
the issue of cold start. However, for conversational agents like Alexa, of source-target alignments to retain and resample named entities.
Siri the utterances are usually short, goal directed and contain several These post-editing approaches can minimize the undesired named
named entities like song title, artist name etc., e.g. Play Moonlight entities conversions for SMT, yet the bigger issue of domain mis-
Sonata by Beethoven. This informal interaction style, characteris- match still remains open.
tic of conversational data, is absent in online text sources, thereby In this work, we explore the synergies between neural machine
rendering them less effective for this task. As a result, LM building translation and speech recognition for data augmentation. We work
relies mostly on transcribed data and its performance is restricted by towards bootstrapping Hindi ASR system. Along with the limited
the speed of manual transcription and annotation. availability of representative transcribed data, an additional chal-
There has been a growing interest in the area of data augmenta- lenge in this setting is that of code-switching. In typical Hindi ut-
tion for ASR language modeling. Previous studies include training a terances people often code mix with English within a sentence. The
recurrent neural network (RNN) based LM on transcriptions and us- techniques explored in this work are however generic, and Hindi is
ing it to generate synthetic samples for augmentation [1]. SeqGAN, chosen as a testbed for its complexity.
We evaluate different architectures for building English to Hindi
(EN→HI) translation models and elaborate on the pitfalls associated 2.2. Choice of NMT architecture
with using off-the-shelf translations. Some initial gains are observed Postprocessing translations generated by an MT model can be facili-
by inferring alignments from attention weights, an approach that en- tated by the information of source-target alignments. In case of SMT
ables preserving and resampling the named entities. This technique this is straightforward owing to the fact that the alignment model is
is further extended to simulate code-switching in the translated data. learnt explicitly. In case of NMT models, the separate components
We then delve into the deeper issue of domain inconsistency. To of the conventional SMT are folded into an all neural architecture.
this end, we develop a data selection strategy for MT model training With this simplification however, assessing which source token is
based on in-domain similarity. This is an extension of [13] for the responsible for generating a target token becomes tricky.
fully unsupervised setting. We also assess model finetuning approach Attention mechanism in the recurrent encoder-decoder architec-
by adding parallel in-domain synthetic pairs. For further adaptation, ture serves as an implicit alignment model, allowing the decoder to
a statistical LM built using transcribed data is used for rescoring focus on relevant source segments. Eq. (1) captures decoder state
the decoded translation beams. Finally, different quality metrics are si as a function of previous state si−1 , previous output yi−1 and
compared for retaining only the high quality translations in the final the context vector ci . Here Tx represents the number of tokens in the
translation component. source sequence, and f denotes some nonlinear function. Notice that
A comparative evaluation of the translation-augmented LM is the attention weight αi,j determines the weight assigned by decoder
performed against baselines built from only transcribed data at vari- at time step i to encoder hidden state hj .
ous stages of bootstrapping. To the best of our knowledge, this work
is the first investigation of the efficacy and challenges associated with Tx
X
neural machine translation for conversational speech recognition. si = f (si−1 , yi−1 , ci ) where, ci = αij hj (1)
j=1

2. MACHINE TRANSLATION FOR DATA Deriving alignments is known to more challenging for transformer
AUGMENTATION networks with self-attention and multiple attention heads. There has
been some recent work for alleviating this issue by explicitly adding
2.1. Building translation model an alignment head to the base architecture [15].
Owing to the relative ease of alignment extraction, we make
Neural machine translation (NMT) is the dominant paradigm for
the modeling design decision of using the recurrent encoder-decoder
current MT research. We assess the popular neural architectures for
networks with attention for our NMT experiments.
the task of building an EN→HI translation model. Sequence-to-
sequence framework with attention, proposed in [4], comprises two 2.3. Incorporating raw translations: An initial study
recurrent neural networks: Encoder, which reads the source sen-
As an initial experiment, we translated user interaction sentences
tence tokens (x1 , x2 , ..xt ) to generate continuous representations
from US English collections to Hindi and directly ingested the raw
(h1 , h2 , ..ht ), and Decoder, which outputs symbols (y1 , y2 ...yn ),
translations for LM training. However, this strategy resulted in a
conditioned on the previous outputs as well as a context vector c,
very high perplexity LM. Upon further analysis, we found three key
derived as the weighted sum of the encoder hidden states h.
explanatory factors. First, typical user interactions with voice con-
Transformer networks proposed in [7] eliminate recurrence in
trolled agents, e.g. song requests, contain several named entities. The
favor of parallelism and rely solely on attention. Here the encoder
general purpose EN→HI MT system generates translations for those
and decoder comprise stacked self-attention and fully connected lay-
entities as well, which is not desirable. The second factor is the ab-
ers. These networks represent the current state-of-the-art for NMT.
sence of code-switching in translations, which are purely in Hindi,
For training the translation models, we use a corpus of 8.4M
owing to the nature of the training data. Given the extent of intra-
parallel (EN, HI) sentence pairs prepared by crawling different web
sentential code mixing in conversational Hindi, it seems imperative
sources. We employ BLEU (Bilingual Evaluation Understudy) [14]
for the translations to capture it as well, in order to add value to
score to assess the translation quality. Fig. 1 captures the perfor-
the downstream language modeling task. Finally, the most challeng-
mance of these models for different configurations. The best per-
ing nuance is the out-of-domain nature of MT training data (news
forming recurrent encoder-decoder with attention model achieves a
items, wiki articles, etc.), which results in a lack of informal interac-
BLEU score of 43.8 as compared to 46.4 achieved by the transformer
tion style in the generated translations, an inherent attribute of user-
architecture.
device interactions. This domain mismatch issue has been observed
in other machine translation settings as well.
2.4. Post-editing translations
We use the attention weights derived while decoding to approximate
alignments. Each target token generated by the decoder is considered
to be aligned with the source token corresponding to the encoder
position with maximum attention weight. Metadata in the source
text annotations, e.g. song name, artist name etc., is used to iden-
tify named entities (NE). Using these alignments and annotations,
the following post-editing steps are performed. 1) NE copy-over:
source tokens corresponding to named entities are simply retained
as-is in the output, 2) NE resampling: named entities are resampled
with local Hindi catalogs, since the trending entities vary by geogra-
Fig. 1. BLEU score over epochs. Transformer and recurrent archi- phy and, 3) Code mixing: code-switching is simulated in the transla-
tectures are represented by labels T and R respectively. tions, by probabilistically copying over source English tokens. The
Fig. 2. Examples of raw translations and postprocessed outputs. NE is a shorthand for named entities. Aligned source and target tokens are
indicated by similar highlighting. Underlined tokens indicate English tokens in the simulated code mixing.

probability of retaining a token is set to be directly proportional to side sentences. Along similar lines as [13], data selection is based on
the smoothed relative frequency in the transcribed Hindi collections. the relative distance δs of a sentence vector vs w.r.t. the in-domain
Fig. 2 provides examples of raw translations and the outputs of the and out-of-domain centroids CFin and CFout respectively, indicated
post-editing step. by Eq. (2). A lower value of δs implies higher resemblance with in-
domain data.
δs = d(vs , CFin ) − d(vs , CFout ) (2)
2.5. Domain adaptation P P
f ∈Fin vf f ∈Fout vf
Domain mismatch between the training and inference data presents a where, CFin = , CFout =
more subtle challenge with the translations not reflecting colloquial |Fin | |Fout |
usage. Most of the prominent approaches for MT model adaptation, and, d(x, y) = ||x − y||2
like backtranslation [16], shallow fusion [17] etc., assume presence 2.5.2. Model finetuning
of a large target side monolingual corpus for boosting the fluency
Backtranslation [16] is a popular approach for adaptation where a
of translations. Such an assumption cannot be made during model
target-to-source translation model is learnt on the parallel corpus and
bootstrapping, where only limited target in-domain data (transcribed
is used to translate the unpaired target-side monolingual data. The
collection) is available for adaptation. With this constraint in place,
resulting synthetic (source, target) pairs are leveraged for the origi-
we experiment with four broad classes of techniques for adaptation:
nal source-to-target model training. In the absence of a large target
1) data selection for MT training, 2) model finetuning, 3) rescoring,
monolingual corpus, we resort to an alternate approach for synthetic
and 4) filtering translations.
corpus generation. We generate pseudo pairs by translating a por-
2.5.1. Data selection for MT training tion of US English utterances using an initially trained NMT model,
and perform post-editing to retain named entities. With this addi-
Selecting NMT training samples similar to the in-domain data from
tional parallel data, the model is further trained for a certain number
the out-of-domain parallel corpus has been explored in [13]. The
of epochs. As we discuss in Section 3.2, this type of finetuning is
central idea of this work is to use both in-domain and out-of-domain
susceptible to overfitting.
parallel corpora to train an NMT system. Similarity between the
learnt encoder representations of these sentences is used to define
their semantic closeness. Sentence selection is based on the relative 2.5.3. Rescoring with in-domain LM
similarity to the in-domain versus out-of-domain vector centers. In order to further boost the fluency of translations, the hypotheses
We adopt a similar data centric approach for adaptation. In our obtained after beam search decoding are rescored using an n-gram
setting however, in-domain parallel corpus is not available. Hence, LM built from the in-domain transcribed data. The score of a transla-
representing sentences via NMT encoder embeddings is not feasi- tion hypothesis is computed as a weighted sum of the MT decoding
ble. In order to facilitate data selection, we resort to the approach score and the LM score. The choice of n-gram LM for rescoring is
of learning unsupervised sentence embeddings for quantifying the motivated by its robustness under low-resource conditions as com-
closeness of target side MT training sentences with the in-domain pared to RNNLM.
transcriptions available.
We compare the following techniques for generating unsuper-
2.5.4. Filtering translations
vised sentence representations: (1) unweighted averaging of word
vectors, (2) smooth inverse frequency [18], where the sentence em- As a final step, we attempt to remove the spurious translations by
beddings are computed as a weighted average of word vectors fol- performing filtering based on a quality measure. The main challenge
lowed by removal of the projection on the first singular vector, and here is to define the “goodness” of a translated output. A potential
(3) language agnostic sentence representations (LASER) [19], which candidate is the approach of using the score assigned by the MT
is an open source pre-trained biLSTM encoder for generating multi- model. That is, the product of conditional probabilities of the output
lingual sentence embeddings that generalize across languages and tokens generated by the MT decoder can be used as a proxy to define
NLP tasks. The first approach is appealing owing to its simplic- well formed translations. We also consider the approach of using
ity. In the second technique, taking word frequency into account is the statistical LM built using transcribed data to assess the quality
the distinguishing factor. Potential cross-lingual generalization is the of translation data. Using each of these scores, we retain the top-x
advantage offered by the third approach. We use FastText [20] for percentile of the translation output.
learning word vector representations. An overview of the adaptation and postprocessing pipeline is
Using each of these approaches, we generate sentence embed- provided in Fig. 3, with section numbers indicated against each com-
ding vectors for in-domain (Fin ) and out-of-domain (Fout ) target ponent.
Fig. 3. Overview of the model training, adptation and postprocessing pipeline. (.) indicates the corresponding section describing the approach.

3. RESULTS AND DISCUSSION domain dataset, the gains don’t carry over while measuring over-
all ASR performance. SIF embedding based selection achieves the
3.1. Experimental setup highest WERR of 7.23%, followed closely by LASER encoder rep-
resentation.
We conduct experimental evaluation using upto 180 hours of Hindi-
Rescoring the decoded beams using transcription based LM
English code-switched speech for training. This dataset comprises
yields a WERR of 6.28% for a beam size of 5. In these experiments,
200K Hindi utterances collected using Cleo, an interactive skill
a relative weight of 0.3 is assigned to LM for overall score computa-
that enables users to teach local languages to voice assistants via
tion. Increasing the beam size from 5 to 20 leads to a drop in WERR,
prompts. These prompts cover use cases like song requests or
suggesting that during decoding, the head portion of the translation
knowledge related questions. These natural utterances represent
output contains hypotheses helpful for improving naturalness, and
the transcribed in-domain component in our experiments. We follow
increasing beam width can result in higher confusability.
the factored ASR architecture, and the AM is a hybrid DNN-HMM
model, trained on log filter bank energy (LFBE) features extracted at For the model finetuning approach, the number of additional
10 ms intervals for a 25 ms analysis window. LM is built by learning training epochs is an important parameter. We observe a WERR of
a 4-gram model with Katz smoothing on the training data. 6.84% when this parameter is set to 3, as compared to 5.23% for
10 epochs. Increasing the number of passes on the synthetic data
The baseline LM is built using the in-domain transcriptions only.
generated using an initially trained model perpetuates the effect of
Translation component is procured by translating 9.8M US English
model reinforcing its own errors. This potential overfitting makes
utterance transcripts using the trained NMT models followed by
early stopping imperative.
adaptation and post-editing. For the evaluation candidates, the LM
is built by a linear interpolation of the transcribed and translated In the experiments focusing on translation output filtering, MT
components. The interpolation weights are tuned to minimize the score did not turn out to be an effective metric for quality evaluation,
perplexity of a held-out in-domain dataset. We assign a floor inter- indicated both by perplexity and WERR. We obtain interesting in-
polation weight of 0.25 to the translation component to ensure that it sights by ranking translations using in-domain LM scores. A WERR
receives sufficient representation. For all the MT experiments except of 6.82% is observed by retaining only top-75% translations. Mak-
rescoring, we employ greedy decoding, i.e. beam size of one. The ing filtering conservative beyond this point degrades performance.
test set comprises 37K utterances. We report relative word error rate One caveat of the LM guided filtering approach is that the patterns
reduction (WERR) w.r.t. models built using baseline LM. which are underrepresented in the initial collections receive low LM
scores. This could explain the drop in WERR when moving to top-
65% translations; since the transcribed volume used for LM training
3.2. Results is itself small, some of the discarded patterns could have been com-
Table 1 captures the performance of post-editing techniques. The plementary for the overall ASR performance.
approach of ingesting raw translations without any postprocessing Combining the SIF selection, finetuning, rescoring and LM
yields a high perplexity translation component, resulting in nega- based filtering approach results in a relative WERR of 7.86%.
tive WERR. We observe consistent improvements by introducing at-
tention weight based post-editing. NE copy-over alone reduces the 3.3. Impact on different interaction scenarios
perplexity significantly. This, coupled with NE resampling and code
mixing, results in a 5.83% WERR. Cleo prompts cover multiple interaction use cases. In order to derive
In Table 2, we assess the impact of each adaptation approach fine grained insights into the effect of translations, we study WERR
followed by post-editing. For MT training data selection, we retain on test utterances manually categorized into scenarios. Nearly 70%
only top-25% (out of 8.4M) sentences w.r.t. their relative similar- of the test utterances fall into one of the nine interaction scenarios
ity with in-domain data. This reduction in training data impacts the mentioned in Table 3. In order to isolate the gains obtained from
BLEU score adversely. Amongst the sentence representation tech- post-editing and adaptation, we study both the post-editing and com-
niques, LASER and SIF embeddings outperform the unweighted bined WERR. We also analyse the proportion of named entities for
averaging approach in terms of BLEU score. Interestingly, while utterances belonging to each of these scenarios. Following observa-
the unweighted averaging achieves lowest perplexity on held-out in- tions can be made from this analysis.
Postprocessing Approach PPL Relative WERR %
None Raw translations 11941.08 -1.81
Post-editing NE copy-over 2889.45 2.36
NE resampling 1241.52 4.62
Code mixing + NE resampling 936.64 5.83
Table 1. Relative WERRs (%) with different post-editing techniques. Perplexity (PPL) is evaluated on a held-out in-domain dataset. Relative
WERR captures the WER reduction w.r.t baseline trained on transcribed data only.

Adaptation Approach PPL Relative WERR %


Data selection Unweighted avg. (BLEU: 29.1) 662.33 6.94
(BLEU original model: 43.8) SIF (BLEU: 37.8) 686.97 7.23
LASER (BLEU: 37.4) 704.12 7.14
Rescoring beam-size=5 792.92 6.28
beam-size=20 852.16 5.88
Model finetuning n-epochs=3 726.62 6.84
n-epochs=10 983.64 5.23
Filtering translations MT score - top 85% 1109.44 4.82
MT score - top 75% 1327.56 3.37
MT score - top 65% 1426.18 2.16
SLM score - top 85% 793.73 6.33
SLM score - top 75% 892.92 6.82
SLM score - top 65% 878.16 5.94
Combined (i) SIF selection + Rescoring + SLM score - top 75% 584.24 7.62
(i) + Model finetuning 564.06 7.86
Table 2. Relative WERRs (%) with different NMT adaptation strategies. Note that these results include the effect of NE resampling and code
mixing techniques.

Coverage Interaction Post-editing Combined Adaptation NE%


(In transcribed collections) scenario WERR % WERR % contribution %
Low Books 5.75 7.26 20.80 34.74
Communication 3.82 5.98 36.12 11.19
Weather 3.23 6.85 52.84 7.63
Shopping 7.86 10.84 27.49 52.94
Moderate Knowledge 6.36 9.54 33.34 31.60
Video 6.44 8.52 24.41 39.36
Home Automation 5.68 7.94 22.63 5.81
High Notifications 4.65 7.06 34.14 5.66
Music 5.74 7.48 23.26 47.62

Table 3. Relative WERR % by interaction scenarios captured along with the extent of coverage in transcribed collections. Named entity
proportion in the utterances is given by NE %. Adaptation contribution % captures the relative contribution of adaptation towards WERR.
For e.g., with a 5.75% post-editing WERR, adaptation yields an additional 1.51% WERR towards a combined WERR of 7.26%, i.e. 20.80%.

• Higher WERR is observed for scenarios such as shopping and 3.4. Impact of floor weight for interpolation
knowledge related queries, which are not well represented in
the transcribed collections as compared to their popular coun- The final augmented LM is an interpolated n-gram model, with the
terparts like song requests and notifications, suggesting that probability of an n-gram computed as a weighted sum of probabili-
translations can effectively complement the transcribed data. ties assigned by transcribed and translation components. Since the
• WERR shows positive correlation with the percentage of tuning data for determining interpolation weights comprises tran-
named entities (Pearson correlation: 0.647, p-value: 0.059). scribed utterances only, the translation component may receive a low
This observation is consistent with the results in the pre- weight owing to domain mismatch.
vious section where postprocessing alone demonstrates a The purpose of this investigation is to observe the effect of
significant WERR. changing the floor weight parameter for the translation component,
• The relative contribution of adaptation towards WERR is which provides a lever to override its relative importance in the
higher for scenarios with smaller named entity footprint, e.g. interpolated LM. As seen from Table 4, the overall PPL increases as
weather. Hence, the gains from postprocessing and adaptation we increase the floor weight. However, WERR demonstrates fluctu-
seem to be complementary across conversation scenarios. ation with varying floor weights: a low weight renders the translation
component ineffective whereas a high value undermines the tran- 5. REFERENCES
scription component. Floor weight sweep can provide empirical
guidance for adjusting this parameter. [1] Arseniy Gorin, Rasa Lileikytė, Guangpu Huang, Lori Lamel,
Jean-Luc Gauvain, and Antoine Laurent, “Language model
Floor Interpolated WERR % data augmentation for keyword spotting in low-resourced train-
weight PPL ing conditions,” in Interspeech 2016, 2016, pp. 775–779.
0.1 50.28 5.78 [2] Saurabh Garg, Tanmay Parekh, and Preethi Jyothi, “Code-
0.15 51.24 7.04 switched language models using dual RNNs and same-source
0.25 52.36 7.86 pretraining,” in Proceedings of the 2018 Conference on Empir-
0.3 53.37 7.49 ical Methods in Natural Language Processing, Brussels, Bel-
0.4 56.34 6.58 gium, Oct.-Nov. 2018, pp. 3078–3083, Association for Com-
putational Linguistics.
Table 4. PPL and relative WERRs (%) with varying floor interpola-
tion weights for the translation component in the 180 hour setup. [3] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau,
and Yoshua Bengio, “On the properties of neural machine
3.5. Impact of in-domain data volume translation: Encoder–decoder approaches,” in Proceedings of
SSST-8, Eighth Workshop on Syntax, Semantics and Structure
We now attempt to address the following question: what are the rel- in Statistical Translation, Doha, Qatar, Oct. 2014, pp. 103–111,
ative gains provided by the translation data during different phases Association for Computational Linguistics.
of bootstrapping? In particular, we measure the WERR between the
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,
baseline and translation-augmented LMs, by varying the in-domain
“Neural machine translation by jointly learning to align and
transcribed utterances from 10K to 200K. We observe that the com-
translate,” CoRR, vol. abs/1409.0473, 2015.
bined WERR after post-editing and adaptation increases from 7.86%
to 15.65% as the amount of in-domain data reduces. Note that in this [5] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le,
experiment, we use the same AM trained on 180 hours data, in or- Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,
der to precisely study the effect of data augmentation for LM. The Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva
WERR we report is hence an underestimate, and will probably be Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
much higher, if the AM was trained using similar levels of tran- Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith
scribed data. These findings, summarized in Table 5, suggest that Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young,
the neural MT supplements can especially aid initial stages of model Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg
development. Corrado, Macduff Hughes, and Jeffrey Dean, “Google’s neural
machine translation system: Bridging the gap between human
Transcribed WERR % and machine translation,” CoRR, vol. abs/1609.08144, 2016.
Volume [6] Philipp Koehn, Franz Josef Och, and Daniel Marcu, “Statisti-
10K 15.65 cal phrase-based translation,” in Proceedings of the 2003 Con-
20K 13.18 ference of the North American Chapter of the Association for
50K 9.42 Computational Linguistics on Human Language Technology -
100K 8.98 Volume 1, Stroudsburg, PA, USA, 2003, NAACL ’03, pp. 48–
200K 7.86 54, Association for Computational Linguistics.
Table 5. Relative WERRs (%) with varying levels of in-domain tran- [7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
scribed data. reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Il-
lia Polosukhin, “Attention is all you need,” CoRR, vol.
4. CONCLUSION abs/1706.03762, 2017.

In this work, we explored the key challenges associated with using [8] ArnarThor Jensson, Koji Iwano, and Sadaoki Furui, “Language
NMT for LM data augmentation in a conversational, code-switched model adaptation using machine-translated text for resource-
setting. Using a combination of post-editing and domain adaptation deficient languages,” EURASIP Journal on Audio, Speech, and
techniques, we demonstrated a relative WERR of 7.8% for 180 hours Music Processing, vol. 2008, no. 1, pp. 573832, Jan 2009.
of transcribed data. We examined the performance trajectory along [9] Thang Vu, Dau-Cheng Lyu, Jochen Weiner, Dominic Telaar,
different bootstrapping phases, and observed relative WERR of upto Tim Schlippe, Fabian Blaicher, Eng Chng, Tanja Schultz, and
15.6% with reduced transcription volumes. A further drilldown of Haizhou Li, “A first speech recognition system for mandarin-
WERR by interaction scenarios provided interesting insights into english code-switch conversational speech,” in ICASSP, 03
the gains derived from translation as a function of proportion of 2012.
named entities and relative representation in the transcribed data. [10] Chenhui Chu and Rui Wang, “A survey of domain adaptation
This experimental evidence establishes the efficacy of using trans- for neural machine translation,” CoRR, vol. abs/1806.00258,
lations for supplementing transcribed collections in the early stages 2018.
of model development, a strategy which could be instrumental for
rapid language expansion. Though Hindi is used as an experimental [11] Horia Cucu, Andi Buzo, Laurent Besacier, and Corneliu
testbed in this work, the techniques presented are generic and can Burileanu, “Smt-based asr domain adaptation methods for
be leveraged for bootstrapping other languages as well. Exploring under-resourced languages: Application to romanian,” Speech
semi-supervised and unsupervised translation is a promising future Commun., vol. 56, pp. 195–212, Jan. 2014.
direction, especially for the low resource languages.
[12] Judith Gaspers, Penny Karanasou, and Rajen Chatterjee, in Proceedings of the 54th Annual Meeting of the Associa-
“Selecting machine-translated data for quick bootstrapping tion for Computational Linguistics (Volume 1: Long Papers),
of a natural language understanding system,” CoRR, vol. Berlin, Germany, Aug. 2016, pp. 86–96, Association for Com-
abs/1805.09119, 2018. putational Linguistics.
[13] Rui Wang, Andrew Finch, Masao Utiyama, and Eiichiro [17] Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho,
Sumita, “Sentence embedding for neural machine translation Loı̈c Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk,
domain adaptation,” in Proceedings of the 55th Annual Meet- and Yoshua Bengio, “On using monolingual corpora in neural
ing of the Association for Computational Linguistics (Volume machine translation,” CoRR, vol. abs/1503.03535, 2015.
2: Short Papers), Vancouver, Canada, July 2017, pp. 560–566,
Association for Computational Linguistics. [18] Jonas Mueller and Aditya Thyagarajan, “Siamese recurrent ar-
chitectures for learning sentence similarity,” in Proceedings of
[14] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing the Thirtieth AAAI Conference on Artificial Intelligence. 2016,
Zhu, “Bleu: A method for automatic evaluation of machine AAAI’16, pp. 2786–2792, AAAI Press.
translation,” in Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics, Stroudsburg, PA, [19] Mikel Artetxe and Holger Schwenk, “Massively multilingual
USA, 2002, ACL ’02, pp. 311–318, Association for Computa- sentence embeddings for zero-shot cross-lingual transfer and
tional Linguistics. beyond,” CoRR, vol. abs/1812.10464, 2018.
[15] Tamer Alkhouli, Gabriel Bretschner, and Hermann Ney, “On [20] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas
the alignment problem in multi-head attention-based neural Mikolov, “Enriching word vectors with subword information,”
machine translation,” in WMT, 2018. Transactions of the Association for Computational Linguistics,
[16] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Improv- vol. 5, pp. 135–146, 2017.
ing neural machine translation models with monolingual data,”

You might also like