Meta-Learning For Low-Resource Neural Machine Translation
Meta-Learning For Low-Resource Neural Machine Translation
† † † ‡ †
Jiatao Gu* , Yong Wang* , Yun Chen , Kyunghyun Cho and Victor O.K. Li
†
The University of Hong Kong
‡
New York University, CIFAR Azrieli Global Scholar
†
{jiataogu, wangyong, vli}@eee.hku.hk
†
[email protected]
‡
[email protected]
3622
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3622–3631
Brussels, Belgium, October 31 - November 4, 2018.
2018
c Association for Computational Linguistics
approach across all the target language pairs, and not easy to obtain, Firat et al. (2016a); Lee et al.
the gap grows as the number of training examples (2016); Johnson et al. (2016) have shown that the
decreases. structure of NMT is suitable for multilingual ma-
chine translation. Gu et al. (2018b) also showed
2 Background that such a multilingual NMT system could im-
Neural Machine Translation (NMT) Given a prove the performance of low resource translation
source sentence X = {x1 , ..., xT 0 }, a neural ma- by using a universal lexical representation to share
chine translation model factors the distribution embedding information across languages.
over possible output sentences Y = {y1 , ..., yT } All the previous work for multilingual NMT as-
into a chain of conditional probabilities with a left- sume the joint training of multiple high-resource
to-right causal structure: languages naturally results in a universal space
(for both the input representation and the model)
TY
+1 which, however, is not necessarily true, especially
p(Y |X; ✓) = p(yt |y0:t 1 , x1:T 0 ; ✓), (1) for very low resource cases.
t=1
Meta Learning In the machine learning com-
where special tokens y0 (hbosi) and yT +1 (heosi)
munity, meta-learning, or learning-to-learn, has
are used to represent the beginning and the end of
recently received interests. Meta-learning tries to
a target sentence. These conditional probabilities
solve the problem of “fast adaptation on new train-
are parameterized using a neural network. Typi-
ing data.” One of the most successful applications
cally, an encoder-decoder architecture (Sutskever
of meta-learning has been on few-shot (or one-
et al., 2014; Cho et al., 2014; Bahdanau et al.,
shot) learning (Lake et al., 2015), where a neural
2015) with a RNN-based decoder is used. More
network is trained to readily learn to classify in-
recently, architectures without any recurrent struc-
puts based on only one or a few training examples.
tures (Gehring et al., 2017; Vaswani et al., 2017)
There are two categories of meta-learning:
have been proposed and shown to speed up train-
ing while achieving state-of-the-art performance. 1. learning a meta-policy for updating model
Low Resource Translation NMT is known to parameters (see, e.g., Andrychowicz et al.,
easily over-fit and result in an inferior performance 2016; Ha et al., 2016a; Mishra et al., 2017)
when the training data is limited (Koehn and
2. learning a good parameter initialization for
Knowles, 2017). In general, there are two ways for
fast adaptation (see, e.g., Finn et al., 2017;
handling the problem of low resource translation:
Vinyals et al., 2016; Snell et al., 2017).
(1) utilizing the resource of unlabeled monolin-
gual data, and (2) sharing the knowledge between
In this paper, we propose to use a meta-learning
low- and high-resource language pairs. Many re-
algorithm for low-resource neural machine trans-
search efforts have been spent on incorporating
lation based on the second category. More specifi-
the monolingual corpora into machine translation,
cally, we extend the idea of model-agnostic meta-
such as multi-task learning (Gulcehre et al., 2015;
learning (MAML, Finn et al., 2017) in the multi-
Zhang and Zong, 2016), back-translation (Sen-
lingual scenario.
nrich et al., 2015), dual learning (He et al., 2016)
and unsupervised machine translation with mono- 3 Meta Learning for Low-Resource
lingual corpora only for both sides (Artetxe et al., Neural Machine Translation
2017b; Lample et al., 2017; Yang et al., 2018).
For the second approach, prior researches have The underlying idea of MAML is to use a set of
worked on methods to exploit the knowledge of source tasks T 1 , . . . , T K to find the initializa-
auxiliary translations, or even auxiliary tasks. For tion of parameters ✓0 from which learning a tar-
instance, Cheng et al. (2016); Chen et al. (2017); get task T 0 would require only a small number of
Lee et al. (2017); Chen et al. (2018) investigate training examples. In the context of machine trans-
the use of a pivot to build a translation path be- lation, this amounts to using many high-resource
tween two languages even without any directed re- language pairs to find good initial parameters and
source. The pivot can be a third language or even training a new translation model on a low-resource
an image in multimodal domains. When pivots are language starting from the found initial parame-
3623
Fast Adaptation
Meta Learning
MetaNMT Emb X_train X_test Emb NMT
initialize
Meta-Train Meta-Test
Tk
Figure 1: The graphical illustration of the training process of the proposed MetaNMT. For each episode,
one task (language pair) is sampled for meta-learning. The boxes and arrows in blue are mainly involved
in language-specific learning (§3.1), and those in purple in meta-learning (§3.2).
ters. This process can be understood as Thus, in the low-resource scenario, finding a good
initialization ✓0 strongly correlates the final per-
✓⇤ = Learn(T 0 ; MetaLearn(T 1 , . . . , T K )). formance of the resulting model.
That is, we meta-learn the initialization from aux- 3.2 MetaLearn
iliary tasks and continue to learn the target task.
We find the initialization ✓0 by repeatedly simulat-
We refer the proposed meta-learning method for
ing low-resource translation scenarios using auxil-
NMT to MetaNMT. See Fig. 1 for the overall il-
iary, high-resource language pairs. Following Finn
lustration.
et al. (2017), we achieve this goal by defining the
3.1 Learn: language-specific learning meta-objective function as
Given any initial parameters ✓0 (which can be ei-
ther random or meta-learned), L(✓) =Ek EDT k ,D0 k (2)
the prior distribution of the parameters of a de- 2 T
3
sired NMT model can be defined as an isotropic
Guassian: 6 X 7
4 log p(Y |X; Learn(DT k ; ✓))5 ,
(X,Y )2D 0
Tk
✓i ⇠ N (✓i0 , 1/ ),
where k ⇠ U ({1, . . . , K}) refers to one meta-
where 1/ is a variance. With this prior distri-
learning episode, and DT , DT0 follow the uniform
bution, we formulate the language-specific learn-
distribution over T ’s data.
ing process Learn(DT ; ✓0 ) as maximizing the log-
We maximize the meta-objective function using
posterior of the model parameters given data DT :
stochastic approximation (Robbins and Monro,
Learn(DT ; ✓0 ) = arg max LDT (✓) 1951) with gradient descent. For each episode,
X
✓ we uniformly sample one source task at random,
= arg max log p(Y |X, ✓) k✓ ✓ 0 k2 , T k . We then sample two subsets of training ex-
✓
(X,Y )2DT amples independently from the chosen task, DT k
and DT0 k . We use the former to simulate language-
where we assume p(X|✓) to be uniform. The first specific learning and the latter to evaluate its out-
term above corresponds to the maximum likeli- come. Assuming a single gradient step is taken
hood criterion often used for training a usual NMT only the with learning rate ⌘, the simulation is:
system. The second term discourages the newly
learned model from deviating too much from the ✓k0 = Learn(DT k ; ✓) = ✓ ⌘r✓ LDT k (✓).
initial parameters, alleviating the issue of over-
fitting when there is not enough training data. In Once the simulation of learning is done, we evalu-
practice, we solve the problem above by maximiz- ate the updated parameters ✓k0 on DT0 k , The gra-
ing the first term with gradient-based optimization dient computed from this evaluation, which we
and early-stopping after only a few update steps. refer to as meta-gradient, is used to update the
3624
Figure 2: An intuitive il-
lustration in which we
use solid lines to repre-
Ro Ro Fr Ro Fr
sent the learning of ini-
Es Es
A
Es A A
tialization, and dashed
Lv Lv Pt Lv Pt lines to show the path of
fine-tuning.
(a) Transfer Learning (b) Multilingual Transfer Learning (c) Meta Learning
meta model ✓. It is possible to aggregate multiple Related Work: Multilingual Transfer Learning
episodes of source tasks before updating ✓: The proposed MetaNMT differs from the existing
X framework of multilingual translation (Lee et al.,
D0
✓ ✓ ⌘0 r✓ L Tk (✓k0 ), 2016; Johnson et al., 2016; Gu et al., 2018b) or
k transfer learning (Zoph et al., 2016). The latter can
be thought of as solving the following problem:
where ⌘ 0 is the meta learning rate. 2 3
Unlike a usual learning scenario, the resulting X
max Lmulti (✓) = Ek 4 log p(Y |X; ✓)5 ,
model ✓0 from this meta-learning procedure is not ✓
(X,Y )2Dk
necessarily a good model on its own. It is however
a good starting point for training a good model us- where Dk is the training set of the k-th task, or lan-
ing only a few steps of learning. In the context of guage pair. The target low-resource language pair
machine translation, this procedure can be under- could either be a part of joint training or be trained
stood as finding the initialization of a neural ma- separately starting from the solution ✓0 found from
chine translation system that could quickly adapt solving the above problem.
to a new language pair by simulating such a fast The major difference between the proposed
adaptation scenario using many high-resource lan- MetaNMT and these multilingual transfer ap-
guage pairs. proaches is that the latter do not consider how
learning happens with the target, low-resource lan-
Meta-Gradient We use the following approxi- guage pair. The former explicitly incorporates the
mation property learning process within the framework by simulat-
ing it repeatedly in Eq. (2). As we will see later in
r(x + ⌫v) r(x)
H(x)v ⇡ the experiments, this results in a substantial gap in
⌫ the final performance on the low-resource task.
to approximate the meta-gradient:1 Illustration In Fig. 2, we contrast transfer learn-
0 0
ing, multilingual learning and meta-learning us-
r✓ LD (✓0 ) = r✓0 LD (✓0 )r✓ (✓ ⌘r✓ LD (✓)) ing three source language pairs (Fr-En, Es-En and
= r✓0 LD (✓0 )
0 0
⌘r✓0 LD (✓0 )H✓ (LD (✓)) Pt-En) and two target pairs (Ro-En and Lv-En).
Transfer learning trains an NMT system specifi-
0 ⌘
⇡ r✓0 LD (✓0 ) r✓ LD (✓) r✓ LD (✓) , cally for a source language pair (Es-En) and fine-
⌫ ˆ
✓ ✓ tunes the system for each target language pair (Ro-
En, Lv-En). Multilingual learning often trains a
where ⌫ is a small constant and
single NMT system that can handle many different
0 language pairs (Fr-En, Pt-En, Es-En), which may
✓ˆ = ✓ + ⌫r✓0 LD (✓0 ).
or may not include the target pairs (Ro-En, Lv-
En). If not, it finetunes the system for each target
In practice, we find that it is also possible to ignore
pair, similarly to transfer learning. Both of these
the second-order term, ending up with the follow-
however aim at directly solving the source tasks.
ing simplified update rule:
On the other hand, meta-learning trains the NMT
0 0 system to be useful for fine-tuning on various tasks
r✓ LD (✓0 ) ⇡ r✓0 LD (✓0 ). (3)
including the source and target tasks. This is done
1
We omit the subscript k for simplicity. by repeatedly simulating the learning process on
3625
low-resource languages using many high-resource # of sents. # of En tokens Dev Test
language pairs (Fr-En, Pt-En, Es-En). Ro-En 0.61 M 16.66 M 31.76
Lv-En 4.46 M 67.24 M 20.24 15.15
3.3 Unified Lexical Representation Fi-En 2.63 M 64.50 M 17.38 20.20
I/O mismatch across language pairs One ma- Tr-En 0.21 M 5.58 M 15.45 13.74
jor challenge that limits applying meta-learning Ko-En 0.09 M 2.33 M 6.88 5.97
for low resource machine translation is that the ap-
proach outlined above assumes the input and out- Table 1: Statistics of full datasets of the target lan-
put spaces are shared across all the source and tar- guage pairs. BLEU scores on the dev and test sets
get tasks. This, however, does not apply to ma- are reported from a supervised Transformer model
chine translation in general due to the vocabulary with the same architecture.
mismatch across different languages. In multilin-
gual translation, this issue has been tackled by us- tuning on a small corpus which contains a lim-
ing a vocabulary of sub-words (Sennrich et al., ited set of unique tokens in the target language,
2015) or characters (Lee et al., 2016) shared across as it could adversely influence the other tokens’
multiple languages. This surface-level sharing is embedding vectors. We thus estimate the change
however limited, as it cannot be applied to lan- to each embedding vector induced by language-
guages exhibiting distinct orthography (e.g., Indo- specific learning by a separate parameter ✏k [x]:
Euroepan languages vs. Korean.)
✏k [x] = ✏0 [x] + ✏k [x].
Universal Lexical Representation (ULR) We
tackle this issue by dynamically building a vo- During language-specific learning, the ULR ✏0 [x]
cabulary specific to each language using a key- is held constant, while only ✏k [x] is updated,
value memory network (Miller et al., 2016; Gul- starting from an all-zero vector. On the other hand,
cehre et al., 2018), as was done successfully for we hold ✏k [x]’s constant while updating ✏u and
low-resource machine translation recently by Gu A during the meta-learning stage.
et al. (2018b). We start with multilingual word em-
bedding matrices ✏kquery 2 R|Vk |⇥d pretrained on 4 Experimental Settings
large monolingual corpora, where Vk is the vo-
cabulary of the k-th language. These embedding 4.1 Dataset
vectors can be obtained with small dictionaries of Target Tasks We show the effectiveness of the
seed word pairs (Artetxe et al., 2017a; Smith et al., proposed meta-learning method for low resource
2017) or in a fully unsupervised manner (Zhang NMT with extremely limited training examples
et al., 2017; Conneau et al., 2018). We take one of on five diverse target languages: Romanian (Ro)
these languages k 0 to build universal lexical repre- from WMT’16,2 Latvian (Lv), Finnish (Fi), Turk-
sentation consisting of a universal embedding ma- ish (Tr) from WMT’17,3 and Korean (Ko) from
trix ✏u 2 RM ⇥d and a corresponding key matrix Korean Parallel Dataset.4 We use the officially
✏key 2 RM ⇥d , where M < |Vk0 |. Both ✏kquery and provided train, dev and test splits for all these lan-
✏key are fixed during meta-learning. We then com- guages. The statistics of these languages are pre-
pute the language-specific embedding of token x sented in Table 1. We simulate the low-resource
from the language k as the convex sum of the uni- translation scenarios by randomly sub-sampling
versal embedding vectors by the training set with different sizes.
M
X Source Tasks We use the following languages
✏0 [x] = ↵i ✏u [i], from Europarl5 : Bulgarian (Bg), Czech (Cs), Dan-
i=1 ish (Da), German (De), Greek (El), Spanish (Es),
where ↵i / exp 1 > Estonian (Et), French (Fr), Hungarian (Hu), Ital-
⌧ ✏key [i] A✏query [x] and ⌧ is
k
set to 0.05. This approach allows us to handle lan- ian (It), Lithuanian (Lt), Dutch (Nl), Polish (Pl),
guages with different vocabularies using a fixed Portuguese (Pt), Slovak (Sk), Slovene (Sl) and
number of shared parameters (✏u , ✏key and A.) 2
http://www.statmt.org/wmt16/translation-task.html
3
http://www.statmt.org/wmt17/translation-task.html
Learning of ULR It is not desirable to update 4
https://sites.google.com/site/koreanparalleldata/
the universal embedding matrix ✏u when fine- 5
http://www.statmt.org/europarl/
3626
(a) Ro-En (b) Lv-En
Figure 3: BLEU scores reported on test sets for {Ro, Lv, Fi, Tr} to En, where each model is first learned
from 6 source tasks (Es, Fr, It, Pt, De, Ru) and then fine-tuned on randomly sampled training sets with
around 16,000 English tokens per run. The error bars show the standard deviation calculated from 5 runs.
Swedish (Sv), in addition to Russian (Ru)6 to using MUSE (Conneau et al., 2018) to get mul-
learn the intilization for fine-tuning. In our exper- tilingual word vectors. We use the multilingual
iments, different combinations of source tasks are word vectors of the 20,000 most frequent words
explored to see the effects from the source tasks. in English to form the universal embedding matrix
✏u .
Validation We pick either Ro-En or Lv-En as a
validation set for meta-learning and test the gener-
4.2 Model and Learning
alization capability on the remaining target tasks.
This allows us to study the strict form of meta- Model We utilize the recently proposed Trans-
learning, in which target tasks are unknown during former (Vaswani et al., 2017) as an underlying
both training and model selection. NMT system. We implement Transformer in this
paper based on (Gu et al., 2018a)8 and mod-
Preprocessing and ULR Initialization As de- ify it to use the universal lexical representation
scribed in §3.3, we initialize the query embed- from §3.3. We use the default set of hyperpa-
ding vectors ✏kquery of all the languages. For each rameters (dmodel = dhidden = 512, nlayer = 6,
language, we use the monolingual corpora built nhead = 8, nbatch = 4000, twarmup = 16000) for
from Wikipedia7 and the parallel corpus. The con- all the language pairs and across all the experi-
catenated corpus is first tokenized and segmented mental settings. We refer the readers to (Vaswani
using byte-pair encoding (BPE, Sennrich et al., et al., 2017; Gu et al., 2018a) for the details of
2016), resulting in 40, 000 subwords for each lan- the model. However, since the proposed meta-
guage. We then estimate word vectors using fast- learning method is model-agnostic, it can be eas-
Text (Bojanowski et al., 2016) and align them ily extended to any other NMT architectures, e.g.
across all the languages in an unsupervised way RNN-based sequence-to-sequence models with at-
6 tention (Bahdanau et al., 2015).
A subsample of approximately 2M pairs from WMT’17.
7
We use the most recent Wikipedia dump (2018.5) from
8
https://dumps.wikimedia.org/backup-index.html. https://github.com/salesforce/nonauto-nmt
3627
Ro-En Lv-En Fi-En Tr-En Ko-En
Meta-Train
zero finetune zero finetune zero finetune zero finetune zero finetune
00.00 ± .00 0.00 ± .00 0.00 ± .00 0.00 ± .00 0.00 ± .00
Es 9.20 15.71 ± .22 2.23 4.65 ± .12 2.73 5.55 ± .08 1.56 4.14 ± .03 0.63 1.40 ± .09
Es Fr 12.35 17.46 ± .41 2.86 5.05 ± .04 3.71 6.08 ± .01 2.17 4.56 ± .20 0.61 1.70 ± .14
Es Fr It Pt 13.88 18.54 ± .19 3.88 5.63 ± .11 4.93 6.80 ± .04 2.49 4.82 ± .10 0.82 1.90 ± .07
De Ru 10.60 16.05 ± .31 5.15 7.19 ± .17 6.62 7.98 ± .22 3.20 6.02 ± .11 1.19 2.16 ± .09
Es Fr It Pt De Ru 15.93 20.00 ± .27 6.33 7.88 ± .14 7.89 9.14 ± .05 3.72 6.02 ± .13 1.28 2.44 ± .11
All 18.12 22.04 ± .23 9.58 10.44 ± .17 11.39 12.63 ± .22 5.34 8.97 ± .08 1.96 3.97 ± .10
Full Supervised 31.76 15.15 20.20 13.74 5.97
Table 2: BLEU Scores w.r.t. the source task set for all five target tasks.
5 Results
vs. Multilingual Transfer Learning We meta-
learn the initial models on all the source tasks us-
ing either Ro-En or Lv-En as a validation task.
We also train the initial models to be multilin-
gual translation systems. We fine-tune them us-
ing the four target tasks (Ro-En, Lv-En, Fi-En
and Tr-En; 16k tokens each) and compare the pro-
posed meta-learning strategy and the multilingual,
Figure 4: BLEU Scores w.r.t. the size of the target transfer learning strategy. As presented in Fig. 3,
task’s training set. the proposed learning approach significantly out-
performs the multilingual, transfer learning strat-
Learning We meta-learn using various sets of egy across all the target tasks regardless of which
source languages to investigate the effect of source target task was used for early stopping. We also
task choice. For each episode, by default, we use a notice that the emb+enc strategy is most effec-
single gradient step of language-specific learning tive for both meta-learning and transfer learn-
with Adam (Kingma and Ba, 2014) per comput- ing approaches. With the proposed meta-learning
ing the meta-gradient, which is computed by the and emb+enc fine-tuning, the final NMT systems
first-order approximation in Eq. (3). trained using only a fraction of all available train-
For each target task, we sample training exam- ing examples achieve 2/3 (Ro-En) and 1/2 (Lv-En,
ples to form a low-resource task. We build tasks of Fi-En and Tr-En) of the BLEU score achieved by
4k, 16k, 40k and 160k English tokens for each lan- the models trained with full training sets.
guage. We randomly sample the training set five
vs. Statistical Machine Translation We also
times for each experiment and report the average
test the same Ro-En datasets with 16, 000 target
score and its standard deviation. Each fine-tuning
tokens using the default setting of Phrase-based
is done on a training set, early-stopped on a val-
MT (Moses) with the dev set for adjusting the
idation set and evaluated on a test set. In default
parameters and the test set for calculating the fi-
without notation, datasets of 16k tokens are used.
nal performance. We obtain 4.79(±0.234) BLEU
point, which is higher than the standard NMT per-
Fine-tuning Strategies The transformer con-
formance (0 BLEU). It is however still lower than
sists of three modules; embedding, encoder and
both the multi-NMT and meta-NMT.
decoder. We update all three modules during meta-
learning, but during fine-tuning, we can selectively Impact of Validation Tasks Similarly to train-
tune only a subset of these modules. Following ing any other neural network, meta-learning still
(Zoph et al., 2016), we consider three fine-tuning requires early-stopping to avoid overfitting to a
3628
specific set of source tasks. In doing so, we ob- proach, we observe that training rapidly saturates
serve that the choice of a validation task has non- and eventually degrades, as the model overfits to
negligible impact on the final performance. For in- the source tasks. MetaNMT on the other hand con-
stance, as shown in Fig. 3, Fi-En benefits more tinues to improve and never degrades, as the meta-
when Ro-En is used for validation, while the oppo- objective ensures that the model is adequate for
site happens with Tr-En. The relationship between fine-tuning on target tasks rather than for solving
the task similarity and the impact of a validation the source tasks.
task must be investigated further in the future.
Training Set Size We vary the size of the tar- Sample Translations We present some sample
get task’s training set and compare the proposed translations from the tested models in Table 3.
meta-learning strategy and multilingual, transfer Inspecting these examples provides the insight
learning strategy. We use the emb+enc fine-tuning into the proposed meta-learning algorithm. For in-
on Ro-En and Fi-En. Fig. 4 demonstrates that the stance, we observe that the meta-learned model
meta-learning approach is more robust to the drop without any fine-tuning produces a word-by-word
in the size of the target task’s training set. The gap translation in the first example (Tr-En), which is
between the meta-learning and transfer learning due to the successful use of the universal lexcial
grows as the size shrinks, confirming the effective- representation and the meta-learned initialization.
ness of the proposed approach on extremely low- The system however cannot reorder tokens from
resource language pairs. Turkish to English, as it has not seen any train-
ing example of Tr-En. After seeing around 600
sentence pairs (16K English tokens), the model
rapidly learns to correctly reorder tokens to form
a better translation. A similar phenomenon is ob-
served in the Ko-En example. These cases could
be found across different language pairs.
6 Conclusion
Figure 5: The learning curves of BLEU scores on
the validation task (Ro-En). In this paper, we proposed a meta-learning algo-
rithm for low-resource neural machine translation
Impact of Source Tasks In Table 2, we present that exploits the availability of high-resource lan-
the results on all five target tasks obtained while guages pairs. We based the proposed algorithm
varying the source task set. We first see that it is on the recently proposed model-agnostic meta-
always beneficial to use more source tasks. Al- learning and adapted it to work with multiple lan-
though the impact of adding more source tasks guages that do not share a common vocabulary us-
varies from one language to another, there is up ing the technique of universal lexcal representa-
to 2⇥ improvement going from one source task to tion, resulting in MetaNMT. Our extensive evalu-
18 source tasks (Lv-En, Fi-En, Tr-En and Ko-En). ation, using 18 high-resource source tasks and 5
The same trend can be observed even without any low-resource target tasks, has shown that the pro-
fine-tuning (i.e., unsupervised translation, (Lam- posed MetaNMT significantly outperforms the ex-
ple et al., 2017; Artetxe et al., 2017b)). In addi- isting approach of multilingual, transfer learning
tion, the choice of source languages has different in low-resource neural machine translation across
implications for different target languages. For in- all the language pairs considered.
stance, Ro-En benefits more from {Es, Fr, It, Pt}
The proposed approach opens new opportuni-
than from {De, Ru}, while the opposite effect is
ties for neural machine translation. First, it is a
observed with all the other target tasks.
principled framework for incorporating various
Training Curves The benefit of meta-learning extra sources of data, such as source- and target-
over multilingual translation is clearly demon- side monolingual corpora. Second, it is a generic
strated when we look at the training curves in framework that can easily accommodate existing
Fig. 5. With the multilingual, transfer learning ap- and future neural machine translation systems.
3629
Source (Tr) google mülteciler için 11 milyon dolar toplamak üzere bağış eşleştirme kampanyasını başlattı .
Target google launches donation-matching campaign to raise $ 11 million for refugees .
Meta-0 google refugee fund for usd 11 million has launched a campaign for donation .
Meta-16k google has launched a campaign to collect $ 11 million for refugees .
Source (Ko) tà– ¥Ï⇠¥ 0å⌧ ¨å‰ ⌘–î ÙÌ\ p ‡⌅ ¨ , ∏`x , Xx , Ω⌧x Òt Ïh⇣‰
Target among the suspects are retired military officials , journalists , politicians , businessmen and others .
Meta-0 last year , convicted people , among other people , of a high-ranking army of journalists in economic
and economic policies , were included .
Meta-16k the arrested persons were included in the charge , including the military officials , journalists , politicians
and economists .
Table 3: Sample translations for Tr-En and Ko-En highlight the impact of fine-tuning which results in
syntactically better formed translations. We highlight tokens of interest in terms of reordering.
Acknowledgement Yun Chen, Yang Liu, and Victor OK Li. 2018. Zero-
resource neural machine translation with multi-
This research was supported in part by the Face- agent communication game. arXiv preprint
book Low Resource Neural Machine Translation arXiv:1802.03116.
Award. This work was also partly supported by Yong Cheng, Yang Liu, Qian Yang, Maosong Sun, and
Samsung Advanced Institute of Technology (Next Wei Xu. 2016. Neural machine translation with
pivot languages. arXiv preprint arXiv:1611.04928.
Generation Deep Learning: from pattern recogni-
tion to AI) and Samsung Electronics (Improving Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah-
Deep Learning using Latent Structure). KC thanks danau, and Yoshua Bengio. 2014. On the properties
of neural machine translation: Encoder–Decoder ap-
support by eBay, TenCent, NVIDIA and CIFAR. proaches. In Eighth Workshop on Syntax, Semantics
and Structure in Statistical Translation.
Alexis Conneau, Guillaume Lample, Marc’Aurelio
References Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018.
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Word translation without parallel data. International
Matthew W Hoffman, David Pfau, Tom Schaul, and Conference on Learning Representations.
Nando de Freitas. 2016. Learning to learn by gra- Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.
dient descent by gradient descent. In Advances Model-agnostic meta-learning for fast adaptation of
in Neural Information Processing Systems, pages deep networks. arXiv preprint arXiv:1703.03400.
3981–3989.
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016a. Multi-way, multilingual neural machine
2017a. Learning bilingual word embeddings with translation with a shared attention mechanism. In
(almost) no bilingual data. In Proceedings of the NAACL.
55th Annual Meeting of the Association for Compu- Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,
tational Linguistics (Volume 1: Long Papers), vol- Fatos T Yarman Vural, and Kyunghyun Cho. 2016b.
ume 1, pages 451–462. Zero-resource translation with multi-lingual neural
machine translation. In EMNLP.
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and
Kyunghyun Cho. 2017b. Unsupervised neural ma- Jonas Gehring, Michael Auli, David Grangier, De-
chine translation. arXiv preprint arXiv:1710.11041. nis Yarats, and Yann Dauphin. 2017. Convolu-
tional sequence to sequence learning. arXiv preprint
arXiv:1705.03122.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2015. Neural machine translation by jointly Jiatao Gu, James Bradbury, Caiming Xiong, Vic-
learning to align and translate. In ICLR. tor O. K. Li, and Richard Socher. 2018a. Non-
autoregressive neural machine translation. ICLR.
Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2016. Enriching word vec- Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor OK
tors with subword information. arXiv preprint Li. 2018b. Universal neural machine translation for
arXiv:1607.04606. extremely low resource languages. arXiv preprint
arXiv:1802.05368.
Yun Chen, Yang Liu, Yong Cheng, and Victor OK Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho,
Li. 2017. A teacher-student framework for zero- and Yoshua Bengio. 2018. Dynamic neural tur-
resource neural machine translation. arXiv preprint ing machine with continuous and discrete address-
arXiv:1705.00753. ing schemes. Neural computation, 30(4):857–884.
3630
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and
Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Pieter Abbeel. 2017. Meta-learning with temporal
Holger Schwenk, and Yoshua Bengio. 2015. On us- convolutions. arXiv preprint arXiv:1707.03141.
ing monolingual corpora in neural machine transla-
tion. arXiv preprint arXiv:1503.03535. Herbert Robbins and Sutton Monro. 1951. A stochastic
approximation method. The annals of mathematical
David Ha, Andrew Dai, and Quoc V Le. 2016a. Hy- statistics, pages 400–407.
pernetworks. arXiv preprint arXiv:1609.09106.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2015. Improving neural machine translation
2016b. Toward multilingual neural machine trans- models with monolingual data. arXiv preprint
lation with universal encoder and decoder. arXiv arXiv:1511.06709.
preprint arXiv:1611.04798.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, 2016. Edinburgh neural machine translation sys-
Tieyan Liu, and Wei-Ying Ma. 2016. Dual learn- tems for wmt 16. arXiv preprint arXiv:1606.02891.
ing for machine translation. In Advances in Neural
Information Processing Systems, pages 820–828. Samuel L Smith, David HP Turban, Steven Hamblin,
and Nils Y Hammerla. 2017. Offline bilingual word
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim vectors, orthogonal transformations and the inverted
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- softmax. arXiv preprint arXiv:1702.03859.
rat, Fernanda Viégas, Martin Wattenberg, Greg Cor-
rado, et al. 2016. Google’s multilingual neural ma- Jake Snell, Kevin Swersky, and Richard Zemel. 2017.
chine translation system: enabling zero-shot transla- Prototypical networks for few-shot learning. In Ad-
tion. arXiv preprint arXiv:1611.04558. vances in Neural Information Processing Systems,
pages 4080–4090.
Diederik Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint Ilya Sutskever, Oriol Vinyals, and Quôc Lê. 2014. Se-
arXiv:1412.6980. quence to sequence learning with neural networks.
In NIPS.
Philipp Koehn and Rebecca Knowles. 2017. Six
challenges for neural machine translation. arXiv Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
preprint arXiv:1706.03872. Uszkoreit, Llion Jones, Aidan Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Philipp Koehn, Franz Josef Och, and Daniel Marcu. you need. arXiv preprint arXiv:1706.03762.
2003. Statistical phrase-based translation. In
Proceedings of the 2003 Conference of the North Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan
American Chapter of the Association for Computa- Wierstra, et al. 2016. Matching networks for one
tional Linguistics on Human Language Technology- shot learning. In Advances in Neural Information
Volume 1, pages 48–54. Association for Computa- Processing Systems, pages 3630–3638.
tional Linguistics.
Zhen Yang, Wei Chen, Feng Wang, and Bo Xu.
Brenden M Lake, Ruslan Salakhutdinov, and Joshua B 2018. Unsupervised neural machine translation with
Tenenbaum. 2015. Human-level concept learning weight sharing. arXiv preprint arXiv:1804.09057.
through probabilistic program induction. Science,
350(6266):1332–1338. Jiajun Zhang and Chengqing Zong. 2016. Exploit-
ing source-side monolingual data in neural machine
Guillaume Lample, Ludovic Denoyer, and translation. In Proceedings of the 2016 Conference
Marc’Aurelio Ranzato. 2017. Unsupervised on Empirical Methods in Natural Language Pro-
machine translation using monolingual corpora cessing, pages 1535–1545.
only. arXiv preprint arXiv:1711.00043.
Meng Zhang, Yang Liu, Huanbo Luan, and Maosong
Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Sun. 2017. Earth mover’s distance minimization for
2016. Fully character-level neural machine trans- unsupervised bilingual lexicon induction. In Pro-
lation without explicit segmentation. arXiv preprint ceedings of the 2017 Conference on Empirical Meth-
arXiv:1610.03017. ods in Natural Language Processing, pages 1934–
1945. Association for Computational Linguistics.
Jason Lee, Kyunghyun Cho, Jason Weston, and Douwe
Kiela. 2017. Emergent translation in multi-agent Barret Zoph, Deniz Yuret, Jonathan May, and
communication. arXiv preprint arXiv:1710.06922. Kevin Knight. 2016. Transfer learning for low-
resource neural machine translation. arXiv preprint
Alexander Miller, Adam Fisch, Jesse Dodge, Amir- arXiv:1604.02201.
Hossein Karimi, Antoine Bordes, and Jason We-
ston. 2016. Key-value memory networks for
directly reading documents. arXiv preprint
arXiv:1606.03126.
3631