0% found this document useful (0 votes)

96 views10 pages

Meta-Learning For Low-Resource Neural Machine Translation

This paper proposes applying model-agnostic meta-learning (MAML) to low-resource neural machine translation. MAML is framed to view translation tasks as separate languages, enabling fast adaptation to new languages with minimal data. A universal lexical representation overcomes input-output mismatches. Evaluation on 18 source languages to 5 target languages shows this meta-learning approach outperforms multilingual transfer learning, especially with limited data.

Uploaded by

Tin Kuculo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views10 pages

Meta-Learning For Low-Resource Neural Machine Translation

Uploaded by

Tin Kuculo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Meta-Learning for Low-Resource Neural Machine Translation

† † † ‡ †
Jiatao Gu* , Yong Wang* , Yun Chen , Kyunghyun Cho and Victor O.K. Li
†
The University of Hong Kong
‡
New York University, CIFAR Azrieli Global Scholar
†
{jiataogu, wangyong, vli}@eee.hku.hk
†
[email protected]
‡
[email protected]

et al., 2015; Sennrich et al., 2015; Zhang and

Abstract Zong, 2016). It was later followed by approaches
based on multilingual translation, in which the
In this paper, we propose to extend the recently
introduced model-agnostic meta-learning al- goal was to exploit knowledge from high-resource
gorithm (MAML, Finn et al., 2017) for low- language pairs by training a single NMT system
resource neural machine translation (NMT). on a mix of high-resource and low-resource lan-
We frame low-resource translation as a meta- guage pairs (Firat et al., 2016a,b; Lee et al., 2016;
learning problem, and we learn to adapt to Johnson et al., 2016; Ha et al., 2016b). Its variant,
low-resource languages based on multilingual transfer learning, was also proposed by Zoph et al.
high-resource language tasks. We use the uni- (2016), in which an NMT system is pretrained on
versal lexical representation (Gu et al., 2018b)
a high-resource language pair before being fine-
to overcome the input-output mismatch across
different languages. We evaluate the proposed tuned on a target low-resource language pair.
meta-learning strategy using eighteen Euro- In this paper, we follow up on these latest ap-
pean languages (Bg, Cs, Da, De, El, Es, Et, proaches based on multilingual NMT and propose
Fr, Hu, It, Lt, Nl, Pl, Pt, Sk, Sl, Sv and Ru) a meta-learning algorithm for low-resource neural
as source tasks and five diverse languages (Ro, machine translation. We start by arguing that the
Lv, Fi, Tr and Ko) as target tasks. We show that recently proposed model-agnostic meta-learning
the proposed approach significantly outper-
algorithm (MAML, Finn et al., 2017) could be ap-
forms the multilingual, transfer learning based
approach (Zoph et al., 2016) and enables us plied to low-resource machine translation by view-
to train a competitive NMT system with only ing language pairs as separate tasks. This view en-
a fraction of training examples. For instance, ables us to use MAML to find the initialization of
the proposed approach can achieve as high as model parameters that facilitate fast adaptation for
22.04 BLEU on Romanian-English WMT’16 a new language pair with a minimal amount of
by seeing only 16,000 translated words (⇠ 600 training examples (§3). Furthermore, the vanilla
parallel sentences).
MAML however cannot handle tasks with mis-
1 Introduction matched input and output. We overcome this limi-
tation by incorporating the universal lexical repre-
Despite the massive success brought by neural ma- sentation (Gu et al., 2018b) and adapting it for the
chine translation (NMT, Sutskever et al., 2014; meta-learning scenario (§3.3).
Bahdanau et al., 2015; Vaswani et al., 2017), it We extensively evaluate the effectiveness and
has been noticed that the vanilla NMT often lags generalizing ability of the proposed meta-learning
behind conventional machine translation systems, algorithm on low-resource neural machine trans-
such as statistical phrase-based translation sys- lation. We utilize 17 languages from Europarl and
tems (PBMT, Koehn et al., 2003), for low-resource Russian from WMT as the source tasks and test
language pairs (see, e.g., Koehn and Knowles, the meta-learned parameter initialization against
2017). In the past few years, various approaches five target languages (Ro, Lv, Fi, Tr and Ko), in
have been proposed to address this issue. The all cases translating to English. Our experiments
first attempts at tackling this problem exploited using only up to 160k tokens in each of the tar-
the availability of monolingual corpora (Gulcehre get task reveal that the proposed meta-learning
* Equal contribution. approach outperforms the multilingual translation

3622
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3622–3631
Brussels, Belgium, October 31 - November 4, 2018. 2018
c Association for Computational Linguistics
approach across all the target language pairs, and not easy to obtain, Firat et al. (2016a); Lee et al.
the gap grows as the number of training examples (2016); Johnson et al. (2016) have shown that the
decreases. structure of NMT is suitable for multilingual ma-
chine translation. Gu et al. (2018b) also showed
2 Background that such a multilingual NMT system could im-
Neural Machine Translation (NMT) Given a prove the performance of low resource translation
source sentence X = {x1 , ..., xT 0 }, a neural ma- by using a universal lexical representation to share
chine translation model factors the distribution embedding information across languages.
over possible output sentences Y = {y1 , ..., yT } All the previous work for multilingual NMT as-
into a chain of conditional probabilities with a left- sume the joint training of multiple high-resource
to-right causal structure: languages naturally results in a universal space
(for both the input representation and the model)
TY
+1 which, however, is not necessarily true, especially
p(Y |X; ✓) = p(yt |y0:t 1 , x1:T 0 ; ✓), (1) for very low resource cases.
t=1
Meta Learning In the machine learning com-
where special tokens y0 (hbosi) and yT +1 (heosi)
munity, meta-learning, or learning-to-learn, has
are used to represent the beginning and the end of
recently received interests. Meta-learning tries to
a target sentence. These conditional probabilities
solve the problem of “fast adaptation on new train-
are parameterized using a neural network. Typi-
ing data.” One of the most successful applications
cally, an encoder-decoder architecture (Sutskever
of meta-learning has been on few-shot (or one-
et al., 2014; Cho et al., 2014; Bahdanau et al.,
shot) learning (Lake et al., 2015), where a neural
2015) with a RNN-based decoder is used. More
network is trained to readily learn to classify in-
recently, architectures without any recurrent struc-
puts based on only one or a few training examples.
tures (Gehring et al., 2017; Vaswani et al., 2017)
There are two categories of meta-learning:
have been proposed and shown to speed up train-
ing while achieving state-of-the-art performance. 1. learning a meta-policy for updating model
Low Resource Translation NMT is known to parameters (see, e.g., Andrychowicz et al.,
easily over-fit and result in an inferior performance 2016; Ha et al., 2016a; Mishra et al., 2017)
when the training data is limited (Koehn and
2. learning a good parameter initialization for
Knowles, 2017). In general, there are two ways for
fast adaptation (see, e.g., Finn et al., 2017;
handling the problem of low resource translation:
Vinyals et al., 2016; Snell et al., 2017).
(1) utilizing the resource of unlabeled monolin-
gual data, and (2) sharing the knowledge between
In this paper, we propose to use a meta-learning
low- and high-resource language pairs. Many re-
algorithm for low-resource neural machine trans-
search efforts have been spent on incorporating
lation based on the second category. More specifi-
the monolingual corpora into machine translation,
cally, we extend the idea of model-agnostic meta-
such as multi-task learning (Gulcehre et al., 2015;
learning (MAML, Finn et al., 2017) in the multi-
Zhang and Zong, 2016), back-translation (Sen-
lingual scenario.
nrich et al., 2015), dual learning (He et al., 2016)
and unsupervised machine translation with mono- 3 Meta Learning for Low-Resource
lingual corpora only for both sides (Artetxe et al., Neural Machine Translation
2017b; Lample et al., 2017; Yang et al., 2018).
For the second approach, prior researches have The underlying idea of MAML is to use a set of
worked on methods to exploit the knowledge of source tasks T 1 , . . . , T K to find the initializa-
auxiliary translations, or even auxiliary tasks. For tion of parameters ✓0 from which learning a tar-
instance, Cheng et al. (2016); Chen et al. (2017); get task T 0 would require only a small number of
Lee et al. (2017); Chen et al. (2018) investigate training examples. In the context of machine trans-
the use of a pivot to build a translation path be- lation, this amounts to using many high-resource
tween two languages even without any directed re- language pairs to find good initial parameters and
source. The pivot can be a third language or even training a new translation model on a low-resource
an image in multimodal domains. When pivots are language starting from the found initial parame-

3623
Fast Adaptation

Meta Learning
MetaNMT Emb X_train X_test Emb NMT

initialize
Meta-Train Meta-Test

Loss Y_train Y_test Loss

query Forward Pass Meta Gradient Pass

Translation Task
Gradient Pass Parameter Tying
Generator
Universal Lexical Representation

Figure 1: The graphical illustration of the training process of the proposed MetaNMT. For each episode,
one task (language pair) is sampled for meta-learning. The boxes and arrows in blue are mainly involved
in language-specific learning (§3.1), and those in purple in meta-learning (§3.2).

ters. This process can be understood as Thus, in the low-resource scenario, finding a good
initialization ✓0 strongly correlates the final per-
✓⇤ = Learn(T 0 ; MetaLearn(T 1 , . . . , T K )). formance of the resulting model.
That is, we meta-learn the initialization from aux- 3.2 MetaLearn
iliary tasks and continue to learn the target task.
We find the initialization ✓0 by repeatedly simulat-
We refer the proposed meta-learning method for
ing low-resource translation scenarios using auxil-
NMT to MetaNMT. See Fig. 1 for the overall il-
iary, high-resource language pairs. Following Finn
lustration.
et al. (2017), we achieve this goal by defining the
3.1 Learn: language-specific learning meta-objective function as
Given any initial parameters ✓0 (which can be ei-
ther random or meta-learned), L(✓) =Ek EDT k ,D0 k (2)
the prior distribution of the parameters of a de- 2 T
3
sired NMT model can be defined as an isotropic
Guassian: 6 X 7
4 log p(Y |X; Learn(DT k ; ✓))5 ,
(X,Y )2D 0
Tk
✓i ⇠ N (✓i0 , 1/ ),
where k ⇠ U ({1, . . . , K}) refers to one meta-
where 1/ is a variance. With this prior distri-
learning episode, and DT , DT0 follow the uniform
bution, we formulate the language-specific learn-
distribution over T ’s data.
ing process Learn(DT ; ✓0 ) as maximizing the log-
We maximize the meta-objective function using
posterior of the model parameters given data DT :
stochastic approximation (Robbins and Monro,
Learn(DT ; ✓0 ) = arg max LDT (✓) 1951) with gradient descent. For each episode,
X
✓ we uniformly sample one source task at random,
= arg max log p(Y |X, ✓) k✓ ✓ 0 k2 , T k . We then sample two subsets of training ex-
✓
(X,Y )2DT amples independently from the chosen task, DT k
and DT0 k . We use the former to simulate language-
where we assume p(X|✓) to be uniform. The first specific learning and the latter to evaluate its out-
term above corresponds to the maximum likeli- come. Assuming a single gradient step is taken
hood criterion often used for training a usual NMT only the with learning rate ⌘, the simulation is:
system. The second term discourages the newly
learned model from deviating too much from the ✓k0 = Learn(DT k ; ✓) = ✓ ⌘r✓ LDT k (✓).
initial parameters, alleviating the issue of over-
fitting when there is not enough training data. In Once the simulation of learning is done, we evalu-
practice, we solve the problem above by maximiz- ate the updated parameters ✓k0 on DT0 k , The gra-
ing the first term with gradient-based optimization dient computed from this evaluation, which we
and early-stopping after only a few update steps. refer to as meta-gradient, is used to update the

3624
Figure 2: An intuitive il-
lustration in which we
use solid lines to repre-
Ro Ro Fr Ro Fr
sent the learning of ini-
Es Es
A
Es A A
tialization, and dashed
Lv Lv Pt Lv Pt lines to show the path of
fine-tuning.
(a) Transfer Learning (b) Multilingual Transfer Learning (c) Meta Learning

meta model ✓. It is possible to aggregate multiple Related Work: Multilingual Transfer Learning
episodes of source tasks before updating ✓: The proposed MetaNMT differs from the existing
X framework of multilingual translation (Lee et al.,
D0
✓ ✓ ⌘0 r✓ L Tk (✓k0 ), 2016; Johnson et al., 2016; Gu et al., 2018b) or
k transfer learning (Zoph et al., 2016). The latter can
be thought of as solving the following problem:
where ⌘ 0 is the meta learning rate. 2 3
Unlike a usual learning scenario, the resulting X
max Lmulti (✓) = Ek 4 log p(Y |X; ✓)5 ,
model ✓0 from this meta-learning procedure is not ✓
(X,Y )2Dk
necessarily a good model on its own. It is however
a good starting point for training a good model us- where Dk is the training set of the k-th task, or lan-
ing only a few steps of learning. In the context of guage pair. The target low-resource language pair
machine translation, this procedure can be under- could either be a part of joint training or be trained
stood as finding the initialization of a neural ma- separately starting from the solution ✓0 found from
chine translation system that could quickly adapt solving the above problem.
to a new language pair by simulating such a fast The major difference between the proposed
adaptation scenario using many high-resource lan- MetaNMT and these multilingual transfer ap-
guage pairs. proaches is that the latter do not consider how
learning happens with the target, low-resource lan-
Meta-Gradient We use the following approxi- guage pair. The former explicitly incorporates the
mation property learning process within the framework by simulat-
ing it repeatedly in Eq. (2). As we will see later in
r(x + ⌫v) r(x)
H(x)v ⇡ the experiments, this results in a substantial gap in
⌫ the final performance on the low-resource task.
to approximate the meta-gradient:1 Illustration In Fig. 2, we contrast transfer learn-
0 0
ing, multilingual learning and meta-learning us-
r✓ LD (✓0 ) = r✓0 LD (✓0 )r✓ (✓ ⌘r✓ LD (✓)) ing three source language pairs (Fr-En, Es-En and
= r✓0 LD (✓0 )
0 0
⌘r✓0 LD (✓0 )H✓ (LD (✓)) Pt-En) and two target pairs (Ro-En and Lv-En).
 Transfer learning trains an NMT system specifi-
0 ⌘
⇡ r✓0 LD (✓0 ) r✓ LD (✓) r✓ LD (✓) , cally for a source language pair (Es-En) and fine-
⌫ ˆ
✓ ✓ tunes the system for each target language pair (Ro-
En, Lv-En). Multilingual learning often trains a
where ⌫ is a small constant and
single NMT system that can handle many different
0 language pairs (Fr-En, Pt-En, Es-En), which may
✓ˆ = ✓ + ⌫r✓0 LD (✓0 ).
or may not include the target pairs (Ro-En, Lv-
En). If not, it finetunes the system for each target
In practice, we find that it is also possible to ignore
pair, similarly to transfer learning. Both of these
the second-order term, ending up with the follow-
however aim at directly solving the source tasks.
ing simplified update rule:
On the other hand, meta-learning trains the NMT
0 0 system to be useful for fine-tuning on various tasks
r✓ LD (✓0 ) ⇡ r✓0 LD (✓0 ). (3)
including the source and target tasks. This is done
1
We omit the subscript k for simplicity. by repeatedly simulating the learning process on

3625
low-resource languages using many high-resource # of sents. # of En tokens Dev Test
language pairs (Fr-En, Pt-En, Es-En). Ro-En 0.61 M 16.66 M 31.76
Lv-En 4.46 M 67.24 M 20.24 15.15
3.3 Unified Lexical Representation Fi-En 2.63 M 64.50 M 17.38 20.20
I/O mismatch across language pairs One ma- Tr-En 0.21 M 5.58 M 15.45 13.74
jor challenge that limits applying meta-learning Ko-En 0.09 M 2.33 M 6.88 5.97
for low resource machine translation is that the ap-
proach outlined above assumes the input and out- Table 1: Statistics of full datasets of the target lan-
put spaces are shared across all the source and tar- guage pairs. BLEU scores on the dev and test sets
get tasks. This, however, does not apply to ma- are reported from a supervised Transformer model
chine translation in general due to the vocabulary with the same architecture.
mismatch across different languages. In multilin-
gual translation, this issue has been tackled by us- tuning on a small corpus which contains a lim-
ing a vocabulary of sub-words (Sennrich et al., ited set of unique tokens in the target language,
2015) or characters (Lee et al., 2016) shared across as it could adversely influence the other tokens’
multiple languages. This surface-level sharing is embedding vectors. We thus estimate the change
however limited, as it cannot be applied to lan- to each embedding vector induced by language-
guages exhibiting distinct orthography (e.g., Indo- specific learning by a separate parameter ✏k [x]:
Euroepan languages vs. Korean.)
✏k [x] = ✏0 [x] + ✏k [x].
Universal Lexical Representation (ULR) We
tackle this issue by dynamically building a vo- During language-specific learning, the ULR ✏0 [x]
cabulary specific to each language using a key- is held constant, while only ✏k [x] is updated,
value memory network (Miller et al., 2016; Gul- starting from an all-zero vector. On the other hand,
cehre et al., 2018), as was done successfully for we hold ✏k [x]’s constant while updating ✏u and
low-resource machine translation recently by Gu A during the meta-learning stage.
et al. (2018b). We start with multilingual word em-
bedding matrices ✏kquery 2 R|Vk |⇥d pretrained on 4 Experimental Settings
large monolingual corpora, where Vk is the vo-
cabulary of the k-th language. These embedding 4.1 Dataset
vectors can be obtained with small dictionaries of Target Tasks We show the effectiveness of the
seed word pairs (Artetxe et al., 2017a; Smith et al., proposed meta-learning method for low resource
2017) or in a fully unsupervised manner (Zhang NMT with extremely limited training examples
et al., 2017; Conneau et al., 2018). We take one of on five diverse target languages: Romanian (Ro)
these languages k 0 to build universal lexical repre- from WMT’16,2 Latvian (Lv), Finnish (Fi), Turk-
sentation consisting of a universal embedding ma- ish (Tr) from WMT’17,3 and Korean (Ko) from
trix ✏u 2 RM ⇥d and a corresponding key matrix Korean Parallel Dataset.4 We use the officially
✏key 2 RM ⇥d , where M < |Vk0 |. Both ✏kquery and provided train, dev and test splits for all these lan-
✏key are fixed during meta-learning. We then com- guages. The statistics of these languages are pre-
pute the language-specific embedding of token x sented in Table 1. We simulate the low-resource
from the language k as the convex sum of the uni- translation scenarios by randomly sub-sampling
versal embedding vectors by the training set with different sizes.
M
X Source Tasks We use the following languages
✏0 [x] = ↵i ✏u [i], from Europarl5 : Bulgarian (Bg), Czech (Cs), Dan-
i=1 ish (Da), German (De), Greek (El), Spanish (Es),
where ↵i / exp 1 > Estonian (Et), French (Fr), Hungarian (Hu), Ital-
⌧ ✏key [i] A✏query [x] and ⌧ is
k

set to 0.05. This approach allows us to handle lan- ian (It), Lithuanian (Lt), Dutch (Nl), Polish (Pl),
guages with different vocabularies using a fixed Portuguese (Pt), Slovak (Sk), Slovene (Sl) and
number of shared parameters (✏u , ✏key and A.) 2
http://www.statmt.org/wmt16/translation-task.html
3
http://www.statmt.org/wmt17/translation-task.html
Learning of ULR It is not desirable to update 4
https://sites.google.com/site/koreanparalleldata/
the universal embedding matrix ✏u when fine- 5
http://www.statmt.org/europarl/

3626
(a) Ro-En (b) Lv-En

(c) Fi-En (d) Tr-En

Figure 3: BLEU scores reported on test sets for {Ro, Lv, Fi, Tr} to En, where each model is first learned
from 6 source tasks (Es, Fr, It, Pt, De, Ru) and then fine-tuned on randomly sampled training sets with
around 16,000 English tokens per run. The error bars show the standard deviation calculated from 5 runs.

Swedish (Sv), in addition to Russian (Ru)6 to using MUSE (Conneau et al., 2018) to get mul-
learn the intilization for fine-tuning. In our exper- tilingual word vectors. We use the multilingual
iments, different combinations of source tasks are word vectors of the 20,000 most frequent words
explored to see the effects from the source tasks. in English to form the universal embedding matrix
✏u .
Validation We pick either Ro-En or Lv-En as a
validation set for meta-learning and test the gener-
4.2 Model and Learning
alization capability on the remaining target tasks.
This allows us to study the strict form of meta- Model We utilize the recently proposed Trans-
learning, in which target tasks are unknown during former (Vaswani et al., 2017) as an underlying
both training and model selection. NMT system. We implement Transformer in this
paper based on (Gu et al., 2018a)8 and mod-
Preprocessing and ULR Initialization As de- ify it to use the universal lexical representation
scribed in §3.3, we initialize the query embed- from §3.3. We use the default set of hyperpa-
ding vectors ✏kquery of all the languages. For each rameters (dmodel = dhidden = 512, nlayer = 6,
language, we use the monolingual corpora built nhead = 8, nbatch = 4000, twarmup = 16000) for
from Wikipedia7 and the parallel corpus. The con- all the language pairs and across all the experi-
catenated corpus is first tokenized and segmented mental settings. We refer the readers to (Vaswani
using byte-pair encoding (BPE, Sennrich et al., et al., 2017; Gu et al., 2018a) for the details of
2016), resulting in 40, 000 subwords for each lan- the model. However, since the proposed meta-
guage. We then estimate word vectors using fast- learning method is model-agnostic, it can be eas-
Text (Bojanowski et al., 2016) and align them ily extended to any other NMT architectures, e.g.
across all the languages in an unsupervised way RNN-based sequence-to-sequence models with at-
6 tention (Bahdanau et al., 2015).
A subsample of approximately 2M pairs from WMT’17.
7
We use the most recent Wikipedia dump (2018.5) from
8
https://dumps.wikimedia.org/backup-index.html. https://github.com/salesforce/nonauto-nmt

3627
Ro-En Lv-En Fi-En Tr-En Ko-En
Meta-Train
zero finetune zero finetune zero finetune zero finetune zero finetune
00.00 ± .00 0.00 ± .00 0.00 ± .00 0.00 ± .00 0.00 ± .00
Es 9.20 15.71 ± .22 2.23 4.65 ± .12 2.73 5.55 ± .08 1.56 4.14 ± .03 0.63 1.40 ± .09
Es Fr 12.35 17.46 ± .41 2.86 5.05 ± .04 3.71 6.08 ± .01 2.17 4.56 ± .20 0.61 1.70 ± .14
Es Fr It Pt 13.88 18.54 ± .19 3.88 5.63 ± .11 4.93 6.80 ± .04 2.49 4.82 ± .10 0.82 1.90 ± .07
De Ru 10.60 16.05 ± .31 5.15 7.19 ± .17 6.62 7.98 ± .22 3.20 6.02 ± .11 1.19 2.16 ± .09
Es Fr It Pt De Ru 15.93 20.00 ± .27 6.33 7.88 ± .14 7.89 9.14 ± .05 3.72 6.02 ± .13 1.28 2.44 ± .11
All 18.12 22.04 ± .23 9.58 10.44 ± .17 11.39 12.63 ± .22 5.34 8.97 ± .08 1.96 3.97 ± .10
Full Supervised 31.76 15.15 20.20 13.74 5.97

Table 2: BLEU Scores w.r.t. the source task set for all five target tasks.

strategies; (1) fine-tuning all the modules (all), (2)

fine-tuning the embedding and encoder, but freez-
ing the parameters of the decoder (emb+enc) and
(3) fine-tuning the embedding only (emb).

5 Results
vs. Multilingual Transfer Learning We meta-
learn the initial models on all the source tasks us-
ing either Ro-En or Lv-En as a validation task.
We also train the initial models to be multilin-
gual translation systems. We fine-tune them us-
ing the four target tasks (Ro-En, Lv-En, Fi-En
and Tr-En; 16k tokens each) and compare the pro-
posed meta-learning strategy and the multilingual,
Figure 4: BLEU Scores w.r.t. the size of the target transfer learning strategy. As presented in Fig. 3,
task’s training set. the proposed learning approach significantly out-
performs the multilingual, transfer learning strat-
Learning We meta-learn using various sets of egy across all the target tasks regardless of which
source languages to investigate the effect of source target task was used for early stopping. We also
task choice. For each episode, by default, we use a notice that the emb+enc strategy is most effec-
single gradient step of language-specific learning tive for both meta-learning and transfer learn-
with Adam (Kingma and Ba, 2014) per comput- ing approaches. With the proposed meta-learning
ing the meta-gradient, which is computed by the and emb+enc fine-tuning, the final NMT systems
first-order approximation in Eq. (3). trained using only a fraction of all available train-
For each target task, we sample training exam- ing examples achieve 2/3 (Ro-En) and 1/2 (Lv-En,
ples to form a low-resource task. We build tasks of Fi-En and Tr-En) of the BLEU score achieved by
4k, 16k, 40k and 160k English tokens for each lan- the models trained with full training sets.
guage. We randomly sample the training set five
vs. Statistical Machine Translation We also
times for each experiment and report the average
test the same Ro-En datasets with 16, 000 target
score and its standard deviation. Each fine-tuning
tokens using the default setting of Phrase-based
is done on a training set, early-stopped on a val-
MT (Moses) with the dev set for adjusting the
idation set and evaluated on a test set. In default
parameters and the test set for calculating the fi-
without notation, datasets of 16k tokens are used.
nal performance. We obtain 4.79(±0.234) BLEU
point, which is higher than the standard NMT per-
Fine-tuning Strategies The transformer con-
formance (0 BLEU). It is however still lower than
sists of three modules; embedding, encoder and
both the multi-NMT and meta-NMT.
decoder. We update all three modules during meta-
learning, but during fine-tuning, we can selectively Impact of Validation Tasks Similarly to train-
tune only a subset of these modules. Following ing any other neural network, meta-learning still
(Zoph et al., 2016), we consider three fine-tuning requires early-stopping to avoid overfitting to a

3628
specific set of source tasks. In doing so, we ob- proach, we observe that training rapidly saturates
serve that the choice of a validation task has non- and eventually degrades, as the model overfits to
negligible impact on the final performance. For in- the source tasks. MetaNMT on the other hand con-
stance, as shown in Fig. 3, Fi-En benefits more tinues to improve and never degrades, as the meta-
when Ro-En is used for validation, while the oppo- objective ensures that the model is adequate for
site happens with Tr-En. The relationship between fine-tuning on target tasks rather than for solving
the task similarity and the impact of a validation the source tasks.
task must be investigated further in the future.
Training Set Size We vary the size of the tar- Sample Translations We present some sample
get task’s training set and compare the proposed translations from the tested models in Table 3.
meta-learning strategy and multilingual, transfer Inspecting these examples provides the insight
learning strategy. We use the emb+enc fine-tuning into the proposed meta-learning algorithm. For in-
on Ro-En and Fi-En. Fig. 4 demonstrates that the stance, we observe that the meta-learned model
meta-learning approach is more robust to the drop without any fine-tuning produces a word-by-word
in the size of the target task’s training set. The gap translation in the first example (Tr-En), which is
between the meta-learning and transfer learning due to the successful use of the universal lexcial
grows as the size shrinks, confirming the effective- representation and the meta-learned initialization.
ness of the proposed approach on extremely low- The system however cannot reorder tokens from
resource language pairs. Turkish to English, as it has not seen any train-
ing example of Tr-En. After seeing around 600
sentence pairs (16K English tokens), the model
rapidly learns to correctly reorder tokens to form
a better translation. A similar phenomenon is ob-
served in the Ko-En example. These cases could
be found across different language pairs.

6 Conclusion
Figure 5: The learning curves of BLEU scores on
the validation task (Ro-En). In this paper, we proposed a meta-learning algo-
rithm for low-resource neural machine translation
Impact of Source Tasks In Table 2, we present that exploits the availability of high-resource lan-
the results on all five target tasks obtained while guages pairs. We based the proposed algorithm
varying the source task set. We first see that it is on the recently proposed model-agnostic meta-
always beneficial to use more source tasks. Al- learning and adapted it to work with multiple lan-
though the impact of adding more source tasks guages that do not share a common vocabulary us-
varies from one language to another, there is up ing the technique of universal lexcal representa-
to 2⇥ improvement going from one source task to tion, resulting in MetaNMT. Our extensive evalu-
18 source tasks (Lv-En, Fi-En, Tr-En and Ko-En). ation, using 18 high-resource source tasks and 5
The same trend can be observed even without any low-resource target tasks, has shown that the pro-
fine-tuning (i.e., unsupervised translation, (Lam- posed MetaNMT significantly outperforms the ex-
ple et al., 2017; Artetxe et al., 2017b)). In addi- isting approach of multilingual, transfer learning
tion, the choice of source languages has different in low-resource neural machine translation across
implications for different target languages. For in- all the language pairs considered.
stance, Ro-En benefits more from {Es, Fr, It, Pt}
The proposed approach opens new opportuni-
than from {De, Ru}, while the opposite effect is
ties for neural machine translation. First, it is a
observed with all the other target tasks.
principled framework for incorporating various
Training Curves The benefit of meta-learning extra sources of data, such as source- and target-
over multilingual translation is clearly demon- side monolingual corpora. Second, it is a generic
strated when we look at the training curves in framework that can easily accommodate existing
Fig. 5. With the multilingual, transfer learning ap- and future neural machine translation systems.

3629
Source (Tr) google mülteciler için 11 milyon dolar toplamak üzere bağış eşleştirme kampanyasını başlattı .
Target google launches donation-matching campaign to raise $ 11 million for refugees .
Meta-0 google refugee fund for usd 11 million has launched a campaign for donation .
Meta-16k google has launched a campaign to collect $ 11 million for refugees .
Source (Ko) tà– ¥Ï⇠¥ 0å⌧ ¨å‰ ⌘–î ÙÌ\ p ‡⌅ ¨ , ∏`x , Xx , Ω⌧x Òt Ïh⇣‰
Target among the suspects are retired military officials , journalists , politicians , businessmen and others .
Meta-0 last year , convicted people , among other people , of a high-ranking army of journalists in economic
and economic policies , were included .
Meta-16k the arrested persons were included in the charge , including the military officials , journalists , politicians
and economists .

Table 3: Sample translations for Tr-En and Ko-En highlight the impact of fine-tuning which results in
syntactically better formed translations. We highlight tokens of interest in terms of reordering.

Acknowledgement Yun Chen, Yang Liu, and Victor OK Li. 2018. Zero-
resource neural machine translation with multi-
This research was supported in part by the Face- agent communication game. arXiv preprint
book Low Resource Neural Machine Translation arXiv:1802.03116.
Award. This work was also partly supported by Yong Cheng, Yang Liu, Qian Yang, Maosong Sun, and
Samsung Advanced Institute of Technology (Next Wei Xu. 2016. Neural machine translation with
pivot languages. arXiv preprint arXiv:1611.04928.
Generation Deep Learning: from pattern recogni-
tion to AI) and Samsung Electronics (Improving Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bah-
Deep Learning using Latent Structure). KC thanks danau, and Yoshua Bengio. 2014. On the properties
of neural machine translation: Encoder–Decoder ap-
support by eBay, TenCent, NVIDIA and CIFAR. proaches. In Eighth Workshop on Syntax, Semantics
and Structure in Statistical Translation.
Alexis Conneau, Guillaume Lample, Marc’Aurelio
References Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018.
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Word translation without parallel data. International
Matthew W Hoffman, David Pfau, Tom Schaul, and Conference on Learning Representations.
Nando de Freitas. 2016. Learning to learn by gra- Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.
dient descent by gradient descent. In Advances Model-agnostic meta-learning for fast adaptation of
in Neural Information Processing Systems, pages deep networks. arXiv preprint arXiv:1703.03400.
3981–3989.
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016a. Multi-way, multilingual neural machine
2017a. Learning bilingual word embeddings with translation with a shared attention mechanism. In
(almost) no bilingual data. In Proceedings of the NAACL.
55th Annual Meeting of the Association for Compu- Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,
tational Linguistics (Volume 1: Long Papers), vol- Fatos T Yarman Vural, and Kyunghyun Cho. 2016b.
ume 1, pages 451–462. Zero-resource translation with multi-lingual neural
machine translation. In EMNLP.
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and
Kyunghyun Cho. 2017b. Unsupervised neural ma- Jonas Gehring, Michael Auli, David Grangier, De-
chine translation. arXiv preprint arXiv:1710.11041. nis Yarats, and Yann Dauphin. 2017. Convolu-
tional sequence to sequence learning. arXiv preprint
arXiv:1705.03122.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2015. Neural machine translation by jointly Jiatao Gu, James Bradbury, Caiming Xiong, Vic-
learning to align and translate. In ICLR. tor O. K. Li, and Richard Socher. 2018a. Non-
autoregressive neural machine translation. ICLR.
Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2016. Enriching word vec- Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor OK
tors with subword information. arXiv preprint Li. 2018b. Universal neural machine translation for
arXiv:1607.04606. extremely low resource languages. arXiv preprint
arXiv:1802.05368.
Yun Chen, Yang Liu, Yong Cheng, and Victor OK Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho,
Li. 2017. A teacher-student framework for zero- and Yoshua Bengio. 2018. Dynamic neural tur-
resource neural machine translation. arXiv preprint ing machine with continuous and discrete address-
arXiv:1705.00753. ing schemes. Neural computation, 30(4):857–884.

3630
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and
Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Pieter Abbeel. 2017. Meta-learning with temporal
Holger Schwenk, and Yoshua Bengio. 2015. On us- convolutions. arXiv preprint arXiv:1707.03141.
ing monolingual corpora in neural machine transla-
tion. arXiv preprint arXiv:1503.03535. Herbert Robbins and Sutton Monro. 1951. A stochastic
approximation method. The annals of mathematical
David Ha, Andrew Dai, and Quoc V Le. 2016a. Hy- statistics, pages 400–407.
pernetworks. arXiv preprint arXiv:1609.09106.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2015. Improving neural machine translation
2016b. Toward multilingual neural machine trans- models with monolingual data. arXiv preprint
lation with universal encoder and decoder. arXiv arXiv:1511.06709.
preprint arXiv:1611.04798.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, 2016. Edinburgh neural machine translation sys-
Tieyan Liu, and Wei-Ying Ma. 2016. Dual learn- tems for wmt 16. arXiv preprint arXiv:1606.02891.
ing for machine translation. In Advances in Neural
Information Processing Systems, pages 820–828. Samuel L Smith, David HP Turban, Steven Hamblin,
and Nils Y Hammerla. 2017. Offline bilingual word
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim vectors, orthogonal transformations and the inverted
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- softmax. arXiv preprint arXiv:1702.03859.
rat, Fernanda Viégas, Martin Wattenberg, Greg Cor-
rado, et al. 2016. Google’s multilingual neural ma- Jake Snell, Kevin Swersky, and Richard Zemel. 2017.
chine translation system: enabling zero-shot transla- Prototypical networks for few-shot learning. In Ad-
tion. arXiv preprint arXiv:1611.04558. vances in Neural Information Processing Systems,
pages 4080–4090.
Diederik Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint Ilya Sutskever, Oriol Vinyals, and Quôc Lê. 2014. Se-
arXiv:1412.6980. quence to sequence learning with neural networks.
In NIPS.
Philipp Koehn and Rebecca Knowles. 2017. Six
challenges for neural machine translation. arXiv Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
preprint arXiv:1706.03872. Uszkoreit, Llion Jones, Aidan Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Philipp Koehn, Franz Josef Och, and Daniel Marcu. you need. arXiv preprint arXiv:1706.03762.
2003. Statistical phrase-based translation. In
Proceedings of the 2003 Conference of the North Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan
American Chapter of the Association for Computa- Wierstra, et al. 2016. Matching networks for one
tional Linguistics on Human Language Technology- shot learning. In Advances in Neural Information
Volume 1, pages 48–54. Association for Computa- Processing Systems, pages 3630–3638.
tional Linguistics.
Zhen Yang, Wei Chen, Feng Wang, and Bo Xu.
Brenden M Lake, Ruslan Salakhutdinov, and Joshua B 2018. Unsupervised neural machine translation with
Tenenbaum. 2015. Human-level concept learning weight sharing. arXiv preprint arXiv:1804.09057.
through probabilistic program induction. Science,
350(6266):1332–1338. Jiajun Zhang and Chengqing Zong. 2016. Exploit-
ing source-side monolingual data in neural machine
Guillaume Lample, Ludovic Denoyer, and translation. In Proceedings of the 2016 Conference
Marc’Aurelio Ranzato. 2017. Unsupervised on Empirical Methods in Natural Language Pro-
machine translation using monolingual corpora cessing, pages 1535–1545.
only. arXiv preprint arXiv:1711.00043.
Meng Zhang, Yang Liu, Huanbo Luan, and Maosong
Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Sun. 2017. Earth mover’s distance minimization for
2016. Fully character-level neural machine trans- unsupervised bilingual lexicon induction. In Pro-
lation without explicit segmentation. arXiv preprint ceedings of the 2017 Conference on Empirical Meth-
arXiv:1610.03017. ods in Natural Language Processing, pages 1934–
1945. Association for Computational Linguistics.
Jason Lee, Kyunghyun Cho, Jason Weston, and Douwe
Kiela. 2017. Emergent translation in multi-agent Barret Zoph, Deniz Yuret, Jonathan May, and
communication. arXiv preprint arXiv:1710.06922. Kevin Knight. 2016. Transfer learning for low-
resource neural machine translation. arXiv preprint
Alexander Miller, Adam Fisch, Jesse Dodge, Amir- arXiv:1604.02201.
Hossein Karimi, Antoine Bordes, and Jason We-
ston. 2016. Key-value memory networks for
directly reading documents. arXiv preprint
arXiv:1606.03126.

3631

Seamus Heaney by Michael Allen
100% (4)
Seamus Heaney by Michael Allen
290 pages
Constructivisme and Inductive Model
No ratings yet
Constructivisme and Inductive Model
26 pages
Interviewing The Child Victim and The Child in
No ratings yet
Interviewing The Child Victim and The Child in
30 pages
Maxims of Equity - PPTX (Lecture 3)
No ratings yet
Maxims of Equity - PPTX (Lecture 3)
32 pages
Mathematical Olympiad Math Olympiad Tutorials
100% (1)
Mathematical Olympiad Math Olympiad Tutorials
62 pages
AACL Machine Translation Tutorial 2023
No ratings yet
AACL Machine Translation Tutorial 2023
145 pages
The Dangers of Yoga
0% (1)
The Dangers of Yoga
42 pages
Thesis Amended
No ratings yet
Thesis Amended
157 pages
05 Lecture08 NMT
No ratings yet
05 Lecture08 NMT
79 pages
Kierkegaard Handbook 2019
No ratings yet
Kierkegaard Handbook 2019
11 pages
Freedom-Writers-discussion Guide
100% (3)
Freedom-Writers-discussion Guide
4 pages
2503 06594v1-LaMaTE
No ratings yet
2503 06594v1-LaMaTE
36 pages
Comprehensive List of Lucky Gemstones According To Nakshtra
No ratings yet
Comprehensive List of Lucky Gemstones According To Nakshtra
5 pages
Vibraciones de Piso
No ratings yet
Vibraciones de Piso
6 pages
Department of Education Region X Division of Bukidnon District of Quezon II Salawagan National High School
No ratings yet
Department of Education Region X Division of Bukidnon District of Quezon II Salawagan National High School
12 pages
12007-Article (PDF) - 24616-1-10-20201002
No ratings yet
12007-Article (PDF) - 24616-1-10-20201002
76 pages
Paper Review
No ratings yet
Paper Review
41 pages
Low-Resource Neural Machine Translation A Systematic Literature Review
No ratings yet
Low-Resource Neural Machine Translation A Systematic Literature Review
39 pages
Neural Machine Translation by Jointly Learning To
No ratings yet
Neural Machine Translation by Jointly Learning To
16 pages
Artificial Intelligent Decoding of Rare Words in Natural Language Translation Using Lexical Level Context
No ratings yet
Artificial Intelligent Decoding of Rare Words in Natural Language Translation Using Lexical Level Context
7 pages
Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
No ratings yet
Exploring The Limits of Transfer Learning With A Unified Text-to-Text Transformer
67 pages
Branding Challenges and Opportunities
No ratings yet
Branding Challenges and Opportunities
10 pages
Green Paper Camden Lilith Rape Stats
100% (2)
Green Paper Camden Lilith Rape Stats
10 pages
SOC SCI 1 Midterm Month 1
No ratings yet
SOC SCI 1 Midterm Month 1
12 pages
Results and Discussions: Fort San Pedro National High School Sto. Rosario ST., Iloilo City
No ratings yet
Results and Discussions: Fort San Pedro National High School Sto. Rosario ST., Iloilo City
22 pages
The FLoRes Evaluation Datasets For Low-Resource Machine Translation Nepali-English and Sinhala-English
No ratings yet
The FLoRes Evaluation Datasets For Low-Resource Machine Translation Nepali-English and Sinhala-English
14 pages
Lyceum of The Philippines, Inc. vs. Court of Appeals
100% (1)
Lyceum of The Philippines, Inc. vs. Court of Appeals
8 pages
Perception and Individual Decision Making
No ratings yet
Perception and Individual Decision Making
36 pages
Electronics 14 00243
No ratings yet
Electronics 14 00243
30 pages
Google T5
No ratings yet
Google T5
67 pages
Neural Machine Translation in Foreign Language Teaching and Learning A Systematic Review
No ratings yet
Neural Machine Translation in Foreign Language Teaching and Learning A Systematic Review
20 pages
Google PDF
No ratings yet
Google PDF
23 pages
A Survey of Multilingual Neural Machine Translation: Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
No ratings yet
A Survey of Multilingual Neural Machine Translation: Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
38 pages
Extremely Low Resource Neural Machine Translation For Asian Languages
No ratings yet
Extremely Low Resource Neural Machine Translation For Asian Languages
36 pages
Observer Metod
100% (1)
Observer Metod
7 pages
E Book
No ratings yet
E Book
30 pages
low resource nmt survey 원본
No ratings yet
low resource nmt survey 원본
35 pages
Group19 Eee Paper
No ratings yet
Group19 Eee Paper
23 pages
Towards The Next 1000 Languages in Multilingual Machine Translation: Exploring The Synergy Between Supervised and Self-Supervised Learning
No ratings yet
Towards The Next 1000 Languages in Multilingual Machine Translation: Exploring The Synergy Between Supervised and Self-Supervised Learning
14 pages
Challenges in NMT - 1907.05019
No ratings yet
Challenges in NMT - 1907.05019
27 pages
Developing A Brand Equity Measurement & MGT System
No ratings yet
Developing A Brand Equity Measurement & MGT System
31 pages
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
No ratings yet
Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
17 pages
2024 Wat-1 51
No ratings yet
2024 Wat-1 51
11 pages
Quinn Thesis Final On NMT
No ratings yet
Quinn Thesis Final On NMT
29 pages
Multilingual Neural Machine Translation For Zero-Resource Languages
No ratings yet
Multilingual Neural Machine Translation For Zero-Resource Languages
16 pages
Challenges in NMT - 2004.05809
No ratings yet
Challenges in NMT - 2004.05809
22 pages
Lang Gragh
No ratings yet
Lang Gragh
14 pages
Multilingual Machine Translation With Large Language Models
No ratings yet
Multilingual Machine Translation With Large Language Models
16 pages
Deep Neural Network-Based Machine Translation System Combination
No ratings yet
Deep Neural Network-Based Machine Translation System Combination
19 pages
Un Supervise
No ratings yet
Un Supervise
14 pages
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
No ratings yet
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
36 pages
Neural Machine Translation A Review of Methods Resources and - 2020 - AI Ope
No ratings yet
Neural Machine Translation A Review of Methods Resources and - 2020 - AI Ope
17 pages
dsvvpublishing,+DSIIJ 2022 19 08 MishraA
No ratings yet
dsvvpublishing,+DSIIJ 2022 19 08 MishraA
14 pages
Advantages and Disadvantages of Performance Appraisal
No ratings yet
Advantages and Disadvantages of Performance Appraisal
6 pages
Transfer Learning For ASR To Deal With Low-Resource Data Problem
No ratings yet
Transfer Learning For ASR To Deal With Low-Resource Data Problem
8 pages
Presentation RaviShankar
No ratings yet
Presentation RaviShankar
28 pages
Machine Tannslation On Low Resource Langugages Arabic Telugu Kannada
No ratings yet
Machine Tannslation On Low Resource Langugages Arabic Telugu Kannada
9 pages
Hoffmann's "Two Worlds" and The Problem of Life-Writing: Julian Knox, Georgia College
No ratings yet
Hoffmann's "Two Worlds" and The Problem of Life-Writing: Julian Knox, Georgia College
21 pages
Tanujasynopsis
No ratings yet
Tanujasynopsis
8 pages
Low-Resource Machine Translation For Low-Resource Languages: Leveraging Comparable Data, Codeswitching and Compute Resources
No ratings yet
Low-Resource Machine Translation For Low-Resource Languages: Leveraging Comparable Data, Codeswitching and Compute Resources
14 pages
Meta-Transfer Learning For Code-Switched Speech Recognition
No ratings yet
Meta-Transfer Learning For Code-Switched Speech Recognition
7 pages
A Teacher-Student Framework For Zero-Resource Neural Machine Translation
No ratings yet
A Teacher-Student Framework For Zero-Resource Neural Machine Translation
11 pages
A Teacher-Student Framework For Zero-Resource Neural Machine Translation
No ratings yet
A Teacher-Student Framework For Zero-Resource Neural Machine Translation
11 pages
Understanding Back-Translation at Scale
No ratings yet
Understanding Back-Translation at Scale
12 pages
Google Neural Machine Translation System
No ratings yet
Google Neural Machine Translation System
23 pages
Unsupervised Neural Machine Translation With Weight Sharing
No ratings yet
Unsupervised Neural Machine Translation With Weight Sharing
11 pages
Seminar Review Assignment 3LT Eng To HAdiyisa
No ratings yet
Seminar Review Assignment 3LT Eng To HAdiyisa
11 pages
Improving Neural Machine Translation Models With Monolingual Data
No ratings yet
Improving Neural Machine Translation Models With Monolingual Data
11 pages
Neural Machine Translation: Max Mustermann, and Hermann Ney
No ratings yet
Neural Machine Translation: Max Mustermann, and Hermann Ney
18 pages
FN Paper 2
No ratings yet
FN Paper 2
13 pages
Multi Mod Al
No ratings yet
Multi Mod Al
10 pages
Psychagogia in Plato's Phaedrus
No ratings yet
Psychagogia in Plato's Phaedrus
20 pages
Neural Machine Paper 5
No ratings yet
Neural Machine Paper 5
4 pages
Addressing Word-Order Divergence in Multilingual Neural Machine Translation For Extremely Low Resource Languages
No ratings yet
Addressing Word-Order Divergence in Multilingual Neural Machine Translation For Extremely Low Resource Languages
6 pages
Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition
No ratings yet
Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition
7 pages
Toward Multilingual Neural Machine Translation With Universal Encoder and Decoder
No ratings yet
Toward Multilingual Neural Machine Translation With Universal Encoder and Decoder
10 pages
Theory Compassion Development
No ratings yet
Theory Compassion Development
8 pages
CUNI Submission For Low-Resource Languages in WMT News 2019
No ratings yet
CUNI Submission For Low-Resource Languages in WMT News 2019
7 pages
Understanding Back-Translation at Scale
No ratings yet
Understanding Back-Translation at Scale
12 pages
1 s2.0 S1877050922024899 Main
No ratings yet
1 s2.0 S1877050922024899 Main
8 pages
Universal Neural Machine Translation For Extremely Low Resource Languages
No ratings yet
Universal Neural Machine Translation For Extremely Low Resource Languages
11 pages
Lucid Dreaming Senior Project
No ratings yet
Lucid Dreaming Senior Project
4 pages
Vision Mission Core Values
No ratings yet
Vision Mission Core Values
5 pages
Machine Translation of Vedic Sanskrit Using Deep Learning Algorithm
No ratings yet
Machine Translation of Vedic Sanskrit Using Deep Learning Algorithm
4 pages
Estado Del Arte
No ratings yet
Estado Del Arte
12 pages
Multi-Task Learning For Multiple Language Translation
No ratings yet
Multi-Task Learning For Multiple Language Translation
10 pages
NLP Project Research Paper Tanmaya
No ratings yet
NLP Project Research Paper Tanmaya
4 pages
Paradox of Technology: Consumer Cognizance, Emotions, and Coping Strategies
No ratings yet
Paradox of Technology: Consumer Cognizance, Emotions, and Coping Strategies
5 pages
Classicism
No ratings yet
Classicism
2 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet

Meta-Learning For Low-Resource Neural Machine Translation

Uploaded by

Meta-Learning For Low-Resource Neural Machine Translation

Uploaded by

Meta-Learning for Low-Resource Neural Machine Translation

et al., 2015; Sennrich et al., 2015; Zhang and

Loss Y_train Y_test Loss

query Forward Pass Meta Gradient Pass

(c) Fi-En (d) Tr-En

strategies; (1) fine-tuning all the modules (all), (2)

You might also like