0% found this document useful (0 votes)
24 views11 pages

Competence-Based Curriculum Learning

1. The document proposes a curriculum learning framework for neural machine translation (NMT) that aims to reduce training time and improve performance compared to existing NMT architectures. 2. The framework estimates the difficulty of training samples and only trains the model on samples whose difficulty is lower than the model's current competence, preventing the model from getting stuck in bad local optima. 3. Experimental results show the framework can reduce training time by up to 70% and improve performance on both recurrent and transformer NMT models, achieving BLEU score gains of up to 2.2 points.

Uploaded by

Naida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views11 pages

Competence-Based Curriculum Learning

1. The document proposes a curriculum learning framework for neural machine translation (NMT) that aims to reduce training time and improve performance compared to existing NMT architectures. 2. The framework estimates the difficulty of training samples and only trains the model on samples whose difficulty is lower than the model's current competence, preventing the model from getting stuck in bad local optima. 3. Experimental results show the framework can reduce training time by up to 70% and improve performance on both recurrent and transformer NMT models, achieving BLEU score gains of up to 2.2 points.

Uploaded by

Naida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Competence-based Curriculum Learning for

Neural Machine Translation

Emmanouil Antonios Platanios† , Otilia Stretcu† , Graham Neubig‡ , Barnabas Poczos† , Tom M. Mitchell†

Machine Learning Department, ‡ Language Technologies Institute
Carnegie Mellon University
{e.a.platanios,ostretcu,gneubig,bpoczos,tom.mitchell}@cs.cmu.edu

Abstract CURRICULUM LEARNING

Current state-of-the-art NMT systems use


large neural networks that are not only slow DIFFICULTY COMPETENCE
to train, but also often require many heuristics

MODEL STATE
SAMPLE
and optimization tricks, such as specialized Use sample only if:
learning rate schedules and large batch sizes. difficulty(sample) ≤ competence(model)
This is undesirable as it requires extensive hy-
perparameter tuning. In this paper, we propose
a curriculum learning framework for NMT that
MODEL TRAINER
reduces training time, reduces the need for spe- DATA
cialized heuristics or large batch sizes, and re-
sults in overall better performance. Our frame-
work consists of a principled way of deciding Figure 1: Overview of the proposed curriculum learn-
which training samples are shown to the model ing framework. During training, difficulty of each train-
at different times during training, based on the ing sample is estimated and a decision whether to use it
estimated difficulty of a sample and the cur- is made based on the current competence of the model.
rent competence of the model. Filtering train-
ing samples in this manner prevents the model
from getting stuck in bad local optima, mak-
both superior performance and training speed com-
ing it converge faster and reach a better solu-
tion than the common approach of uniformly
pared to previous architectures using recurrent neu-
sampling training examples. Furthermore, the ral networks (RNNs; (Kalchbrenner and Blunsom,
proposed method can be easily applied to ex- 2013; Sutskever et al., 2014)). However, large scale
isting NMT models by simply modifying their NMT systems are often hard to train, requiring
input data pipelines. We show that our frame- complicated heuristics which can be both time-
work can help improve the training time and consuming and expensive to tune. This is espe-
the performance of both recurrent neural net- cially true for Transformers which, when carefully
work models and Transformers, achieving up
tuned, have been shown to consistently outperform
to a 70% decrease in training time, while at the
same time obtaining accuracy improvements RNNs (Popel and Bojar, 2018), but on the other
of up to 2.2 BLEU. hand, also rely on a number of heuristics such as
specialized learning rates and large-batch training.
1 Introduction
In this paper, we attempt to tackle this problem
Neural Machine Translation (NMT; Kalchbrenner by proposing a curriculum learning framework
and Blunsom (2013); Bahdanau et al. (2015)) now for training NMT systems that reduces training
represents the state-of-the-art adapted in most ma- time, reduces the need for specialized heuristics
chine translation systems (Wu et al., 2016; Crego or large batch sizes, and results in overall better
et al., 2016; Bojar et al., 2017a), largely due to performance. It allows us to train both RNNs and,
its ability to benefit from end-to-end training on perhaps more importantly, Transformers, with rel-
massive amounts of data. In particular, recently- ative ease. Our proposed method is based on the
introduced self-attentional Transformer architec- idea of teaching algorithms in a similar manner
tures (Vaswani et al., 2017) are rapidly becoming as humans, from easy concepts to more difficult
the de-facto standard in NMT, having demonstrated ones. This idea can be traced back to the work

1162
Proceedings of NAACL-HLT 2019, pages 1162–1172
Minneapolis, Minnesota, June 2 - June 7, 2019. 2019
c Association for Computational Linguistics
of Elman (1993) and Krueger and Dayan (2009). that of Kocmi and Bojar (2017), can be formu-
The main motivation is that training algorithms lated as special cases of our framework.
can perform better if training data is presented in 2. Simple: It can be applied to existing NMT sys-
a specific order, starting from easy examples and tems with only a small modification to their
moving on to more difficult ones, as the learner training data pipelines.
becomes more competent. In the case of machine 3. Automatic: It does not require any tuning other
learning, it can also be thought of as a means to than picking the value of a single parameter,
avoid getting stuck in bad local optima early on in which is the length of the curriculum (i.e., for
training. An overview of the proposed framework how many steps to use curriculum learning, be-
is shown in Figure 1. fore easing into normal training).
Notably, we are not the first to examine cur- 4. Efficient: It reduces training time by up to 70%,
riculum learning for NMT, although other related whereas contemporaneous work of Zhang et al.
works have met with mixed success. Kocmi and (2018) reports reductions of up to 46%.
Bojar (2017) explore impact of several curriculum 5. Improved Performance: It improves the per-
heuristics on training a translation system for a sin- formance of the learned models by up to 2.2
gle epoch, presenting the training examples in an BLEU points, where the best setting reported
easy-to-hard order based on sentence length and by Zhang et al. (2018) achieves gains of up 1.55
vocabulary frequency. However, their strategy in- BLEU after careful tuning.
troduces all training samples during the first epoch, In the next section, we introduce our proposed cur-
and how this affects learning in following epochs riculum learning framework.
is not clear, with official evaluation results (Bo-
jar et al., 2017b) indicating that final performance 2 Proposed Method
may indeed be hurt with this strategy. Contempo- We propose competence-based curriculum learn-
raneously to our work, Zhang et al. (2018) further ing, a training framework based on the idea that
propose to split the training samples into a prede- training algorithms can perform better if training
fined number of bins (5, in their case), based on data is presented in a way that picks examples ap-
various difficulty metrics. A manually designed propriate for the model’s current competence. More
curriculum schedule then specifies the bins from specifically, we define the following two concepts
which the model samples training examples. Ex- that are central to our framework:
periments demonstrate that benefits of curriculum
learning are highly sensitive to several hyperparam- Difficulty: A value that represents the difficulty
eters (e.g., learning rate, number of iterations spent of a training sample and that may depend on the
in each phase, etc.), and largely provide benefits current state of the learner. For example, sentence
in convergence speed as opposed to final model length is an intuitive difficulty metric for natural
accuracy. language processing tasks. The only constraint is
In contrast to these previous approaches, we de- that difficulty scores are comparable across differ-
fine a continuous curriculum learning method (in- ent training samples (i.e., the training samples can
stead of a discretized regime) with only one tunable be ranked according to their difficulty).
hyperparameter (the duration of curriculum learn- Competence: A value between 0 and 1 that rep-
ing). Furthermore, as opposed to previous work resents the progress of a learner during its training.
which only focuses on RNNs, we also experiment It is defined as a function of the learner’s state.
with Transformers, which are notoriously hard to More specifically, we define the competence, c(t)
train (Popel and Bojar, 2018). Finally, unlike any at time t (measured in terms of training steps), of
of the work described above, we show that our a learner as the proportion of training data it is al-
curriculum approach helps not only in terms of lowed to use at that time. The training examples are
convergence speed, but also in terms of the learned ranked according to their difficulty and the learner
model performance. In summary, our method has is only allowed to use the top c(t) portion of them
the following desirable features: at time t.
1. Abstract: It is a novel, generic, and extensible Using these two concepts, we propose Algorithm 1
formulation of curriculum learning. A number (a high-level overview is shown in Figure 1, an ex-
of previous heuristic-based approaches, such as ample visualization of the first two steps is shown

1163
Sentence Length Sentence Difficulty

Thank you very much! 4 Thank you very much! 0.01


Barack Obama loves ... 13 Barack Obama loves ... 0.15
My name is ... 6 My name is ... 0.03
What did she say ... 123 What did she say ... 0.95

Figure 2: Example visualization of the preprocessing sequence used in the proposed algorithm. The histogram
shown is that of sentence lengths from the WMT-16 En)De dataset used in our experiments. Here sentence lengths
represent an example difficulty scoring function, d. “CDF” stands for the empirical “cumulative density function”
obtained from the histogram on the left plot.

Difficulty Competence
Algorithm 1: Competence-based curricu-
Step 1000
lum learning algorithm.
Input: Dataset, D = {si }M i=1 , consisting of M
samples, model trainer, T , that takes as input
batches of training data to use at each step,
Sample uniformly from Competence at current step difficulty scoring function, d, and competence
blue region
function, c.
1 Compute the difficulty, d(si ), for each si ∈ D.
Step 10000
2 Compute the cumulative density function (CDF) of
the difficulty scores. This results in one difficulty
CDF score per sample, d(s ¯ i ) ∈ [0, 1]. Illustrated in
Figure 2.
3 for training step t = 1, . . . do
4 Compute the model competence, c(t).
5 Sample a data batch, Bt , uniformly from all
¯ i ) ≤ c(t). Illustrated in
si ∈ D, such that d(s
Figure 3.
Figure 3: Example illustration of the training data 6 Invoke the trainer, T , using Bt as input.
“filtering” performed by our curriculum learning algo- Output: Trained model.
rithm. At each training step: (i) the current competence
of the model is computed, and (ii) a batch of training
examples is sampled uniformly from all training ex-
amples whose difficulty is lower than that competence. 2.1 Difficulty Metrics
In this example, we are using the sentence length dif- There are many possible ways of defining the dif-
ficulty heuristic shown in Equation 1, along with the
ficulty of translating a sentence. We consider two
square root competence model shown in Equation 7.
heuristics inspired by what we, as humans, may
consider difficult when translating, and by fac-
tors which can negatively impact the optimiza-
in Figure 2, and an example of the interaction be- tion algorithms used when training NMT mod-
tween difficulty and competence is shown in Fig- els. In the rest of this section we denote our
ure 3). training corpus as a collection of M sentences,
Note that, at each training step, we are not chang- {si }M
i=1 , where each sentence is a sequence of
ing the relative probability of each training sample words. si = {w0i , . . . , wN
i }.
i
under the input data distribution, but we are rather
constraining the domain of that distribution, based Sentence Length: We argue that it is harder to
on the current competence of the learner. Even- translate longer sentences, as longer sentences re-
tually, once the competence becomes 1, the train- quire being able to translate their component parts,
ing process becomes equivalent to that without us- which often consist of short sentences. Further-
ing a curriculum, with the main difference that the more, longer sentences are intuitively harder to
learner should now be more capable to learn from translate due to the propagation of errors made
the more difficult examples. Given the dependence early on when generating the target language sen-
of this algorithm on the specific choices of the dif- tence. Therefore, a simple way to define the dif-
ficulty scoring function, d, and the competence ficulty of a sentence si = {w0i , . . . , wN
i } is as
i

function, c, we now describe our instantiations for follows:


training NMT models. dlength (si ) , Ni . (1)

1164
Note that we can compute this difficulty metric on formation about the sentence length that was pro-
either the source language sentence or the target posed earlier (longer sentence scores are products
language sentence. We only consider the source over more terms in [0, 1] and are thus likely to be
sentence in this paper 1 . smaller). We thus propose the following difficulty
heuristic:
Word Rarity: Another aspect of language that
can affect the difficulty of translation is the fre- Ni
X
quency with which words appear. For example, hu- drarity (si ) , − log p̂(wki ), (3)
mans may find rare words hard to translate because k=1

we rarely ever see them and it may be hard to recall where we use logarithms of word probabilities to
their meaning. The same can be true for NMT mod- prevent numerical errors. Note that negation is
els where: (i) the statistical strength of the training used because we define less likely (i.e., more rare)
examples containing rare words is low and thus the sentences as more difficult.
model needs to keep revisiting such words in order
These are just two examples of difficulty metrics,
to learn robust representations for them, and (ii)
and it is easy to conceive of other metrics such as
the gradients of the rare word embeddings tend to
the occurrence of homographs (Liu et al., 2018) or
have high variance; they are overestimates of the
context-sensitive words (Bawden et al., 2018), the
true gradients in the few occasions where they are
examination of which we leave for future work.
non-zero, and underestimates otherwise. This sug-
gests that using word frequencies may be a helpful 2.2 Competence Functions
difficulty heuristic. Given a corpus of sentences,
For this paper, we propose two simple functional
{si }M
i=1 , we define relative word frequencies as: forms for c(t) and justify them with some intuition.
Ni
M X More sophisticated strategies that depend on the
1 X
loss function, the loss gradient, or on the learner’s
p̂(wj ) , 1wi =wj , (2)
Ntotal k
performance on held-out data, are possible, but we
i=1 k=1
do not consider them in this paper.
where j = 1, . . . , #{unique words in corpus} and
1condition is the indicator function which is equal Linear: This is a simple way to define c(t).
to 1 if its condition is satisfied and 0 otherwise. Given an initial value c0 , c(0) ≥ 0 and a slope
Next we need to decide how to aggregate the rel- parameter r, we define:
ative word frequencies of all words in a sentence
to obtain a single difficulty score for that sentence. c(t) , min (1, tr + c0 ) . (4)
Previous research has proposed various pooling op-
In this case, new training examples are constantly
erations, such as minimum, maximum, and average
being introduced during the training process, with a
(Zhang et al., 2018), but they show that they do not
constant rate r (as a proportion of the total number
work well in practice. We propose a different ap-
of available training examples). Note that we can
proach. Ultimately, what might be most important
also define r = (1 − c0 )/T , where T denotes the
is the overall likelihood of a sentence as that con-
time after which the learner is fully competent,
tains information about both word frequency and,
which results in:
implicitly, sentence length. An approximation to
this likelihood is the product of the unigram prob-
 
1 − c0
abilities, which is related to previous work in the clinear (t) , min 1, t + c0 . (5)
T
area of active learning (Settles and Craven, 2008).
This product can be thought of as an approximate Root: In the case of the linear form, the same
language model (assuming words are sampled in- number of new and more difficult, examples are
dependently) and also implicitly incorporates in- added to the training set, at all times t. However,
1
as the training data grows in size, it gets less likely
NMT models typically first pick up information about
producing sentences of correct length. It can be argued that that any single data example will be sampled in a
presenting only short sentences first may lead to learning a training batch. Thus, given that the newly added
strong bias for the sentence lengths. In our experiments, we examples are less likely to be sampled, we propose
did not observe this to be an issue as the models kept im-
proving and predicting sentences of correct length, throughout to reduce the number of new training examples per
training. unit time as training progresses to give the learner

1165
1.0
sufficient time to assimilate their information con-
tent. More specifically, we define the rate in which 0.8
new examples are added as inversely proportional

Competence
to the current training data size: 0.6
clinear
dc(t) P 0.4
csqrt
= , (6) croot−3
dt c(t) croot−5
0.2 croot−10
for some constant P ≥ 0. Solving this simple
differential equation, we obtain: 0.0
0 200 400 600 800 1000
Z Z
√ Time
c(t)dc(t) = P dt ⇒ c(t) = 2P t + D,
Figure 4: Plots of various competence functions with
for some constants P and D. Then, we consider c0 = 0.01 (initial competence value) and T = 1, 000
√ (total duration of the curriculum learning phase).
the following constraint: c0 , c(0) = D ⇒
D = c20 . Finally, we also have that c(T ) = 1 ⇒
P = (1 − c20 )/2T , where T denotes the time after
3 Experiments
which the learner is fully competent. This, along
with the constraint that c(t) ∈ [0, 1] for all t ≥ 0, For our experiments, we use three of the most com-
results in the following definition: monly used datasets in NMT, that range from a
!
small benchmark dataset to a large-scale dataset
r
1 − c20 2
csqrt (t) , min 1, t + c0 . (7) with millions of sentences. Statistics about the
T
datasets are shown in Table 1. We perform experi-
In our experiments, we refer to this specific for- ments using both RNNs and Transformers. For the
mulation as the “square root” competence model. RNN experiments we use a bidirectional LSTM
If we want to make the curve sharper, meaning for the encoder, and an LSTM with the attention
that even more time is spent per sample added later model of Bahdanau et al. (2015) for the decoder.
on in training, then we can consider the following The number of layers of the encoder and the de-
more general form, for p ≥ 1: coder are equal. We use a 2-layer encoder and
r ! a 2-layer decoder for all experiments on IWSLT
p 1 − cp0 p
croot-p (t) , min 1, t + c0 . (8) datasets, and a 4-layer encoder and a 4-layer de-
T coder for all experiments on the WMT dataset, due
We observed that best performance is obtained to the dataset’s significantly larger size. For the
when p = 2 and then, as we increase p, perfor- Transformer experiments we use the BASE model
mance converges to that obtained when training proposed by Vaswani et al. (2017). It consists of
without a curriculum. Plots of the competence a 6-layer encoder and decoder, using 8 attention
functions we presented are shown in Figure 4. heads, and 2,048 units for the feed-forward layers.
The multi-head attention keys and values depth is
2.3 Scalability set to the word embedding size. The word embed-
Our method can be easily used in large-scale NMT ding size is 512 for all experiments. Furthermore,
systems. This is because it mainly consists of a for the Transformer experiments on the two smaller
preprocessing step of the training data that com- datasets we do not use any learning rate schedule,
putes the difficulty scores. The implementation we and for the experiments on the largest dataset we
are releasing with this paper computes these scores use the default Transformer schedule. A detailed
in an efficient manner by building a graph describ- discussion on learning rate schedules for Trans-
ing their dependencies, as well as whether they formers is provided near the end of this section.
are sentence-level scores (e.g., sentence length), All of our experiments were conducted on a ma-
or corpus-level (e.g., CDF), and using that graph chine with a single Nvidia V100 GPU, and 24 GBs
to optimize their execution. Using only 8GB of of system memory.
memory, we can process up to 20k sentences per During training, we use a label smoothing factor
second when computing sentence rarity scores, and of 0.1 (Wu et al., 2016) and the AMSGrad opti-
up to 150k sentences per second when computing mizer (Reddi et al., 2018) with its default parame-
sentence length scores. ters in TensorFlow, and a batch size of 5,120 tokens

1166
Dataset # Train # Dev # Test Plain SL Linear SL Sqrt SR Linear SR Sqrt

IWSLT-15 En)Vi 133k 768 1268 IWSLT15 : En → Vi


IWSLT-16 Fr)En 224k 1080 1133 RNN
30.00
Transformer
30 30
WMT-16 En)De 4.5m 3003 2999 27.50
28.00
25 26.00 25

BLEU
Table 1: Number of parallel sentences in each dataset. 20 20
“k” stands for “thousand” and “m” stands for “million”.
15 15
0 5000 10000 0 50000 100000
Step Step
IWSLT16 : Fr → En
(due to GPU memory constraints). During infer- RNN Transformer
36.00
ence, we employ beam search with a beam size of 35 35
34.00
10 and the length normalization scheme of Wu et al. 32.00

30 30

BLEU
31.00
(2016).2
25 25

Curriculum Hyperparameters. We set the ini- 20 20


0 10000 20000 0 50000 100000
tial competence c0 to 0.01, in all experiments. This Step Step
means that all models start training using the 1% WMT16 : En → De
easiest training examples. The curriculum length RNN
30.00
Transformer
30 30
T is effectively the only hyperparameter that we 26.50 28.00

need to set for our curriculum methods. In each 25 25


BLEU
25.50

experiment, we set T in the following manner: we 20 20


train the baseline model without using any curricu-
15 15
lum and we compute the number of training steps 0 100000
Step
0 100000 200000
Step
it takes to reach approximately 90% of its final
BLEU score. We then set T to this value. This Figure 5: Plots illustrating the performance of various
results in T being set to 5,000 for the RNN ex- models on the test set, as training progresses. Blue
periments on the IWSLT datasets, and 20,000 for lines represent the baseline methods when no curricu-
the corresponding Transformer experiments. For lum is used, and red lines represent the same mod-
els when different versions of our curriculum learning
WMT, we set T to 20,000 and 50,000 for RNNs
framework are used to train them. The vertical lines
and Transformers, respectively. Furthermore, we represent the step in which the models attain the BLEU
use the following notation and abbreviations when score that the baseline models attain at convergence.
presenting our results:
– Plain: Trained without using any curriculum.
– SL: Curriculum with sentence length difficulty. occurring words, while ignoring words that appear
– SR: Curriculum with sentence rarity difficulty. less than 5 times in the whole corpus. For the
– Linear: Curriculum with the linear competence IWSLT-16 and WMT-16 experiments we use a byte-
shown in Equation 5. pair encoding (BPE) vocabulary (Sennrich et al.,
– Sqrt: Curriculum with the square root compe- 2016) trained using 32,000 merge operations, simi-
tence shown in Equation 7. lar to the original Transformer paper by Vaswani
et al. (2017).
Data Preprocessing. Our experiments are per-
Results. We present a summary of our results in
formed using the machine translation library re-
Table 2 and we also show complete learning curves
leased by Platanios et al. (2018). We use the same
for all methods in Figure 5. The evaluation metrics
data preprocessing approach the authors used in
we use are the test set BLEU score and the time it
their experiments. While training, we consider sen-
takes for the models using curriculum learning to
tences up to length 200. Similar to them, for the
obtain the BLEU score that the baseline models at-
IWSLT-15 experiments we use a per-language vo-
tain at convergence. We observe that Transformers
cabulary which contains the 20,000 most frequently
consistently benefit from our curriculum learning
2
We emphasize that we did not run experiments with other approach, achieving gains of up to 2 BLEU, and
architectures or configurations, and thus our baseline archi- reductions in training time of up to 70%. RNNs
tectures were not chosen because they were favorable to our
method, but rather because they were frequently mentioned in also benefit, but to a lesser extent. This is con-
existing literature. sistent with our motivation for this paper, which

1167
RNN T RANSFORMER
SL Curriculum SR Curriculum SL Curriculum SR Curriculum
Plain Plain Plain*
clinear csqrt clinear csqrt clinear csqrt clinear csqrt
En)Vi 26.27 26.57 27.23 26.72 26.87 28.06 29.77 29.14 29.57 29.03 29.81
BLEU

Fr)En 31.15 31.88 31.92 31.39 31.57 34.05 34.88 34.98 35.47 35.30 35.83
En)De 26.53 26.55 26.54 26.62 26.62 – 27.95 28.71 29.28 29.93 30.16
En)Vi 1.00 0.64 0.61 0.71 0.57 1.00 1.00 0.44 0.33 0.35 0.31
Time

Fr)En 1.00 1.00 0.93 1.10 0.73 1.00 1.00 0.49 0.44 0.42 0.39
En)De 1.00 0.86 0.89 1.00 0.83 – 1.00 0.58 0.55 0.55 0.55

Table 2: Summary of experimental results. For each method and dataset, we present the test set BLEU score of
the best model based on validation set performance. We also show the relative time required to obtain the BLEU
score of the best performing baseline model. For example, if an RNN gets to 26.27 BLEU in 10, 000 steps and
the SL curriculum gets to the same BLEU in 3, 000 steps, then the plain model gets a score of 1.0 and the SL
curriculum receives a score of 3, 000/10, 000 = 0.3. “Plain” stands for the model trained without a curriculum
and, for Transformers, “Plain*” stands for the model trained using the learning rate schedule shown in Equation 9.

stems from the observation that training RNNs is schedule, due to their instability. Such schedules
easier and more robust than training Transformers. typically use a warm-up phase, which means that
Furthermore, the square root competence model the learning rate starts at a very low value and keeps
consistently outperforms the linear model, which increasing until the end of the warm-up period, af-
fits well with our intuition and motivation for in- ter which a decay rate is typically used. In order
troducing it. Regarding the difficulty heuristics, to show that our curriculum learning approach can
sentence length and sentence rarity both result in act as a principled alternative to such highly tuned
similar performance. learning rate schedules, we now present the results
We also observe that, for the two small datasets, we obtain when training our Transformers using
RNNs converge faster than Transformers in terms the following learning rate schedule:
of both the number of training iterations and the
lr(t) , d−0.5 −0.5 −1.5

embedding min t , t · Twarmup , (9)
overall training time. This is contrary to other re-
sults in the machine translation community (e.g., where t is the current training step, dembedding is the
Vaswani et al., 2017), but could be explained by the word embeddings size, and Twarmup is the number
fact that we are not using any learning rate sched- of warmup steps and is set to 10,000 in these exper-
ule for training Transformers. However, they never iments. This schedule was proposed in the original
manage to outperform Transformers in terms of Transformer paper (Vaswani et al., 2017), and was
test BLEU score of the final model. Furthermore, tuned for the WMT dataset.
to the best of our knowledge, for IWSLT-15 we The results obtained when using this learning
achieve state-of-the-art performance. The highest rate schedule are also shown in table 2, under
previously reported result was 29.03 BLEU (Pla- the name “Plain*”. In both cases, our curricu-
tanios et al., 2018), in a multi-lingual setting. Using lum learning approach obtains a better model in
our curriculum learning approach we are able to about 70% less training time. This is very impor-
achieve a BLEU score of 29.81 for this dataset. tant, especially when applying Transformers in new
Overall, we have shown that our curriculum datasets, because such learning rate heuristics of-
learning approach consistently outperforms mod- ten require careful tuning. This tuning can be both
els trained without any curriculum, in both limited very expensive and time consuming, often resulting
data settings and large-scale settings. in very complex mathematical expressions, with
no clear motivation or intuitive explanation (Chen
Learning Rate Schedule. In all of our IWSLT et al., 2018). Our curriculum learning approach
experiments so far, we use the default AMSGrad achieves better results, in significantly less time,
learning rate of 0.001 and intentionally avoid using while only requiring one parameter (the length of
any learning rate schedules. However, Transform- the curriculum).
ers are not generally trained without a learning rate Note that even without using any learning rate

1168
schedule, our curriculum methods were able to dings, which were subsequently used in the context
achieve performance comparable to the “Plain*” in of phrase-based machine translation. They split
about twice as many training steps. “Plain” was the word vocabulary in 5 separate groups based on
not able to achieve a BLEU score above 2.00 even word frequency, and learned separate word embed-
after fives times as many training steps, at which dings for each of these groups in parallel. Then,
point we stopped these experiments. they merged the 5 different learned embeddings
and continued training using the full vocabulary.
Implementation and Reproducibility. We are
While this approach makes use of some of the ideas
releasing an implementation of our proposed
behind curriculum learning, it does not directly fol-
method and experiments built on top of the ma-
low the original definition introduced by Bengio
chine translation library released by Platanios
et al. (2009). Moreover, their model required 19
et al. (2018), using TensorFlow Scala (Platanios,
days to train. There have also been a couple of
2018), and is available at https://github.com/
attempts to apply curriculum learning in NMT that
eaplatanios/symphony-mt. Furthermore, all ex-
were discussed in section 1.
periments can be run on a machine with a single
Nvidia V100 GPU, and 24 GBs of system memory. There also exists some relevant work in areas
Our most expensive experiments — the ones using other than curriculum learning. Zhang et al. (2016)
Transformers on the WMT-16 dataset — take about propose training neural networks for NMT by fo-
2 days to complete, which would cost about $125 cusing on hard examples, rather than easy ones.
on a cloud computing service such as Google Cloud They report improvements in BLEU score, while
or Amazon Web Services, thus making our results only using the hardest 80% training examples in
reproducible, even by independent researchers. their corpus. This approach is more similar to
boosting by Schapire (1999), rather than curricu-
4 Related work lum learning, and it does not help speed up the
training process; it rather focuses on improving
The idea of teaching algorithms in a similar man-
the performance of the trained model. The fact
ner as humans, from easy concepts to more dif-
that hard examples are used instead of easy ones
ficult ones, has existed for a long time (Elman,
is interesting because it is somewhat contradictory
1993; Krueger and Dayan, 2009). Machine learn-
to that of curriculum learning. Also, in contrast
ing models are typically trained using stochastic
to curriculum learning, no ordering of the training
gradient descent methods, by uniformly sampling
examples is considered.
mini-batches from the pool of training examples,
and using them to compute updates for the model Perhaps another related area is that of active
parameters. Deep neural networks, such as RNNs learning, where the goal is to develop methods that
and Transformers, have highly non-convex loss request for specific training examples. Haffari et al.
functions. This makes them prone to getting stuck (2009), Bloodgood and Callison-Burch (2010), and
in saddle points or bad local minima during train- Ambati (2012) all propose methods to solicit train-
ing, often resulting in long training times and bad ing examples for MT systems, based on the occur-
generalization performance. Bengio et al. (2009) rence frequency of n-grams in the training corpus.
propose a curriculum learning approach that aims The main idea is that if an n-gram is very rare in the
to address these issues by changing the mini-batch training corpus, then it is difficult to learn to trans-
sampling strategy. They propose starting with a late sentences in which it appears. This is related
distribution that puts more weight on easy samples, to our sentence rarity difficulty metric and points
and gradually increase the probability of more dif- out an interesting connection between curriculum
ficult samples as training progresses, eventually learning and active learning.
converging to a uniform distribution. They demon- Regarding training Transformer networks,
strate empirically that such curriculum approaches Shazeer and Stern (2018) perform a thorough ex-
indeed help decrease training times and sometimes perimental evaluation of Transformers, when using
even improve generalization. different optimization configurations. They show
Perhaps the earliest attempt to apply curriculum that a significantly higher level of performance can
learning in MT was made by Zou et al. (2013). The be reached by not using momentum during opti-
authors employed a curriculum learning method mization, as long as a carefully chosen learning
to learn Chinese-English bilingual word embed- rate schedule is used. Such learning rate sched-

1169
ules are often hard to tune because of the multiple Acknowledgments
seemingly arbitrary terms they often contain. Fur-
thermore, Popel and Bojar (2018) show that, when We would like to thank Maruan Al-Shedivat and
using Transformers, increasing the batch size re- Dan Schwartz for the useful feedback they pro-
sults in a better model at convergence. We believe vided in early versions of this paper. This re-
this is indicative of very noisy gradients when start- search was supported in part by AFOSR under
ing to train Transformers and that higher batch sizes grant FA95501710218.
help increase the signal-to-noise ratio. We show
that our proposed curriculum learning method of-
References
fers a more principled and robust way to tackle
this problem. Using our approach, we are able to Vamshi Ambati. 2012. Active Learning and Crowd-
train Transformers to state-of-the-art performance, sourcing for Machine Translation in Low Resource
Scenarios. Ph.D. thesis, Pittsburgh, PA, USA.
using small batch sizes and without the need for pe- AAI3528171.
culiar learning rate schedules, which are typically
necessary. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2015. Neural Machine Translation by Jointly
5 Conclusion and Future Work Learning to Align and Translate. In International
Conference on Learning Representations.
We have presented a novel competence-based cur-
Rachel Bawden, Rico Sennrich, Alexandra Birch, and
riculum learning approach for training neural ma- Barry Haddow. 2018. Evaluating discourse phenom-
chine translation models. Our resulting framework ena in neural machine translation. In Proceedings of
is able to boost performance of existing NMT sys- the 2018 Conference of the North American Chap-
tems, while at the same time significantly reduc- ter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long Pa-
ing their training time. It differs from previous pers), pages 1304–1313, New Orleans, Louisiana.
approaches in that it does not depend on multi- Association for Computational Linguistics.
ple hyperparameters that can be hard to tune, and
it does not depend on a manually designed dis- Yoshua Bengio, Jérôme Louradour, Ronan Collobert,
and Jason Weston. 2009. Curriculum learning. In
cretized training regime. We define the notions of Proceedings of the 26th Annual International Con-
competence, for a learner, and difficulty, for the ference on Machine Learning, ICML ’09, pages 41–
training examples, and propose a way to filter train- 48, New York, NY, USA. ACM.
ing data based on these two quantities. Perhaps
most interestingly, we show that our method makes Michael Bloodgood and Chris Callison-Burch. 2010.
Bucking the trend: Large-scale cost-focused active
training Transformers faster and more reliable, but learning for statistical machine translation. In ACL.
has a much smaller effect in training RNNs.
In the future, we are mainly interested in: (i) ex- Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
ploring more difficulty heuristics, such as measures Yvette Graham, Barry Haddow, Shujian Huang,
Matthias Huck, Philipp Koehn, Qun Liu, Varvara Lo-
of alignment between the source and target sen- gacheva, et al. 2017a. Findings of the 2017 confer-
tences (Kocmi and Bojar, 2017), sentence length ence on machine translation (wmt17). In Proceed-
discrepancies, or even using a pre-trained language ings of the Second Conference on Machine Transla-
model to score sentences, which would act as a tion, pages 169–214.
more robust replacement of our sentence rarity
Ondřej Bojar, Jindřich Helcl, Tom Kocmi, Jindřich Li-
heuristic, and (ii) exploring more sophisticated bovický, and Tomáš Musil. 2017b. Results of the
competence metrics that may depend on the loss WMT17 Neural MT Training Task. In Proceedings
function, the loss gradient, or on the learner’s per- of the Second Conference on Machine Translation,
formance on held-out data. Furthermore, it would Volume 2: Shared Task Papers, pages 525–533. As-
sociation for Computational Linguistics.
be interesting to explore applications of curricu-
lum learning to multilingual machine translation Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin
(e.g., it may be easier to start with high-resource Johnson, Wolfgang Macherey, George Foster, Llion
languages and move to low-resource ones later on). Jones, Mike Schuster, Noam Shazeer, Niki Parmar,
Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser,
We would also like to explore the usefulness of our Zhifeng Chen, Yonghui Wu, and Macduff Hughes.
framework in more general machine learning tasks, 2018. The Best of Both Worlds: Combining Re-
outside of NMT. cent Advances in Neural Machine Translation. In

1170
Proceedings of the 56th Annual Meeting of the As- Martin Popel and Ondřej Bojar. 2018. Training Tips
sociation for Computational Linguistics (Long Pa- for the Transformer Model. The Prague Bulletin of
pers), pages 76–86. Association for Computational Mathematical Linguistics, 110(1):43–70.
Linguistics.
Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar.
Josep Maria Crego, Jungi Kim, Guillaume Klein, 2018. On the Convergence of Adam and Beyond.
Anabel Rebollo, Kathy Yang, Jean Senellart, In International Conference on Learning Represen-
Egor Akhanov, Patrice Brunelle, Aurelien Co- tations.
quard, Yongchao Deng, Satoshi Enoue, Chiyo Geiss,
Joshua Johanson, Ardas Khalsa, Raoum Khiari, Robert E. Schapire. 1999. A brief introduction to
Byeongil Ko, Catherine Kobus, Jean Lorieux, Lei- boosting. In Proceedings of the 16th International
diana Martins, Dang-Chuan Nguyen, Alexandra Pri- Joint Conference on Artificial Intelligence - Volume
ori, Thomas Riccardi, Natalia Segal, Christophe Ser- 2, IJCAI’99, pages 1401–1406, San Francisco, CA,
van, Cyril Tiquet, Bo Wang, Jin Yang, Dakun Zhang, USA. Morgan Kaufmann Publishers Inc.
Jing Zhou, and Peter Zoldan. 2016. SYSTRAN’s
Pure Neural Machine Translation Systems. CoRR, Rico Sennrich, Barry Haddow, and Alexandra Birch.
abs/1610.05540. 2016. Neural Machine Translation of Rare Words
with Subword Units. In Proceedings of the 54th An-
Jeffrey L Elman. 1993. Learning and development in nual Meeting of the Association for Computational
neural networks: The importance of starting small. Linguistics, pages 1715–1725.
Cognition, 48(1):71–99.
Burr Settles and Mark Craven. 2008. An analysis of ac-
Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. tive learning strategies for sequence labeling tasks.
2009. Active learning for statistical phrase-based In Proceedings of the 2008 Conference on Empiri-
machine translation. In Proceedings of Human Lan- cal Methods in Natural Language Processing, pages
guage Technologies: The 2009 Annual Conference 1070–1079, Honolulu, Hawaii. Association for Com-
of the North American Chapter of the Association putational Linguistics.
for Computational Linguistics, NAACL ’09, pages
415–423, Stroudsburg, PA, USA. Association for Noam Shazeer and Mitchell Stern. 2018. Adafac-
Computational Linguistics. tor: Adaptive Learning Rates with Sublinear Mem-
ory Cost. In Proceedings of the 35th International
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Conference on Machine Learning, volume 80 of
continuous translation models. In Proceedings of Proceedings of Machine Learning Research, pages
the 2013 Conference on Empirical Methods in Natu- 4596–4604, Stockholmsmässan, Stockholm Sweden.
ral Language Processing, pages 1700–1709. PMLR.
Tom Kocmi and Ondřej Bojar. 2017. Curriculum
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Learning and Minibatch Bucketing in Neural Ma-
Sequence to sequence learning with neural networks.
chine Translation. In Proceedings of the Interna-
In Advances in neural information processing sys-
tional Conference Recent Advances in Natural Lan-
tems, pages 3104–3112.
guage Processing, pages 379–386.

Kai A Krueger and Peter Dayan. 2009. Flexible shap- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ing: How learning in small steps helps. Cognition, Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
110(3):380–394. Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In Advances in Neural Information Pro-
Frederick Liu, Han Lu, and Graham Neubig. 2018. cessing Systems, pages 5998–6008.
Handling homographs in neural machine translation.
In Proceedings of the 2018 Conference of the North Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
American Chapter of the Association for Computa- Le, Mohammad Norouzi, Wolfgang Macherey,
tional Linguistics: Human Language Technologies, Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Volume 1 (Long Papers), pages 1336–1345, New Macherey, Jeff Klingner, Apurva Shah, Melvin John-
Orleans, Louisiana. Association for Computational son, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws,
Linguistics. Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith
Stevens, George Kurian, Nishant Patil, Wei Wang,
Emmanouil A. Platanios. 2018. TensorFlow Scala. Cliff Young, Jason Smith, Jason Riesa, Alex Rud-
https://github.com/eaplatanios/ nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,
tensorflow_scala. and Jeffrey Dean. 2016. Google’s Neural Ma-
chine Translation System: Bridging the Gap be-
Emmanouil Antonios Platanios, Mrinmaya Sachan, tween Human and Machine Translation. CoRR,
Graham Neubig, and Tom Mitchell. 2018. Con- abs/1609.08144.
textual Parameter Generation for Universal Neural
Machine Translation. In Conference on Empirical Dakun Zhang, Jungi Kim, Josep Crego, and Jean Senel-
Methods in Natural Language Processing (EMNLP), lart. 2016. Boosting neural machine translation.
Brussels, Belgium. arXiv preprint arXiv:1612.06138.

1171
Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Kenton
Murray, Jeremy Gwinnup, Marianna J Martindale,
Paul McNamee, Kevin Duh, and Marine Carpuat.
2018. An Empirical Exploration of Curriculum
Learning for Neural Machine Translation. CoRR,
abs/1811.00739.
Will Y. Zou, Richard Socher, Daniel Cer, and Christo-
pher D. Manning. 2013. Bilingual Word Embed-
dings for Phrase-Based Machine Translation. In
Proceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing, pages
1393–1398. Association for Computational Linguis-
tics.

1172

You might also like