Thesis Amended
Thesis Amended
Poorya Zaremoodi
Thesis
Submitted for the fulfilment of the requirements for the degree of
Doctor of Philosophy
January, 2020
ii
To my parents.
c Copyright by
Poorya Zaremoodi
2020
Except as provided in the Copyright Act 1968, this thesis may not be reproduced in
any form without the written permission of the author.
I certify that I have made all reasonable efforts to secure copyright permissions for
third-party content included in this thesis and have not knowingly added copyright
content to my work without the owner’s permission.
Declaration
I hereby declare that this thesis contains no material which has been accepted for
the award of any other degree or diploma at any university or equivalent institution
and that, to the best of my knowledge and belief, this thesis contains no material
previously published or written by another person, except where due reference is
made in the text of the thesis.
Poorya Zaremoodi
January 30, 2020
v
Abstract
vii
Acknowledgements
ix
Contents
List of Abbreviations 1
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Transduction of Complex Structures for Incorporating Linguis-
tic Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Injecting Linguistic Inductive Biases via Multi-Task Learning . 4
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Thesis Outline and Contributions . . . . . . . . . . . . . . . . . . . . 7
2 Background 10
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . 11
2.1.2 Recurrent Neural Network (RNN) . . . . . . . . . . . . . . . . 12
2.1.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . 15
2.1.4.1 Tree-LSTM . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.5 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.5.1 Early Stopping . . . . . . . . . . . . . . . . . . . . . 19
2.1.5.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.6 Deep Learning for NLP . . . . . . . . . . . . . . . . . . . . . . 21
2.1.6.1 Word Embedding . . . . . . . . . . . . . . . . . . . . 21
2.1.6.2 Statistical Language Modelling . . . . . . . . . . . . 22
2.2 Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Seq2Seq model . . . . . . . . . . . . . . . . . . . . . . . . . 24
x
2.2.2 Attentional Seq2Seq model . . . . . . . . . . . . . . . . . . . 26
2.2.3 Convolutional Seq2Seq model . . . . . . . . . . . . . . . . . 28
2.2.4 Self-attention Seq2Seq model . . . . . . . . . . . . . . . . . . 30
2.2.5 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Low-Resource Neural Machine Translation . . . . . . . . . . . . . . . 33
2.3.1 Incorporate Linguistic Annotation by Transduction of Complex
Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Multi-Task Learning for Directly Injecting Auxiliary Knowledge 39
2.3.2.1 Multi-Task Learning . . . . . . . . . . . . . . . . . . 40
2.3.2.2 Multi-Task Learning and Transfer Learning . . . . . 41
2.3.2.3 Multi-Task Learning in Practice . . . . . . . . . . . . 42
2.3.2.4 Multi-task learning for injecting linguistic knowledge
in NMT . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.3 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3.3.1 Active Learning . . . . . . . . . . . . . . . . . . . . . 45
2.3.3.2 Back-translation and Dual Learning . . . . . . . . . 46
2.3.3.3 Adversarial training . . . . . . . . . . . . . . . . . . 47
2.3.3.4 Zero/Few Shot Learning . . . . . . . . . . . . . . . . 47
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
xi
II Multi-Task Learning: Architectural Design 61
xii
7 Learning to Multi-Task Learn 100
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 MTL training schedule as a Markov Decision Process . . . . . . . . . 102
7.3 An Oracle Policy for MTL-MDP . . . . . . . . . . . . . . . . . . . . . 105
7.4 Learning to Multi-Task Learn . . . . . . . . . . . . . . . . . . . . . . 107
7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.5.1 MTL architectures . . . . . . . . . . . . . . . . . . . . . . . . 110
7.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Bibliography 120
xiii
List of Figures
1.1 BLEU scores (higher is better) for PBSMT and NMT models on Ger-
man → English translation task using the TED data from the IWSLT
2014 shared translation task. The number of words in parallel train-
ing data varies from 0.1 to 3.2 million. Image source: (Sennrich and
Zhang, 2019). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
xiv
2.13 A two-layer syntactic GCN on top of the embeddings, updating them
concerning a dependency parse tree. To simplify image, gates and some
labels are removed. Image source: (Bastings et al., 2017) . . . . . . . 37
2.14 An example of a source sentence, and its translation in the form of
linearised lexicalised constituency tree (Aharoni and Goldberg, 2017). 39
4.1 Percentage of more correct n-grams generated by the deep MTL models
compared to the single-task model (only MT) for En→Vi translation. 73
4.2 BLEU scores for different numbers of shared layers in encoder (from
top) and decoder (from bottom). The vocabulary is shared among
tasks while each task has its own attention mechanism. . . . . . . . 74
4.3 Percentage of more corrected n-grams with at least one noun generated
by MT+NER model compared with the only MT model (only MT) for
En→Vi language pair. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
xv
6.4 Weights assigned to the training pairs of different tasks (averaged over
200 update iteration chunks). Y-axis shows the average weight and
X-axis shows the number of update iteration. In the top figures, the
main translation task is English→Spanish while in the bottom ones it
is English→Turkish. . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5 The number of words in the gold English→Spanish translation which
are missed in the generated translations (lower is better). Missed words
are categorised by their tags (Part-of-Speech and named-entity types). 98
xvi
List of Notations
Other Symbols
(k)
wi The weight of ith sentence pair of task k
W , U Parameter matrices
xvii
List of Abbreviations
Abbreviation Definition
AI Artificial Intelligence
AL Active Learning
GRU Gated Recurrent Unit
IL Imitation Learning
LM Language Model
LSTM Long Short-Term Memory
MDP Markov Decision Process
MT Machine Translation
MTL Multi-Task Learning
NLP Natural Language Processing
NMT Neural Machine Translation
RL Reinforcement Learning
RNN Recurrent Neural Network
xviii
Chapter 1
Introduction
1.1 Motivation
Deep learning is a comparatively new player in artificial intelligence (AI), but with
an old history. Its origin goes back to decades ago when the subject of artificial
neural networks started (McCulloch and Pitts, 1943; Rosenblatt, 1958; Minsky and
Papert, 1969). Neural networks are, in fact, representation learning methods based
on a multi-layer hierarchy of data abstraction. The power of neural networks is rooted
in the fact that the representations are learned automatically and data-driven via a
general-purpose algorithm, removing the need for hand-engineered features. Thanks
to the recent advances in parallel computing and the abundance of big data, deep
neural networks, aka deep learning, has started to transform AI. The deep learning
tsunami1 lapped the shores of many areas in AI, including computer vision, self-
driving vehicles, robotics, natural language processing, and more specifically Machine
Translation (MT).
Machine Translation is the task of automatically translating a text from one nat-
ural language to another. The translation should preserve the meaning of the source
sentence while being fluent in the target language. Statistical Machine Translation
(SMT) models dominated this area for decades. The most successful SMT model is
phrase-based (Koehn et al., 2003; Brown et al., 1990), which segments the source sen-
tence into phrases, translates these phrases, and re-orders them. The SMT systems
consist of several sub-components which should be tuned separately, some of them
requiring hand-engineered features along with large amounts of memory to store in-
formation in the form of phrase tables/dictionaries.
1
A term coined by Manning (2015)
1
The deep learning revolution has now reached MT, leading to the state-of-the-
art for many language pairs (Luong et al., 2015b,a; Wu et al., 2016; Edunov et al.,
2018; Wu et al., 2019). This approach, called Neural Machine Translation (NMT),
translates via a large neural network which reads the input sentence and outputs its
corresponding translation.
The advantage of NMT models is their ability to learn end-to-end, without the
need for many brittle design choices and hand-engineered features of those in SMT.
However, as stated by Andrew Ng, one of the deep learning pioneers, deep learning is a
powerful yet data-hungry technology.2 However, we do not have the luxury of having
large parallel datasets for many languages, a setting referred to as bilingually low-
resource scenario. Koehn and Knowles (2017); Lample et al. (2018b) have reported
NMT needs large amounts of bilingual training data to achieve reasonable translation
quality; furthermore, it underperforms phrase-based SMT (PBSMT) models in bilin-
gually low-resource scenarios where we. These findings have motivated research for
improving the low-resource NMT, mostly by incorporating linguistic and monolingual
resources.
More recently, Sennrich and Zhang (2019) has re-visited the case of low-resource
NMT and shown that low-resource NMT with an optimised practice3 is, in fact,
a realistic option for low-resource regimes as well; see Figure 1.1. Further, they
argue that “best practices differ between high-resource and low-resource settings”.
To summarise, there are two key messages in these findings: (1) low-resource NMT
is very sensitive, and its best practice is different from those of high-resource; (2)
the performance of an NMT model trained with small amount of training data still
is much less than the one trained with millions of sentence pairs, and there is a
great room for improvement to take full advantage of NMT models in small bilingual
training data conditions. These key findings have shown the importance and need for
investigating NMT exclusively for the case of bilingually low-resource scenarios.
For compensating the lack of bilingual data, a practical approach is to use auxiliary
data in the form of curated monolingual linguistic resources, monolingual sentences
or multilingual sentence pairs (a mixture of high- and low-resource). In this thesis,
the focus is on using curated monolingual resources; however, this work is orthogonal
to other approaches and could be combined with the others, e.g., a combination of
linguistic resources and monolingual sentences. More specifically, we assume that rich
2
https://www.wired.com/brandlab/2015/05/andrew-ng-deep-learning-mandate-humans-not-
just-machines/
3
In terms of architectural design, techniques and hyperparameters such as dropout, BPE (and
its vocabulary size), tying parameters, and others.
2
1
model depth, label smoothing rate)
9 8 + lexical model 1
30.8
32.8
size, model depth,
30 28.7
learning rate. De
26.6
24.4
23
24.9
25.7
ported in Appendix
20.6 20.5
BLEU 20
16.6
16
18.3
18.5
5 Results
11.6
Table 2 shows the e
10
neural MT optimized ods to the baseline
phrase-based SMT
data condition (10
neural MT baseline
106
corpus size (English words) BLEU in both data
In the ultra-low
Figurescores
Figure 1.1: BLEU 2: German!English
(higher is better) forlearning
PBSMT curve,
and NMT showingmodels on Ger-
man → English BLEUtranslation task using
as a function theamount
of the TED data IWSLT 2014BPE
from thetraining
of parallel sharedvocabulary
translation task. The number of words in parallel training data varies from 0.1 BLEU).
to 3.2 Reducing
data, for PBSMT and NMT.
million. Image source: (Sennrich and Zhang, 2019). sults in a BLEU ga
yields an addition
linguistic annotation is available on the source language. The linguistic resources
4.3 NMT Systems gressive
are (word) dro
available in the form of annotated datasets, e.g., treebanks for syntactic parsing other orhyperparame
We train neural systems with Nematus (Sennrich
part-of-speech tagged sentences. There are two main approaches for incorporating effect than the lexi
et al., 2017b). Our baseline mostly follows the
linguistic knowledge in MT: ical model (9) on
settings in (Koehn and Knowles, 2017); we use
1. Use these monolingual
adam (Kingma resources
and Ba, to build
2015) tools
and which can then
perform be used uration
early to anno- (8) does n
tate bilingual sentences. These annotations convey linguistic information gether,
in the the adaptati
stopping based on dev set BLEU. We express our
form of combinatorial structures, and modelling them changes the translation yield 9.4 BLEU (7
batch size in number of tokens, and set it to 4000
task from a sequence-to-sequence task to the transduction of more complex on full
com-IWSLT data
in the baseline (comparable to a batch size of 80
binatorial structures, e.g., tree-to-sequence (Eriguchi et al., 2016; Chen (31.9!32.8
et al., BLEU
sentences used in previous work).
2017a). differ depending o
We subsequently add the methods described in
quently, we still a
2. Directlysection
injecting3,linguistic
namely knowledge
the bideepinto RNN, label smooth-
the translation model to improve
were optimized to
ing, dropout,
its performance. tiedLearning
Multi-Task embeddings,(MTL) layer normaliza-
is an effective approach to inject
knowledgetion,
intochanges to the
a task, which BPE vocabulary
is learned size, tasks,
from other related batchaka auxiliary
5
beam search resu
tasks. (2016).
4 6
Signature BLEU+c.mixed+#.1+s.exp+tok.13a+v.1.3.2. p = 0.3 for droppi
3
1.1.1 Transduction of Complex Structures for Incorporating
Linguistic Annotations
One of the central premises about natural language is that words of a sentence are
interrelated according to a (latent) hierarchical structure (Chomsky, 1957), i.e., the
syntactic tree. Therefore, it is expected that modelling the syntactic structure should
improve the performance of NMT, especially in low-resource or linguistically divergent
scenarios, such as English-Farsi, by learning better reorderings. Recently, Eriguchi
et al. (2016); Chen et al. (2017a) have proposed methods to incorporate the hier-
archical syntactic constituency information of the source sentence. They propose
tree-to-sequence NMT models, which use the top-1 parse tree of the source sentence
generated by a parser. They showed that the syntax-aware NMT model outperforms
the vanilla sequence-to-sequence NMT model.
This approach relies on the accuracy of top-1 parse tree while automati-
cally generated annotations are error-prone. As mentioned, curated monolin-
gual linguistic resources can be used to build tools which then can provide the NMT
model with the linguistic annotations of the source sentences. One should note that
the automatically generated annotations incorporate errors and uncertainties of the
tools, e.g., generated top-1 syntactic constituency trees are prone to parser errors;
moreover, they cannot capture semantic ambiguities of the source sentence.
Although this is a practical approach for one type of linguistic information, adding
more types of linguistic annotation requires the model to be re-designed and gets more
and more complicated.
4
et al., 2018). In addition, Dalvi et al. (2017) shows explicitly injecting morphology
can improve translation quality. Hence, incorporating linguistically based inductive
biases is a promising direction to improve the generalisation of NMT models in low-
resource data regimes.
Multi-Task Learning is a practical approach to acquire inductive biases from re-
lated tasks, and its primary goal is to improve the generalisation performance (Caru-
ana, 1997). MTL provides a general framework to incorporate different kinds of
linguistic information into the translation model, and multiple different types of lin-
guistic knowledge can be used (or added) without changing the model architecture
or the need to replace the model with a more complicated one.
MTL is effective yet challenging. Various recent works have attempted to im-
prove NMT with an MTL approach (Domhan and Hieber, 2017; Zhang and Zong,
2016; Niehues and Cho, 2017; Kiperwasser and Ballesteros, 2018); although their re-
sults are promising, these methods usually suffer from shortcomings which prevent
them from maximally exploiting the knowledge in auxiliary tasks. These shortcom-
ings are mainly rooted in the fact that injecting knowledge from auxiliary tasks is
a double-edged sword. While “positive transfer” may help to improve the perfor-
mance of the main translation task, “negative transfer” (i.e. task interference) may
have unfavourable effects and degrade the translation quality. Addressing the neg-
ative transfer phenomena is challenging and needs to be tackled at different levels:
architectural design and training schedule.
5
the training process, in order to make the best use of knowledge provided by auxiliary
tasks.
6
1.3 Thesis Outline and Contributions
Chapter 2: Background
In Chapter 2, we review the foundations and related work for the research in this
thesis. We start by discussing the foundations of deep learning and its use in NLP.
Then, we review NMT and focus on the case of bilingually low-resource scenarios.
Part I: This part is dedicated to the transduction of complex structures where the
model uses automatically generated annotations.
Part II: This part focuses on MTL in order to directly inject the linguistic knowl-
edge into the translation model. We contribute to the architectural design aspect of
MTL:
We make use of curated monolingual linguistic resources in the source side to im-
prove NMT in bilingually scarce scenarios. We scaffold the machine translation task
on auxiliary tasks including syntactic parsing, named-entity recognition and seman-
tic parsing. To the best of our knowledge, our work is the first to inject semantic
7
knowledge into a neural translation model using MTL. This is achieved by casting
the auxiliary tasks as sequence-to-sequence (Seq2Seq) transduction tasks and tying
the parameters of these neural models with those of the main translation task. Our
MTL architecture makes use of deep stacked models, where the parameters of the
top layers are shared across the tasks. We further prevent the contamination of com-
mon knowledge with task-specific information using a technique so-called adversarial
training. Our extensive empirical results demonstrate the effectiveness of our MTL
approach in improving the translation quality.
We propose an MTL model, which learns how to control the amount of sharing
among all tasks dynamically. We propose a new neural recurrent unit by extending
existing ones to process multiple information flows through time. The proposed unit
is equipped with a trainable routing mechanism that enables adaptive collaboration
by dynamic sharing of information flows conditioned on the task at hand, input, and
model state. Our experimental results and analyses show the effectiveness of the
proposed approach on leveraging commonalities among subsets of tasks without the
need to search on parameter sharing strategies.
Part III: This part is focused on the effective training of an MTL model for making
the best use of auxiliary linguistic knowledge in translation.
We propose a rigorous general approach for adaptively changing the training schedule
in MTL to make the best use of auxiliary syntactic and semantic tasks. To balance
the importance of the auxiliary tasks versus the main translation task, we re-weight
8
training data of the tasks based on their contributions to the generalisation capabil-
ities of the resulted translation model. Our main idea is to learn these importance
weights automatically by maximising the performance of the main task on a validation
set, separated from the training set, in each parameter update step. In addition, our
analysis shed light on the relative importance of auxiliary linguistic tasks throughout
the training of an MTL model, showing a tendency from syntax to semantics.
9
Chapter 2
Background
In this chapter, the foundations and prior related works for the research in this the-
sis are overviewed. As the aim of this thesis is to improve bilingually low-resource
Neural Machine Translation (NMT) by injecting linguistic knowledge, we review the
recent progress in: deep learning fundamentals, Neural Machine Translation and low-
resource NMT.
In Section 2.1, we start by overviewing fundamental concepts and methods in
deep learning by discussing multi-layer perceptron (MLP), recurrent neural network
(RNN) and its variants along with two popular regularisation techniques. Then, we
cover the use of deep learning in the fundamentals of NLP: word embedding and
statical language model.
In Section 2.2, we review the recent progress in NMT and discuss the leading
families of architectures based on RNN, convolution and self-attention. Moreover, we
describe how to evaluate the performance of an NMT model using different metrics.
In Section 2.3, we outline recent progress in bilingually low-resource NMT with the
focus on describing how monolingual resources are used to compensate for the lack of
bilingual data. Also, we narrow down on the usage of linguistic knowledge and explain
two commonly used techniques for doing so: (1) transduction of complex structures
for incorporating linguistic annotations; (2) Multi-Task Learning for directly injecting
the linguistic knowledge into the NMT system.
10
... output layer (y)
y1 y2 yn
input output
ℎ),( ℎ),1 ... ℎ),24
x1
w1
)
x2 w2
... n hidden layers
∅(% *& +& + -) (h1, …, hn)
&'(
ℎ(,( ℎ(,1 ... ℎ(,23
wn
b
xn input layer (x)
x1 x2 xm
11
A single layer of neurons (perceptrons) can learn simple mapping but not able to
learn nonlinearly separable data. MLP (Rumelhart et al., 1985) addresses this issue
by adding multiple intermediate layers where the inputs to neurons in a layer are the
outputs of neurons in previous layers, as shown in Figure 2.1b. The output calculated
by this network is equivalent to the following equation:
where W(i) shows the matrix weight of i-th hidden layer, U is the weight matrix
connecting last hidden layer to the output layer, and φi is the activation function.
Generally, element-wise functions like sigmoid and hyperbolic tangent are used in the
hidden layer, and normalisation functions like softmax are used for the output layer.
If we define each layer i as a non-linear module mi ,
then
y = φo (U (mn ◦ mn−1 ◦ ... ◦ m1 (x))) , (2.5)
where ◦ denotes the function composition. As seen, MLP is, in fact, composed of
multiple-layers of parametrised non-linear modules (Bengio et al., 2007). Each layer
transforms representation of its previous layer to a higher, slightly more abstract
representation using a non-linear module:
MLP is a simple yet powerful computational model, and an MLP with as little
as one hidden-layer is proved to be a universal function approximator under certain
conditions (Hornik, 1991).
12
y
yt-1 yt yt+1
yt-1 yt yt+1
(a) Elman RNN (b) unfolded Elman RNN (c) unfolded generative RNN
Figure 2.2: Elman RNN, unfolded Elman RNN and unfolded generative RNN.
MLP; see Figure 2.2b. Here, the representation in the hidden layer aka “state” of the
model at time t is calculated as:
where Whh and Whi are weight matrices corresponding to feedback loop and input
to hidden layer, respectively. φh is activation function (e.g. sigmoid), and xt is the
input vector at time t. The output at time t is:
yt = φo (Woh ht ), (2.8)
where Woh is the weight matrix of connection between hidden and output layers, and
φo is a normalisation function, e.g. softmax.
Generative RNN is another kind of RNN architecture proposed to generate a
sequence. In this model, output is feed-backed as the next input (Figure 2.2c). In an
NLP task, we normally use a normalisation function as the activation function of the
output layer (φo ), then the output vector yi would be a probability distribution on
symbols (e.g., words) at time i. At each time step, we draw a sample symbol (word)
from this distribution and the symbol will be the input of the network at the next
time step.
2.1.3 Backpropagation
Backpropagation (BP) is an algorithm for the supervised learning of neural models. It
calculates the gradients in an MLP, which in turn is essential for training the model,
and is consist of two phases: forward and backward. In the forward phase, the input
is fed to the model and output is calculated. In the backward pass BP calculates
13
the error between the output y and the target output t with respect to an objective
function, for example:
1
J (Θ) = (t − z)2 , (2.9)
2
where Θ denotes all weights and biases in the network. For each node i in layer l,
(l)
BP calculates a local gradient δi that is a measure showing how much the node was
responsible for the error in the output. Then gradients are fed into an optimisation
algorithm like Stochastic Gradient Descent (SGD) to update the weights:
(l) (l) ∂
Wij = Wij − η (l)
J (Θ), (2.10)
∂Wij
(l)
where Wij refers to the weight connecting j-th node of layer (l − 1) to the i-th node
at the layer l, and η is the step size (aka learning rate).
We can calculate the derivative of the error with respect to each weight in the
network using the chain rule. There is a systematic way to perform it layer-by-layer.
In this way, we calculate the local gradient of nodes layer-by-layer (descending order)
then we use them to calculate gradients of weights. The local gradients for the neurons
at the output layer is calculated as:
(o)
δj = (ti − yi )φ0o (vjo ), (2.11)
where vjo is the input of the neuron (inner product of previous layer signals with
corresponding weights). The error is backpropagated to the other layers with chain
rule, which results in the following equation for the local gradients of j-th node in
l-th layer:
(l)
X (l) (l+1)
δj = φ0 (vjl ) δk Wkj . (2.12)
k
Now we calculate the desired partial derivatives as follows:
∂ (l) (l+1)
(l)
J (Θ) = φj (vj )δi , (2.13)
∂Wij
(l)
where vj is the input of the j-th neuron in l-th layer.
Backpropagation is designed for networks without a loop, e.g. MLP. There is a
modified version of it called Backpropagation Through Time (BPTT) which is used
for RNNs. It simply unrolls RNN and applies the standard BP algorithm. In BPTT,
for each weight, we sum its gradients at different time steps.
Hochreiter (1991) first discovered the vanishing and exploding gradient problem
in training RNNs using backpropagation. Here we analyse this problem in the one-
dimensional scenario as it is easier to understand; the extension to higher dimensions
14
can be found in (Pascanu et al., 2013). The hidden state of RNN at time t can be
calculated by eqn. 2.7. If we simplify it by assuming only one neuron in the hidden
layer and also removing the input, then we have:
In order to show the problem more clearly, we directly use the chain rule instead of
the systematic layer-by-layer approach. Thus, by taking the derivative of the hidden
layer at time t + l with respect to it at time t we have:
∂ht+l Y Y
l l
0
= l
whh φ (whh ht+(l−k) ) = whh φ0 (wht+(l−k) ). (2.15)
∂ht k=1 k=1
l
The whh is the factor that causes the problem. As seen, if the weight is not equal to
1, it will decay to zero, or grow exponentially fast. This vanishing or exploding of the
gradient causes RNN to be unable to learn long-term dependencies efficiently. One
way to avoid this problem is to use ReLU activation function:
The derivation of this function is whether 0 or 1. Thus, it is not as likely to suffer from
the vanishing or exploding problem. A more popular approach is Long Short-Term
memory which is explained in the following.
15
[xt; ht-1] [xt; ht-1]
<8
forget gate ;
[xt; ht-1]
forget the old one. In the LSTM model, forget gate is fed with the previous hidden
state and current input:
Now, for a higher level regulation on the added information, we calculate an input
gate which is very similar to the forget gate but with a different set of weights:
Now, everything is ready to update the cell state. In terms of language modelling
example, we want to drop information related to the old subject, and replace them
16
y5
A
[c4; h4] [c3; h3]
y4 y3
x5
A A
yt-1 yt yt+1 [c1; h1] [c2; h2]
x4 x3
y1 y2
[ct-1; ht-1] [ct; ht]
A A A
A A
xt-1 xt xt+1 x1 x2
Figure 2.4: Tree and sequence structured LSTMs. Here we used “A” to show an
LSTM unit and emphasise that the unit and weights remain the same while traversing
the structure.
The output of the LSTM unit, ht , shows the hidden representation of the LSTM,
which can be used by exterior units. The output is a filtered version of the cell, and
the output gate is responsible for gating the information:
Then, a hyperbolic tangent is applied to the cell to push values in the [-1, 1] range.
Finally, the output of the unit would be:
ht = ot tanh(ct ). (2.22)
As seen, the cell is a linear unit with a fixed-weight self-connection (Hochreiter and
Schmidhuber, 1997a) , and avoids the vanishing or exploding gradient problem. In-
spired by LSTM, Cho et al. (2014b) proposed Gated Recurrent Unit (GRU) that
modulates the flow of information using two gates, without having a separate mem-
ory cell.
2.1.4.1 Tree-LSTM
The LSTM has the ability to preserve long-term information over time. This ability
causes the LSTM to be a powerful tool for chain-structured topologies like temporal
17
sequences. However, in some tasks, we deal with the tree-structure data flow. Tai
et al. (2015) proposed two extensions to the original LSTM architecture for tree-
structured network typologies: Child-sum Tree-LSTM and N-ary Tree-LSTM.
In the standard LSTM, the unit updates the cell concerning the current input
and the state of the cell in the previous time step t − 1. We can think about this
temporal sequence as a linear tree; see Figure 2.4a. Thus, an LSTM at position t is
dependent on the current input and its child (the unit state at time t − 1). Now, tree-
structure topology (Figure 2.4b) can be considered as a generalisation to the linear
tree structure. Here each node may have more than one child. Hence, Tree-LSTM
has a forget gate for each child.
Child-sum Tree-LSTM: If we are given a tree and C(j) denotes the children of
node j, the transition functions for this unit are as follows:
X
h̃j = hk , (2.23)
k∈C(j)
hj = oj tanh(cj ). (2.29)
Here, for each child k we calculate a forget gate fjk . In this architecture, each
component is conditioned on the sum of children hidden states hk . Thus, it is a good
option when the branching factor is high, and the order of children is not important.
N -ary Tree-LSTM: This kind of Tree-LSTM is designed for the cases that the
branching factor is at most N and the order is important. The mathematical equa-
tions of this model are as follows:
X
N
(i)
(i)
ij = σ(W xj + Ul h̃jl + b(i) ), (2.30)
l=1
XN
(f )
fjk = σ(W (f ) xj + Ukl h̃jl + b(f ) ), (2.31)
l=1
X
N
(o)
oj = σ(W (o) xj + Ul h̃jl + b(o) ), (2.32)
l=1
18
Figure 2.5: An example of learning curves that show the behaviour of the negative
log-likelihood loss for the training set and validation set; Image source: (Goodfellow
et al., 2016).
X
N
(c̃)
c̃ = tanh(W (c̃) xj + Ul h̃jl + b(c̃) ), (2.33)
l=1
X
N
cj = fjl cjl + ij c̃j , (2.34)
l=1
hj = oj tanh(cj ). (2.35)
In contrast with Child-sum Tree-LSTM that shares a set of weights among children,
this model keeps a different set of weights for each child.
2.1.5 Regularisation
A major challenge in deep learning and general machine learning is to develop a
method that performs well not just on the training data but also on new inputs.
Regularisation is an umbrella name for strategies that help to prevent a machine
learning model from overfitting and reduce test errors, possibly at the expense of an
increase in the training error (Goodfellow et al., 2016). In the following, we discuss
two popular methods for regularisation in deep models.
When we train a model with a large set of parameters, the model may overfit the
task. We often observe the error in the training set steadily decreases while at some
point, error in the validation set starts to increase (Figure 2.5).
In this approach, we save a copy of the model parameters when the error on the
validation set decreases. We run an optimisation algorithm until the error of the
19
(a) Standard neural network (b) After applying dropout
Figure 2.6: Dropout out mechanism. Left: a standard two-layer network. Right: a
thinned version of the network after applying dropout and dropping crossed units
(Srivastava et al., 2014).
validation set has not decreased for some amount of time. After termination, the last
saved model is returned. By using this approach, we obtain a model with a better
validation set error which hopefully results in a better test set error.
2.1.5.2 Dropout
Large neural networks are usually slow to run and train; thus, it is challenging to
overcome overfitting by techniques like bagging (Breiman, 1996) that combines many
different models. Dropout is proposed to deal with this problem (Srivastava et al.,
2014). The idea is to drop some neurons and their corresponding connections ran-
domly. This approach prevents co-adapting of neurons.
The standard dropout randomly drops neurons from non-output layers at the
training time. It uses an independent Bernoulli random variable with parameter p
for each neuron to decide whether it should be dropped or not. This process can
be thought as sampling from an exponential number of different “thinned” networks
as depicted in Figure 2.6. If the network has n neurons in the input and hidden
layers, then this process samples from a set of 2n different networks. As averaging the
predictions of exponentially many “thinned” models is not feasible in the test time,
authors proposed a simple approximate averaging method. During the test, they use
the original network without dropping neurons. In this network, outgoing weights
20
of each neuron are multiplied by p to obtain a scaled-down version of the trained
weights.
Dropout is widely used in the modern deep learning models, and have shown em-
pirically to significantly lower generalisation error. Some works investigate a Bayesian
perspective of the dropout and propose different variants of it. Wang and Manning
(2013) shows that applying dropout in training can be seen as a Monte Carlo approxi-
mation. They propose Gaussian dropout to do fast dropout training by sampling from
or integrating a Gaussian approximation instead. Kingma et al. (2015) proposes Vari-
ational dropout, which is a generalisation of Gaussian dropout, and optimal dropout
rates are inferred from the data. Gal and Ghahramani (2016) proposes Monte Carlo
dropout, which applies dropout at test time as well. At test time, it performs T
stochastic forward passes through the network and averages the results.
21
embedding table of words can be learned end-to-end, it also can be pre-trained
and used for transfer learning. There are different approaches for training the
word embeddings; however, the idea behind all of them is that semantically
similar words tend to appear within the same context. There are two main
types of pre-training methods: (1) prediction-based methods which learn the
word embeddings in such a way to improve the predictive ability of the target
word given context words, e.g. Word2Vec (Mikolov et al., 2013) for local context
and Global Vectors (GloVe) (Pennington et al., 2014) for global context; (2)
count-based methods that construct co-occurrence counts matrix with the aim
to capture global statistics and perform dimensionality reduction techniques
over the matrix (Church and Hanks, 1989; Deerwester et al., 1990; Turney and
Pantel, 2010; Levy et al., 2015).
Statistical Language Models (LM) are one of the key building blocks in most of the
natural language processing tasks, including machine translation. A statistical lan-
guage model predicts the probability of a sequence of tokens, e.g., words (w1 w2 ...wT ).
It is used to predict the next token (e.g., character, word, sentence) considering the
preceding context. At time t, the probability of token wt given its proceedings is:
22
B ENGIO , D UCHARME , V INCENT AND JAUVIN
softmax
... ...
tanh
... ...
Table Matrix C
look−up
shared parameters
in C across words
1142
Neural Probabilistic Language Model Bengio et al. (2003) proposed the first
Neural Probabilistic Language Model (NPLM) to address the sparsity and curse of
dimensionality issues in n-gram LMs by learning a distributed representation for
words. As depicted in Figure 2.7, the core idea is to convert words to distributed
embeddings and use a feed-forward neural network to predict the next word. As the
feed-forward neural network requires a fixed-size input, each word can be conditioned
on fixed-size window of its context p(wi |wi−1 , ..., wi−n+1 ).
hi = RNN(hi−1 , E [wi ])
where E is the distributed embedding table. The hidden state of RNN will capture
the context for all the previous words (wi−1 , ..., w1 ) and the probability of the i-th
words can be determined as follows:
wi ∼ softmax(Woh ht ).
24
Das ist ein Test <EOS>
h1 h2 h3 h4 h5 s1 s2 s3 s4 s5
used by the decoder as the context to generate the target sentence word by word. As
the length of sentences can have arbitrary lengths, a special end-of-sentence (EOS)
symbol is used to signal encoder and decoder about the end of the sentence.
Encoder. The encoder is a uni-directional RNN whose hidden states represent to-
kens of the input sequence. These representations capture information not only of
the corresponding token but also other tokens in the sequence to leverage the context.
The RNN runs in the left-to-right direction over the input sequence:
where E S [xi ] is the embedding of the token xi from the embedding table E S of the
input (source) space, and hi is the hidden states of the RNNs which can be based
on the LSTM or GRU units. The hidden state, after reading the last symbol is
the fixed-length vector representation of the source sentence, which is called context
(c = hT ).
yj ∼ softmax(Wy · rj + br ) (2.40)
rj = tanh(sj + Wrc · c + Wrj · E T [yj−1 ]) (2.41)
sj = tanh(Ws · sj−1 + Wsj · E T [yj−1 ] + Wsc · c) (2.42)
25
where E T [yj ] is the embedding of the token yj from the embedding table E T of the
output (target) space, and the W matrices and br vector are the parameters.
Training and Decoding. Suppose we are given a training set D := {(xi , yi )}N
i=0 ,
the model parameters are trained end-to-end by maximising the (regularised) log-
likelihood of the training data:
X
arg max log PΘ (y|x),
Θ
(x,y)∈D
In the decoding time, the best output sequence for a given input sequence is
produced by
Y
arg max PΘ (y|x) = PΘ (yj |y<j , x).
y
j
26
cj
!j yj
h1 h2 hn
"1 "2 ... "n s1 ... sj
"1 "2 ... "n
←
− ←−
h i = RNN( h i+1 , E S [xi ]) (2.44)
→
− ←
−
where h i and h i are the hidden states of the forward and backward RNNs which
can be based on the LSTM or GRU units. Each source token is then represented by
→
− ← −
the concatenation of the corresponding bidirectional hidden states, hi = [ h i ; h i ].
yj ∼ softmax(Wy · rj + br ) (2.45)
rj = tanh(sj + Wrc · cj + Wrj · E T [yj−1 ]) (2.46)
sj = tanh(Ws · sj−1 + Wsj · E T [yj−1 ] + Wsc · cj ) (2.47)
where E T [yj ] is the embedding of the token yj from the embedding table E T of the
output (target) space, and the W matrices and br vector are the parameters.
A crucial element of the decoder is the attention mechanism which dynamically
attends to relevant parts of the input sequence necessary for generating the next token
in the output sequence. Before generating the next token yj , the decoder computes
the attention vector αj over the input token:
αj = softmax(aaj )
aji = v · tanh(Wae · hi + Wat · sj−1 )
27
Convolutional Sequence to Sequence Learning
28
Figure 2.11: The generalFigurearchitecture
1: The Transformer - model
of the architecture. model. Image source:
Transformer
(Vaswani et al., 2017).
3.1 Encoder and Decoder Stacks
Bengio, 1998) are proposed e.g., ByteNet (Kalchbrenner et al., 2017) and ConvS2S
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers.
(Gehring The first
et al., is a multi-head
2017). self-attention
The high-level mechanism,ofand
architecture the secondmodel
ConvS2S is a simple, position- in
is depicted
wise fully connected feed-forward network. We employ a residual connection [11] around each of
Figure 2.10.
the two As seen,
sub-layers, convolutional
followed layers operate
by layer normalization over
[1]. That is, a
thefixed-size window
output of each of input
sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
tokens and therefore can be operated in parallel. Although the generated representa-
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers,
tions are produce outputs of dimension
for a fixed-size context, dstacking
model = 512.convolutional layers leads to a larger effec-
tive window size. A multi-layer convolutional neural network creates a hierarchical
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
representation overencoder
sub-layers in each the input
layer,sentence
the decoderwhere
insertslower
a third layer capture
sub-layer, whichnearby
performsdependencies
multi-head
andattention
higherover the output
layers of thelong-term
capture encoder stack. Similar to the encoder,
dependencies. Assumewe employ
we areresidual
givenconnections
a sentence
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
withsub-layer in thea decoder
n tokens, stack to prevent
convolutional networkpositions
withfrom
the attending to subsequent
kernel size of k canpositions.
capture This
depen-
masking, combined with fact n that the output embeddings are offset by one position, ensures that the
dencies withfor
predictions position i O(
applying ) operations.
can kdepend The number
only on the known outputs at of operations
positions less thanfor
i. a recurrent
unit is linear to the size of input O(n).
3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
of the values, where the weight assigned to each value is computed by a compatibility function of the
query with the corresponding key.
29
3
2.2.4 Self-attention Seq2Seq model
Although convolutional neural network-based models are able to parallelise the train-
ing of Seq2Seq model and capture longer dependencies than RNN-based models,
they still suffer from the inability to capture long-term dependencies. Vaswani et al.
(2017) proposed Transformer, an architecture solely based on self-attention. The
Transformer can be seen as an adaptive weighted convolutional kernel with the
window-length of the sequence size. In the convolution kernel, the weights are depen-
dent on the position and not content. Therefore, when we slide the windows through
the sequence, weights are fixed. However, in the Transformer, weights are determined
with respect to the representation of the current position and the representation of
other positions. In other words, the idea is that the position should not be a limi-
tation, and if two tokens are related, they should have higher weights for each other
even if they are distant.
As depicted in Figure 2.11, each layer is consists of a Multi-head attention sublayer
followed by a feed-forward neural network. The difference between the encoder and
the decoder is two-fold: (1) the decoder also attends to the encoder representations;
(2) the decoder does not have attention to the future.
30
different kind of information, and the result could be considered as an ensemble of
attentions.
where Wi s are learned parameters. The Transformer applies the mentioned attention
mechanism in three different ways:
Positional embedding Unlike RNN, the Transformer does not process input to-
kens with respect to their temporal position. Therefore, it needs to be informed about
the relative or absolute position of the tokens in the sequence in order to make use of
sequence order. To encode the position, Vaswani et al. (2017) used sinusoidal func-
tions with different frequencies to extrapolate to sequence lengths even larger than
those observed during the training. Finally, the position embedding is added to the
word embeddings to enrich them with positional information.
31
Perplexity (PPL): is the inverse probability of the translation sentence, nor-
malised by the number of words.
s
1
P P L(y) = T
QT . (2.52)
i=1 p(yi |y<i , x)
Intuitively, PPL is a measure to determine “how confused is the model about its
decision?” (Neubig, 2017).
TER (Snover et al., 2006): Translation Edit Rate (TER) is a machine trans-
lation measure that determines the amount of post-editing required for a generated
translation to exactly matches one of the references. It counts the minimum number
of possible edits including insertion, deletion, shifts of sequence and substitution of
words, and normalise it with the average weight of the references:
32
similarity scores by explicitly aligning words between the generated translation and
a given reference translation. The matching of two words is done by applying the
following matchers: Exact (identical surface form), Stem (identical stem), Synonym
(common synonym). It then calculates an F-score:
P ·R
Fmean = , (2.57)
α · P + (1 − α) · R
where P and R are precision and recall based on single-word matches. In the next
step, METEOR penalises the translation concerning the order of matched words in
the generated translation and reference. The penalty coefficient is P en ∝ c/m where
c is the smallest number of “chunks” of matched words such that the words in each
chunk are adjacent (in the generated translation and reference) and in same word
order, and m is the number of matched words:
The difference between low- and high-resource NMT is more than bilin-
gual data availability! It has been shown that in bilingually low-resource scenar-
ios, Phrase-Based Statical Machine Translation (PBSMT) models outperform NMT
models while for the high-resource the situation is reversed (Koehn and Knowles, 2017;
Lample et al., 2018b). Other works reported different improvements of a model or
3
https://github.com/jhclark/multeval
33
technique in different data condition regimes (Qi et al., 2018; Kiperwasser and Balles-
teros, 2018). Recently, Sennrich and Zhang (2019) re-visited low-resource NMT and
showed that low-resource NMT is very sensitive to hyperparameters, architectural de-
sign and other design choices. They have shown that typical settings for high-resource
are not the best for low-resource, and NMT can be unleashed and outperform PBSMT
in low-resource scenarios with a proper setting. However, note that the performance
of an NMT model trained with hundreds of bilingual pairs still is much less than the
one trained with millions.
A practical approach to compensate for the lack of bilingual data in low-resource
NMT is to use auxiliary data. The auxiliary data could be in the form of curated
monolingual linguistic resources, monolingual sentences or multilingual sentence pairs,
i.e., a mixture of high- and low-resource. In this thesis, we will focus on the first type
and leverage curated monolingual for compensating the shortage of bilingual training
data. These resources are mostly available in the form of annotated datasets, e.g.,
treebanks for syntactic parsing or part-of-speech tagged sentences. There are two
main approaches to incorporate these resources into the training of an NMT model:
(1) Using these monolingual resources to build tools which can then be used to an-
notate source sentences (covered in Section 2.3.1); (2) Directly injecting linguistic
knowledge into the translation model (covered in Section 2.3.2). There are other ap-
proaches for improving low-resource NMT that are mainly focusing on using auxiliary
monolingual and multilingual data, and we will cover them in Section 2.3.3.
34
yj
cj uj
(phr)
h2n-1
...
!j
(phr)
hn+1
h1 h2 ... hn g1 ... gj
parse tree of the source sentence. As generating gold-standard parse trees is not pos-
sible in real-world scenarios, the author proposed to use binary constituency parse
trees4 generated by an off-the-shelf parser.
Tree Encoder It consists of sequential and recursive parts. The sequential part is
the vanilla sequence encoder discussed in Section 2.2.1, which calculates the embed-
dings of words. Then, the embeddings of phrases are calculated using the embeddings
of their constituent words in a recursive bottom-up fashion:
(phr)
hk = TreeLSTM(hlk , hrk ).
where hlk and hrk are hidden states of left and right children respectively. This method
uses TreeLSTM units (Tai et al., 2015) to calculate the embedding of a parent node
using its two children units as follow:
(i)
i = σ(Ul hl + Ur(i) hr + b(i) )
(fl )
f l = σ(Ul hl + Ur(fl ) hr + b(fl ) )
(fr )
f r = σ(Ul hr + Ur(fr ) hr + b(fr ) )
(o)
o = σ(Ul hl + Ur(o) hr + b(o) )
(c̃)
c̃ = tanh(Ul hl + Ur(c̃) hr + b(c̃) )
4
A constituency parse tree breaks a sentence into sub-phrases.
35
c(phr) = i c̃ + f l cl + f r cr
h(phr) = o tanh(c(phr) )
where i, f l , f r , oj ,c̃j are the input gate, left and right forget gates, output gate, and
a state for updating memory cell; cr and cl are memory cells of the right and left
units.
Sequential Decoder Eriguchi et al. set the initial state of the decoder by com-
bining the final state of the sequential and tree encoders as follow:
(phr)
g0 = TreeLSTM(hn , hroot ),
The rest of the decoder is similar to the vanilla attentional decoder discussed in
Section 2.2.2. The difference is that, in this model, the attention mechanism makes
use of phrases as well as words. Thus, the dynamic context is calculated as follows:
X
n 2n−1
X
cj = αji hi + αji0 hphr
i0
i=1 i0 =n+1
Chen et al. (2017a) extended this work by proposing a bidirectional tree encoder.
Though the results for Tree2Seq models are promising, the top-1 trees are prone
to the parser error, and cannot capture semantic ambiguities of the source sentence.
In Chapter 3, we address the issues mentioned above by using combinatorially many
trees encoded in a forest instead of a single top-1 parse tree. We capture the parser
uncertainty by considering many parse trees along with their probabilities using our
ForestLSTM architecture. Following our work, Ma et al. (2018) has also proposed a
forest-based NMT model with a different approach of linearising the forest and using
a Seq2Seq model.
36
h(2)
h(1)
h(0)
Figure 2.13: A two-layer syntactic GCN on top of the embeddings, updating them
concerning a dependency parse tree. To simplify image, gates and some labels are
removed. Image source: (Bastings et al., 2017)
.
set of tuples: (wi , Hwi , lwi ) where wi is i-th word in the sentence, its parent node
(head) is Hwi ∈ {w1 , ..., wn , Root}, and lwi is a dependency tag (label). The encoding
process has two phases: First, they calculate an initial embeddings of words. Second,
they refine and enrich these embeddings concerning the dependency tree.
Initial embeddings can be obtained from an embedding weight matrix or calcu-
lated using an encoder to incorporate contextual information. In the second phase,
a syntactic Graph Convolutional Network (Marcheggiani and Titov, 2017) is used
to update the initial embeddings. As shown in Figure 2.13, each layer updates the
embeddings of nodes with respect to their neighbours in the graph (in this case de-
pendency tree). It means that with k layers, nodes will receive information from
the neighbours at most k hopes away. H-dimensional embedding of word v in the
(j + 1)-th layer is updated:
X
(j) (j)
h(j+1)
v = ρ (j)
gu,v Wdir(u,v) h(j)
u + blab(u,v)
, (2.59)
u∈N (v)
37
is calculated as follows:
(j) (j) (j)
gu,v = σ h(j)
u .ŵ dir(u,v) + b̂ lab(u,v) , (2.60)
exp(m(i, j))
p(Hwi = wj |wi ) = P , (2.61)
k6=i exp(m(i, k))
(2)T (2)
where m(i, j) = hk Wdp hi is a scoring function with weight matrix Wdp . To
(2)
predict the probability distribution p(lwi |wi ), they feed [hi ; z(Hwi )] into a softmax
P
function. z(Hwi ) is the weighted average of parent’s hidden states: j6=i p(Hwi =
(2)
wj |wi )hj . The parameters of this parsing model are learned by propagating errors
from the translation objective function.
38
Jane hatte eine Katze . (ROOT (S (NP Jane )NP (VP had (NP a cat )NP )VP . )S )ROOT
Figure 2.14: An example of a source sentence, and its translation in the form of
linearised lexicalised constituency tree (Aharoni and Goldberg, 2017).
After calculating the latent graph, this model uses a modified version of the atten-
tional Seq2Seq model to translate. In the encoder, a sequential encoder generates
embedding of words ei . Then, for each word, another embedding is calculated with
respect to the latent graph:
39
2.3.2.1 Multi-Task Learning
• Full sharing: In this case, a single model performs all of the tasks. For
informing the model regarding the desired task, a special tag could be added to
the corresponding input.
• Partial sharing: The idea behind this model is that not all of the representa-
tions need to be shared across tasks. Each task has its own set of parameters to
learn task-specific representations, while some of the parameters are tied among
tasks to make knowledge sharing possible.
Training Schedule. Training schedule is the beating heart of MTL, and has a
critical role in the performance of the resulted model. Training schedule is responsible
for balancing out the importance (participation rate) of different tasks throughout the
training process, in order to make the best use of the knowledge provided them. There
is more than one task involve in the training of an MTL model. We categorise the
MTL approaches based on the flavour of MTL they consider:
• General-MTL: where the goal is to improve all of the tasks (Chen et al., 2018b;
Guo et al., 2018);
• Biased-MTL: where the aim is to improve one of the tasks, referred to as the
main task, the most (Kiperwasser and Ballesteros, 2018; Guo et al., 2019). As
the main aim of this thesis is to improve the translation task, we use these
flavour MTL.
40
Note that prior works solely use “MTL” to refer to either of these categories, however
we distinguish between them to make comparison easier.5
In this Section, we want to show the relation of the MTL to other learning frameworks.
Note that there is no universal definition for some of the learning frameworks. For
example, researchers use different definitions of transfer learning; some try to put
MTL in the umbrella of transfer learning (Pan and Yang, 2009) while others try to
distinguish between them (Torrey and Shavlik, 2010). We stick with the definition
(S) (S)
of Pan and Yang (2009). Assume we are given a source set DS := {(xi , yi )}N S
i=0
(T ) (T )
from the source joint PS (X, Y ), and a target set DT := {(xi , yi )}N T
i=0 comes from
the target joint PT (X, Y ). Transfer learning aims to improve the target learned
hypothesis (model) hT in PT (X) by using DS . Please note that with this definition,
the General-MTL is not necessarily a transfer learning method because of following
reasons: (1) in General-MTL we cannot define source and target tasks as all of the
tasks are regarded as same; (2) the goal is to improve (in expectation) all of the tasks
even with the cost of degradation in the performance of some of them. On the other
side, Biased-MTL can be seen as inductive transfer learning where the source and
target tasks are different and trained together. Now, we can consider auxiliary tasks
as source tasks and the main task as the target.
41
Assuming ` is the loss for the main task, and P0 (X, Y ) is the joint distribution of the
main task, then the standard expected risk is defined as follows:
Note that the expected risk should be measured on a separate test set. Then we
define negative transfer condition for any Biased-MTL approach as:
RP0 MTL {D0 , D1 } > RP0 MTL {D0 } .
where MTL ({D0 }) refers to the hypothesis generated for the single main task. This
definition can also be extended to examine whether a training schedule and/or ar-
chitecture A is better than B for capturing positive transfer and avoiding negative
transfer as follows:
RP0 MTLB {D0 , D1 } > RP0 MTLA {D0 , D1 } ,
MTL has been applied in different areas of machine learning. For non-deep learning
methods, MTL has been applied in different settings including support vector ma-
chines (Evgeniou and Pontil, 2004), multiple linear regression (Jalali et al., 2010),
image classification (Yuan et al., 2012), relation extraction (Jiang, 2009), linguistic
annotation (Reichart et al., 2008). Recently, with the raising wave of deep learning,
deep MTL becomes popular for different applications including visual representation
learning (Doersch and Zisserman, 2017), face detection (Ranjan et al., 2017), scene
understanding (Kendall et al., 2018), medical image segmentation (Moeskops et al.,
2016), speaker-role adaptation in conversational AI (Luan et al., 2017).
MTL and NLP. Deep MTL has also been used for various NLP problems, including
graph-based parsing (Chen and Ye, 2011) and key-phrase boundary classification
(Augenstein and Søgaard, 2017) . (Chen et al., 2017b) has applied to Multi-Task
Learning for Chinese word segmentation, and (Liu et al., 2017) applied it for text
classification problem. Both of these works have used adversarial training to make
sure the shared layer extract only common knowledge.
42
MTL has been used effectively to learn from multimodal data. Luong et al. (2016)
has proposed MTL architectures for neural Seq2Seq transduction for tasks including
MT, image caption generation, and parsing. They fully share the encoders (many-to-
one), the decoders (one-to-many), or some of the encoders and decoders (many-to-
many). Pasunuru and Bansal (2017) has made use of an MTL approach to improve
video captioning with auxiliary tasks, including video prediction and logical language
entailment based on a many-to-many architecture.
MTL for NMT Multi-task learning has attracted attention to improve NMT in
recent works. (Zhang and Zong, 2016) has made use of monolingual sentences in
the source language in a Multi-Task Learning framework by sharing encoder in the
attentional encoder-decoder model. Their auxiliary task is to reorder the source text
to make it close to the target language word order. (Domhan and Hieber, 2017)
proposed a two-layer stacked decoder, which the bottom layer is trained on language
modelling on the target language text. The next word is jointly predicted by the
bottom layer language model and the top layer attentional RNN decoder. They
reported only moderate improvements over the baseline and fall short against using
synthetic parallel data.
The current research on MTL is focused on encouraging positive transfer and prevent-
ing the negative transfer phenomena in two lines of research: (1) Architecture design:
works in this area, including part II of this thesis, try to learn effective parameter
sharing among tasks; (2) Training schedule: works in this area, including part III of
this thesis, focus on setting the importance of tasks throughout the training.
43
Niehues and Cho (2017) has made use of part-of-speech tagging and named-entity
recognition tasks to improve NMT. They have used the attentional encoder-decoder
with a shallow architecture, and share different parts, e.g. the encoder, decoder, and
attention. They report the best performance with fully sharing the encoder.
In Chapter 4, for the first time to the best of our knowledge, we have used semantic
parsing as an auxiliary task along with others, i.e. syntactic parsing, and named-
entity recognition. We propose an architecture that uses partial sharing on deep
stacked encoder and decoder components, and the results show that it is critical for
NMT improvement in MTL. Furthermore, we propose adversarial training to prevent
contamination of shared knowledge with task-specific details. Then, in Chapter 4,
we propose a model that learns how to control the amount of sharing among tasks
dynamically.
Training schedule. Since there is more than one task involved in MTL, training
schedules are designed specifically for each flavour of MTL: General-MTL and Biased-
MTL. Training schedules designed for the global-MTL are focused on co-evolving easy
and difficult tasks uniformly. These methods are designed to achieve competitive
performance with existing single-task models of each task (Chen et al., 2018b; Guo
et al., 2018). On the other hand, training schedules for Biased-MTL focus on achieving
greater improvements on the main task, and our method belongs to this category.
Since this thesis aims to improve the translation task the most, we will focus on this
flavour.
Training schedules can be fixed/dynamic throughout the training and be hand-
engineered/adaptive. In Chapter 4, we propose a fixed hand-engineered schedule for
improving low-resource NMT with auxiliary linguistic tasks. Recently, Guo et al.
(2019) has proposed an adaptive way to calculate the importance weights of tasks.
Instead of manual tuning of importance weights via an extensive grid search, they
model the performance of each set of weights as a sample from a Gaussian Process
(GP) and search for optimal values. Their method is not entirely adaptive as strong
prior needs to be set for the main task. This method can be seen as a guided yet
computationally exhaustive trial-and-error, where in each trial, MTL models need
to be re-trained (from scratch) with the sampled weights. Moreover, the weight of
tasks is fixed throughout the training. At least, for the case of low-resource NMT,
Kiperwasser and Ballesteros (2018) shows that dynamically changing the weights
throughout the training is essential to make better use of auxiliary tasks. They have
44
proposed hand-engineered training schedules for MTL in NMT, where they dynami-
cally change the importance of the main task vs. the auxiliary tasks throughout the
training process. Their method relies on hand-engineered schedules which should be
tuned by trial-and-error, In Part III, we will introduce a learning to multi-task learn
method to adaptively and dynamically set the importance of the tasks and learn the
MTL model in the course of a single training run.
Zhang et al. (2018a) has proposed a learning-to-MTL framework in order to learn
effective MTL architectures for generalising to new tasks. This is achieved by col-
lecting historical multi-task experience, represented by tuples consisting of the MTL
problem, MTL architecture, and its relative error. In contrast, our learning-to-MTL
framework tackles the problem of learning effective training schedules to use auxiliary
tasks in such a way to improve the main translation task the most.
The key idea behind active learning (AL) is that we can achieve better accuracy with
fewer labels if the model is able to choose the training data it learns from (Settles,
2009). AL approaches are based on the assumption that there is a pool or stream of
unlabelled data and a limited budget for the annotation. Traditional AL methods are
mostly based on heuristics to guide the selection of unlabelled data for the annotation.
Tong and Chang (2001) proposed policies to select data considering the confidence of
the classifier while Gilad-Bachrach et al. (2006) introduced an approach based on a
query on a committee of classifiers. Yang et al. (2015) proposed an approach based on
the heuristic of making the selected data as diverse as possible. There are much more
heuristics in the literature which also could be mixed and matched. Although these
heuristics have shown to be beneficial, the effectiveness of these heuristics is limited,
and their performance varies with between datasets. Recently, deep Reinforcement
Learning has been used for learning the active learning algorithm. Woodward and
Finn (2017) combined the RL with one-shot learning for learning how and when to
request labels in stream-based AL. Bachman et al. (2017) used policy gradients to
select the unlabelled data, and also the order of selection for the pool-based AL. The
learning to active learning methods also have been successfully applied on some NLP
classification tasks (Fang et al., 2017; Liu et al., 2018; Vu et al., 2019).
For low-resource machine translation, we assume small bilingual corpora and a
pool of monolingual sentences along with an annotation budget. The active learner
45
aims to select the unlabelled data in such a way to make the best use of the anno-
tation budget in terms of the performance of the resulted translation model. There
are several heuristics for statistical MT for selecting the untranslated sentences by
considering the entropy of the potential translation, similarity-based selection, feature
decay to increase the diversity (Haffari et al., 2009; Haffari and Sarkar, 2009; Biçici
and Yuret, 2011). Recently, Liu et al. (2018) proposed an approach for learning active
learning policies for low-resource NMT by formulating the training as a hierarchical
MDP.
46
when training the other dual/primal model, and vice versa. As a result, they were able
to apply their fully-differentiable model on unsupervised NMT effectively. Artetxe
et al. (2018) also proposed an approach based on dual learning to train an NMT model
in a completely unsupervised manner. They use a shared encoder for both primal
and dual models and use pre-trained unsupervised cross-lingual embeddings (Artetxe
et al., 2017). Therefore, the distributed representation of words would be language-
independent, and the model only needs to learn how to compose representations for
larger phrases. The last two approaches can be combined for a further improvement
(Lample et al., 2018b).
Johnson et al. (2017); Ha et al. (2016) are the first who have shown that multilin-
gual NMT models are somewhat capable of translating between untrained language
pairs; the setting referred to as zero-shot learning. Gu et al. (2018a) proposed a
new multilingual NMT model specifically to improve the translation of low-resource
languages. It shares the universal lexical and sentence level representations across
multiple source languages into one target language. This model enables the sharing
of resources between high-resource languages and low-resource ones. Further, Gu
et al. (2018b) proposed a meta-learning approach for low-resource NMT by using
the universal lexical representations. Inspired by the idea of model-agnostic meta-
learning algorithm (Finn et al., 2017), they view the training on NMT model on
different low-resource language pairs as separate tasks. Then, they train the NMT
model in such a way to rapidly adapt to new language pairs with a minimal amount
47
of bilingual language pairs. In a different direction, Neubig and Hu (2018) proposed
to use a pre-trained multilingual NMT model and fine-tune it on new low-resource
language pairs for rapid adaptation of NMT on new languages.
Recently, Arivazhagan et al. (2019) have investigated why multilingual NMT mod-
els do not perform well on unseen language pairs. They have found that the issue
raises when the model is not able to learn language invariant features and propose
auxiliary losses to impose language-invariant representations across languages. Gu
et al. (2019) investigated the issue rises as the results of capturing spurious correla-
tion in the training data. Then, they proposed to use language model pre-training
and back-translation to help the model to disregard such correlation.
2.4 Summary
In this chapter, we covered the foundations and prior related works to this thesis.
We overviewed deep learning fundamentals and its usage in NLP and machine trans-
lation. Then, we reviewed the case of bilingually low-resource NMT and described
how linguistic resources could be used to compensate for the lack of bilingual data.
This thesis aims to explore further and extend this line of research. In part I, we in-
corporate uncertainty in machine-generated linguistic annotations in Neural Machine
Translation by proposing a forest-to-sequence model. In parts II and III, we focus on
directly injecting the linguistic knowledge using Multi-Task Learning where Part II
is mainly focused on the architectural design aspect and part III is dedicated to the
training schedule.
48
Part I
Transduction of Complex
Structures
49
Chapter 3
An Attentional Forest-To-Sequence
Model
The main aim of this thesis is to improve the performance of Neural Machine Trans-
lation in bilingually low-resource scenarios by incorporating linguistic knowledge. As
mentioned in Section 2.3, one approach is to train or use pre-trained parsers to provide
annotations of the source sentences. Incorporating syntactic information in Neural
Machine Translation (NMT) can lead to better reorderings (Eriguchi et al., 2016),
particularly useful when the language pairs are syntactically highly divergent or when
the training bitext is not large. Previous work on using syntactic information, pro-
vided by top-1 parse trees generated by (inevitably error-prone) parsers, has been
promising (Eriguchi et al., 2016; Chen et al., 2017a). In this chapter, we propose a
forest-to-sequence NMT model to make use of combinatorially many parse trees of
the source sentence to compensate for the parser errors. Our method represents the
collection of parse trees as a packed forest and learns a neural transducer to translate
from the input forest to the target sentence. Experiments on English to German,
Chinese and Farsi translation tasks show the superiority of our approach over the
sequence-to-sequence and tree-to-sequence neural translation models.
50
3.1 Introduction
One of the main premises about natural language is that words of a sentence are
inter-related according to a (latent) hierarchical structure (Chomsky, 1957), i.e. a
syntactic tree. Therefore, it is expected that modelling the hierarchical syntactic
structure should improve the performance of NMT, especially in low-resource or lin-
guistically divergent scenarios, such as English-Farsi. In this direction, Li et al. (2017)
uses a sequence-to-sequence model, making use of linearised parse trees. Chen et al.
(2018a) has proposed a model which uses syntax to constrain the dynamic encod-
ing of the source sentence via structurally constrained attention. Bastings et al.
(2017); Shuangzhi Wu (2017); Beck et al. (2018); Ding and Tao (2019) have incorpo-
rated syntactic information provided by the dependency tree of the source sentence.
Marcheggiani et al. (2018) has proposed a model to inject semantic bias into the
encoder of NMT model.
More related to our work, Eriguchi et al. (2016); Chen et al. (2017a) have proposed
methods to incorporate the hierarchical syntactic constituency information of the
source sentence. In addition to the embedding of words, calculated using the vanilla
sequential encoder, they calculate the embeddings of phrases recursively, directed by
the top-1 parse tree of the source sentence generated by a parser. Though the results
are promising, the top-1 trees are prone to parser error, and furthermore cannot
capture semantic ambiguities of the source sentence.
In this chapter, we address the issues mentioned above by using exponentially
many trees encoded in a forest instead of a single top-1 parse tree. We capture
the parser uncertainty by considering many parse trees and their probabilities. The
encoding of each source sentence is guided by the forest and includes the forest nodes
whose representations are calculated in a bottom-up fashion using our ForestLSTM
architecture (Section 3.2). Thus, in the encoding stage of this approach, different
ways of constructing a phrase are taken into consideration, along with the probability
of rules in the corresponding trees. We evaluate our approach on English to Chinese,
Farsi and German translation tasks, showing that forests lead to better performance
compared to top-1 trees and sequential encoders (Section 3.4).
Following our work, Ma et al. (2018) has also proposed a forest-based NMT model.
Instead of modelling the hierarchical structure of the forest, their model is based on
linearising the forest and using a Seq2Seq model.
51
saw jane with telescope
0.3 0.6
0.4
saw jane
with telescope
0.5
0.8
Figure 3.1: An example of generating a phrase from two different parse trees
hi = SeqLSTM(hi−1 , Ex [xi ])
52
where Ex [xi ] is the embedding of the word xi in the embedding table Ex of the source
language, and hi is the context-dependent embedding of xi .
!
X
N
l0
γ l = tanh U γ 1l6=l0 h + W γ hl + v γ pl + bγ
l0 =1
!
X
N
0 0
fl = σ Uf 1l6=l0 [hl ; γ l ] + W f [hl ; γ l ] + bl
l0 =1
!
XN
i = σ Ui [hl ; γ l ] + bi
l=1
!
XN
o = σ Uo [hl ; γ l ] + bo
l=1
where N is the number of incoming hyper-edges, hl is the embedding for the head
of the l-th incoming hyper-edge and pl is its probability and v γ is the learned weight
for the probablity. γ l is a probability-sensitive intermediate representation for the
l-th incoming hyper-edge, which is then used in the computations of the forget gate
f l , the input gate i, and the output gate o. The representation of the phrase hphr is
then calculated as
!
X
N
c̃ = tanh U c̃ [hl ; γ l ] + bc̃
l=1
X
N
cphr = i c̃ + fl cl
l=1
hphr = o tanh(cphr )
where cl is the memory cell of the TreeLSTM unit used to calculate the representation
of the head for the l-th hyper-edge from its tail nodes.
53
3.2.2 Sequential Decoder
We use a sequential attentional decoder similar to that of the Tree2Seq model,
where the attention mechanism attends to both words and phrases in the forest:
np +n
X
n X
cj = αji hi + αji0 hphr
i0
i=1 i0 =1+n
where n is the length of the input sentence, and np is the number of forest nodes.
We initialise the decoder’s first state by using a TreeLSTM unit (Section 2.1.4.1)
to combine the embeddings of the last word in the source sentence and the root of
the forest:
g0 = TreeLSTM(hn , hphr
root ).
This provides a summary of phrases and words in the source sentence to the decoder.
3.2.3 Training
Suppose we are given a training set D := {(xi , yi , Fxi )}N
i=0 , the model parameters
are trained end-to-end by maximising the (regularised) log-likelihood of the training
data:
X
arg max log PΘ (y|x, Fxi ),
Θ
(x,y,Fx )∈D
where D is the set of triples consists of the bilingual training sentences (x, y) paired
with the parse forests of the source sentences Fx .
O(2Ws |x| + Wr N )
54
Sentence Length Avg. tree nodes Avg. forest nodes Avg. # of trees in forests
<10 7.94 9.77 6.13E+4
10-19 12.3 18.99 2.62E+16
20-29 21.18 41.79 2.76E+22
>30 31 78.72 2.21E+15
all 10.33 14.84 1.41E+20
Table 3.1: The average number of nodes in trees and forests along with average
number of trees in forests for En→Fa bucketed dataset.
where the first term shows the computational complexity of a bidirectional sequential
encoder to calculate the embeddings of words, and the latter one is the time for
computing the embeddings of phrases with respect to the corresponding tree/forest.
For generating each word in the target sentence, the attention mechanism performs
soft attention on words and phrases of the source sentence. If O(Wt ) be the time for
updating the decoder state and generating the next target word, for a target sentence
with length |y| the decoding phase computational complexity would be:
The difference among these three methods is N . For the Seq2Seq model N is 0.
For the Tree2Seq model, the number of nodes in the tree is a constant function
of the input size: N = |x| − 1. Since we used pruned forests obtained from the
parser in (Huang, 2008), the number of nodes in the forest is variable. Table 3.1
shows the average value of N for trees/forests for different source lengths for one of
the datasets we used in experiments. As seen, while forests contain combinatorially
many trees, on average, the number of nodes in parse forests is less than twice the
number of nodes in the corresponding top-1 parse trees. It shows that our method
considers combinatorially many trees instead of the top-1 tree using only a small
linear overhead.
3.4 Experiments
3.4.1 The Setup
Datasets. We make use of three different language pairs, translating from English
(En) to Farsi (Fa), Chinese (Ch), and German (De). Our research focus is to tackle
55
Train Dev Test
En → Fa 337K RND. 2k RND. 2k
En → Ch 44k devset1 2 devset 3
En → De 100K newstest2013 newstest2014
NMT issues for bilingually low-resource scenarios, and En→Fa is intrinsically a low-
resource language. Moreover, we used small datasets for En→Ch and En→De lan-
guage pairs to simulate low-resource scenarios, where the source and target languages
are linguistically divergent and close, respectively. For En→Fa, we use the TEP cor-
pus (Tiedemann, 2009) which is extracted from movie subtitles. For En→Ch, we use
BTEC and for En→De, we use the first 100K sentences of Europarl1 . The statistics
of the datasets has been summarised in Table 3.2.
We lowercase and tokenise the corpora using Moses scripts (Koehn et al., 2007).
Sentences longer than 50 words are removed, and words with the frequency less than
5 are replaced with <Unk>. Compact forests and trees for English source sentences
are obtained from the parser in (Huang, 2008), where the forests are binarised, i.e.
hyper-edges with more than two tail nodes are converted to multiple hyper-edges
with two tail nodes. This is to ensure a fair comparison between our model and the
Tree2Seq model (Eriguchi et al., 2016) where they use binary HPSG parse trees.
Furthermore, we prune the forests by removing low probability hyper-edges, which
significantly reduces the size of the forests. In all experiments, we use the development
sets for setting the hyper-parameters, and the test sets for evaluation.
3.4.2 Results
The perplexity and BLEU scores of different models for all translation tasks are pre-
sented in Table 3.4. In all translation tasks, Forest2Seq outperforms Tree2Seq
1
http://www.statmt.org/wmt14/translation-task.html
56
(a) (b)
Figure 3.2: (a) BLEU scores for bucketed En→Ch dataset. (b) Percentage of more
correct n-grams generated by the Tree2Seq and Forest2Seq models compared to
Seq2Seq model for En→Ch dataset.
Perplexity BLEU
Forest encoder W/ sequential part 16.66 12.38
Forest encoder W/O sequential part 17.48 11.97
Table 3.3: The effect of the sequential part in the forest encoder (En → Fa).
as it reduces syntactic errors by using forests instead of top-1 parse trees. Our re-
sults confirm those in (Eriguchi et al., 2016) and show that using syntactic trees in
Tree2Seq improve the translation quality compared to the vanilla Seq2Seq. Com-
paring BLEU scores of the forest-based and tree-based models, the largest increase is
observed for English to Farsi pair. This can be attributed to the syntactic divergence
between English and Farsi (SVO vs. SOV) as well as the reduction of significant
errors in the top-1 parser trees for this translation task, resulted from the domain
mismatch between the parser’s training data (i.e. Penn Tree Bank) and the English
source (i.e. informal movie subtitles).
3.4.3 Analysis
The effect of the sequential part in the forest encoder The forest encoder
consists of sequential and recursive parts, where the former is the vanilla sequence
encoder. The attention mechanism attends to the embeddings of both sequential and
recursive parts. We investigate the effect of the sequential part in the proposed forest
encoder. Table 3.3 shows the results on the test set of En → Fa dataset. The results
show that the sequential part in the forest encoder leads to improvement in results.
Speculatively, the sequential part helps the forest encoder by providing the context-
aware embeddings for words which then be used to construct phrase embeddings.
57
English → German English → Chinese English → Farsi
Method H Perplexity BLEU Perplexity BLEU Perplexity BLEU
Seq2Seq 256 33.07 11.98 6.48 25.43 19.21 10.17
(Luong et al., 2015a) 512 32.61 12.21 6.12 26.77 18.4 10.93
Tree2Seq 256 30.13 13 6.17 26.85 17.94 11.32
(Eriguchi et al., 2016) 512 31.86 13.05 5.71 28 16.28 11.71
our 256 30.83 13.54 6.16 27.08 17.62 11.91
Forest2Seq 512 29.25 13.43 5.49 28.39 16.66 12.38
Table 3.4: Comparison of the methods together with different hidden dimension size (H) for all datasets.
(a) (b)
Figure 3.3: (a) Attention ratios for En→Fa bucketed dataset.(b) Inference time (sec-
onds) required for the test set of En→Fa dataset using trained models.
How much attention the forest-based and tree-based models pay to the
syntactic information? We next analyse the extent by which the syntactic in-
59
formation is used by the Tree2Seq and Forest2Seq models. We compute the
ratio of attention on phrases to words for both of the syntax-aware models in En→Fa
translation task, where the source and target languages are highly syntactically di-
vergent. For each triple in the test set, we calculate the sum of attention on words
and phrases during decoding. Then, the ratio of attention on phrases to words is
computed and is averaged for all triples. Figure 3.3a shows these attention ratios for
bucketed En→Fa dataset. It shows that for all sentence lengths, the Forest2Seq
model provides richer phrase embeddings compared to the Tree2Seq model, leading
to more usage of the syntactic information.
3.5 Summary
We have proposed a forest-to-sequence attentional NMT model, which uses a packed
forest instead of the top-1 parse tree in the encoder. Using a forest of parse trees,
our method efficiently considers combinatorially many constituency trees in order to
take into account parser uncertainties and errors. Experimental results show our
method is superior to the attentional tree-to-sequence model, which is more prone to
the parsing errors.
60
Part II
61
Chapter 4
The main aim of this thesis is to improve bilingually low-resource NMT by incorpo-
rating linguistic knowledge. In Part I, we incorporated linguistic knowledge provided
in the form of annotations of source sentences. In this part and the next one, we
will focus on directly injecting linguistic knowledge. More specifically, we scaffold
the machine translation task on auxiliary tasks, including semantic parsing, syntac-
tic parsing, and named-entity recognition. As discussed in Section 2.3.2, a practical
approach for injecting knowledge from a task to others is Multi-Task Learning. It is
based on tying parameters among different tasks to share statistical strength. In this
chapter, we start by casting the auxiliary linguistic tasks as Seq2Seq transduction
tasks to make all tasks structurally homogeneous. Then, we propose a multitask ar-
chitecture that enables an effective sharing strategy between tasks by tying a fraction
of their parameters with those of the main translation task. Further, we make use
of adversarial training to protect shared representations from being contaminated by
task-specific features. Our extensive experiments and analyses show the effectiveness
of the proposed approach for improving bilingually low-resource NMT by incorporat-
ing linguistic knowledge in several language pairs.
62
4.1 Introduction
NMT with attentional encoder-decoder architectures (Luong et al., 2015c; Bahdanau
et al., 2015) has revolutionised machine translation, and achieved state-of-the-art for
several language pairs. However, NMT is notorious for its need for large amounts of
bilingual data (Koehn and Knowles, 2017) to achieve reasonable translation quality.
Leveraging existing monolingual resources is a potential approach for compensating
this requirement in bilingually scarce scenarios. Ideally, semantic and syntactic knowl-
edge learned from existing linguistic resources provides NMT with proper inductive
biases, leading to increased generalisation and better translation quality.
Multi-task learning is an effective approach to inject knowledge into a task from
other related tasks. Various recent works have attempted to improve NMT with an
MTL approach (Peng et al., 2017; Liu et al., 2017; Zhang and Zong, 2016); however,
they either do not make use of curated linguistic resources (Domhan and Hieber, 2017;
Zhang and Zong, 2016), or their MTL architectures are restrictive, yielding mediocre
improvements (Niehues and Cho, 2017). The current research leaves open how to
best leverage curated linguistic resources in a suitable MTL framework to improve
NMT.
In this chapter, we make use of curated monolingual linguistic resources in the
source side to improve NMT in bilingually scarce scenarios. More specifically, we
scaffold the machine translation task on auxiliary tasks, including semantic parsing,
syntactic parsing, and named-entity recognition. This is achieved by casting the aux-
iliary tasks as Seq2Seq transduction tasks and tie the parameters of their encoders
and/or decoders with those of the main translation task. Our MTL architectures
make use of deep stacked encoders and decoders, where the parameters of the top
layers are shared across the tasks. We further make use of adversarial training to
prevent contamination of shared knowledge with task-specific information.
We present empirical results on translating from English into Vietnamese, Turkish,
Farsi and Spanish; four target languages with varying degree of divergence compared
to English. Empirical results demonstrate the effectiveness of the proposed approach
in improving the translation quality for these four translation tasks in bilingually
scarce scenarios.
In summary, the contributions of this chapter can be categorised as follows:
• We propose a partial sharing strategy for Seq2Seq MTL models and show it
is crucial for improving low-resource NMT using curated linguistic resources.
63
• We further improve the partial sharing strategy by adding adversarial training
to prevent contamination of the shared representation space.
Deep Stacked Encoder. The deep encoder consists of multiple layers, where the
hidden states in layer ` − 1 are the inputs to the hidden states at the next layer `.
That is,
→
− ` −−−→` →
−
h i = RNNθ `,enc ( h `i−1 , h`−1
i )
←
−` ←−−−` ←−`
h i = RNNθ `,enc ( h i+1 , h`−1
i )
→
− ← −
where h`i = [ h `i ; h `i ] is the hidden state of the `’th layer RNN encoder for the i’th
source sentence word. The inputs to the first layer forward/backward RNNs are the
source word embeddings E S [xi ]. The representation of the source sentence is then
the concatenation of the hidden states for all layers hi = [h1i ; . . . ; hLi ] which is then
used by the decoder.
Deep Stacked Decoder. Similar to the multi-layer RNN encoder, the decoder
RNN has multiple layers:
64
in which cj is the dynamic source context, as defined in eqn. 2.48. The state of the
decoder is then the concatenation of the hidden states for all layers: sj = [s1j ; . . . ; sLj ]
which is then used in eqn. 2.41 as part of the “output generation module”.
Shared Layer MTL. We share the deep layer RNNs in the encoders and/or de-
coders across the tasks, as a mechanism to share abstract knowledge and increase
model generalisation.
Suppose we have a total of M + 1 tasks, consisting of the main task plus M
L0
auxiliary tasks. Let Θm θm
enc = {θ
L m
θm
`,enc }`=1 and Θdec = {θ `0 ,dec }`0 =1 be the parameters of
0 0
multi-layer encoder and decoder for the task m. Let {Θm m M
enc , Θdec }m=1 and {Θenc , Θdec }
be the RNN parameters for the auxiliary tasks and the main task, respectively. We
share the parameters of the deep-level encoders and decoders of the auxiliary tasks
with those of the main task. That is,
0
∀m ∈ [1, .., M ] ∀` ∈ [1, .., Lm m
enc ] : θ `,enc = θ `,enc
m
∀m ∈ [1, .., M ] ∀`0 ∈ [1, .., L0 dec ] : θ m 0
`0 ,dec = θ `0 ,dec
0m
where Lmenc and L dec specify the deep-layer RNNs need to be shared parameters.
Other parameters to share across the tasks include those of the attention module, the
source/target embedding tables, and the output generation module. As an extreme
case, we can share all the parameters of Seq2Seq architectures across the tasks.
X
K X
Lmtl (Θmtl ) := w(k) log PΘmtl (y (k) |x(k) ). (4.1)
k=0 (x,y)∈Dk
where Θmtl denotes all the parameters of the MTL architecture and w(k) denotes the
importance weight of task k and we set it to
γ (k)
w(k) =
Nk
where γm balances out its influence in the training objective. In Chapter 6, we will
back to this training objective by proposing adaptive importance weights.
65
Training Schedule. Variants of stochastic gradient descent (SGD) can be used
to optimise the objective in order to learn the parameters. Making the best use of
tasks with different objective geometries is challenging, e.g. due to the scale of their
gradients. One strategy for making an SGD update is to select the tasks from which
the next data items should be chosen. In our training schedule, we randomly select
a training data item from the main task and pair it with a data item selected from a
randomly selected auxiliary task for making the next SGD update. This ensures the
presence of a training signal from the main task in all SGD updates and avoids the
training signal being washed out by the auxiliary tasks. In Chapter 7, we will go back
to the scheduling of Biased-MTL and replace this predefined schedule by learning an
adaptive schedule.
Task Discriminator. The goal of the task discriminator is to predict the identity
of a task for a data item based on the representations of the share layers. More
specifically, our task discriminator consists of two RNNs with LSTM units, each of
which encodes the sequence of hidden states in the shared layers of the encoder and
the decoder.1 The last hidden states of these two RNNs are then concatenated, giving
rise to a fixed dimensional vector summarising the representations in the shared layers.
The summary vector is passed through a fully connected layer, followed by a softmax
to predict the probability distribution over the tasks:
where disLSTMs denotes the discriminator LSTMs, shrRepΘmtl (x, y) denotes the
representations in the shared layer of deep encoders and decoders in the MTL archi-
tecture, and Θd includes the disLSTMs parameters as well as {Wd , bd }.
1
When multiple layers are shared, we concatenate their hidden states at each time step, which is
then inputted to the task discriminator’s LSTMs.
66
Adversarial Objective. Inspired by (Chen et al., 2017b), we add two additional
terms to the MTL training objective in eqn. 4.1. The first term is Ladv1 (Θd ) defined
as:
X
K X
log PΘd (m| disLSTMs(shrRepΘmtl (x, y))).
k=0 (x,y)∈Dk
Maximising the above objective over Θd ensures proper training of the discriminator
to predict the identity of the task. The second term ensures that the parameters of
the shared layers are trained so that they confuse the discriminator by maximising
the entropy of its predicted distribution over the task identities. That is, we add the
term Ladv2 (Θmtl ) to the training objective defined as:
X
K X
H PΘd (.| disLSTMs(shrRepΘmtl (x, y)))
k=0 (x,y)∈Dk
where H[.] is the entropy of a distribution. In summary, the adversarial training leads
to the following optimisation
We maximise the above objective by SGD, and update the parameters by alternating
between optimising Lmtl (Θmtl ) + λLadv2 (Θmtl ) and Ladv1 (Θd ).
4.4 Experiments
4.4.1 Bilingual Corpora
We use four language-pairs, translating from English to Vietnamese, Turkish, Farsi
and Spanish. We have chosen these languages to analyse the effect of Multi-Task
Learning on languages with different underlying linguistic structures.2 The sentences
are segmented using BPE (Sennrich et al., 2016b) on the union of source and tar-
get vocabularies with a 40k vocabulary size for English to Vietnamese Spanish and
Turkish. For English-Farsi, BPE is performed using separate vocabularies due to the
disjoint alphabets with 30k vocabulary sizes.
Table 4.1 show some statistics about the bilingual corpora. Further details about
the corpora and their pre-processing is as follows:
2
English, Vietnamese and Spanish are SVO while Farsi and Turkish are SOV.
67
Train Dev Test
En → Vi 133K 1,553 1,268
En → Tr 200K 2,910 2,898
En → Es 150K 2,928 2,898
En → Fa 98K 3,000 4,000
• The English-Vietnamese is from the translation task in IWSLT 2015, and we use
the preprocessed version provided by (Luong and Manning, 2015). The sentence
pairs in which at least one of their sentences had more than 300 units (after ap-
plying BPE) are removed. “tst2012” and “tst2013” parts are used for validation
and test sets, respectively.
• The English-Farsi corpus is assembled from all the parallel news text in LDC2016E93
Farsi Representative Language Pack from the Linguistic Data Consortium, com-
bined with English-Farsi parallel subtitles from the TED corpus (Tiedemann,
2012). Since the TED subtitles are user-contributed, this text contained con-
siderable variation in the encoding of its Perso-Arabic characters. To address this
issue, we have normalised the corpus using the Hazm toolkit.3 Sentence pairs in
which one of the sentences has more than 80 (before applying BPE) are removed,
and BPE is performed with a 30k vocabulary size. Random subsets of this corpus
(3k and 4k sentences each) are held out as validation and test sets, respectively.
• The English-Turkish is from WMT parallel corpus (Bojar et al., 2016) with about
200K training pairs gathered from news articles. We use the Moses toolkit (Koehn
et al., 2007) to filter out pairs where the number of tokens is more than 250 tokens
(after applying BPE) and pairs with a source/target length ratio higher than 1.5.
“newstest2016” and “newstest2018” parts are used as validation and test set.
• For English-Spanish pair, we have used the first 150K training pairs of Europarl
corpus (Koehn, 2005). We also have applied the similar filtering process that we
have done on English-Turkish. “newstest2011” and “newstest2013” parts are used
as validation and test set, respectively.
3
www.sobhe.ir/hazm
68
4.4.2 Auxiliary Tasks
We have chosen the following auxiliary tasks to provide the NMT model with syntactic
and/or semantic knowledge, in order to enhance the quality of translation:
Syntactic Parsing. By learning the phrase structure of the input sentence, the
model would be able to learn better re-ordering. Especially, in the case of language
pairs with the high level of syntactic divergence (e.g. English-Farsi). We have used
Penn Tree Bank parsing data with the standard split for training, development, and
test (Marcus et al., 1993). We cast syntactic parsing to a Seq2Seq transduction task
by linearising constituency trees (Vinyals et al., 2015).
69
English → Vietnamese English → Turkish English → Spanish English → Farsi
Dev Test Dev Test Dev Test Dev Test
TER BLEU TER BLEU TER BLEU TER BLEU TER BLEU TER BLEU TER BLEU TER BLEU
NMT 58.4 22.83 55.7 24.15 104.2 8.55 101.2 8.5 73.1 14.49 73.6 13.44 96.1 12.16 96.7 11.95
MTL (Full)
+ All Tasks 57.2† 22.71 55.1 24.71 90† 9.12† 88.8† 8.84 71.2† 14.63 71† 13.75 76.5† 12.67† 76.6† 12.45†
MTL (Partial)
+ Semantic 56.8† 23.26 54.52† 25.09† 88.24† 9.24† 87.06† 8.25 71.1† 14.68 70.9† 13.95† 74.8† 12.73† 75.2 12.27
+ NER 57.7† 22.54 54.72† 25† 83.14† 9.3† 81.47† 9† 73.7 14.23 73.8 13.28 74.71† 13.34† 73.43† 13.17†
+ Syntactic 57.3† 23.02 54.72 †
24.74 80.39 †
9.72† 79.9 †
9.01†
71.8† 14.39 71.6† 13.59 73.04† 13.43† 73.82† 13.43†
+ All Tasks 57.4† 23.42† 54.32 25.22†
†
79.12† 10.06† 79.61† 9.53† 70.4† 15.14† 70.2† 14.11† 72.84† 13.53† 73.43† 13.47†
+ All+Adv. 57† 23.49† 54.22† 25.56† 77.84† 10.44† 78.43† 9.98† 70.2† 15.2† 70† 14.28† 73.33† 13.23† 73.43† 13.17†
Table 4.2: BLEU and TER scores of the baselines vs. our partial parameter sharing MTL architecture with various auxiliary
tasks on the bilingual datasets. † : Statistically significantly better than the NMT baseline (p < 0.05).
are shared among the encoders of the tasks. Moreover, we share the parameters of
the bottom layers of stacked decoders among the tasks. As we will show in Section
4.4.5, different sharing scenarios (i.e. the number of shared layers) lead to the best
performance for different language pairs. In the experiments, we have used the best
sharing scenario for each of the language pairs reported in Section 4.4.5. Additionally,
source and target embedding tables are shared among the tasks, while the attention
component is task-specific.6 We compare against the following baselines:
The configuration of models is as follows. The encoders and decoders make use of
LSTM units with 512 hidden dimensions. For training, we used Adam algorithm
(Kingma and Ba, 2014) with the initial learning rate of 0.001 for all of the tasks.
Learning rates are halved when the performance on the corresponding dev set de-
creased. In order to speed-up the training, we use mini-batching with the size of 32.
Dropout rates for both encoder and decoder are set to 0.3, and models are trained for
25 epochs where the best models are selected based on the perplexity on the dev set.
λ for the adversarial training is set to 0.1. Once trained, the NMT model translates
using the greedy search. We use BLEU (Papineni et al., 2002) and TER (Snover
et al., 2006) to measure translation quality, and measure the statistical significance
p < 0.05 using the approximate randomisation (Clark et al., 2011).
4.4.4 Results
Table 5.1 reports the BLEU and TER scores for the baseline and our proposed method
on the four translation tasks mentioned above. It can be seen that the performance
6
In our experiments, models with task-specific attention components achieved better results than
those sharing them.
7
We have used their reported best performing architecture (shared encoder) and changed the
training schedule to ours.
71
W/O Adaptation W/ Adaptation
Partial Part.+Adv. Full Partial Part.+Adv. Full
En → Vi 25.22 25.56 24.71 25.45 25.86 24.83
En → Tr 9.53 9.98 8.84 9.63 10.09 9.18
En → Sp 14.11 14.28 13.75 14.05 14.2 13.46
En → Fa 13.47 13.17 12.45 13.23 12.99 12.21
Table 4.3: Our method (partial parameter sharing) against Baseline 2 (full parameter
sharing). Part.+Adv. means partial parameter sharing and adversarial training have
been employed.
of Multi-Task Learning models is better than Baseline 1 (only MT task). This con-
firms that adding auxiliary tasks helps to increase the performance of the machine
translation task.
As expected, the effect of different tasks are not similar across the language pairs,
possibly due to the following reasons: (i) these translation tasks datasets come from
different domains, so they have various degree of domain relatedness to the auxiliary
tasks, and (ii) the BLEU and TER scores of the Baseline 1 show that the four trans-
lation models are on different quality levels which may entail that they benefit from
auxiliary knowledge on different levels. In order to improve a model with low-quality
translations due to language divergence, syntactic knowledge can be more helpful as
they help to perform better reorderings. In a higher-quality model, however, semantic
knowledge can be more useful as a higher-level linguistic knowledge. This pattern can
be seen in the reported results: syntactic parsing leads to more improvement on Farsi
translation which has a low BLEU score and high language divergence to English, and
semantic parsing yields more improvement on the Vietnamese translation task which
already has a high BLEU score. The NER task has led to a steady improvement in
all the translation tasks, as it leads to better handling of named entities.
We have further added adversarial training to ensure the shared representation
learned by the encoder is not contaminated by the task-specific information. As the
MTL focuses on extracting extracting the common and task-invariant features. The
results are in the last row of Table 5.1. The experiments show that adversarial training
leads to further gains in MTL translation quality, except when translating into Farsi.
We speculate this is due to the low quality of NMT for Farsi, where updating shared
parameters with respect to the entropy of discriminator’s predicted distribution may
negatively affect the model.
Table 4.3 compares our Multi-Task Learning approach to Baseline 2. As Table
4.3, our partial parameter sharing mechanism is more effective than fully sharing
72
16
MT + Semantic
MT + NER
14 MT + Syntactic
MT + All tasks
12
10
0
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram
Figure 4.1: Percentage of more correct n-grams generated by the deep MTL models
compared to the single-task model (only MT) for En→Vi translation.
the parameters (Baseline 2), due to its flexibility in allowing access to private task-
specific knowledge. We also applied the adaptation technique (Niehues and Cho,
2017) as follows: upon finishing MTL training, we continue to train the best saved
model only on the MT task for another 20 epochs and choose the best model based
on perplexity on dev set. Adaptation has led to gains in the performance of our MTL
architecture and Baseline 2 on two language pairs.
4.4.5 Analysis
How many layers of encoder/decoder to share? In this analysis, we want to
see the effect of the number of shared layers in encoders and decoders on the perfor-
mance of the MTL model. Figure 4.2 shows the results on the En→Vi translation
task. The results confirm that the partial sharing of stacked layers is better than
full sharing. Intuitively, partial sharing provides the model with an opportunity to
learn task-specific skills via the private layers, while leveraging the knowledge learned
from other tasks via shared layers. The figure shows the best sharing strategy for
En→Vi translation task is 1-2 where 1 and 2 layers are shared among encoders and
decoders, respectively. We have found that this strategy works very well for En→Tr
and En→Fa as well. However, this sharing strategy does not work well for En→Sp
task, and the performance of the resulted MTL model is even worse than the single-
task NMT model. Interestingly, we have found it is better for En→Sp task to use 2-1
sharing strategy although the strategy does not work well for other language pairs.
Therefore, there is a need for tuning over sharing strategies for each language pair. To
73
25.5
0
25.0
23.5
24.71 24.73 23.72 24.61
3
23.0
0 1 2 3
# shared decoder layers
Figure 4.2: BLEU scores for different numbers of shared layers in encoder (from top)
and decoder (from bottom). The vocabulary is shared among tasks while each task
has its own attention mechanism.
speed up, the tuning process can be performed only on 1-2 and 2-1 sharing strategies.
Effect of the NER task. The NMT model has difficulty translating rarely oc-
curred named-entities, particularly when the bilingual parallel data is scarce. We
expect learning from the NER task leads the MTL model to recognise named-entities
and learn underlying patterns for translating them. The top part in Table 4.4 shows
an example of such a situation. As seen, the MTL is able to recognise all of the
named-entities in the sentence and translate the while the single-task model missed
“1981”.
74
English AIDS was discovered 1981 ; the virus , 1983 .
Reference AIDS disease was discovered 1981 ; its virus on 1983 .
MT only model AIDS was discovered 1998 ; virus , 1983 .
MT+NER model AIDS was discovered 1981 ; virus , 1983 .
English of course , you couldn’t have done that all alone .
Reference of course, you couldn’t all that alone have done that .
MT only model of course, you cannot have done that .
MT+semantic model of course, you couldn’t all that alone have done that .
in hospitals , for new medical instruments ; in streets for traffic
English
control .
in hospitals , for instruments medical new ; in streets for
Reference
control traffic .
in hospitals , for equipments medical new , in streets for
MT only model
control instruments new .
in hospitals , for instruments medical new ; in streets for
MT+syntactic model
control traffic .
Table 4.4: Example of translations on Farsi test set. In this examples each Farsi word
is replaced with its English translation, and the order of words is reversed (Farsi is
written right-to-left). The structure of Farsi is Subject-Object-Verb (SOV), leading
to different word orders in English and Reference sentences.
For more analysis, we have applied a Farsi POS tagger (Feely et al., 2014) to gold
translations. Then, we extracted n-grams with at least one noun in them, and report
the statistics of correct such n-grams, similar to what reported in Figure 4.1. The
resulting statistics is depicted in Figure 4.3. As seen, the MTL model trained on MT
and NER tasks leads to the generation of more correct unigram noun phrases relative
to the vanilla NMT, as n increases.
Effect of the syntactic parsing task. Recognising the syntactic structure of the
source sentence helps NMT to translate phrases better. The bottom part of Table
4.4 shows an example of translation demonstrating such case. The source sentence
is talking about “new medical instruments” and “traffic control”, which the second
term is correctly translated by the MTL model while vanilla NMT has mistakenly
translated it to “control instruments new”.
75
14
MT+NER
12
10
0
1-grams 2-grams 3-grams 4-grams 5-grams 6-grams 7-grams
Figure 4.3: Percentage of more corrected n-grams with at least one noun generated by
MT+NER model compared with the only MT model (only MT) for En→Vi language
pair.
4.5 Summary
We have presented an approach to improve NMT in bilingually scarce scenarios,
by leveraging curated linguistic resources in the source, including semantic parsing,
syntactic parsing, and named entity recognition. This is achieved via a powerful
MTL architecture, based on deep stacked encoders and decoders, to share common
knowledge among the MT and auxiliary tasks. Our experimental results show the
effectiveness of the proposed approach on improving the translation quality when
translating from English to Vietnamese, Turkish, Spanish and Farsi in bilingually
scarce scenarios.
76
Chapter 5
In the previous chapter, we proposed an MTL architecture based on deep stacked lay-
ers in encoder/decoder and sharing the layers partially. The hypothesis behind partial
sharing is that although similar tasks share some commonalities, they have some un-
derlying differences which should be captured. Otherwise, the MTL models would
not be able to make the best use of auxiliary tasks. Empirical results and analyses
support the hypothesis by showing that dedicating task-specific parameters is crucial
and provides the model with the capacity to learn task-specific representations. How-
ever, the approach has two limitations: (1) the shared components (layers) are shared
among all of the tasks, causing this MTL approach to suffering from task interfer-
ence; (2) finding the optimal sharing policy is a computationally expensive search.
In this chapter, we address both of these limitations by learning how to adaptively
control the amount of sharing. We extend the recurrent units with multiple blocks
along with a routing network to dynamically control sharing of blocks conditioned on
the task at hand, the input, and model state. Empirical results and analyses show
the effectiveness of the proposed approach on four bilingually low-resource scenarios.
77
5.1 Introduction
MTL has been used for various NLP problems, e.g. dependency parsing (Peng et al.,
2017), video captioning (Pasunuru and Bansal, 2017) key-phrase boundary classifica-
tion (Augenstein and Søgaard, 2017), Chinese word segmentation, and text classifi-
cation problem (Liu et al., 2017). More specifically, as seen in the previous chapter,
MTL is an effective approach for injecting knowledge obtained from linguistic tasks
to improve the performance of the underlying translation model. However, injecting
knowledge from auxiliary tasks is a double-edged sword. While positive transfer may
help to improve the performance of the main translation task, the negative trans-
fer may have unfavourable effects and degrade the translation quality. This can be
seen in the observations of the previous chapter that show dedicating task-specific
parameters is crucial, as it provides the model with the capacity to learn task-specific
representations. Although the proposed approach encouraged a more positive trans-
fer compared to the full sharing approach, it still has two limitations: (1) Since the
shared components (layers) are shared among all of the tasks, the approach still suf-
fers from task interference, and it is not able to fully capture commonalities among
subsets of tasks. (2) Experiments have shown that different language pairs have dif-
ferent optimal sharing strategy, and we need to perform a computationally expensive
search to find the optimal strategy. Recently, Ruder et al. (2017) tried to address the
task interference issue; however, their method is restrictive for Seq2Seq scenarios
and does not consider the input at each time step to modulate parameter sharing.
In this chapter, we address the aforementioned issues by learning how to control
the amount of sharing among the tasks “dynamically”. We propose a novel recurrent
unit which is an extension of conventional recurrent units. It has multiple flows of
information, controlled by multiple blocks, and is equipped with a “routing network”
to dynamically control sharing of blocks conditioning on the task at hand, the input,
and model state. Empirical results on four low-resource translation scenarios, English
to Vietnamese, Turkish, Spanish and Farsi, show the effectiveness of the proposed
model.
In summary, the contributions of this chapter are as follows:
• We propose a novel recurrent unit with multiple flows of information along with
a router network to dynamically control the amount of sharing among all tasks.
• We present empirical results and analyses that show the effectiveness of the
proposed approach on leveraging the commonalities among subsets of tasks.
78
5.2 Routing Networks for Deep Neural Networks
Our work is related to the Mixture-of-Expert (MoE) architectures, where multiple
experts (sub-networks) are learned to cover different subspaces of the problem. The
beating heart of an MoE model is its routing network (aka gating network) responsible
for modulating input and output of these experts. MoE has introduced two decades
ago (Jacobs et al., 1991; Jordan and Jacobs, 1994), and has gained attention in deep
learning recently. Shazeer et al. (2017) uses MoEs (feed-forward sub-networks) be-
tween stacked layers of recurrent units, to adaptively gate state information vertically.
This is in contrast to our approach where the horizontal information flow is adaptively
modulated, as we would like to minimise the task interference in MTL. Rosenbaum
et al. (2018) has proposed a routing network for adaptive selection of non-linear func-
tions for MTL. However, it is for fixed-size inputs based on a feed-forward architecture
and is not applicable to Seq2Seq scenarios such as MT.
There are two variants of routing networks wrt type of their decisions: hard- versus
soft-decision. In this chapter, we employed soft decision making. The hard-decision
making is not differentiable, and could be solved by techniques like Gumbel-Softmax
(Jang et al., 2016) or Reinforcement Learning (RL). It becomes more challenging
as the environment is non-stationary as both the router and experts are trained
simultaneously. After the publication of our work, it has been followed by Cases
et al. (2019) that proposes a hard version of the routing mechanism using RL.
79
Routing
Network
h (1)
t
Block 1
h (1)
t-1
h (2)
t
h t-1 Block 2 ht
(2)
ht-1
h (3)
t
Block 3
(3)
ht-1
h (4)
t
(4)
Block 4
ht-1
xt
Figure 5.1: High-level architecture of the proposed recurrent unit with 3 shared blocks
and 1 task-specific.
direct the input at each time step to these blocks (see Figure 5.1). Each block mimics
an expert in handling different kinds of information, coordinated by the router. In
MTL, the tasks can use different subsets of these shared experts.
Assuming there are n blocks in a recurrent unit, we share n − 1 blocks among the
tasks, and let the last one be task-specific.1 The task-specific block receives the input
of the unit directly while shared blocks are fed with modulated input by the routing
network. The state of the unit at each time-step would be the aggregation of blocks’
states.
where W ’s and b’s are the parameters. Then, the i-th shared block is fed with the
input of the unit modulated by the corresponding output of the routing network
(i)
x̃t = τt [i]xt where τt [i] is the scalar output of the routing network for the i-th block.
The hidden state of the unit is the concatenation of the hidden state of the shared
(shared) (task)
and task-specific parts ht = [ht ; ht ]. The state of task-specific part is the
1
Multiple recurrent units can be stacked on top of each other to consist a multi-layer component.
80
(task) (n)
state of the corresponding block ht = ht , and the state of the shared part is
the sum of states of shared blocks weighted by the outputs of the routing network
(shared) P (i)
ht = n−1
i=1 τt [i]ht .
5.4 Experiments
For the experiments of this chapter, we have used the bilingual corpora and auxiliary
tasks discussed in Sections 4.4.1 and 4.4.2, respectively.
81
English → Vietnamese English → Turkish English → Spanish English → Farsi
Dev Test Dev Test Dev Test Dev Test
TER BLEU TER BLEU TER BLEU TER BLEU TER BLEU TER BLEU TER BLEU TER BLEU
NMT 58.4 22.83 55.7 24.15 104.2 8.55 101.2 8.5 73.1 14.49 73.6 13.44 96.1 12.16 96.7 11.95
MTL (Full) 57.2 22.71 55.1 24.71 90 9.12 88.8 8.84 71.2 14.63 71 13.75 76.5 12.67 76.6 12.45
MTL (Partial) 57.4 23.42 54.32 25.22 79.12 10.06 79.61 9.53 70.4 15.14 70.2 14.11 72.84 13.53 73.43 13.47
† † † † † † † † †
MTL (Routing) 55.69 24.17 52.94 26.39 74.43 11.09 74.62 10.62 68.43 16.23 68.04† † 15.1† 68.61† 14.9† 69.07† 14.64†
Table 5.1: The performance of the baselines vs. our MTL architecture on the bilingual datasets. † : Statistically significantly
better (p < 0.05) than the MTL (partial).
5.4.1 Models and Baselines
We have implemented the proposed MTL architecture along with the baselines in
PyTorch on top of OpenNMT. For our MTL architecture, we used the proposed
recurrent unit with 3 blocks in encoder and decoder. For the fair comparison in
terms the of number of parameters, we used 3 stacked layers in both encoder and
decoder components for the baselines. We compare against the following baselines:
• Baseline 1: The vanilla Seq2Seq model (Luong et al., 2015a) without any
auxiliary task.
• Baseline 2: The MTL architecture proposed in (Niehues and Cho, 2017) which
fully shares parameters in components. We have used their best performing
architecture with our training schedule. We have extended their work with
deep stacked layers for the sake of comparison.
For the proposed MTL, we use recurrent units with 512 hidden dimensions for
each block. The encoders and decoders of the baselines use GRU units with 400
hidden dimensions. The attention component has 512 dimensions. We use Adam
optimiser (Kingma and Ba, 2014) with the initial learning rate of 0.001 for all the
tasks. Learning rates are halved on the decrease in the performance on the dev set
of the corresponding task. Mini-batch size is set to 32, and the dropout rate is 0.3.
All models are trained for 25 epochs, and the best models are saved based on the
perplexity on the dev set of the translation task.
For each task, we add special tokens to the beginning of source sequence (similar
to (Johnson et al., 2017)) to indicate which task the sequence pair comes from. We
have used the same data for all models and baselines.
We used greedy decoding to generate translation. In order to measure the trans-
lation quality, we use BLEU (Papineni et al., 2002) and TER (Snover et al., 2006)
scores, and measure the statistical significance (p < 0.05) using the approximate
randomisation (Clark et al., 2011).
83
Model Number of Parameters
NMT 41M
MTL (Full) 51M
MTL (Partial) 100M
MTL (Routing) 105M
Table 5.2: The number of parameters for different models (for En→Vi language pair).
For MTL models, we report the total number of parameters for all of the tasks (shared
parameters are counted once).
0.5
0.45
0.4
0.35
0.3
0.25
0.2
Block 1 Block 2 Block 3
MT Semantic Syntactic NER
84
Figure 5.2 shows the average percentage of block usage for each task in an MTL
model with 3 shared blocks, on the English-Farsi test set. We have aggregated the
output of the routing network for the blocks in the encoder recurrent units over all
the input tokens. Then, it is normalised by dividing the total number of input tokens.
Based on Figure 5.2, the first and third blocks are more specialised (based on their
usage) for the translation and NER tasks, respectively. The second block is mostly
used by the semantic and syntactic parsing tasks, so specialised for them. This
confirms our model leverages commonalities among subsets of tasks by dedicating
common blocks to them to reduce task interference.
5.5 Summary
In this chapter, we address two of the main issues in previous MTL models: (1)
task interference that causes the inability to fully capture the commonalities among
subsets of tasks; (2) need for a computationally expensive search to find optimal
sharing strategy. We address the issues by extending conventional recurrent units
with multiple blocks along with a trainable routing network. Each block mimics an
expert in handling a different kind of information, and the routing network guides the
input to these blocks conditioning on the task at hand, the input, and the model state.
Our experimental results on low-resource English to Vietnamese, Turkish, Spanish
and Farsi datasets, show the effectiveness of the proposed approach compared to the
full and partial sharing MTL models.
85
Part III
86
Chapter 6
Scarcity of parallel sentence pairs is a major challenge for training high-quality Neural
Machine Translation (NMT) models in bilingually low-resource scenarios, as NMT is
data-hungry. As seen in part II, Multi-Task Learning can alleviate this issue by
injecting inductive biases into NMT, using auxiliary syntactic and semantic tasks.
However, an effective training schedule is required to balance the importance of tasks
to get the best use of the training signal. The role of training schedule becomes even
more crucial in Biased-MTL where the goal is to improve one of tasks the most, e.g.
the translation quality in our setting. Current approaches for Biased-MTL are based
on brittle hand-engineered heuristics that require trial and error, and should be (re-
)designed for each learning scenario. To the best of our knowledge, ours is the first
work on adaptively and dynamically changing the training schedule in Biased-MTL.
We propose a rigorous approach for automatically reweighing the training data of the
main and auxiliary tasks throughout the training process based on their contributions
to the generalisability of the main NMT task. Empirical results and analyses show
the effectiveness of the proposed approach on four bilingually low-resource scenarios.
Additionally, our analyses shed light on the dynamic of needs throughout the training
of NMT: from syntax to semantics.
87
Adaptive vs Fixed schedules
0.25
Adaptive-Semantic
Adaptive-Syntactic
0.2
Adaptive-NER
Fixed-schedule
Average Weight
0.15
0.1
0.05
0
0
0
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
20
60
10
14
18
22
26
30
34
38
42
46
50
54
58
62
66
70
74
78
82
86
90
94
98
Training iteration
Figure 6.1: The dynamic in the relative importance of named entity recognition,
syntactic parsing, and semantic parsing as the auxiliary tasks for the main machine
translation task (based on our experiments in Section 6.3). The plot shows our
proposed adaptive scheduling vs. fixed scheduling (Kiperwasser and Ballesteros, 2018)
(scaled down for better illustration).
6.1 Introduction
The majority of the MTL literature, including part II of this thesis, has focused on
investigating how to share common knowledge among the tasks through tying their
parameters and joint training using standard algorithms. However, a big challenge
of MTL is how to get the best signal from the tasks by changing their importance in
the training process aka training schedule; see Figure 1.
Crucially, as discussed in Section 2.3.2, a proper training schedule would encourage
positive transfer and prevent the negative transfer, as the inductive biases of the
auxiliary tasks may interfere with those of the main task leading to degradation of
generalisation capabilities. Most of the works on the training schedule focus on general
MTL where the goal is to improve the performance of all tasks. They are based on
addressing the imbalance in task difficulties and co-evolve easy and difficult tasks
uniformly (performance-wise). These methods achieve competitive performance with
existing single-task models of each task, and not necessarily much better performance
(Chen et al., 2018b; Guo et al., 2018). On the other hand, Biased-MTL focuses on
the main task to achieve higher improvements in it. In Chapter 4, we have proposed
a fixed training schedule to balance out the importance of the main NMT task vs.
auxiliary task to improve NMT the most. Kiperwasser and Ballesteros (2018) has
shown the effectiveness of a changing training schedule through the MTL process.
88
However, their approach is based on hand-engineered heuristics, and should be (re-
)designed and fine-tuned for every change in tasks or even training data.
In this chapter, for the first time to the best of our knowledge, we propose a method
to adaptively and dynamically set the importance weights of tasks for Biased-MTL.
By using influence functions from robust statistics (Cook and Weisberg, 1980; Koh
and Liang, 2017), we adaptively examine the influence of training instances inside
mini-batches of the tasks on the generalisation capabilities on the main task. The
generalisation is measured as the performance of the main task on a validation set,
separated from the training set, in each parameter update step dynamically. As our
method is general and does not rely on hand-engineered heuristics, it can be used for
effective learning of multi-task architectures beyond NMT.
We evaluate our method on translating from English to Vietnamese, Turkish,
Spanish and Farsi, with auxiliary tasks including syntactic parsing, semantic parsing,
and named entity recognition. Compared to the strong training schedule baselines,
our method achieves considerable improvements in terms of BLEU score. Addition-
ally, our analyses on the weights assigned by the proposed training schedule show that
although the dynamic of weights are different for different language pairs, the under-
lying pattern is to gradually alter tasks importance from syntactic to semantic-related
tasks.
In summary, our main contributions to MTL and low-resource NMT are as follows:
• We extensively evaluate four language pairs, and experimental results show that
our model outperforms the hand-engineered heuristics.
89
parameters are learned by maximising the log-likelihood objective:
X Nk
K X
(k) (k) (k)
arg max wi log PΘmtl (yi |xi ).
Θmtl
k=0 i=0
X
arg min − log PΘ̂mtl (ŵw ) (y|x) (6.1)
w
ŵ
(x,y)∈Dval
(k)
K |bX
X |−1
(t) (k) (k) (k)
w ) := Θmtl + η
Θ̂mtl (ŵ ŵi ∇ log PΘ(t) (yi |xi ) (6.2)
mtl
k=0 i=0
90
Syntactic Semantic
Translation
Parsing Parsing
Adaptive
Importance Weights
Multi-Task NMT
Figure 6.2: High-level idea for training an MTL architecture using adaptive impor-
tance weights (AIWs). Here, translation is the main task along with syntactic and
semantic parsing as auxiliary linguistic tasks.
(k)
where ŵi is the raw weight of the ith training instance in the mini-batch b(k) of the
kth task, Θ̂mtl is the resulting parameters in case SGD update rule is applied on the
(t)
current parameters Θmtl using instances weighted by ŵw . Following (Ren et al., 2018),
we zero out negative raw weights, and then normalise them with respect to the other
(k)
(k) w̃i
instances in the MTL training mini-batch to obtain the AIWs: wi = P P (k0 )
k0 i0 w̃i0
(k) (k)
where w̃i = ReLU(ŵi ).
(k)
In the preliminary experiments, we observed that using wi as AIW does not
perform well. We speculate that a small validation set does not provide a good
estimation of the generalisation, hence influence of the training instances. This is
exacerbated as we approximate the validation set by only one of its mini-batches
for computational efficiency. Therefore, we hypothesise that the computed weights
should not be regarded as the final verdict for the usefulness of the training instances.
Instead, we regarded them as rewards for enhancing the training signals of instances
(k)
that lead to a lower loss on the validation set. Hence, we use 1 + wi as our AIWs
in the experiments. The full algorithm is in Algorithm 1.
91
Algorithm 1 Adaptively Scheduled Multitask Learning
1: while t=0 ... T-1 do
2: b(1) , .., b(K) ← SampleMB(D (1) , .., D (K) )
3: b(val) ← SampleMB(D (val) )
. Step 1: Update model with initialised weights
(k) (k) (k)
4: `i ← − log PΘtmtl (yi |xi ) . Forward
(k)
5: ŵi,0←0 . Initialise weights
P P|b(k) | (k) (k)
6: Ltrn ← Kk=1 i=1 ŵi,0 `i
7: gtrn ← Backward(Ltrn , Θtmtl )
8: Θ̂tmtl = Θtmtl + ηgtrn
. Step 2: Calculate loss of the updated model on validation MB
P val |
9: Lval = − |b i=1 log PΘ̂t (yi |xi ) mtl
. Step 3: Calculate raw weights.
(k)
10: gval ← Backward(Lval , ŵ0 )
11: ŵ(k) = gval
. Step 4: Normalise weights to get AIWs
(k) (k)
12: w̃i = ReLU(ŵi )
(k)
(k) w̃i
13: wi = P P (k0 ) +1
k0 i0 w̃i0
. Step 5: Update MTL with AIWs
P P|b(k) | (k) (k)
14: L̂trn ← K k=1 i=1 wi `i
15: ĝtrn ← Backward(L̂trn , Θtmtl )
16: Θt+1 t
mtl = Θmtl + ηĝtrn
17: end while
Liang, 2017).
P|bval |−1
More concretely, let us define the loss function L(Θ̂mtl ) := − i=0 log PΘ̂mtl (yi |xi ),
val
where b is a mini-batch from the validation set. The training instances’ raw weights,
i.e. influences, are then calculated using the chain rule:
The last term ∇ŵ0 Θ̂mtl involves backpropagation through Θ̂mtl wrt ŵ0 , which ac-
cording to eqn. 6.2, involves an inner backpropagation wrt Θmtl . The computation
graph is depicted in Figure 6.3.
92
(3) b a c kwa rd over b a c kwa rd
Multi- Multi- Multi-
(2) b a c kwa rd
(1) Forwa rd
Ta sk Ta sk Ta sk
NMT NMT NMT
Figure 6.3: Computation graph of the proposed method for adaptively determining
weights.
6.3 Experiments
6.3.1 Bilingual Corpora and Auxiliary Tasks
For the experiments of this chapter, we have used the bilingual corpora and auxiliary
tasks discussed in Sections 4.4.1 and 4.4.2, respectively. In addition, we use sepa-
rate Val (meta-validation) sets for the proposed AIW-based approach. For English-
Vietnamese, “tst2012” is divided and used as Dev and Val sets (with the ratio 2 to 1).
For English-Turkish and English-Spanish, we use “newstest2017” and “newstest2012”
parts as Val set, respectively. For English-Farsi, a random 3K subset of the corpus
is held out as Val set. For a fair comparison, we add the meta-validation set used in
the AIW-based approach to the training set of the competing baselines.
93
• Biased: It is introduced in Section 4.2 and selects a random mini-batch from the
translation task (bias towards the main task) and another one for a randomly
selected task.
In each of these schedules, the rest of the probability is uniformly divided among the
remaining tasks. By using them, a mini-batch can have training pairs from different
tasks which makes it inefficient for partially shared MTL models. Hence, we modified
these schedules to select the source of the next training mini-batch.
94
English→Vietnamese English→Turkish English→Spanish English→Farsi
BLEU BLEU BLEU METEOR BLEU
Dev Test Dev Test Dev Test Dev Test Dev Test
MT only 22.83 24.15 8.55 8.5 14.49 13.44 31.3 31.1 12.16 11.95
MTL with Fixed Schedule
+ Uniform 23.10 24.81 9.14 8.94 12.81 12.12 29.6 29.5 13.51 13.22
†‡
+ Biased (Constant) 23.42 25.22 10.06 9.53 15.14 14.11 31.8 31.3 13.53 13.47
‡
+ Exponential 23.45 25.65 9.62 9.12 12.25 11.62 28.0 28.1 14.23 13.99
‡
+ Sigmoid 23.35 25.36 9.55 9.01 11.55 11.34 26.6 26.9 14.05 13.88
MTL with Adaptive Schedule
+ Biased + AIW 23.95♦ 25.75 10.67♦ 10.25♦ 11.23 10.66 27.5 27.4 14.41 14.2
+ Uniform + AIW 24.38♦ 26.68♦ 11.03♦ 10.81♦ 16.05♦ 14.95♦ 33.0♦ 32.5♦ 15.24♦ 15.08♦
Table 6.1: Results for three language pairs. ”+ AIW” indicates Adaptive Importance Weighting is used in training. † : Proposed
in Chapter 4, ‡ : Proposed in (Kiperwasser and Ballesteros, 2018). ♦ : Statistically significantly better (p < 0.05) than the best
MTL with fixed schedule.
reason is likely that the hand-engineered strategies do not consider the state of the
model, and they do not distinguish among the auxiliary tasks.
It is interesting to see that the Biased schedule is beneficial for standard MTL,
while it is harmful when combined with the AIWs. The standard MTL is not able to
select training signals on-demand, and using a biased heuristic strategy improves it.
However, our weighting method can selectively filter out training signals; hence, it is
better to provide all of the training signals and leave the selection to the AIWs.
Analysis on how/when auxiliary tasks have been used? This analysis aims
to shed light on how AIWs control the contribution of each task through the training.
As seen, our method has the best result when it is combined with the Uniform MTL
schedule. In this schedule, at each update iteration, we have one mini-batch from
each of the tasks, and AIWs are determined for all of the training pairs in these mini-
batches. For this analysis, we divide the training into 200 update iteration chunks.
In each chunk, we calculate the average weights assigned to the training pairs of each
task.
Figure 6.1 shows the results of this analysis for the MTL model trained with
En→Vi as the main task. and Figure 6.4 shows the results of this analysis for En→Es
and En→Tr. Also, it can be seen that at the beginning of the training, the Adaptive
Importance Weighting mechanism gradually increases the training signals which come
from the auxiliary tasks. However, after reaching a certain point in the training, it
will gradually reduce the auxiliary training signals to concentrate more on the adap-
tation to the main task. It can be seen that the weighting mechanism distinguishes
the importance of auxiliary tasks. More interestingly, it can be seen that for the
En→Tr, the contribution of NER task is more than the syntactic parsing while for
the other languages we have seen the reverse. It shows that our method can adap-
tively determine the contribution of the tasks by considering the demand of the main
translation task.
As seen, it gives more weight to the syntactic tasks at the beginning of the train-
ing while it gradually reduces their contribution and increases the involvement of
the semantics-related task. We speculate the reason is that at the beginning of the
training, the model requires more lower-level linguistic knowledge (e.g. syntactic pars-
ing and NER) while over time, the needs of model gradually change to higher-level
linguistic knowledge (e.g. semantic parsing).
96
1 0.08
0.9
0.07
0.8
0.06
0.7
0.05
0.6
0.5 0.04
0.4
0.03
0.3
0.02
0.2
0.01
0.1
0 0
200
600
1000
1400
1800
2200
2600
3000
3400
3800
4200
4600
5000
5400
5800
6200
6600
7000
7400
7800
8200
8600
9000
9400
9800
200
600
1000
1400
1800
2200
2600
3000
3400
3800
4200
4600
5000
5400
5800
6200
6600
7000
7400
7800
8200
8600
9000
9400
9800
Auxiliary tasks Translation Semantic Parsing Syntactic Parsing Named-Entity Recognition
(a) Translation task (En→Es) vs. auxiliary (b) Auxiliary tasks vs. each other.
tasks.
1 0.09
0.9 0.08
0.8 0.07
0.7
0.06
0.6
0.05
0.5
0.04
0.4
0.03
0.3
0.02
0.2
0.1 0.01
0 0
200
600
1000
1400
1800
2200
2600
3000
3400
3800
4200
4600
5000
5400
5800
6200
6600
7000
7400
7800
8200
8600
9000
9400
9800
200
600
1000
1400
1800
2200
2600
3000
3400
3800
4200
4600
5000
5400
5800
6200
6600
7000
7400
7800
8200
8600
9000
9400
(c) Translation task (En→Tr) vs. auxiliary (d) Auxiliary tasks vs. each other.
tasks.
Figure 6.4: Weights assigned to the training pairs of different tasks (averaged over
200 update iteration chunks). Y-axis shows the average weight and X-axis shows
the number of update iteration. In the top figures, the main translation task is
English→Spanish while in the bottom ones it is English→Turkish.
97
1200 14000
1100 12000
1000
10000
900
8000
800
700 6000
600 4000
500 2000
400
0
300
s
s
s
es
ns
ns
rb
er
on
al
ive
un
rb
io
tio
t
ou
er
in
ve
Da
at
Ve
No
iti
ct
m
rm
on
nc
tu
200
Ad
os
je
Nu
ju
nc
Pr
Ad
te
ep
n
Pu
OTROS ORG PERS LUG
De
Pr
Co
MT MTL-Biased MTL-Uniform + AIW MT MTL-Biased MTL-Uniform + AIW
Figure 6.5: The number of words in the gold English→Spanish translation which are
missed in the generated translations (lower is better). Missed words are categorised
by their tags (Part-of-Speech and named-entity types).
98
6.4 Summary
This chapter presents a rigorous approach for adaptively and dynamically changing
the training schedule in Biased-MTL to make the best use of auxiliary tasks. To
balance the importance of the auxiliary tasks versus the main task, we re-weight
training data of tasks to adjust their contributions to the generalisation capabilities
of the resulted model on the main task. Our experimental results on English to Viet-
namese/Turkish/Spanish/Farsi show up to +1.2 BLEU score improvement compared
to strong baselines. Additionally, the analyses show that the proposed method auto-
matically finds a schedule which places more importance on the auxiliary syntactic
tasks at the beginning while gradually altering the importance toward the auxiliary
semantic task.
99
Chapter 7
As discussed in the previous chapter, one of the main challenges in MTL is to devise
effective training schedules, prescribing when to make use of the auxiliary tasks during
the training process to fill the knowledge gaps of the main task, a setting referred
to as Biased-MTL. In that chapter, we present a rigorous approach for adaptively
and dynamically changing the training schedule in Biased-MTL by re-weight training
instances of tasks to adjust their contributions to the generalisation capabilities of
the resulted model on the main task. We have shown that the proposed method
is effective and has devised interesting weighting patterns. One major limitation,
however, is the computational complexity of the method. This chapter is built upon
the learned lessons of the previous chapter and introduces a generalised framework
to extend our previous work and make it more efficient.
We propose a novel framework for learning the training schedule, i.e., learning to
multi-task learn, for a given MTL setting. We formulate the training schedule as a
Markov Decision Process which paves the way to employ policy learning methods to
learn the scheduling policy. We effectively and efficiently learn the training sched-
ule policy within the Imitation Learning framework using an oracle policy algorithm
that dynamically sets the importance weights of auxiliary tasks based on their con-
tributions to the generalisability of the main task. Experiments on both low-resource
100
and standard NMT settings show the resulting automatically learned training sched-
ulers are competitive with the best heuristics and lead to up to +1.1 BLEU score
improvements.
7.1 Introduction
The training schedule, the beating heart of MTL, is responsible for balancing out the
importance (participation rate) of different tasks throughout the training process, in
order to make the best use of the knowledge provided by the tasks. In the previous
chapter, we have proposed an approach to automatically re-weight training instances
of tasks to adjust their contributions to the generalisation capabilities of the resulted
model on the main task. Although the proposed method is effective and has found
interesting patterns, it is computationally expensive. This chapter is built upon the
hypotheses derived from the lessons learnt from the previous chapter: (1) As seen,
although we have calculated weights for individual training instances, interesting
task-related patterns have also emerged, e.g., shifting the importance from syntax
to semantic-related tasks. Therefore, we may approximate the instance-weights by
task-weights. (2) We have seen the weights of auxiliary tasks are small relative to the
main translation task, which means they have smaller contributions. However, the
computational complexity for processing an instance/mini-batch is almost agnostic
to the weight coefficient in its loss objective. As the task-weights determine the
contribution of tasks, we can use them as “sampling ratios” instead of coefficients in
the loss. Previously, at each iteration, we use a mini-batch from all of the tasks and
use weights as coefficients in the loss objective. Now, we only select one of the tasks
wrt the task-weights (i.e. higher weight means the higher chance for selection). The
expected training objective would be the same while we save computation time. (3)
On top of all the above mentioned hypotheses, the training schedule is a sequential
decision making process, so we can learn it.
To the best of our knowledge, there is no approach for automatically learning
a dynamic training schedule for Biased-MTL. In this chapter, we propose a novel
framework for automatically learning how to multi-task learn to maximally improve
the main translation task. This is achieved by formulating the problem as a Markov
Decision Process (MDP), enabling us to treat the training scheduler as the policy. We
solve the MDP by proposing an effective yet computationally expensive oracle policy
(inspired from the previous chapter) that sets the participation rates of auxiliary
tasks with respect to their contributions to the generalisation capability of the main
101
translation task. In order to scale up the decision making, we use the oracle policy
as a teacher to train a scheduler network within the Imitation Learning framework
using Dagger (Ross et al., 2011). Thanks to the ∼8X speedup, we jointly train
and use the scheduler along with the MTL model in the course of a single run. Our
experimental results on low-resource (English to Vietnamese/Turkish/Spanish/Farsi)
setting show up to +1.1 BLEU score improvements over heuristic training schedules
and comparable to the approach of the previous chapter. Additionally, we investigate
the effectiveness of the proposed approach on a high-resource setting for WMT17
English to German to show the scalability of the approach further. Our analyses
on the automatically learned scheduling policies show interesting patterns among
auxiliary tasks for high-resource setting, e.g. gradually altering importance from
syntactic to auxiliary semantic tasks (similar to what we have seen for the low-resource
setting in the previous chapter).
To summarise, this chapter contributions are as follows:
102
shared across all the tasks. The MTL parameters are then learned by maximising the
log-likelihood objective:
X
K Nk
X (k) (k)
arg max w(k) log PΘ (yi |xi ). (7.1)
Θmtl
k=0 i=0
Typically, the tasks are assumed to have a predefined importance, e.g. their weights
1
are set to the uniform distribution w(k) = K+1 , or set proportional to the size of
the tasks’ training datasets. As mentioned, we consider task-level weights instead of
instance-level ones.
Assuming an iterative learning algorithm, the standard steps in training the MTL
architecture at time step t are as follows: (i) A collection of training mini-batches
bt := {bkt }K k
k=0 is provided, where bt comes from the k-th task, and (ii) The MTL
parameters are re-trained by combining information from the mini-batches according
to tasks’ importance weights. For example, the mini-batch gradients can be combined
according to the tasks’ weights, which can then be used to update the parameters
using a gradient-based optimisation algorithm.
103
Translation
Semantic
…
…
…
… Val
NER
T S N
Stochastic
… Scheduler Net
N
Multi-Task NMT
104
The optimal policy is found by maximising the following objective function:
hX
T i
arg max Eπφ R(st , a t , st+1 ) (7.2)
φ
t=0
where T corresponds to the maximum training steps for the MTL architecture. Cru-
cially, maximising the above long-term reward objective corresponds to finding a
policy, under which the validation loss of the resulting MTL model (at the end of the
training trajectory) is minimised. In this chapter, we will present methods to provide
such optimal/reasonable stochastic policies for training MTL architectures.
X
wopt := arg min − log PΘ̂(ŵw ) (y|x), (7.3)
w ∈4K
ŵ
(x,y)∈bval
| {z }
L(Θ̂(ŵ))
such that
|bk |−1
X
K X (k) (k)
(k)
w ) := Θt + η
Θ̂(ŵ ŵ ∇ log PΘt (yi |xi ), (7.4)
k=0 i=0
where we use a mini-batch bval from the validation set for computational efficiency.
To find the optimal importance weights, we do one-step projected gradient descent
on the objective function L(Θt ) starting from zero. That is, we firstly set ŵ
w opt :=
∇ŵ0 L(Θ̂(ŵ0 )) w , and finally normalise
, then zero out the negative elements of ŵ
0
ŵ0 =0
the resulting vector to project back to the simplex and produce wopt ∈ 4K . Our
approach to define ŵ is based on the classic notion of influence functions from robust
statistics (Koh and Liang, 2017; Ren et al., 2018; Cook and Weisberg, 1980).
105
Computing ∇ŵ0 L(Θ̂(ŵ0 )) is computationally expensive and complicated, as it
involves backpropagation wrt ŵ0 to a function which itself includes backpropagation
(k)
wrt Θ. Interestingly, as proved at the end of this section, the component ŵopt is
proportional to the inner-product of the gradients of the loss over the mini-batches
from the validation set and the k-th,
This approach is also computationally expensive and limits the scalability of the
oracle, as it requires backpropagation over the mini-batches from all of the tasks and
validation set.
Proof for calculating weights efficiently. We prove that the raw weight for k-th
task (ŵ(k) ) is proportional to the inner-product of the gradients of the loss over the
mini-batches from the validation set and the task.
(k)
ŵopt = ∇ŵ(k) L(Θ̂(ŵ0 )) (7.5)
0 0
ŵ0 =0
where the first term is the gradient of the loss of the updated model (L(Θ̂(ŵ0 )) on
Val set with respect to the parameters before update (Θ̂t ). Since ŵ0 = 0, the value
of parameters before and after update remain the same i.e. Θ̂(ŵ0 ) = Θt . Therefore:
(k)
ŵopt = ∇Θt L(Θt ) ·∇ŵ(k) Θ̂(ŵ0 )
| {z } 0 0
ŵ0 =0
g val
|bm |−1
X
K
(m)
X (m) (m)
∇ŵ(k) [Θt + η ŵ0 ∇ log PΘt (yi |xi )]
0 0
ŵ0 =0
m=0 i=0
|bm |−1
X
K
(m)
X (m) (m)
= η∇ŵ(k) ŵ0 ∇ log PΘt (yi |xi )
0 0
ŵ0 =0
m=0 i=0
|bk |−1
X (k) (k)
=η ∇ log PΘt (yi |xi )
i=0
| {z }
g (k)
106
Therefore, the weight for the task is as follows:
(k)
ŵopt = η(g val · g (k) )
|bval |−1 |bk |−1
X X (k) (k)
=η ∇ log PΘt (yjval |xval
j ) · ∇ log PΘt (yi |xi ). (7.6)
j=0 i=0
Unlike eqn. 7.5, eqn. 7.6 does not require backpropagation of backpropagation and
deep learning frameworks e.g. PyTorch can compute it more efficiently.
107
'
1/7
FC (Softmax)
FC (Tanh)
#
1/7
Input
…
...
&
1/7
where γ ∈ [0, 1]. Importantly, the loss value is already computed when updating
the MTL parameters, so this feature does not impose additional burden on the com-
putational graph. These features are inspired by those used in MentorNet (Jiang
et al., 2018), and our investigations with the oracle policy of Section 7.3 on finding
informative features predictive of the selected tasks.
Learning the Policy with IL. Inspired by the Dataset Aggregation (Dagger )
algorithm (Ross et al., 2011), we learn the scheduler network jointly with the MTL
architecture, in the course of a single training run. Algorithm 2 depicts the high-level
procedure of the training and making use of the scheduler network.
At each update iteration, we decide between using or training the scheduler net-
work with the probability of β 1 (line 9). In case of training the scheduler network
(lines 12-14), we use the oracle policy π oracle in Section 7.3 to generate the optimal
weights. This creates a new training instance for the policy network, where the in-
put is the current state and the output is the optimal weights. We add this new
training instance to the memory replay buffer M , which is then used to re-train the
policy/scheduler network. In case of making use of the scheduler network (line 10),
we simply feed the state to the network and receive the predicted weights.
1
For efficiency, we use the scheduler network at least 90% of times.
108
Algorithm 2 Learning the scheduler and MTL model
1: Input: Train sets for the tasks D0 ..DK , validation of the main task Dval , scheduler
usage ratio β
2: Init Θ0 randomly . MTL architecture parameters
3: Init φ randomly . scheduler network parameters
4: M ← {} . memory-replay buffer
k
5: lma ← 0 ∀k ∈ {0, .., K}
6: t←0
7: while the stopping condition is not met do
8: b0t , .., bkt , bval ← sampleMB(D0 , .., DK , Dval )
9: if Rand(0, 1) < β then
. Use the scheduler policy
0 K
10: wt ← πφ (lma , .., lma )
11: else
. Train the scheduler policy
12: wt ← π oracle (b0t , .., bK , bval , Θt )
0 K
13: M ← M + {((lma , .., lma ), wt )}
14: φ ← retrainScheduler(φ, M )
15: end if
16: kt ← sampleTask(wt )
17: Θt+1 , loss ← retrainModel(Θt , bkt t )
kt kt
18: lma ← (1 − γ)lma + γloss
19: t←t+1
20: end while
After getting the tasks’ importance weights, the algorithm samples a task (accord-
ing to the distribution wt ) to re-train the MTL architecture and update the moving
average of the selected task (lines 16-18).
109
7.5 Experiments
We analyse the effectiveness of our scheduling method on MTL models learned on
languages with different underlying linguistic structures, and under different data
availability regimes. We have used the low-resource setting bilingual corpora and
auxiliary tasks discussed in Sections 6.3. Further, we have experiments with WMT17
English to German translation task with ∼4.6M training pairs to show the scalability
of the proposed approach on a high-resource scenario. We use “newstest2013“ as
the Dev set for early stopping, “newstest2017“ as the Val (meta-validation) set for
training the policy/scheduler network, and “newstest2014“ for testing.
110
their combination. In this schedule, the probability of selection of the main task is
determined by a hand-engineered schedule (Biased for low-resource setting and Ex-
ponential for high-resource setting). However, instead of uniformly distributing the
remaining probability among the auxiliary tasks, we use the scheduler network to
assign a probability to each of the auxiliary tasks based on their contribution to the
generalisation of the MTL model.
7.5.2 Results
Table 7.1 reports the results for our proposed method and the baselines for the bilin-
gually low-resource conditions, i.e. translation from English into Vietnamese, Turk-
ish, Spanish and Farsi. As seen, the NMT models trained with our scheduler network
perform the best across different language pairs. More specifically, the three MTL
training heuristics are effective in producing models which outperform the MT-only
baseline. Among the three heuristics, the Biased training strategy is more effective
than Uniform and Exponential, and leads to trained models with substantially better
translation quality than others. Although our policy learning is agnostic to this MTL
setup, it has automatically learned effective training strategies, leading to further im-
provements compared to Uniform as the best heuristic training strategy. We further
considered learning a training strategy which is a combination of the best heuristic
(i.e., Biased) and the scheduler network, as described before. As seen, this combined
policy is not as effective as the pure scheduler network, although it is still better than
the best heuristic training strategy.
Table 7.2 shows the results for English to German translation task, as the high
resource data condition. Among the three heuristics, Exponential is the most effective
training strategy compared to Biased and Uniform. Our automatically learned sched-
uler network leads to an NMT model which is competitive with the model trained
with the best heuristic. Furthermore, learning a combined policy resulted from the
scheduler network and the Exponential heuristic leads to the most effective training
strategy, where the trained NMT model outperforms all the other models wrt the
three translation quality metrics.
In the following section, we will elaborate more on the reasons behind the success of
the scheduler network by shedding light on how the scheduler controls the contribution
of each task throughout training.
111
BLEU ↑ METEOR ↑ TER ↓
MT only 24.32 43.5 58.1
Hand-engineered
+ Uniform 24.04 43.0 58.2
+ Biased (Constant)†‡ 23.37 42.8 59.0
+ Exponential‡ 25.06 44.0 57.1
Scheduler network
+ SN 24.6 43.5 57.4
+ Exponential + SN 25.3 44.2 56.6
Table 7.1: BLEU, METEOR and TER scores for English-German language pair. ”+
SN” indicates Scheduler Network is used in training. † : Proposed in Chapter 4, ‡ :
Proposed in (Kiperwasser and Ballesteros, 2018).
2.449
2.5
2
Sec
1.5
0.5 0.295
0.265 0.266
0
MT only Exponential MTL+ SN (β=0.99) MTL-Oracle policy
schedule
Figure 7.3: Average second per step for different MTL model on the Transformer set-
ting (En→De). We achieve ∼8.3X speed up in the training of MTL by simultaneously
training and using the scheduler network.
7.6 Analysis
Scalability of training with the scheduler network As discussed in Section
7.4, we introduce the scheduler to make the oracle policy scalable. To analyse the
speed up, we calculated the average time of each step in the high-resource regime
with Transformer setting. As mentioned, for the Transformer setting, we train all the
models for a fixed time of 3 days (this time includes measuring perplexity and save
the model after each epoch). We divide the total number of training steps by this
time and depict the result in Figure 7.3. As seen, the oracle policy is computationally
expensive, and interestingly using the scheduler network lead to ∼8.3X speed up in
the MTL training. Although we both train and use the scheduler network in a single
112
English→Vietnamese English→Turkish English→Spanish English→Farsi
BLEU BLEU BLEU METEOR BLEU
Dev Test Dev Test Dev Test Dev Test Dev Test
MT only 22.83 24.15 8.55 8.5 14.49 13.44 31.3 31.1 12.16 11.95
MTL with Fixed Schedule
+ Uniform 23.10 24.81 9.14 8.94 12.81 12.12 29.6 29.5 13.51 13.22
†‡
+ Biased (Constant) 23.42 25.22 10.06 9.53 15.14 14.11 31.8 31.3 13.53 13.47
+ Exponential‡ 23.45 25.65 9.62 9.12 12.25 11.62 28.0 28.1 14.23 13.99
‡
+ Sigmoid 23.35 25.36 9.55 9.01 11.55 11.34 26.6 26.9 14.05 13.88
MTL with Adaptive Schedule (Chapter 6)
+ Biased + AIW 23.95♦ 25.75 10.67♦ 10.25♦ 11.23 10.66 27.5 27.4 14.41 14.2
♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦
+ Uniform + AIW 24.38 26.68♦ 11.03 10.81 16.05 14.95 33.0 32.5 15.24 15.08♦
MTL with Scheduler Network
+ SN + Biased 23.86 25.70 10.53♦ 10.18♦ 13.20 12.38 29.9 29.7 14.37 14.51♦
♦
+ SN + Uniform 24.21 26.45♦ 10.92♦ 10.62♦ 16.14♦ 15.12♦ 33.1♦ 32.7♦ 15.14♦ 14.95♦
Table 7.2: Results for three language pairs. ”+ SN” indicates Scheduler Network is used in training. † : Proposed in Chapter
4, ‡ : Proposed in (Kiperwasser and Ballesteros, 2018). ♦ : Statistically significantly better (p < 0.05) than the best MTL with
fixed schedule.
Figure 7.4: Average weights of auxiliary tasks during the training on the English to
German language pair. Weights are averaged over 100-steps chunks.
run, thanks to using Imitation Learning, the procedure is so efficient that makes it
comparable to hand-engineered heuristics. From the other side, it is not feasible (at
least for us) to train an MTL model for the high-resource scenario by only using the
oracle policy as we need to train the model for 25 days (instead of 3) to process the
same number of steps.
114
7000
6000
5000
4000
3000
2000
1000
0
CONJ AUX NUM PUNCT PART ADJ DET VERB NOUN PROPN ADP
MT only MTL + best heuristic MTL+ best SN
Figure 7.5: The number of missed words in the generated translations of English-to-
German language pair. Words are categorised based on their POS tags.
7.7 Summary
We introduce a novel approach for automatically learning effective training schedules
for MTL. We formulate MTL training as a Markov Decision Process, paving the
way to treat the training scheduler as the policy. We then introduce an effective
oracle policy and use it to train a policy/scheduler network within the Imitation
115
Learning framework using Dagger in an on-policy manner. Our results on low-
resource (English to Vietnamese/Turkish/Spanish/Farsi) and high-resource (English
to German) settings using LSTM and Transformer architectures show up to +1.1
BLEU score improvement compared to the strong hand-engineered heuristics.
116
Chapter 8
This thesis presents a comprehensive study on using auxiliary linguistic data for
improving Neural Machine Translation (NMT) in bilingually low-resource scenarios.
We have explored two main approaches for this purpose: (1) using automatically
generated annotations; (2) directly injecting linguistic knowledge. For the first case,
we have shown that considering the errors and uncertainties of annotations could
help to improve the translation quality. For the second case, we have used Multi-
Task Learning (MTL) to inject linguistic inductive biases into the translation model.
We have shown that both the architecture and training schedule of MTL should be
carefully crafted, as they play crucial roles in making the best use of linguistic tasks
for improving the underlying translation task.
These thesis contributions can be categorised in three main directions: (1) han-
dling the uncertainty of automatically generated annotations in the translation pro-
cess (Part I); (2) Effective MTL architectural design for injecting both semantic and
syntactic knowledge into the underlying translation model (Part II); (3) Effective
strategies to train an MTL model to maximally improve the translation task (Part
III).
In Chapter 3, we looked at the case of using pre-trained parsers for generating
the syntactic constituency trees of source sentences. We argued that considering only
the top-1 tree may lead to an inability to capture parser uncertainties as well as se-
mantic ambiguities in the sentences. Then, we proposed a novel architecture for the
transduction of a collection of trees, aka a forest, to a sequence. Our proposed forest-
to-sequence considers combinatorially many parse trees of the source sentence along
with their probabilities in an efficient bottom-up fashion. Interestingly, the analysis
of computational complexity showed that our method processes a forest with com-
binatorially many trees with merely a small linear time overhead. The experimental
117
results and analyses backed our hypothesis and demonstrated the effectiveness and
efficiency of the proposed method in handling parsing errors and uncertainties.
Chapter 4 was our starting point for injecting linguistic inductive biases into the
translation task using MTL. We scaffolded the machine translation task on auxiliary
linguistic tasks, including named-entity recognition, syntactic parsing and for the first
time1 semantic parsing. It was done by casting auxiliary tasks to sequence-to-sequence
transduction tasks and proposing a partial sharing strategy for Seq2Seq MTL mod-
els. We have further improved the sharing strategy by incorporating adversarial
training to ensure the shared parameters are not dominated by the representations of
a minority of tasks (a potential cause for task interference). The experiments, com-
parisons and analyses on four language pairs led to the interesting findings. First, the
effect of different linguistic tasks varies from a language pair to another one. Second,
partial sharing is a better strategy than full sharing for making better use of auxiliary
tasks. It also can be enhanced by the proposed adversarial training approach. Third,
the best partial sharing practice can be different from one language pair to another
one. Therefore, there is a need for tuning the amount of sharing.
Inspired by the findings mentioned above, Chapter 5 proposed a novel approach
for adaptive knowledge sharing in deep Seq2Seq MTL. We extended the conven-
tional recurrent units to keep multiple flows of information, controlled by multiple
blocks. The amount of sharing among tasks is adaptively tuned via a routing network
responsible for modulating the input and output of each block conditioning on the
task at hand, the input, and model state. Empirical results showed the effectiveness of
the proposed approach for different language pairs. In addition, the analysis unveiled
that each block is mostly used by a subset of tasks, leveraging the commonalities
among subgroups of tasks.
In Chapter 6, we shifted our focus towards the beating heart of the MTL, the
training schedule. The training schedule is highly dependent on the aim of the MTL,
which could relatively improve all the tasks or only one of them. While in the
literature, the term “MTL” is used to refer to both of these scenarios, we make the
comparison more clear and categorised them as General-MTL and Biased-MTL, to
the best of our knowledge it was for the first time. This thesis was focused on the
latter as our goal was to improve the translation task. We proposed an approach for
adaptively and dynamically setting the importance of tasks throughout the training
in Biased-MTL. We showed that the proposed method is able to automatically find
1
To the best of our knowledge.
118
interesting schedules that shed light on dynamics of needs throughout the training of
bilingually low-resource NMT: from syntax to semantics.
Chapter 7 was built upon the findings of its preceding chapter and proposed a novel
framework for learning to multi-task learn. We formulated the training schedule of
MTL as a Markov Decision Process (MDP), enabling to treat the training scheduler
as the policy. We solved the MDP by proposing an oracle policy inspired by the
approach proposed in Chapter 6, with several techniques to scale it up. Further,
we introduced a scheduler policy trained and used simultaneously using Imitation
Learning to mimic the oracle policy. The interesting point regarding the proposed
approach is its ability to jointly train and use the scheduler along with the MTL
model in the course of a single run. Moreover, thanks to the ∼8X speedup, we
showed that the proposed approach is highly efficient and can even be used in high-
resource scenarios without the need for extra training time in comparison to hand-
crafted training schedules. Finally, we showed that the interesting scheduling pattern
of “from syntax to semantics” is also valid in the high-resource cases.
• Many of the findings and proposed approaches in the MTL-related parts of this
thesis are general and can be applied to other areas within NLP or other domains
like computer vision. More specifically, we are keen to see the application of the
adaptive training schedules on multi-task image classification. We expect it will
unveil interesting scheduling patterns similar to the “from syntax to semantic”
that we have seen in this thesis.
119
rates. Recently, it has been shown that the large variance of the adaptive
learning rate could be problematic (Liu et al., 2019a). Therefore, applying
variance reduction techniques is a natural direction as switching the tasks during
the training of MTL may lead to a higher variance of learning rates.
Returning to the bigger picture, as mentioned, deep learning tsunami is mainly driven
by the abundance of big data as well as computational power. The shortage of labelled
data in some areas make it difficult for deep learning to fully unleash its capabilities.
Therefore, this thesis is a step towards deep learning with less labelled data for the
target task of interest. This is achieved by employing labelled data from related tasks
with the aim to lap the waves of deep learning to the less potent areas.
120
Bibliography
Roee Aharoni and Yoav Goldberg. 2017. Towards string-to-tree neural machine trans-
lation. Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics, pages 132–140.
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson,
and Wolfgang Macherey. 2019. The missing ingredient in zero-shot neural machine
translation. arXiv preprint arXiv:1903.07091.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word
embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pages 451–462.
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsuper-
vised neural machine translation. In International Conference on Learning Repre-
sentations.
Philip Bachman, Alessandro Sordoni, and Adam Trischler. 2017. Learning algorithms
for active learning. In Proceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 301–310.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine
translation by jointly learning to align and translate. International Conference on
Learning Representations.
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT
evaluation with improved correlation with human judgments. In Proceedings of
121
the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine
Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan.
Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an.
2017. Graph convolutional encoders for syntax-aware neural machine translation.
arXiv preprint arXiv:1704.04675.
Daniel Beck, Gholamreza Haffari, and Trevor Cohn. 2018. Graph-to-sequence learning
using gated graph neural networks. In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), pages
273–283, Melbourne, Australia.
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass.
2017. What do neural machine translation models learn about morphology? In
Proceedings of the Annual Meeting of the Association for Computational Linguis-
tics, pages 861–872.
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003.
A neural probabilistic language model. Journal of machine learning research,
3(Feb):1137–1155.
Yoshua Bengio, Yann LeCun, et al. 2007. Scaling learning algorithms towards AI.
Large-scale kernel machines, 34(5):1–41.
Ergun Biçici and Deniz Yuret. 2011. Instance selection for machine translation us-
ing feature decay algorithms. In Proceedings of the Sixth Workshop on Statistical
Machine Translation, pages 272–283.
Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Had-
dow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva,
Christof Monz, et al. 2016. Findings of the 2016 conference on machine translation.
In ACL 2016 First Conference On Machine Translation (WMT16), pages 131–198.
Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive ex-
ploration of neural machine translation architectures. In Proceedings of the 2017
Conference on Empirical Methods in Natural Language Processing, pages 1442–
1451.
122
Peter F Brown, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Fredrick
Jelinek, John D Lafferty, Robert L Mercer, and Paul S Roossin. 1990. A statistical
approach to machine translation. Computational linguistics, 16(2):79–85.
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L.
Mercer. 1993. The mathematics of statistical machine translation: Parameter esti-
mation. Computational Linguistics, 19(2):263–311.
Ignacio Cases, Clemens Rosenbaum, Matthew Riemer, Atticus Geiger, Tim Klinger,
Alex Tamkin, Olivia Li, Sandhini Agarwal, Joshua D Greene, Dan Jurafsky, et al.
2019. Recursive routing networks: Learning to compose modules for language un-
derstanding. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pages 3631–3648.
Huadong Chen, Shujian Huang, David Chiang, and Jiajun Chen. 2017a. Improved
neural machine translation with a syntax-aware encoder and decoder. In Proceed-
ings of the 55th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), volume 1, pages 1936–1945.
Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, and Tiejun Zhao. 2018a.
Syntax-directed attention for neural machine translation. In Thirty-Second AAAI
Conference on Artificial Intelligence.
Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017b. Adversarial multi-
criteria learning for chinese word segmentation. In Proceedings of the Annual Meet-
ing of the Association for Computational Linguistics, pages 1193–1203.
Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018b.
Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask
networks. In International Conference on Machine Learning, pages 793–802.
Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu.
2016. Semi-supervised learning for neural machine translation. In Proceedings of
the 54th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 1965–1974.
123
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio.
2014a. On the properties of neural machine translation: Encoder-decoder ap-
proaches. arXiv preprint arXiv:1409.1259.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. 2014b. Learning phrase repre-
sentations using RNN encoder-decoder for statistical machine translation. pages
1724–1734.
Noam Chomsky. 1957. Syntactic Structures. Mouton and Co., The Hague.
Kenneth Ward Church and Patrick Hanks. 1989. Word association norms, mutual
information, and lexicography. In 27th Annual Meeting of the Association for
Computational Linguistics, pages 76–83, Vancouver, British Columbia, Canada.
Jonathan H Clark, Chris Dyer, Alon Lavie, and Noah A Smith. 2011. Better hypoth-
esis testing for statistical machine translation: Controlling for optimizer instability.
In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies: short papers-Volume 2, pages 176–181.
Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer,
and Gholamreza Haffari. 2016. Incorporating structural alignment biases into an
attentional neural translation model. Proceedings of the 2016 Conference of the
North American Chapter of the Association for Computational Linguistics.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu,
and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal
of Machine Learning Research, 12(Aug):2493–2537.
Balázs Csanád Csáji. 2001. Approximation with artificial neural networks. Faculty
of Sciences, Etvs Lornd University, Hungary, 24:48.
Anna Currey, Antonio Valerio Miceli Barone, and Kenneth Heafield. 2017. Copied
monolingual data improves low-resource neural machine translation. In Proceedings
of the Second Conference on Machine Translation, pages 148–156.
124
Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, and Stephan Vogel.
2017. Understanding and improving morphological learning in the neural machine
translation decoder. In Proceedings of the International Joint Conference on Nat-
ural Language Processing, pages 142–151.
Liang Ding and Dacheng Tao. 2019. Recurrent graph syntax encoder for neural
machine translation. arXiv preprint arXiv:1908.06559.
Carl Doersch and Andrew Zisserman. 2017. Multi-task self-supervised visual learning.
In Proceedings of the IEEE International Conference on Computer Vision, pages
2051–2060.
Tobias Domhan and Felix Hieber. 2017. Using target-side monolingual data for neural
machine translation through multi-task learning. In Proceedings of the Conference
on Empirical Methods in Natural Language Processing, pages 1501–1506.
Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015. Low resource de-
pendency parsing: Cross-lingual parameter sharing in a neural network parser. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Lin-
guistics and the 7th International Joint Conference on Natural Language Processing
(Volume 2: Short Papers), pages 845–850, Beijing, China.
Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. 2016. Re-
current neural network grammars. In Proceedings of the 2016 Conference of the
North American Chapter of the Association for Computational Linguistics, pages
199–209.
Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding
back-translation at scale. In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, pages 489–500.
125
Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning
to parse and translate improves neural machine translation. arXiv preprint
arXiv:1702.03525.
Meng Fang, Yuan Li, and Trevor Cohn. 2017. Learning how to active learn: A
deep reinforcement learning approach. In Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing, pages 595–605.
Weston Feely, Mehdi Manshadi, Robert E Frederking, and Lori S Levin. 2014. The
CMU METAL Farsi NLP Approach. In LREC, pages 4052–4055.
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating
non-local information into information extraction systems by gibbs sampling. In
Proceedings of the 43rd annual meeting on association for computational linguistics,
pages 363–370.
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning
for fast adaptation of deep networks. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org.
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin.
2017. Convolutional sequence to sequence learning. In Proceedings of the 34th In-
ternational Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR.
org.
Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. 2006. Query by committee
made real. In Advances in neural information processing systems, pages 443–450.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT
Press.
126
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial
nets. In Advances in neural information processing systems, pages 2672–2680.
Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor OK Li. 2018a. Universal neural
machine translation for extremely low resource languages. In Proceedings of the
2018 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages
344–354.
Jiatao Gu, Yong Wang, Yun Chen, Victor OK Li, and Kyunghyun Cho. 2018b. Meta-
learning for low-resource neural machine translation. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, pages 3622–
3631.
Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor OK Li. 2019. Improved zero-
shot neural machine translation via ignoring spurious correlations. arXiv preprint
arXiv:1906.01181.
Han Guo, Ramakanth Pasunuru, and Mohit Bansal. 2019. Autosem: Automatic task
selection and mixing in multi-task learning. CoRR, abs/1904.04153.
Michelle Guo, Albert Haque, De-An Huang, Serena Yeung, and Li Fei-Fei. 2018.
Dynamic task prioritization for multitask learning. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 270–287.
Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2016. Toward multilingual
neural machine translation with universal encoder and decoder. arXiv preprint
arXiv:1611.04798.
Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. 2009. Active learning for statis-
tical phrase-based machine translation. In Proceedings of Human Language Tech-
nologies: The 2009 Annual Conference of the North American Chapter of the As-
sociation for Computational Linguistics, pages 415–423.
Gholamreza Haffari and Anoop Sarkar. 2009. Active learning for multilingual sta-
tistical machine translation. In Proceedings of the Joint Conference of the 47th
Annual Meeting of the ACL and the 4th International Joint Conference on Natural
Language Processing of the AFNLP: Volume 1-Volume 1, pages 181–189.
127
Kazuma Hashimoto and Yoshimasa Tsuruoka. 2017. Neural machine translation with
source-side latent graph parsing. arXiv preprint arXiv:1702.02265.
Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2016.
A joint many-task model: Growing a neural network for multiple NLP tasks.
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma.
2016. Dual learning for machine translation. In Advances in Neural Information
Processing Systems, pages 820–828.
Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, and Trevor Cohn. 2018.
Iterative back-translation for neural machine translation. In Proceedings of the 2nd
Workshop on Neural Machine Translation and Generation, pages 18–24.
Sepp Hochreiter and Jiirgen Schmidhuber. 1997a. LTSM can solve hard time lag
problems. In Advances in Neural Information Processing Systems: Proceedings of
the 1996 Conference, pages 473–479.
Sepp Hochreiter and Jürgen Schmidhuber. 1997b. Long short-term memory. Neural
computation, 9(8):1735–1780.
Liang Huang. 2008. Forest reranking: Discriminative parsing with non-local features.
In ACL, pages 586–594.
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, Geoffrey E Hinton, et al. 1991.
Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
Ali Jalali, Sujay Sanghavi, Chao Ruan, and Pradeep K Ravikumar. 2010. A dirty
model for multi-task learning. In Advances in neural information processing sys-
tems, pages 964–972.
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with
gumbel-softmax. arXiv preprint arXiv:1611.01144.
128
Jing Jiang. 2009. Multi-task transfer learning for weakly-supervised relation extrac-
tion. In Proceedings of the Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on Natural Language Processing
of the AFNLP, pages 1012–1020, Suntec, Singapore.
Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. Mentornet:
Learning data-driven curriculum for very deep neural networks on corrupted labels.
In International Conference on Machine Learning, pages 2309–2318.
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng
Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al.
2017. Googles multilingual neural machine translation system: Enabling zero-shot
translation. Transactions of the Association for Computational Linguistics, 5:339–
351.
Michael I Jordan and Robert A Jacobs. 1994. Hierarchical mixtures of experts and
the em algorithm. Neural computation, 6(2):181–214.
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex
Graves, and Koray Kavukcuoglu. 2017. Neural machine translation in linear time.
arXiv preprint arXiv:1610.10099.
Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using un-
certainty to weigh losses for scene geometry and semantics. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 7482–7491.
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980.
Durk P Kingma, Tim Salimans, and Max Welling. 2015. Variational dropout and
the local reparameterization trick. In Advances in Neural Information Processing
Systems, pages 2575–2583.
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush.
2017. Opennmt: Open-source toolkit for neural machine translation. In Proceedings
of ACL 2017, System Demonstrations, pages 67–72.
129
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation.
In MT summit, volume 5, pages 79–86.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed-
erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens,
Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses:
Open source toolkit for statistical machine translation. In Proceedings of the 45th
Annual Meeting of the Association for Computational Linguistics on Interactive
Poster and Demonstration Sessions, pages 177–180.
Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine trans-
lation. In Proceedings of the First Workshop on Neural Machine Translation, pages
28–39.
Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based
translation. In Proceedings of the 2003 Conference of the North American Chapter
of the Association for Computational Linguistics on Human Language Technology-
Volume 1, pages 48–54.
Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via
influence functions. In Proceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 1885–1894. JMLR. org.
Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, Yejin Choi, and Luke Zettlemoyer.
2017. Neural amr: Sequence-to-sequence models for parsing and generation. In Pro-
ceedings of the Annual Meeting of the Association for Computational Linguistics,
pages 146–157.
Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio
Ranzato. 2018b. Phrase-based & neural unsupervised machine translation. arXiv
preprint arXiv:1804.07755.
Yann LeCun and Yoshua Bengio. 1998. The handbook of brain theory and neural
networks. chapter Convolutional Networks for Images, Speech, and Time Series,
pages 255–258. MIT Press, Cambridge, MA, USA.
130
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature,
521(7553):436–444.
Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity
with lessons learned from word embeddings. Transactions of the Association for
Computational Linguistics, 3:211–225.
Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, Min Zhang, and Guodong Zhou.
2017. Modeling source syntax for neural machine translation. In Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 688–697.
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph
sequence neural networks. arXiv preprint arXiv:1511.05493.
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng
Gao, and Jiawei Han. 2019a. On the variance of the adaptive learning rate and
beyond. arXiv preprint arXiv:1908.03265.
Ming Liu, Wray Buntine, and Gholamreza Haffari. 2018. Learning to actively learn
neural machine translation. In Proceedings of the 22nd Conference on Computa-
tional Natural Language Learning, pages 334–344.
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multi-task learning
for text classification. In Proceedings of the 55th Annual Meeting of the Association
for Computational Linguistics, pages 1–10.
Shikun Liu, Andrew J Davison, and Edward Johns. 2019b. Self-supervised generali-
sation with meta auxiliary learning. arXiv preprint arXiv:1901.08933.
Yi Luan, Chris Brockett, Bill Dolan, Jianfeng Gao, and Michel Galley. 2017. Multi-
task learning for speaker-role adaptation in neural conversation models. In Proceed-
ings of the Eighth International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 605–614.
131
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015a. Effective ap-
proaches to attention-based neural machine translation. In Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing (EMNLP). As-
sociation for Computational Linguistics.
Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.
2015b. Addressing the rare word problem in neural machine translation.
Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016.
Multi-task sequence to sequence learning. In International Conference on Learning
Representations.
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015c. Effective Approaches
to Attention-based Neural Machine Translation. In Proceedings of the 2015 Con-
ference on Empirical Methods in Natural Language Processing, pages 1412–1421,
Lisbon, Portugal.
Chunpeng Ma, Akihiro Tamura, Masao Utiyama, Tiejun Zhao, and Eiichiro Sumita.
2018. Forest-based neural machine translation. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pages 1253–1263, Melbourne, Australia.
Diego Marcheggiani, Joost Bastings, and Ivan Titov. 2018. Exploiting semantics in
neural machine translation with graph convolutional networks. In Proceedings of
the 2018 Conference of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, Volume 2 (Short Papers),
pages 486–492, New Orleans, Louisiana.
Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolu-
tional networks for semantic role labeling. arXiv preprint arXiv:1703.04826.
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building
a large annotated corpus of english: The penn treebank. Computational Linguistics,
19(2):313–330.
Warren S McCulloch and Walter Pitts. 1943. A logical calculus of the ideas immanent
in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133.
132
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781.
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudan-
pur. 2010. Recurrent neural network based language model. In Eleventh annual
conference of the international speech communication association.
Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černockỳ, and Sanjeev Khu-
danpur. 2011. Extensions of recurrent neural network language model. In
2011 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 5528–5531. IEEE.
Tom M Mitchell. 1980. The need for biases in learning generalizations. Department
of Computer Science, Laboratory for Computer Science Research .
Pim Moeskops, Jelmer M Wolterink, Bas HM van der Velden, Kenneth GA Gilhuijs,
Tim Leiner, Max A Viergever, and Ivana Išgum. 2016. Deep learning for multi-task
medical image segmentation in multiple modalities. In International Conference
on Medical Image Computing and Computer-Assisted Intervention, pages 478–486.
Springer.
Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar,
Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux,
Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng
Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul
Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta,
and Pengcheng Yin. 2017. DyNet: The dynamic neural network toolkit. arXiv
preprint arXiv:1701.03980.
Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation
to new languages. arXiv preprint arXiv:1808.04189.
Weili Nie, Nina Narodytska, and Ankit Patel. 2019. RelGAN: Relational generative
adversarial networks for text generation. In International Conference on Learning
Representations.
133
Jan Niehues and Eunah Cho. 2017. Exploiting linguistic resources for neural machine
translation using multi-task learning. In Proceedings of the Second Conference on
Machine Translation, pages 80–89.
Xing Niu, Michael Denkowski, and Marine Carpuat. 2018. Bi-directional neural ma-
chine translation with synthetic parallel data. In Proceedings of the 2nd Workshop
on Neural Machine Translation and Generation, pages 84–91.
Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Trans-
actions on knowledge and data engineering, 22(10):1345–1359.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
method for automatic evaluation of machine translation. In Proceedings of the 40th
annual meeting on association for computational linguistics, pages 311–318.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of
training recurrent neural networks. ICML (3), 28:1310–1318.
Ramakanth Pasunuru and Mohit Bansal. 2017. Multi-task video captioning with
video and entailment generation. In Proceedings of ACL.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
2017. Automatic differentiation in pytorch. In NIPS-W.
Hao Peng, Sam Thomson, and Noah A. Smith. 2017. Deep multitask learning for
semantic dependency parsing. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pages 2037–
2048.
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global
vectors for word representation. In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar.
Adam Poliak, Yonatan Belinkov, James Glass, and Benjamin Van Durme. 2018. On
the evaluation of semantic phenomena in neural machine translation using natural
language inference. arXiv preprint arXiv:1804.09779.
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neu-
big. 2018. When and why are pre-trained word embeddings useful for neural ma-
chine translation? In Proceedings of the 2018 Conference of the North American
134
Chapter of the Association for Computational Linguistics: Human Language Tech-
nologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana.
Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. 2017. Hyperface: A deep multi-
task learning framework for face detection, landmark localization, pose estimation,
and gender recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 41(1):121–135.
Roi Reichart, Katrin Tomanek, Udo Hahn, and Ari Rappoport. 2008. Multi-task
active learning for linguistic annotations. In Proceedings of ACL-08: HLT, pages
861–869, Columbus, Ohio.
Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. 2018. Learning to
reweight examples for robust deep learning. In International Conference on Ma-
chine Learning, pages 4331–4340.
Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. 2018. Routing networks:
Adaptive selection of non-linear functions for multi-task learning. In International
Conference on Learning Representations.
Frank Rosenblatt. 1958. The perceptron: A probabilistic model for information stor-
age and organization in the brain. Psychological review, 65(6):386.
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation
learning and structured prediction to no-regret online learning. In Proceedings of
the fourteenth international conference on artificial intelligence and statistics, pages
627–635.
Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. 2017.
Sluice networks: Learning what to share between loosely related tasks. arXiv
preprint arXiv:1705.08142.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Edinburgh neural ma-
chine translation systems for wmt 16. In Proceedings of the First Conference on
Machine Translation: Volume 2, Shared Task Papers, pages 371–376.
135
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural Machine Trans-
lation of Rare Words with Subword Units. In Proceedings of the Annual Meeting
of the Association for Computational Linguistics, pages 1715–1725.
Rico Sennrich and Biao Zhang. 2019. Revisiting low-resource neural machine trans-
lation: A case study. In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 211–221, Florence, Italy.
Burr Settles. 2009. Active learning literature survey. Technical report, University of
Wisconsin-Madison Department of Computer Sciences.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey
Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-
gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does string-based neural mt learn
source syntax? In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, pages 1526–1534.
Dongdong Zhang Shuangzhi Wu, Ming Zhou. 2017. Improved neural machine trans-
lation with source syntax. In Proceedings of the Twenty-Sixth International Joint
Conference on Artificial Intelligence, IJCAI-17, pages 4179–4185.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John
Makhoul. 2006. A study of translation edit rate with targeted human annotation.
In Proceedings of association for machine translation in the Americas.
Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level
tasks supervised at lower layers. In Proceedings of the Annual Meeting of the
Association for Computational Linguistics, pages 231–235.
Linfeng Song, Daniel Gildea, Yue Zhang, Zhiguo Wang, and Jinsong Su. 2019. Se-
mantic neural machine translation using AMR. Transactions of the Association for
Computational Linguistics, 7:19–31.
Linfeng Song, Yue Zhang, Zhiguo Wang, and Daniel Gildea. 2018. A graph-to-
sequence model for amr-to-text generation. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Pa-
pers), pages 1616–1626.
136
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from
overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bag-
nell. 2017. Deeply aggrevated: Differentiable imitation learning for sequential pre-
diction. In Proceedings of the 34th International Conference on Machine Learning-
Volume 70, pages 3309–3318. JMLR. org.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
with neural networks. In Advances in Neural information Processing Systems, pages
3104–3112.
Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved seman-
tic representations from tree-structured long short-term memory networks. arXiv
preprint arXiv:1503.00075.
Jörg Tiedemann. 2009. News from OPUS-A collection of multilingual parallel cor-
pora with tools and interfaces. In Recent advances in natural language processing,
volume 5, pages 237–248.
Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of
the International Conference on Language Resources and Evaluation, pages 2214–
2218.
Simon Tong and Edward Chang. 2001. Support vector machine active learning for
image retrieval. In Proceedings of the ninth ACM international conference on Mul-
timedia, pages 107–118. ACM.
Lisa Torrey and Jude Shavlik. 2010. Transfer learning. In Handbook of research
on machine learning applications and trends: algorithms, methods, and techniques,
pages 242–264.
Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003.
Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceed-
ings of the 2003 conference of the North American chapter of the association for
computational linguistics on human language technology-volume 1, pages 173–180.
Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017. Neural
machine translation with reconstruction. In Thirty-First AAAI Conference on Ar-
tificial Intelligence.
137
Peter D Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space
models of semantics. Journal of artificial intelligence research, 37:141–188.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In
Advances in neural information processing systems, pages 5998–6008.
Oriol Vinyals, L ukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey
Hinton. 2015. Grammar as a foreign language. In Advances in Neural Information
Processing Systems, pages 2773–2781.
Thuy-Trang Vu, Ming Liu, Dinh Phung, and Gholamreza Haffari. 2019. Learning
how to active learn by dreaming. In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pages 4091–4101, Florence, Italy.
Sida Wang and Christopher Manning. 2013. Fast dropout training. In international
conference on machine learning, pages 118–126.
Zirui Wang, Zihang Dai, Barnabás Póczos, and Jaime Carbonell. 2019. Characterizing
and avoiding negative transfer. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 11293–11302.
Mark Woodward and Chelsea Finn. 2017. Active one-shot learning. arXiv preprint
arXiv:1702.06559.
Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. 2019. Pay
less attention with lightweight and dynamic convolutions. In International Confer-
ence on Learning Representations.
Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu.
2017. Adversarial neural machine translation. arXiv preprint arXiv:1704.06933.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolf-
gang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016.
Google’s neural machine translation system: Bridging the gap between human and
machine translation. arXiv preprint arXiv:1609.08144.
Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G Hauptmann.
2015. Multi-class active learning by uncertainty sampling with diversity maximiza-
tion. International Journal of Computer Vision, 113(2):113–127.
138
Yongxin Yang and Timothy M Hospedales. 2016. Trace norm regularised deep multi-
task learning. arXiv preprint arXiv:1606.04038.
Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018. Improving neural machine
translation with conditional sequence generative adversarial nets. In Proceedings of
the 2018 Conference of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, Volume 1 (Long Papers),
pages 1346–1355.
Xiao-Tong Yuan, Xiaobai Liu, and Shuicheng Yan. 2012. Visual classification with
multitask joint sparse representation. IEEE Transactions on Image Processing,
21(10):4349–4360.
Jiajun Zhang and Chengqing Zong. 2016. Exploiting source-side monolingual data in
neural machine translation. In Proceedings of the Conference on Empirical Methods
in Natural Language Processing, pages 1535–1545.
Yu Zhang, Ying Wei, and Qiang Yang. 2018a. Learning to multitask. In S. Bengio,
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Systems 31, pages 5771–5782.
Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. 2018b. Joint training
for neural machine translation models with monolingual data. In Thirty-Second
AAAI Conference on Artificial Intelligence.
139