Deep Neural Networks in Machine Translation
Deep Neural Networks in Machine Translation
Deep Neural Networks in Machine Translation
Deep Neural
Networks in Machine
Translation: An
Overview
Jiajun Zhang and Chengqing Zong, Institute of Automation,
Chinese Academy of Sciences
D
ue to the powerful capacity of feature learning and representation, deep
nition and image processing. Following recent success in signal variable process-
Deep neural
ing, researchers want to figure out whether DNNs can achieve similar progress in
networks (DNNs)
symbol variable processing, such as natural context’s semantics and better predict trans
are increasingly language processing (NLP). As one of the lation candidates.
more challenging NLP tasks, machine trans- • Direct application regards MT as a se-
popular in machine lation (MT) has become a testing ground for quence-to-sequence prediction task and,
researchers who want to evaluate various without using any information from stan-
translation. kinds of DNNs. dard MT systems, designs two deep neural
MT aims to find for the source language sen- networks—an encoder, which learns con-
tence the most probable target language sen- tinuous representations of source language
tence that shares the most similar meaning. sentences, and a decoder, which generates
Essentially, MT is a sequence-to-sequence pre- the target language sentence with source
diction task. This article gives a comprehen- sentence representation.
sive overview of applications of DNNs in MT
from two views: indirect application, which at- Let’s start by examining DNNs themselves.
tempts to improve standard MT systems, and
direct application, which adopts DNNs to de- Deep Neural Networks
sign a purely neural MT model. We can elabo- Researchers have designed many kinds
rate further: of DNNs, including deep belief networks
(DBNs), deep stack networks (DSNs), con-
• Indirect application designs new features volutional neural networks (CNNs), and
with DNNs in the framework of standard recurrent neural networks (RNNs). In NLP,
MT systems, which consist of multiple sub- all these DNNs aim to learn the syntactic
models (such as translation selection and lan- and semantic representations for the dis
guage models). For example, DNNs can be crete words, phrases, structures, and sen-
leveraged to represent the source language tences in the real-valued continuous space so
that similar words (phrases or struc- to its capability of representing all the
tures) are near each other. We briefly history words rather than a fixed-length W2 Hidden layers
introduce five popular neural networks context as in FNN. Figure 2 depicts the
after giving some notations—namely, RecurrentNN architecture. Given the
we use wi to denote the ith word of history representation ht−1 encoding all W1
a T-word sentence, and xi as the cor- the preceding words, we can obtain the Input
responding distributed real-valued new history representation ht with the
vector. Vectors of all the words in formula ht = Uxt + Wht−1. With ht, we Figure 1. The feed-forward neural
the vocabulary form the embedding can calculate the probability of next network (FNN) architecture. Taking the
matrix L ∈ Rk×|V|, where k is the word using the softmax function: language model as an example, the
embedding dimension and |V| is the FNN attempts to predict the conditional
probability of the next word given the
vocabulary size. Additionally, U and e yt
p ( yt ) = ,(1) fixed-window history words.
W are parameter matrices of a neural ∑ iey i
∑ exp (∑ λ h ( f , e′))
ing; RecursiveNN doesn’t have to recon- The convolution layer involves sev-
e
struct the inputs; and different matrices eral filters W ∈ Rh×k that summarize the e′ i
i i
Variable-length
vided into three steps: partition the L
sentence e
source sentence (or syntactic tree) into
a sequence of words or phrases (or
set of subtrees), perform word/phrase
or subtree translation; and compos-
ite the fragment translations to obtain
the final target sentence. If the transla-
Figure 5. The convolutional neural network (CNN) architecture. The CNN model takes as
tion unit is word, it’s the word-based
input the sequence of word embeddings, summarizes the sentence meaning by convolving
model. If phrase is the basic transla- the sliding window and pooling the saliency through the sentence, and yields the fixed-
tion unit, it’s the popular phrase-based length distributed vector with other layers, such as dropout and fully connected layers.
model. In this article, we mainly take
the phrase-based SMT7 as an example.
In the training stage, we first per-
form word alignment to find word
correspondence between the bilingual
sentences. Then, based on the word-
aligned bilingual sentences, we extract
phrase-based translation rules (such as
the a–e translation rules in Figure 6)
and learn their probabilities. Mean-
while, the phrase reordering model
can be trained from the word-aligned
bilingual text. In addition, the lan-
guage model can be trained with the
large-scale target monolingual data.
During decoding, the phrase-based
model finds the best phrase partition
of the source sentence, searches for the
best phrase translations, and figures
out the best composition of the target
phrases. Figure 6 shows an example
for a Chinese-to-English translation.
Phrasal rules (a–e) are first utilized to Figure 6. An example of translation derivation structure prediction. Phrasal rules
get the partial translations, and then (a–e) are first utilized to get the partial translations, and then reordering rules (f–i)
are employed to arrange the translation positions. Rule g denotes that “the two
reordering rules (f–i) are employed to countries” and “the relations between” should be swapped. Rules f, g, and i just
arrange the translation positions. Rule composite the target phrases monotonously. Finally, the language model measures
g denotes that “the two countries” which translation is more accurate.
and “the relations between” should be
swapped. Rules f, g, and i just compos-
ite the target phrases monotonously. no knowledge besides the parallel • It’s tough work to predict the trans-
Finally, the language model measures data. lation derivation structure because
which translation is more accurate. • It’s difficult to determine which tar- phrase partition and phrase reor-
Obviously, from training and decod- get phrase is the best candidate for dering for a source sentence can be
ing, we can see the difficulties in SMT: a source phrase because a source arbitrary.
phrase can have many translations, • It’s difficult to learn a good language
• It’s difficult to obtain accurate and different contexts lead to differ- model due to the data sparseness
word alignment because we have ent translations. problem.
M
developing the relations between the two countries
scale monolingual data. Ashish Vaswani
and colleagues19 employed two hidden
layers in the FNN that’s similar to
Bengio’s FNN.
S the relations between the two countries The n-gram model assumes that the
word depends on the previous n - 1
words. RecurrentNN doesn’t use this
assumption and models the probabil-
ity of a sentence as follows:
developing the two countries the relations between
T′
monolingual data, and how to utilize • Error analysis. Because the DNN- sentence, such as paragraphs and dis-
more syntactic/semantic knowledge in based subcomponent (or NMT) deals course? Unfortunately, representation,
addition to source sentences. with variables in the real-valued con- computation, and reasoning of such
For both direct and indirect appli- tinuous space and there are no effec- information in DNNs remain a diffi-
cations, DNNs boost translation per- tive approaches to show a meaningful cult problem.
formance. Naturally, we’re interested and explainable trace from input to Third, effectively integrating DNNs
in the following questions: output, it’s difficult to understand into standard SMT is still worth trying.
why it leads to better translation per- In the multicomponent system, we can
• Why can DNNs improve transla- formance or why it fails. study which subcomponent is indis-
tion quality? • Remembering and reasoning. For cur- pensable and which can be completely
• Can DNNs lead to a big break- rent DNNs, the continuous vector replaced by DNN-based features. In-
through? representation (even using LSTM in stead of the log-linear model, we need a
• What aspects should DNNs im- RecurrentNN) can’t remember full in- better mathematical model to combine
prove if they’re to become an MT formation for the source sentence. It’s multiple subcomponents.
panacea? quite difficult to obtain correct target Fourth, it’s interesting and impera-
translation by decoding from this rep- tive to investigate more efficient algo-
For the first question, DNNs rep- resentation. Furthermore, unlike other rithms for parameter learning of the
resent and operate language units in sequence-to-sequence NLP tasks, MT complicated neural network architec-
the continuous vector space that fa- is a more complicated problem that tures. Moreover, new network archi-
cilitates the computation of semantic requires rich reasoning operations tectures can be explored in addition to
distance. For example, several algo- (such as coreference resolution). Cur- existing neural networks. We believe
rithms such as Euclidean distance and rent DNNs can’t perform this kind of that the best network architectures for
cosine distance can be applied to cal- reasoning with simple vector or ma- MT must be equipped with representa-
culate the similarity between phrases trix operations. tion, remembering, computation, and
or sentences. But they also capture reasoning, simultaneously.
much more contextual information
than standard SMT systems, and data
sparseness isn’t a big problem. For ex-
ample, the RecurrentNN can utilize all
T hese problems tell us that
DNNs have a long way to go
in MT. Nevertheless, due to their ef-
Acknowledgments
This research work was partially funded by
the Natural Science Foundation of China un-
der grant numbers 61333018 and 61303181,
the history information before the cur- fective representations of languages, the International Science and Technology
rently predicted target word; this is im- they could be a good solution eventu- Cooperation Program of China under grant
possible with standard SMT systems. ally. To achieve this goal, we should number 2014DFA11350, and the High New
For the second question, DNNs pay attention to the path ahead. Technology Research and Development Pro-
gram of Xinjiang Uyghur Autonomous Re-
haven’t achieved huge success with First, DNNs are good at handling
gion, grant number 201312103.
MT until recently. We’ve conducted continuous variables, but natural lan-
some analysis and propose some key guage is composed of abstract discrete
problems for SMT with DNNs: symbols. If they completely abandon References
discrete symbols, DNNs won’t fully 1. Y. Bengio et al., “A Neural Probabilistic
• Computational complexity. Be- control the language generation pro- Language Model,” J. Machine Learning
cause the network structure is com- cess: sentences are discrete, not con- Research, vol. 3, 2003, pp. 1137–1155;
plicated, and normalization over tinuous. Representing and handling www.jmlr.org/papers/volume3/bengio03a/
the entire vocabulary is usually both discrete and continuous vari- bengio03a.pdf.
required, DNN training is a time- ables in DNNs is a big challenge. 2. J.L. Elman, “Distributed Representations,
consuming task. Training a stan- Second, DNNs represent words, Simple Recurrent Networks, and Gram-
dard SMT system on millions of phrases, and sentences in continuous matical Structure,” Machine Learning,
sentence pairs only requires about space, but what if they could mine vol. 7, 1991, pp. 195–225; http://crl.ucsd.
two or three days, whereas train- deeper knowledge, such as parts of edu/~elman/Papers/machine.learning.pdf.
ing a similar NMT system can take speech, syntactic parse trees, and 3. R. Socher et al., “Semi-supervised Recur-
several weeks, even with powerful knowledge graphs? What about ex- sive Autoencoders for Predicting Sentiment
GPUs. ploring wider knowledge beyond the Distributions,” Proc. Empirical Methods