Deep Neural Networks in Machine Translation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Natural Language Processing

Deep Neural
Networks in Machine
Translation: An
Overview
Jiajun Zhang and Chengqing Zong, Institute of Automation,
Chinese Academy of Sciences

D
ue to the powerful capacity of feature learning and representation, deep

neural networks (DNNs) have made big breakthroughs in speech recog-

nition and image processing. Following recent success in signal variable process-
Deep neural
ing, researchers want to figure out whether DNNs can achieve similar progress in
networks (DNNs)
symbol variable processing, such as natural context’s semantics and better predict trans­
are increasingly language processing (NLP). As one of the lation candidates.
more challenging NLP tasks, machine trans- • Direct application regards MT as a se-
popular in machine lation (MT) has become a testing ground for quence-to-sequence prediction task and,
researchers who want to evaluate various without using any information from stan-
translation. kinds of DNNs. dard MT systems, designs two deep neural
MT aims to find for the source language sen- networks—an encoder, which learns con-
tence the most probable target language sen- tinuous representations of source language
tence that shares the most similar meaning. sentences, and a decoder, which generates
Essentially, MT is a sequence-to-sequence pre- the target language sentence with source
diction task. This article gives a comprehen- sentence representation.
sive overview of applications of DNNs in MT
from two views: indirect application, which at- Let’s start by examining DNNs themselves.
tempts to improve standard MT systems, and
direct application, which adopts DNNs to de- Deep Neural Networks
sign a purely neural MT model. We can elabo- Researchers have designed many kinds
rate further: of DNNs, including deep belief networks
(DBNs), deep stack networks (DSNs), con-
• Indirect application designs new features volutional neural networks (CNNs), and
with DNNs in the framework of standard recurrent neural networks (RNNs). In NLP,
MT systems, which consist of multiple sub- all these DNNs aim to learn the syntactic
models (such as translation selection and lan- and semantic representations for the dis­
guage models). For example, DNNs can be crete words, phrases, structures, and sen-
leveraged to represent the source language tences in the real-valued continuous space so

16 1541-1672/15/$31.00 © 2015 IEEE IEEE INTELLIGENT SYSTEMS


Published by the IEEE Computer Society
Output

that similar words (phrases or struc- to its capability of representing all the
tures) are near each other. We briefly history words rather than a fixed-length W2 Hidden layers
introduce five popular neural networks context as in FNN. Figure 2 depicts the
after giving some notations—namely, RecurrentNN architecture. Given the
we use wi to denote the ith word of history representation ht−1 encoding all W1
a T-word sentence, and xi as the cor- the preceding words, we can obtain the Input
responding distributed real-valued new history representation ht with the
vector. Vectors of all the words in formula ht = Uxt + Wht−1. With ht, we Figure 1. The feed-forward neural
the vocabulary form the embedding can calculate the probability of next network (FNN) architecture. Taking the
matrix L ∈ Rk×|V|, where k is the word using the softmax function: language model as an example, the
embedding dimension and |V| is the FNN attempts to predict the conditional
probability of the next word given the
vocabulary size. Additionally, U and e yt
p ( yt ) = ,(1) fixed-window history words.
W are parameter matrices of a neural ∑ iey i

network, and b is the bias; f and e in-


dicate the source and target sentence, where i traverses all the words in the
respectively. vocabulary. Similarly, the new his- xt+1 ht+1
tory representation ht and the next U
Feed-Forward Neural Network word will be utilized to get the his-
The feed-forward neural network (FNN) tory representation ht+1 at time t + 1.
W y ht = Uxt + Wht–1
is one of the simplest multilayer net- t
xt ht
works.1 Figure 1 shows an FNN ar- Recursive Auto-encoder eyt
U p(yt) =
chitecture with hidden layers as well The recursive auto-encoder (RAE)3
∑i eyi
as input and output layers. Taking the provides a good way to embed a
ht–1 W
language model as an example, the phrase or a sentence in continu-
FNN attempts to predict the conditional ous space with an unsupervised or
probability of the next word given the semisupervised method. Figure 3
fixed-window history words. Suppose shows an RAE architecture that learns
we have a T-word sentence, w1, w2, ..., a vector representation of a four-word Figure 2. The recurrent neural network
wt, ..., wT; our task is to estimate phrase by recursively combining two (RecurrentNN) architecture. Theoretically,
the four-gram conditional probability children vectors in a bottom-up man- it’s more powerful than FNN in language
of wt given the trigram history wt−3, ner. By convention, the four words w1, modeling due to its capability of
representing all the history words rather
wt−2, wt−1. The FNN first maps each his- w2, w3, and w4 are first projected into
than a fixed-length context.
tory word into a real-valued vector xt−3, real-valued vectors x1, x2, x3, and x4.
xt−2, xt−1 using embedding matrix L ∈ In RAE, a standard auto-encoder (in
Rk×|V|; xt−3, xt−2, xt−1 are then concate- box) is reused at each node. For two between the inputs and the recon-
nated to form a single input vector xt_ children c1 = x1 and c2 = x2, the auto- structions during training:
history. The hidden layers are followed encoder computes the parent vector y1
to extract the abstract representation as follows: Erec([c1; c 2]) = ½ ||[c1; c 2] - [c′1; c′2]||2 .
of the history words through a linear (4)
transformation W × xt_history and a non- y1 = f(W (1) [c1; c 2] + b(1)).(2)
linear projection f(W × xt_history + b), The same auto-encoder is reused
such as f = tanh (x)). The softmax layer To assess how well the parent’s vec- until the whole phrase’s vector is gen-
is usually adopted in the output to tor represents its children, the standard erated. For unsupervised learning,
predict each word’s probability in the auto-encoder reconstructs the children the objective is to minimize the sum
vocabulary. in a reconstruction layer: of reconstruction errors at each node
in the optimal binary tree:
Recurrent Neural Network [c′1; c′2] = f ′(W(2) y1 + b(2)).(3)
The recurrent neural network (Recur- RAEθ ( x ) = argmin
y ∈A( x )
∑ s∈yErec ( c1; c2 )s.
rentNN)2 is theoretically more power- The standard auto-encoder tries to
ful than FNN in language modeling due minimize the reconstruction errors (5)

september/october 2015 www.computer.org/intelligent 17


Natural Language Processing

information of an h-word window and


y3 =f(W(1)[y2; x4]+b)
produce a new feature. For the window
of h words Xt:t+h−1, a filter Fl (1 ≤ l ≤ L)
generates the feature ytl as follows:
y2 =f(W(1)[y1; x3]+b)
ytl = f (WXt :t + h−1 + b ).(6)

When a filter traverses each window


from X1:h−1 to XT−h+1:T, we get the fea-
y1 =f(W(1)[x1; x2]+b)
ture map’s output: y1l , y2l ,… , yTl −t +1
 
(yl ∈ RT−h+1). Note that the sentences
differ from each other in length T, and
x1 x2 x3 x4 yl has different dimensions for different
sentences. A key question becomes how
do we transform the variable-length
Figure 3. The recursive auto-encoder (RAE) architecture. It learns a vector representation of vector yl into a fixed-size vector?
a four-word phrase by recursively combining two children vectors in a bottom-up manner. The pooling layer is designed to
perform this task. In most cases, we
can be used at different nodes. Figure 4 apply a standard max-over-time pool-
label illustrates an example that applies three ing ­operation over yl and choose the
label different matrices. The structure, repre-
y3 =f(W (3)[y2; x4]+b(3))
label
sentation, and parameter matrices W(1), maximum value ŷl = max yl . With { }
W(2), and W(3) have been learned to op-
y2 =f(W (2)[y1; x3]+b(2))
timize the label-related supervised objec- L filters, the dimension of the pooling
y1 =f(W (1)[x1; x2]+b(1))
tive function. layer output will be L. Using other lay-
x1 x2 x3 x4 ers, such as fully connected linear lay-
Convolutional Neural Network ers, we can finally obtain a fixed-length
Figure 4. Recursive neural network The convolutional neural network output representation.
(RecursiveNN) architecture. The (CNN)5 consists of the convolution and
structure, representation, and pooling layers and provides a standard Machine Translation
parameter matrices W(1), W (2), and W (3)
have been learned to optimize the label-
architecture that maps variable-length Statistical models dominate the MT
related supervised objective function. sentences into fixed-size distributed vec- community today. Given a source lan-
tors. Figure 5 shows the architecture. guage sentence f, statistical machine
The CNN model takes as input the translation (SMT) searches through
A(x) denotes all the possible binary sequence of word embeddings, sum- all the target language sentences e
trees that can be built from input x. marizes the sentence meaning by con- and finds the one with the highest
volving the sliding window and pooling probability:
Recursive Neural Network the saliency through the sentence,
The recursive neural network (Recur- and yields the fixed-length distributed e′ = arg max p(e | f ).(7)
e
siveNN)4 performs structure prediction vector with other layers, such as drop-
and representation learning using a bot- out and fully connected layers. Usually, p(e|f) is decomposed using
tom-up fashion similar to that of RAE. Given a sentence w 1, w 2 , ..., wt, ..., the log-linear model6:
However, RecursiveNN differs from wT, each word wt is first projected
RAE in four points: RecursiveNN is op- into a vector xt. Then, we concate- e′ = arg max p(e | f )
e
timized with supervised learning; the tree nate all the vectors to form the input
structure is usually fixed before train- X = [x1, x 2 , ..., xt, ..., xT].
= arg max
(∑ λ h (f , e)) , (8)
exp
i
i i

∑ exp (∑ λ h ( f , e′))
ing; RecursiveNN doesn’t have to recon- The convolution layer involves sev-
e
struct the inputs; and different matrices eral filters W ∈ Rh×k that summarize the e′ i
i i

18 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


Fixed length
Convolution Max pooling output Oe
k
where hi(f, e) can be any translation fea- L
ture and λi is the corresponding weight.
The translation process can be di-

Variable-length
vided into three steps: partition the L

sentence e
source sentence (or syntactic tree) into
a sequence of words or phrases (or
set of subtrees), perform word/phrase
or subtree translation; and compos-
ite the fragment translations to obtain
the final target sentence. If the transla-
Figure 5. The convolutional neural network (CNN) architecture. The CNN model takes as
tion unit is word, it’s the word-based
input the sequence of word embeddings, summarizes the sentence meaning by convolving
model. If phrase is the basic transla- the sliding window and pooling the saliency through the sentence, and yields the fixed-
tion unit, it’s the popular phrase-based length distributed vector with other layers, such as dropout and fully connected layers.
model. In this article, we mainly take
the phrase-based SMT7 as an example.
In the training stage, we first per-
form word alignment to find word
correspondence between the bilingual
sentences. Then, based on the word-
aligned bilingual sentences, we extract
phrase-based translation rules (such as
the a–e translation rules in Figure 6)
and learn their probabilities. Mean-
while, the phrase reordering model
can be trained from the word-aligned
bilingual text. In addition, the lan-
guage model can be trained with the
large-scale target monolingual data.
During decoding, the phrase-based
model finds the best phrase partition
of the source sentence, searches for the
best phrase translations, and figures
out the best composition of the target
phrases. Figure 6 shows an example
for a Chinese-to-English translation.
Phrasal rules (a–e) are first utilized to Figure 6. An example of translation derivation structure prediction. Phrasal rules
get the partial translations, and then (a–e) are first utilized to get the partial translations, and then reordering rules (f–i)
are employed to arrange the translation positions. Rule g denotes that “the two
reordering rules (f–i) are employed to countries” and “the relations between” should be swapped. Rules f, g, and i just
arrange the translation positions. Rule composite the target phrases monotonously. Finally, the language model measures
g denotes that “the two countries” which translation is more accurate.
and “the relations between” should be
swapped. Rules f, g, and i just compos-
ite the target phrases monotonously. no knowledge besides the parallel • It’s tough work to predict the trans-
Finally, the language model measures data. lation derivation structure because
which translation is more accurate. • It’s difficult to determine which tar- phrase partition and phrase reor-
Obviously, from training and decod- get phrase is the best candidate for dering for a source sentence can be
ing, we can see the difficulties in SMT: a source phrase because a source arbitrary.
phrase can have many translations, • It’s difficult to learn a good language
• It’s difficult to obtain accurate and different contexts lead to differ- model due to the data sparseness
word alignment because we have ent translations. problem.

september/october 2015 www.computer.org/intelligent 19


Natural Language Processing

Table 1. Statistical machine translation difficulties and their corresponding


deep neural network solutions.

Word alignment FNN, RecurrentNN


Translation rule selection FNN, RAE, CNN to feed to hidden layers f 1 and f 2.
Reordering and structure prediction RAE, RecurrentNN, RecursiveNN Finally, the output layer f 3 generates
Language model FNN, RecurrentNN a translation score. A similar FNN is
Joint translation prediction FNN, RecurrentNN, CNN applied to model the distortion score
sd(aj - aj−1). This DNN-based method
not only can learn the bilingual word
embedding that captures the similarity
between words, but can also make use
of wide contextual information.
Akihiro Tamura and colleagues9
adopted RecurrentNN to extend the
we pay great attention to developing the relations between the two countries FNN-based model. Because the FNN-
based approach can only explore
the context in a window, the Recur-
Figure 7. Word alignment example. Each line connecting a Chinese word to an
English word indicates they are translation pairs. rentNN predicts the jth alignment aj
by conditioning on all the preceding
alignments a1j −1.
The core issues lie in two areas: data of word occurrences and learn their pa- The reported experimental results
sparseness (when considering addi- rameters to maximize the likelihood of indicate that RecurrentNN outper-
tional contexts) and the lack of se- the bilingual training data. They have forms FNN in word alignment qual-
mantic modeling of words (phrases two disadvantages: discrete symbol ity on the same test set. It also implies
and sentences). Fortunately, DNNs are representation can’t capture the simi- that RecurrentNN can capture long
good at learning semantic representa- larity between words, and contextual dependency by trying to memorize all
tions and modeling wide context with- information surrounding the word isn’t the history.
out severe data sparseness. fully explored.
Nan Yang and colleagues8 extended DNNs for Translation Rule
DNNs in Standard SMT the HMM word alignment model and Selection
Frameworks adapted each subcomponent with an With word-aligned bilingual text, we
The indirect application of DNNs in FNN. The HMM word alignment can extract a huge number of transla-
SMT aims to solve one difficult prob- takes the following form: tion rules. In phrase-based SMT, we can
lem in an SMT system with more extract many phrase translation rules
T′
accurate context modeling and syn- for a given source phrase. It becomes a
tactic/semantic representation. Table 1
p(a, e | f ) = ∏ plex (e j | fa )pd (aj − a
j
),
j −1 key issue to choose the most appropri-
 j =1
gives an overview of SMT problems (9) ate translation rules during decoding.
and their various DNN solutions. Traditionally, translation rule selection
where plex is the lexical translation is usually performed according to co-
DNNs for Word Alignment probability and pd is the distor- occurrence statistics in the bilingual
Word alignment attempts to identify tion probability. Both components are training data rather than by exploring
the word correspondence between modeled with an FNN. For the lexical the large context and its semantics.
parallel sentence pairs. Given a source translation score, the authors employed Will Zou and colleagues10 used two
sentence f = f1, f2, ..., ft, ..., fT and its the following formula: FNNs (one for source language and
target translation e = e1, e2, ..., et, ..., the other for target language) to learn
eT ′, the word alignment is to find the slex(ej|fi, e, f) = f3 ° f2 ° f1 ° L bilingual word embeddings so as to
set A = {(i, j), 1 ≤ i ≤ T, 1 ≤ j ≤ T ′}, in (window(ej), window (fi)). (10) make sure that a source word is close
which (i, j) denotes that fi and ej are to its correct translation in the joint
translations of each other. Figure 7 The FNN-based approach considers embedding space. The FNN used for
shows an example. the bilingual contexts (window(ej) and source or target language takes as in-
In SMT, the generative model is a window (fi)). All the source and target put the concatenation of the context
popular solution for word alignment. words in the window are mapped into words, applies one hidden layer, and
Generative approaches use the statistics vectors using L ∈ Rk×|V| and concatenated finally generates a score in the output

20 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


layer. For source language embedding, translation equivalents. By fine-tuning can’t share similar reordering patterns
the objective function is as follows: BRAE’s parameters, the model can with each other.
learn the semantic vector representation Peng Li and colleagues14,15 adopted
Jsrc + lJtgt→src,(11) for each source and target phrase. Using the semisupervised RAE to learn the
BRAE, each phrase translation rule can phrase representations that are sensitive
where Jsrc is the contrastive objective be associated with a semantic similarity. to reordering patterns. For two neigh-
function with monolingual data, and With the help of semantic similarities, boring translation candidates (f1, e1)
Jtgt→src is a bilingual constraint: translation rule selection is much more and (f2, e2), the objective function is
accurate.
Jtgt→src = || L src - Atgt→srcLtgt||2 .(12) Lei Cui and colleagues13 applied the E = aErec(f1, e1, f 2 , e2) + (1 - a) Ereorder
auto-encoder to learn the topic repre- ((f 1, e1), (f 2 , e2)),(13)
Equation 12 says that after word sentation for each sentence in the par-
alignment projection Atgt→src, the tar- allel training data. By associating each where E rec(f 1, e1, f 2 , e 2) is the sum
get word embeddings Ltgt should be translation rule with topic informa- of reconstruction errors, and E reorder
close to the source embeddings Lsrc. tion, topic-related rules can be selected ((f 1, e 1), (f 2 , e 2)) is the reordering loss
This method has shown that it can according to the distributed similarity computed with cross-entropy error
cluster bilingual words with similar with the source language text. function. The semisupervised RAE
meanings. The bilingual word embed- Although these methods adopt dif- shows that it can group the phrases
dings are adopted to calculate the se- ferent DNNs, they all achieve better sharing similar reordering patterns.
mantic similarity between the source rule prediction by addressing differ- Feifei Zhai and colleagues 16 and
and target phrases in a phrasal rule, ent aspects such as phrase similar- Jianjun Zhang and colleagues17 explic-
effectively improving the performance ity and topic similarity. FNN as used itly modeled the translation process
of translation rule selection. in the first two approaches is simple of the derivation structure prediction.
Jianfeng Gao and colleagues11 at- and learns much of the semantics of A type-dependent RecursiveNN17
tempted to predict the similarity be- words and phrases with bilingual or jointly determines which two partial
tween a source and a target phrase BLEU (Bilingual Evaluation Under- translation candidates should be com-
using two FNNs with the objective study) objectives. In contrast, RAE is posed together and how that should be
of maximizing translation quality on capable of capturing a phrase’s word done. Figure 8 shows a training exam-
a validation set. For a phrasal rule order information. ple. For a parallel sentence pair (f, e),
(f1,…,i, e1,…,j), the FNN (for source the correct derivation exactly leads to e,
or target language) is employed first DNNs for Reordering and as Figure 8a illustrates. Meanwhile, we
to abstract the vector representa- Structure Prediction have other wrong derivation trees in the
tion for f1,…,i and e1,…,j, respectively. After translation rule selection, we search space (Figure 8b gives one incor-
The similarity score will be score can obtain the partial translation rect derivation). Using RecursiveNN,
( )
f1,…, i , e1,…, j = yf1,…,i T ye1,…, j . The FNN candidates for the source phrases we can get scores S RecursiveNN(cTree)
parameters are trained to optimize the (see the branches in Figure 6). The and SRecursiveNN(wTree) for the correct
score of the phrase pairs that can lead next task is to perform derivation and incorrect derivations. We train the
to better translation quality in the ­structure ­prediction, which includes model by making sure that the score
valida­tion set. two subtasks: determining which of the correct derivation is better than
Because word order isn’t considered two neighboring candidates should that of incorrect one:
in the above approach, Jianjun Zhang be composed first and deciding how
and colleagues12 proposed a bilingually to compose the two candidates. The SRecursiveNN(cTree) ≤ SRecursiveNN(wTree)
constrained RAE (BRAE) to learn se- first subtask hasn’t been explicitly + D(SRecursiveNN(cTree),
mantic phrase embeddings. As shown in modeled to date. The second subtask SRecursiveNN(wTree)),(14)
Figure 3, unsupervised RAE can get the is usually done via the reordering
vector representation for each phrase. model. In SMT, the phrase reordering where, D( SRecursiveNN(cTree), SRecursiveNN
In contrast, the BRAE model not only model is formalized as a classification (wTree)) is a structure margin.
tries to minimize the reconstruction problem, and discrete word features As RecursiveNN can only explore
error but also attempts to minimize are employed, although data sparse- the children information, Shujie Liu
the semantic distance between phrasal ness is a big issue and similar phrases and colleagues18 designed a model

september/october 2015 www.computer.org/intelligent 21


Natural Language Processing

M
developing the relations between the two countries
scale monolingual data. Ashish Vaswani
and colleagues19 employed two hidden
layers in the FNN that’s similar to
Bengio’s FNN.
S the relations between the two countries The n-gram model assumes that the
word depends on the previous n - 1
words. RecurrentNN doesn’t use this
assumption and models the probabil-
ity of a sentence as follows:
developing the two countries the relations between
T′

(a) Correct derivation tree p ( e1, … , eT ′ ) = ∏ p (e j | e1, … , e j −1 ).


(15)
j =1

M developing the two countries between the relations


All the history words are applied to
predict the next word.
Tomas Mikolov20 designed the Re-
currentNN (see Figure 2). A sentence
start symbol <s> is first mapped to a
M developing the two countries real-valued vector as h0 and then em-
ployed to predict the probability of e1;
h0 and e1 are used to form the new his-
tory h1 to predict e2, h1 and e2 gener-
ate h2, and so on. When predicting eT ′,
developing the two countries between the relations all the history e1, ..., eT ′−1 can be used.
The RecurrentNN language model is
(b) Wrong derivation tree
employed to rescore the n-best trans-
lation candidates. Michael Auli and
Figure 8. Type-dependent RecursiveNN: (a) correct derivation vs. (b) incorrect Jianfeng Gao 21 integrated the Re-
derivation. The correct derivation is obtained by performing forced decoding on
the bilingual sentence pair; the derivation structure leads directly to the correct
currentNN language model during
translation. The incorrect derivation is obtained by decoding the source sentence the decoding stage, and further im-
with the trained SMT model; it results in a wrong translation. provements can still be obtained than
just rescoring the final n-best transla-
combining RecursiveNN and Recur- most popular language model is the tion candidates.
rentNN together. This not only retains count-based n-gram model. One big
the capacity of RecursiveNN but also issue here is that data sparseness be- DNNs for Joint Translation
takes advantage of the history. comes severe as n grows. To alleviate Prediction
Compared to RAE, RecursiveNN ap- this problem, researchers tried to de- The joint model predicts the target
plies different weight matrixes according sign a neural network-based language translation by using both of the source
to different composition types. Other model in the continuous vector space. sentences’ information and the target-
work17 has shown via experiments that Yoshua Bengio and colleagues1 de- side history.
RecursiveNN can outperform RAE on signed an FNN as Figure 1 shows to Yuening Hu and colleagues22 and
the same test data. learn the n-gram model in the continu- Youzheng Wu and colleagues23 cast the
ous space. For an n-gram e1, ..., en, each translation process as a language model
DNNs for Language Models in SMT word in e1, ..., en−1 is mapped onto a prediction over the minimum transla-
During derivation prediction, any vector and concatenation of vectors tion units (smallest bilingual phrase
composition of two partial transla- feed into the input layer followed by one pairs satisfying word alignments). They
tions leads to a bigger partial transla- hidden layer and one softmax layer that adopted RecurrentNN to model the
tion. The language model performs the outputs the probability p(en|e1, ..., en−1). process.
task to measure whether the transla- The network parameters are optimized Michael Auli and colleagues24 adapted
tion hypothesis is fluent. In SMT, the to maximize the likelihood of the large- the RecurrentNN language model and

22 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


w x y z <EOS>

added a vector representation for the


source sentence as the input along with
the target history. Jacob Devlin and col-
leagues25 proposed a neural network
a b c <EOS> w x y z
joint model (NNJM) that adapts FNN
to take as input both the n - 1 target
word history and h-window source con- Figure 9. Neural machine translation (NMT) architecture. The model reads a source
sentence abc and produces a target sentence wxyz.
text. They reported promising improve-
ments over the strong baseline. Because
no global information is employed in sequence X = (x1, ..., xT), the encoder The MT network architecture is
NNJM, Fandong Meng and colleagues26 applies RecurrentNN to obtain a vec- simple, but it has many shortcomings.
and Jiajun Zhang and colleagues27 pre- tor C = q(h1, ..., hT) in which ht (1 ≤ ht For example, it restricts tens of thou-
sented an augmented NNJM model: ≤ T) is calculated as follows: sands of vocabulary words for both
CNN is designed to learn the vector languages to make it workable in real
representation for each source sentence; ht = f(ht−1, xt),(16) applications, meaning that many un-
then, the sentence representation aug- known words appear. Furthermore,
ments the NNJM model’s input to pre- where f and q are nonlinear func- this architecture can’t make use of
dict the target word generation. This tions. Sutskever and colleagues sim- the target large-scale monolingual
approach further improves the transla- plified the vector to be a fixed-length data. Recently, Minh-Thang Luong
tion quality over the NNJM model. vector C = q(h1, ..., hT) = hT, whereas and colleagues31 and Sebastien Jean
The RecurrentNN joint model just Bahdanau and colleagues directly and colleagues32 attempted to solve
fits the phrase-based SMT due to the as- applied the variable-length vector the vocabulary problem, but their ap-
sumption that the translation is gener- (h1, ..., hT) when predicting each tar- proaches are heuristic. For example,
ated from left to right or right to left. In get word. they used a dictionary in the post-
contrast, FNN and CNN can benefit all The decoder also applies Recur- processor to translate the unknown
the translation models because they fo- rentNN to predict the target sentence words.
cus only on applying DNNs to learn the Y = (y 1, ..., yT’), where T′ usually dif-
distributed representations of local and fers from T. Each target word yt de- Discussion and Future
global contexts. pends on the source context C and Directions
all the predicted target words {y 1, ..., Applying DNNs to MT is a hot re-
Purely Neural MT yt−1}; the probability of Y will be search topic. Indirect application is
Purely neural machine translation a relatively conservative attempt be-
T′
(NMT) is the new MT paradigm. The
standard SMT system consists of sev-
p (Y ) = ∑ p ( yt | {y1, … , yt −1} , Ct ).(17) cause it retains the standard SMT sys-
tem’s strength, and the log-linear SMT
t =1
eral subcomponents that are separately model facilitates the integration of
optimized. In contrast, NMT employs Sutskever and colleagues chose Ct = DNN-based translation features that
only one neural network that’s trained C = hT, and Bahdanau and colleagues can employ different kinds of DNNs
to maximize the conditional likelihood set Ct = ∑T
j =1α tj hj. to deal with different tasks. However,
on the bilingual training data. The basic All the network parameters are trained indirect application makes the SMT
architecture includes two networks: one to maximize ∏p(Y) in the bilingual train- system much more complicated.
encodes the variable-length source sen- ing data. For a specific network struc- In contrast, direct application is
tence into a real-valued vector, and the ture, Sutskever and colleagues employed simple in terms of model architecture:
other decodes the vector into a variable- deep LSTM to calculate each hidden a network encodes the source sentence
length target sentence. state, whereas Bahdanau and colleagues and another network decodes to the
Kyunghyun Cho and colleagues,28 applied bidirectional RecurrentNN to target sentence. Translation quality is
Ilya Sutskever and colleagues,29 and compute the source-side hidden-state hj. improving, but this new MT architec-
Dzmitry Bahdanau and colleagues30 fol- Both report similar or superior perfor- ture is far from perfect. There’s still
low the similar RecurrentNN encoder- mance in English-to-French translation an open question of how to efficiently
decoder architecture (see Figure 9). compared to the standard phrase-based cover most of the vocabulary, how
Given a source sentence in vector SMT system. to make use of the target large-scale

september/october 2015 www.computer.org/intelligent 23


Natural Language Processing

monolingual data, and how to utilize • Error analysis. Because the DNN- sentence, such as paragraphs and dis-
more syntactic/semantic knowledge in based subcomponent (or NMT) deals course? Unfortunately, representation,
addition to source sentences. with variables in the real-valued con- computation, and reasoning of such
For both direct and indirect appli- tinuous space and there are no effec- information in DNNs remain a diffi-
cations, DNNs boost translation per- tive approaches to show a meaningful cult problem.
formance. Naturally, we’re interested and explainable trace from input to Third, effectively integrating DNNs
in the following questions: output, it’s difficult to understand into standard SMT is still worth trying.
why it leads to better translation per- In the multicomponent system, we can
• Why can DNNs improve transla- formance or why it fails. study which subcomponent is indis-
tion quality? • Remembering and reasoning. For cur- pensable and which can be completely
• Can DNNs lead to a big break- rent DNNs, the continuous vector replaced by DNN-based features. In-
through? representation (even using LSTM in stead of the log-linear model, we need a
• What aspects should DNNs im- RecurrentNN) can’t remember full in- better mathematical model to combine
prove if they’re to become an MT formation for the source sentence. It’s multiple subcomponents.
panacea? quite difficult to obtain correct target Fourth, it’s interesting and impera-
translation by decoding from this rep- tive to investigate more efficient algo-
For the first question, DNNs rep- resentation. Furthermore, unlike other rithms for parameter learning of the
resent and operate language units in sequence-to-sequence NLP tasks, MT complicated neural network architec-
the continuous vector space that fa- is a more complicated problem that tures. Moreover, new network archi-
cilitates the computation of semantic requires rich reasoning operations tectures can be explored in addition to
distance. For example, several algo- (such as coreference resolution). Cur- existing neural networks. We believe
rithms such as Euclidean distance and rent DNNs can’t perform this kind of that the best network architectures for
cosine distance can be applied to cal- reasoning with simple vector or ma- MT must be equipped with representa-
culate the similarity between phrases trix operations. tion, remembering, computation, and
or sentences. But they also capture reasoning, simultaneously.
much more contextual information
than standard SMT systems, and data
sparseness isn’t a big problem. For ex-
ample, the RecurrentNN can utilize all
T hese problems tell us that
DNNs have a long way to go
in MT. Nevertheless, due to their ef-
Acknowledgments
This research work was partially funded by
the Natural Science Foundation of China un-
der grant numbers 61333018 and 61303181,
the history information before the cur- fective representations of languages, the International Science and Technology
rently predicted target word; this is im- they could be a good solution eventu- Cooperation Program of China under grant
possible with standard SMT systems. ally. To achieve this goal, we should number 2014DFA11350, and the High New
For the second question, DNNs pay attention to the path ahead. Technology Research and Development Pro-
gram of Xinjiang Uyghur Autonomous Re-
haven’t achieved huge success with First, DNNs are good at handling
gion, grant number 201312103.
MT until recently. We’ve conducted continuous variables, but natural lan-
some analysis and propose some key guage is composed of abstract discrete
problems for SMT with DNNs: symbols. If they completely abandon References
discrete symbols, DNNs won’t fully 1. Y. Bengio et al., “A Neural Probabilistic
• Computational complexity. Be- control the language generation pro- Language Model,” J. Machine Learning
cause the network structure is com- cess: sentences are discrete, not con- Research, vol. 3, 2003, pp. 1137–1155;
plicated, and normalization over tinuous. Representing and handling www.jmlr.org/papers/volume3/bengio03a/
the entire vocabulary is usually both discrete and continuous vari- bengio03a.pdf.
required, DNN training is a time- ables in DNNs is a big challenge. 2. J.L. Elman, “Distributed Representations,
consuming task. Training a stan- Second, DNNs represent words, Simple Recurrent Networks, and Gram-
dard SMT system on millions of phrases, and sentences in continuous matical Structure,” Machine Learning,
sentence pairs only requires about space, but what if they could mine vol. 7, 1991, pp. 195–225; http://crl.ucsd.
two or three days, whereas train- deeper knowledge, such as parts of edu/~elman/Papers/machine.learning.pdf.
ing a similar NMT system can take speech, syntactic parse trees, and 3. R. Socher et al., “Semi-supervised Recur-
several weeks, even with powerful knowledge graphs? What about ex- sive Autoencoders for Predicting Sentiment
GPUs. ploring wider knowledge beyond the Distributions,” Proc. Empirical Methods

24 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


The Authors
Jiajun Zhang is an associate professor at the National Laboratory of Pattern Recogni-
and Natural Language Process, 2011; tion at the Institute of Automation, Chinese Academy of Sciences. His research interests
http://nlp.stanford.edu/pubs/SocherPen- include machine translation, multilingual natural language processing, and statistical
learning. Zhang has a PhD in computer science from the Institute of Automation, Chi-
ningtonHuangNgManning_EMNLP2011. nese Academy of Sciences. Contact him at [email protected].
pdf.
4. J.B. Pollack, “Recursive Distributed Chengqing Zong is a professor at the National Laboratory of Pattern Recognition at the
Institute of Automation, Chinese Academy of Sciences. His research interests include ma-
Representations,” Artificial Intelligence, chine translation, natural language processing, and sentiment classification. Zong has a
vol. 46, no. 1, 1990, pp. 77–105. PhD in computer science from the Institute of Computing Technology, Chinese Academy
5. Y. LeCun et al., “Gradient-Based Learn- of Sciences. He’s a member of the International Committee on Computational Linguistics
(ICCL). Contact him at [email protected].
ing Applied to Document Recognition,”
Proc. IEEE, vol. 86, no. 11, 1998,
pp. 2278–2324.
6. F.J. Och and H. Ney, “Discriminative 16. F. Zhai et al., “RNN-Based Derivation 2013; http://research.microsoft.com/
Training and Maximum Entropy Mod- Structure Prediction for SMT,” Proc. pubs/201107/emnlp2013rnnmt.pdf.
els for Statistical Machine Translation,” ACL, 2014, pp. 779–784. 25. J. Devlin et al., “Fast and Robust Neural
Proc. ACL, 2002, pp. 295–302. 17. J. Zhang et al., “Mind the Gap: Machine Network Joint Models for Statistical Ma-
7. D. Xiong, Q. Liu, and S. Lin, “Maxi- Translation by Minimizing the Semantic chine Translation,” Proc. ACL, 2014; http://
mum Entropy Based Phrase Reordering Gap in Embedding Space,” Proc. AAAI, aclweb.org/anthology/P/P14/P14-1000.pdf.
Model for Statistical Machine Transla- 2014, pp. 1657–1664. 26. F. Meng, “Encoding Source Language
tion,” Proc. ACL, 2006, pp. 521–528. 18. S. Liu et al., “A Recursive Recurrent Neu- with Convolutional Neural Network for
8. N. Yang et al., “Word Alignment Model- ral Network for Statistical Machine Trans- Machine Translation,” arXiv preprint
ing with Context Dependent Deep Neural lation,” Proc. ACL, 2014, pp. 1491–1500. arXiv:1503.01838, 2015.
Network,” Proc. ACL, 2013, pp. 41–46. 19. A. Vaswani et al., “Decoding with Large- 27. J. Zhang, D. Zhang, and J. Hao, “Local
9. A. Tamura, T. Watanabe, and E. Scale Neural Language Models Improves Translation Prediction with Global Sen-
Sumita, “Recurrent Neural Networks Translation,” Proc. Empirical Methods and tence Representation,” to be published in
for Word Alignment Model,” to be Natural Language Process, 2013; https:// Proc. Int’l J. Conf. Artificial Intelligence,
published in Proc. ACL, 2015. aclweb.org/anthology/D/D13/D13-1.pdf. 2015.
10. W.Y. Zou et al., “Bilingual Word Embed- 20. T. Mikolov, “Statistical Language 28. K. Cho et al., “Learning Phrase Represen-
dings for Phrase-Based Machine Transla- Models Based on Neural Networks,” tations Using RNN Encoder-Decoder
tion,” Proc. Empirical Methods and presentation at Google, 2012; www.fit. for Statistical Machine Translation,” Proc.
Natural Language Process, 2013, vutbr.cz/~imikolov/rnnlm/google.pdf. Empirical Methods and Natural Lan-
pp. 1393–1398. 21. M. Auli and J. Gao, “Decoder Integration guage Processing, 2014, pp. 355–362.
11. J. Gao et al., “Learning Continuous and Expected BLEU Training for Recurrent 29. I. Sutskever, O. Vinyals, and Q.V. Le,
Phrase Representations for Translation Neural Network Language Models,” Proc. “Sequence to Sequence Learning with
Modeling,” Proc. ACL, 2014; www. ACL, 2014; http://research.microsoft.com/ Neural Networks,” Proc. Neural In-
aclweb.org/anthology/P14-1066.pdf. pubs/217163/acl2014_expbleu_rnn.pdf. formation Processing Systems (NIPS),
12. J. Zhang et al., “Bilingually-Constrained 22. Y. Hu et al., “Minimum Translation 2014; http://papers.nips.cc/paper/5346-
Phrase Embeddings for Machine Transla- Modeling with Recurrent Neural Net- sequence-to-sequence-learning-with-
tion,” Proc. ACL, 2014, pp. 111–121. works,” Proc. European ACL, 2014; neural-networks.pdf.
13. L. Cui et al., “Learning Topic Represen- www.cs.umd.edu/~ynhu/publications/ 30. D. Bahdanau, K. Cho, and Y. Bengio,
tation for SMT with Neural Networks,” eacl2014_rnn_mtu.pdf. “Neural Machine Translation by Jointly
Proc. ACL, 2014; http://aclweb.org/ 23. Y. Wu, T. Watanabe, and C. Hori, “Re- Learning to Align and Translate,” arXiv
anthology/P/P14/P14-1000.pdf. current Neural Network-Based Tuple Se- preprint arXiv:1409.0473, 2014.
14. P. Li, Y. Liu, and M. Sun, “Recursive Au- quence Model for Machine Translation,” 31. T. Luong et al., “Addressing the Rare
toencoders for ITG-Based Translation,” Proc. Conf. Computational Linguistics Word Problem in Neural Machine
Proc. Empirical Methods and Natural (COLING), 2014; http://anthology. Translation,” arXiv preprint arX-
Language Process, 2013, pp. 151–161. aclweb.org/C/C14/C14-1180.pdf. iv:1410.8206, 2014.
15. P. Li et al., “A Neural Reordering 24. M. Auli et al., “Joint Language and 32. S. Jean et al., “On Using Very Large
Model for Phrase-Based Translation,” Translation Modeling with Recurrent Target Vocabulary for Neural Machine
Proc. Conf. Computational Linguistics Neural Networks,” Proc. Empirical Translation,” arXiv preprint arX-
(COLING), 2014, pp. 1897–1907. Methods and Natural Language Process, iv:1412.2007, 2014.

september/october 2015 www.computer.org/intelligent 25

You might also like