0% found this document useful (0 votes)
62 views58 pages

Machine Translation Wise 2016/2017

The document discusses neural machine translation and neural networks. It provides an overview of neural networks, including artificial neurons, neural language models, and encoder-decoder models. Key aspects covered include the use of recurrent neural networks in language models, attentional encoder-decoder models, and how Google has adopted neural machine translation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views58 pages

Machine Translation Wise 2016/2017

The document discusses neural machine translation and neural networks. It provides an overview of neural networks, including artificial neurons, neural language models, and encoder-decoder models. Key aspects covered include the use of recurrent neural networks in language models, attentional encoder-decoder models, and how Google has adopted neural machine translation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Machine Translation

WiSe 2016/2017

Neural Machine Translation


Dr. Mariana Neves January 30th, 2017
Overview

● Introduction
● Neural networks
● Neural language models
● Attentional encoder-decoder
● Google NMT

2
Overview


Introduction

Neural networks

Neural language models

Attentional encoder-decoder

Google NMT

3
Neural MT


„Neural MT went from a fringe research activity in 2014 to the
widely-adopted leading way to do MT in 2016.“ (NMT ACL‘16)


Google Scholar

Since 2012: 28,600

Since 2015: 22,500

Since 2016: 16,100

4
Neural MT

[Picture from NMT ACL16 slides]


5
Neural MT

● „Neural Machine Translation is the approach of modeling the


entire MT process via one big artificial neural network“ (NMT
ACL‘16)

[Picture from NMT ACL16 slides]]


6
Overview

● Introduction
● Neural networks
● Neural language models
● Attentional encoder-decoder
● Google NMT

7
Artificial neuron


Input are the dendrites; Output are the axons

Activation occurs if the sum of the weighted inputs is higher
than a threshold (message is passed)

(http://www.innoarchitech.com/artificial-intelligence-deep-learning-neural-networks-explained/)
8
Artificial neural networks (ANN)

● Statistical models inspired on biological neural networks


● They model and process nonlinear relationships between input
and output
● They are based on adptative weights and a cost function
● Based on optimization techniques, e.g., gradient descent and
stochastic gradient descent

9
Basic architecture of ANNs


Layers of artificial neurons

Input layer, hidden layer, output layer

Overfitting can occur with increasing model complexity

(http://www.innoarchitech.com/artificial-intelligence-deep-learning-neural-networks-explained/)
10
Deep learning

● Certain types of NN that consume very raw input data

● Data is processed through many layers of nonlinear


transformations

11
Deep learning – feature learning


Deep learning excels in unspervised feature extraction, i.e.,
automatic derivation of meaningful features from the input
data


They learn which features are important for a task


As opposed to feature selection and engineering, usual tasks in
machine learning approaches

12
Deep learning - architectures


Feed-forward neural networks

Recurrent neural network

Multi-layer perceptrons

Convolutional neural networks

Recursive neural networks

Deep belief networks

Convolutional deep belief networks

Self-Organizing Maps

Deep Boltzmann machines

Stacked de-noising auto-encoders

13
Overview


Introduction

Neural networks

Neural language models

Attentional encoder-decoder

Google NMT

14
Language Models (LM) for MT

15
LM for MT

16
N-gram Neural LM with feed-forward NN

Input as one-hot
representations
of the words in
context u (n-1),

where n is the
order of the
language model

(https://www3.nd.edu/~dchiang/papers/vaswani-emnlp13.pdf)
17
N-gram Neural LM with
feed-forward NN

● Input: context of n-1 previous words


● Output: probability distribution for
next word
● Size of input/output: vocabulary size
● One or many hidden layers
● Embedding layer is lower
dimensional and dense
– Smaller weight matrices
– Learns to map similar words to
similar points in the vector
space

(https://www3.nd.edu/~dchiang/papers/vaswani-emnlp13.pdf)
18
One-hot representation

● Corpus: „the man runs.“


● Vocabulary = {man,runs,the,.}
● Input/output for p(runs|the man)

0 1 0
0 0 1
x0= x1= ytrue=
1 0 0
0 0 0

19
Softmax function

● It normalize the output vectors to probability distribution


(sum=1)
● Its computational cost is linear to vocabulary size
● When combined with stochastic gradient descend, it minimizes
cross-entropy (perpexity)

xT w j
e
p( y= j∣x)= K
T
x wk
∑e
k =1

T
x w is the inner product of x (sample vector) and w (weight vector)

20
Softmax function

● Example:
– input = [1,2,3,4,1,2,3]
– softmax = [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]

● The output has most of its weight where the '4' was in the original
input.
● The function highlights the largest values and suppress values
which are significantly below the maximum value.

(https://en.wikipedia.org/wiki/Softmax_function)
21
(http://sebastianruder.com/word-embeddings-1/)

Classical neural language model


(Bengio et al. 2003)

22
Feed-forward neural language model (FFNLM) in SMT

● One more feature in the log-linear phrase-based model

23
(http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)

Recurrent neural networks language model (RNNLM)


Recurrent neural networks (RNN) is a class of NN in which
connections between the units form a directed cycle

It makes use of sequential information

It does not assume independence between input and output

24
(http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

RNNLM

● Condition on arbitrarly long contexts


● No Markov assumption
● It reads one word at a time, updates network incrementally

25
Overview

● Introduction
● Neural networks
● Neural language models
● Attentional encoder-decoder
● Google NMT

26
Translation modelling

● Source sentence S of length m: x1, . . . , xm


● Target sentence T of length n: y1, . . . , yn

*
T =arg max P (T∣S )
t

P (T∣S)=P ( y 1 ,... , y n∣x 1 ,... , x m )

n
P (T∣S)=∏ P( y i∣y 0 , ... , y i−1 , x 1 , ... , x m )
i=1

27
Encoder-Decoder


Two RNNs (usually LSTM):
– encoder reads input and produces hidden state
representations
– decoder produces output, based on last encoder hidden
state

[Picture from NMT ACL16 slides]


28
Long short-term memory (LSTM)

● It is a special kind of RNN


● It connects previous information to the present task
● It is capable to learn long-term dependencies

(http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
29
Long short-term memory (LSTM)

● LSTMs have four interating layers


● But there are many variations of the architecture

(http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
30
Encoder-Decoder

● Encoder-decoder
are learned jointly

● Supervision signal
from parallel
corpora is
backpropagated

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)
31
(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)

The Encoder (continuous-space representation)

● The encoder linearly projects the 1-of-K coded vector wi with a


matrix E which has as many columns as there are words in the
source vocabulary and as many rows as you want (typically, 100
– 500.)

32
The Encoder (summary vector)

● Last encoder hidden state summarizes source sentence


● But quality decreases for long sentences (fixed-size vector)

33 (https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)
The Encoder (summary vector)

● Projection to 2D using Principal Componnet Analysis (PCA)

34 (https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)
(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)

The Decoder

● The inverse of the encoder


● Based on the softmax function

35
Problem with simple E-D architectures

● Fixed-size vector from which the decoder needs to generate a


full translation

● The context vector must contain every signgle detail of the


source sentence

● The dimensionality of the contect vector must be large enough


that a sentence of any length can be compressed

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
36
Problem with simple E-D architectures

● Large models are necessary to cope with large sentences

(experiments with
small models)

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
37
(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)

Bidirectional recurrent neural network (BRNN)

● Use a memory with as many banks as source words, instead of a


fixed-size context vector
● BRNN = forward RNN + backwards RNN

38
Bidirectional recurrent neural network (BRNN)

● At any point, the forward and backward vectors summarizes a


whole input sentece

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
39
Bidirectional recurrent neural network (BRNN)

● This mechanism allows storage of a source sentence as a


variable-length representation

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
40
Soft Attention mechanism

● It is a small NN that takes as input the previous decoder‘s hidden


state (what has been translated)

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
41
Soft Attention mechanism

● It contains one hidden layer and outputs a scalar


● Normalization (to sum up to 1) is done with softmax function

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
42
Soft Attention mechanism

● The model learn attention (alignment) between two languages

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
43
Soft Attention mechanism

● With this mechanism, the quality of the translation does not drop
as the sentence length increases

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
44
Overview

● Introduction
● Neural networks
● Neural language models
● Attentional encoder-decoder
● Google NMT

45
Google Multilingual NMT system (Nov/16)


Simplicity:
– Single NMT model to translate between multiple languages,
instead of many models (1002)


Low-resource language improvement:
– Improve low-resource language pair by mixing with high-resource
languages into a single model


Zero-shot translation:
– It learns to perform implicit bridging between language pairs
never seen explicitly during training

46 (https://arxiv.org/abs/1611.04558)
(https://arxiv.org/abs/1609.08144)

Google NMT system (Sep-Oct/16)

● Deep LSTM network with 8 encoder and 8 decoder layers

47
Google NMT system (Sep-Oct/16)

● Normal LSTM (left) vs. stacked LSTM (right) with residual


connections

48 (https://arxiv.org/abs/1609.08144)
Google NMT system (Sep-Oct/16)

● Output from LSTMf and LSTMb are first concatenated and then
fed to the next LSTM layer LSTM1

49 (https://arxiv.org/abs/1609.08144)
Google NMT system (Sep-Oct/16)

● Wordpiece model (WPM) implementation initially developed to


solve a Japanese/Korean segmentation problem

● Data-driven approach to maximize the language-model


likelihood of the training data

(“_” is a special character added to mark the beginning of a word.)

50 (https://arxiv.org/abs/1609.08144)
Google Multilingual NMT system (Nov/16)

● ??

51 (https://arxiv.org/abs/1611.04558)
Google Multilingual NMT system (Nov/16)

● Introduction of an artificial token at the beginning of the input


sentence to indicate the target language the model should
translate to.

52 (https://arxiv.org/abs/1611.04558)
Google Multilingual NMT system (Nov/16)

● Experiments: Many to one

53 (https://arxiv.org/abs/1611.04558)
Google Multilingual NMT system (Nov/16)

● Experiments: One to many

54 (https://arxiv.org/abs/1611.04558)
Google Multilingual NMT system (Nov/16)

● Experiments: Many to many

55 (https://arxiv.org/abs/1611.04558)
Google Multilingual NMT system (Nov/16)

● Experiments: Zero-Shot translation

56 (https://arxiv.org/abs/1611.04558)
Summary


Very brief introduction to neural networks
● Neural language models
– One-hot representations (1-of-K coded vector)
– Softmax function
● Neural machine translation
– Recurrent NN; LSTM
– Encoder and Decoder
– Soft attention mechanism (BRNN)
● Google MT
– Architecture and multilingual experiments

57
Suggested reading


Artificial Intelligence, Deep Learning, and Neural Networks
Explained:
http://www.innoarchitech.com/artificial-intelligence-deep-learn
ing-neural-networks-explained/

Introduction to Neural Machine Translation with GPUs:
https://devblogs.nvidia.com/parallelforall/introduction-neural-
machine-translation-with-gpus/


Neural Machine Translation slides, ACL‘2016:
https://sites.google.com/site/acl16nmt/

Neural Machine Translation slides (Univ. Edinburgh)
http://statmt.org/mtma16/uploads/mtma16-neural.pdf

58

You might also like