0% found this document useful (0 votes)

62 views58 pages

Machine Translation Wise 2016/2017

The document discusses neural machine translation and neural networks. It provides an overview of neural networks, including artificial neurons, neural language models, and encoder-decoder models. Key aspects covered include the use of recurrent neural networks in language models, attentional encoder-decoder models, and how Google has adopted neural machine translation.

Uploaded by

Kemas Muhammad Rouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views58 pages

Machine Translation Wise 2016/2017

Uploaded by

Kemas Muhammad Rouf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Machine Translation

WiSe 2016/2017

Neural Machine Translation

Dr. Mariana Neves January 30th, 2017
Overview

● Introduction
● Neural networks
● Neural language models
● Attentional encoder-decoder
● Google NMT

2
Overview

●
Introduction
●
Neural networks
●
Neural language models
●
Attentional encoder-decoder
●
Google NMT

3
Neural MT

●
„Neural MT went from a fringe research activity in 2014 to the
widely-adopted leading way to do MT in 2016.“ (NMT ACL‘16)

●
Google Scholar
●
Since 2012: 28,600
●
Since 2015: 22,500
●
Since 2016: 16,100

4
Neural MT

[Picture from NMT ACL16 slides]

5
Neural MT

● „Neural Machine Translation is the approach of modeling the

entire MT process via one big artificial neural network“ (NMT
ACL‘16)

[Picture from NMT ACL16 slides]]

6
Overview

● Introduction
● Neural networks
● Neural language models
● Attentional encoder-decoder
● Google NMT

7
Artificial neuron

●
Input are the dendrites; Output are the axons
●
Activation occurs if the sum of the weighted inputs is higher
than a threshold (message is passed)

(http://www.innoarchitech.com/artificial-intelligence-deep-learning-neural-networks-explained/)
8
Artificial neural networks (ANN)

● Statistical models inspired on biological neural networks

● They model and process nonlinear relationships between input
and output
● They are based on adptative weights and a cost function
● Based on optimization techniques, e.g., gradient descent and
stochastic gradient descent

9
Basic architecture of ANNs

●
Layers of artificial neurons
●
Input layer, hidden layer, output layer
●
Overfitting can occur with increasing model complexity

(http://www.innoarchitech.com/artificial-intelligence-deep-learning-neural-networks-explained/)
10
Deep learning

● Certain types of NN that consume very raw input data

● Data is processed through many layers of nonlinear

transformations

11
Deep learning – feature learning

●
Deep learning excels in unspervised feature extraction, i.e.,
automatic derivation of meaningful features from the input
data

●
They learn which features are important for a task

●
As opposed to feature selection and engineering, usual tasks in
machine learning approaches

12
Deep learning - architectures

●
Feed-forward neural networks
●
Recurrent neural network
●
Multi-layer perceptrons
●
Convolutional neural networks
●
Recursive neural networks
●
Deep belief networks
●
Convolutional deep belief networks
●
Self-Organizing Maps
●
Deep Boltzmann machines
●
Stacked de-noising auto-encoders

13
Overview

●
Introduction
●
Neural networks
●
Neural language models
●
Attentional encoder-decoder
●
Google NMT

14
Language Models (LM) for MT

15
LM for MT

16
N-gram Neural LM with feed-forward NN

Input as one-hot
representations
of the words in
context u (n-1),

where n is the
order of the
language model

(https://www3.nd.edu/~dchiang/papers/vaswani-emnlp13.pdf)
17
N-gram Neural LM with
feed-forward NN

● Input: context of n-1 previous words

● Output: probability distribution for
next word
● Size of input/output: vocabulary size
● One or many hidden layers
● Embedding layer is lower
dimensional and dense
– Smaller weight matrices
– Learns to map similar words to
similar points in the vector
space

(https://www3.nd.edu/~dchiang/papers/vaswani-emnlp13.pdf)
18
One-hot representation

● Corpus: „the man runs.“

● Vocabulary = {man,runs,the,.}
● Input/output for p(runs|the man)

0 1 0
0 0 1
x0= x1= ytrue=
1 0 0
0 0 0

19
Softmax function

● It normalize the output vectors to probability distribution

(sum=1)
● Its computational cost is linear to vocabulary size
● When combined with stochastic gradient descend, it minimizes
cross-entropy (perpexity)

xT w j
e
p( y= j∣x)= K
T
x wk
∑e
k =1

T
x w is the inner product of x (sample vector) and w (weight vector)

20
Softmax function

● Example:
– input = [1,2,3,4,1,2,3]
– softmax = [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]

● The output has most of its weight where the '4' was in the original
input.
● The function highlights the largest values and suppress values
which are significantly below the maximum value.

(https://en.wikipedia.org/wiki/Softmax_function)
21
(http://sebastianruder.com/word-embeddings-1/)

Classical neural language model

(Bengio et al. 2003)

22
Feed-forward neural language model (FFNLM) in SMT

● One more feature in the log-linear phrase-based model

23
(http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)

Recurrent neural networks language model (RNNLM)

●
Recurrent neural networks (RNN) is a class of NN in which
connections between the units form a directed cycle
●
It makes use of sequential information
●
It does not assume independence between input and output

24
(http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

RNNLM

● Condition on arbitrarly long contexts

● No Markov assumption
● It reads one word at a time, updates network incrementally

25
Overview

● Introduction
● Neural networks
● Neural language models
● Attentional encoder-decoder
● Google NMT

26
Translation modelling

● Source sentence S of length m: x1, . . . , xm

● Target sentence T of length n: y1, . . . , yn

*
T =arg max P (T∣S )
t

P (T∣S)=P ( y 1 ,... , y n∣x 1 ,... , x m )

n
P (T∣S)=∏ P( y i∣y 0 , ... , y i−1 , x 1 , ... , x m )
i=1

27
Encoder-Decoder

●
Two RNNs (usually LSTM):
– encoder reads input and produces hidden state
representations
– decoder produces output, based on last encoder hidden
state

[Picture from NMT ACL16 slides]

28
Long short-term memory (LSTM)

● It is a special kind of RNN

● It connects previous information to the present task
● It is capable to learn long-term dependencies

(http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
29
Long short-term memory (LSTM)

● LSTMs have four interating layers

● But there are many variations of the architecture

(http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
30
Encoder-Decoder

● Encoder-decoder
are learned jointly

● Supervision signal
from parallel
corpora is
backpropagated

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)
31
(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)

The Encoder (continuous-space representation)

● The encoder linearly projects the 1-of-K coded vector wi with a

matrix E which has as many columns as there are words in the
source vocabulary and as many rows as you want (typically, 100
– 500.)

32
The Encoder (summary vector)

● Last encoder hidden state summarizes source sentence

● But quality decreases for long sentences (fixed-size vector)

33 (https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)
The Encoder (summary vector)

● Projection to 2D using Principal Componnet Analysis (PCA)

34 (https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)
(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)

The Decoder

● The inverse of the encoder

● Based on the softmax function

35
Problem with simple E-D architectures

● Fixed-size vector from which the decoder needs to generate a

full translation

● The context vector must contain every signgle detail of the

source sentence

● The dimensionality of the contect vector must be large enough

that a sentence of any length can be compressed

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
36
Problem with simple E-D architectures

● Large models are necessary to cope with large sentences

(experiments with
small models)

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
37
(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/)

Bidirectional recurrent neural network (BRNN)

● Use a memory with as many banks as source words, instead of a

fixed-size context vector
● BRNN = forward RNN + backwards RNN

38
Bidirectional recurrent neural network (BRNN)

● At any point, the forward and backward vectors summarizes a

whole input sentece

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
39
Bidirectional recurrent neural network (BRNN)

● This mechanism allows storage of a source sentence as a

variable-length representation

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
40
Soft Attention mechanism

● It is a small NN that takes as input the previous decoder‘s hidden

state (what has been translated)

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
41
Soft Attention mechanism

● It contains one hidden layer and outputs a scalar

● Normalization (to sum up to 1) is done with softmax function

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
42
Soft Attention mechanism

● The model learn attention (alignment) between two languages

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
43
Soft Attention mechanism

● With this mechanism, the quality of the translation does not drop
as the sentence length increases

(https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/)
44
Overview

● Introduction
● Neural networks
● Neural language models
● Attentional encoder-decoder
● Google NMT

45
Google Multilingual NMT system (Nov/16)

●
Simplicity:
– Single NMT model to translate between multiple languages,
instead of many models (1002)

●
Low-resource language improvement:
– Improve low-resource language pair by mixing with high-resource
languages into a single model

●
Zero-shot translation:
– It learns to perform implicit bridging between language pairs
never seen explicitly during training

46 (https://arxiv.org/abs/1611.04558)
(https://arxiv.org/abs/1609.08144)

Google NMT system (Sep-Oct/16)

● Deep LSTM network with 8 encoder and 8 decoder layers

47
Google NMT system (Sep-Oct/16)

● Normal LSTM (left) vs. stacked LSTM (right) with residual

connections

48 (https://arxiv.org/abs/1609.08144)
Google NMT system (Sep-Oct/16)

● Output from LSTMf and LSTMb are first concatenated and then
fed to the next LSTM layer LSTM1

49 (https://arxiv.org/abs/1609.08144)
Google NMT system (Sep-Oct/16)

● Wordpiece model (WPM) implementation initially developed to

solve a Japanese/Korean segmentation problem

● Data-driven approach to maximize the language-model

likelihood of the training data

(“_” is a special character added to mark the beginning of a word.)

50 (https://arxiv.org/abs/1609.08144)
Google Multilingual NMT system (Nov/16)

● ??

51 (https://arxiv.org/abs/1611.04558)
Google Multilingual NMT system (Nov/16)

● Introduction of an artificial token at the beginning of the input

sentence to indicate the target language the model should
translate to.

52 (https://arxiv.org/abs/1611.04558)
Google Multilingual NMT system (Nov/16)

● Experiments: Many to one

53 (https://arxiv.org/abs/1611.04558)
Google Multilingual NMT system (Nov/16)

● Experiments: One to many

54 (https://arxiv.org/abs/1611.04558)
Google Multilingual NMT system (Nov/16)

● Experiments: Many to many

55 (https://arxiv.org/abs/1611.04558)
Google Multilingual NMT system (Nov/16)

● Experiments: Zero-Shot translation

56 (https://arxiv.org/abs/1611.04558)
Summary

●
Very brief introduction to neural networks
● Neural language models
– One-hot representations (1-of-K coded vector)
– Softmax function
● Neural machine translation
– Recurrent NN; LSTM
– Encoder and Decoder
– Soft attention mechanism (BRNN)
● Google MT
– Architecture and multilingual experiments

57
Suggested reading

●
Artificial Intelligence, Deep Learning, and Neural Networks
Explained:
http://www.innoarchitech.com/artificial-intelligence-deep-learn
ing-neural-networks-explained/
●
Introduction to Neural Machine Translation with GPUs:
https://devblogs.nvidia.com/parallelforall/introduction-neural-
machine-translation-with-gpus/

●
Neural Machine Translation slides, ACL‘2016:
https://sites.google.com/site/acl16nmt/
●
Neural Machine Translation slides (Univ. Edinburgh)
http://statmt.org/mtma16/uploads/mtma16-neural.pdf

Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Understanding Deep Learning
100% (1)
Understanding Deep Learning
39 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
272 pages
The Theory of Learning-Jerome Bruner
100% (1)
The Theory of Learning-Jerome Bruner
15 pages
Lecture8 421
No ratings yet
Lecture8 421
85 pages
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
No ratings yet
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
103 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
NLP Lecture 6
No ratings yet
NLP Lecture 6
57 pages
Deep Learning Report For Students
No ratings yet
Deep Learning Report For Students
32 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
DL Mod4
No ratings yet
DL Mod4
105 pages
IT641 RNN V2-Compressed
No ratings yet
IT641 RNN V2-Compressed
74 pages
cs224n-2021-LSTM NN
No ratings yet
cs224n-2021-LSTM NN
59 pages
Transformer Networks
No ratings yet
Transformer Networks
53 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
Unit III - Recurrent Neural Networks
No ratings yet
Unit III - Recurrent Neural Networks
44 pages
Lecture 11
No ratings yet
Lecture 11
57 pages
ENG6500 8 DL IntroductionToDeepLearning Part2
No ratings yet
ENG6500 8 DL IntroductionToDeepLearning Part2
65 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
BTech Advanced AI Unit03
No ratings yet
BTech Advanced AI Unit03
109 pages
Bert
No ratings yet
Bert
60 pages
5a. Recurrent Neural Networks
No ratings yet
5a. Recurrent Neural Networks
45 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
NLP - Machine Learning
No ratings yet
NLP - Machine Learning
23 pages
Group19 Eee Paper
No ratings yet
Group19 Eee Paper
23 pages
01 The Transformer
No ratings yet
01 The Transformer
64 pages
Time Series RNN LSTM 1746197734
No ratings yet
Time Series RNN LSTM 1746197734
25 pages
Neubig 16 Afnlp
No ratings yet
Neubig 16 Afnlp
58 pages
06 Neural Networks For NLP
No ratings yet
06 Neural Networks For NLP
26 pages
Session2 2024 - 2025 - Natural Language Processing
No ratings yet
Session2 2024 - 2025 - Natural Language Processing
30 pages
Nn4ir PDF
No ratings yet
Nn4ir PDF
290 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
16 pages
06 - LLM
No ratings yet
06 - LLM
18 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
AAM Unit 6 Notes
No ratings yet
AAM Unit 6 Notes
20 pages
Deep Learning Basics
No ratings yet
Deep Learning Basics
10 pages
Transformers
No ratings yet
Transformers
27 pages
NN DL
No ratings yet
NN DL
54 pages
Session 8
No ratings yet
Session 8
24 pages
Exploring The Efficacy of LSTM Networks in Machine Translation: A Survey of Techniques and Applications
No ratings yet
Exploring The Efficacy of LSTM Networks in Machine Translation: A Survey of Techniques and Applications
11 pages
2022-Markowitz-Transformers, Explained - Understand The Model Behind GPT-3, BERT, and T5
No ratings yet
2022-Markowitz-Transformers, Explained - Understand The Model Behind GPT-3, BERT, and T5
11 pages
CP4252 ML Unit - V
No ratings yet
CP4252 ML Unit - V
17 pages
Neural Network and Deep Learning
No ratings yet
Neural Network and Deep Learning
14 pages
Quinn Thesis Final On NMT
No ratings yet
Quinn Thesis Final On NMT
29 pages
IC Unit6 DeepLearning
No ratings yet
IC Unit6 DeepLearning
35 pages
AI Primer
No ratings yet
AI Primer
12 pages
90 Minute Dysgraphia Evaluation
100% (2)
90 Minute Dysgraphia Evaluation
2 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Deep Neural Networks in Machine Translation
No ratings yet
Deep Neural Networks in Machine Translation
10 pages
Attention and Memory in Deep Learning and NLP
No ratings yet
Attention and Memory in Deep Learning and NLP
8 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
IV Memory: 4.1 Nature and Definition
0% (1)
IV Memory: 4.1 Nature and Definition
6 pages
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
No ratings yet
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
4 pages
Multi-Model Neural Machine Translation: B. Nikitha, K. Bhanu Prakash, M. Sravanthi Suma, M. Kavya Srihitha
No ratings yet
Multi-Model Neural Machine Translation: B. Nikitha, K. Bhanu Prakash, M. Sravanthi Suma, M. Kavya Srihitha
2 pages
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
No ratings yet
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
36 pages
Unit 3
No ratings yet
Unit 3
8 pages
G5Aiai Introduction To AI: Graham Kendall
No ratings yet
G5Aiai Introduction To AI: Graham Kendall
48 pages
WAIS IV Brochure
No ratings yet
WAIS IV Brochure
6 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
11 pages
Rat & Wais Tests
No ratings yet
Rat & Wais Tests
19 pages
OCI DL Fundations
No ratings yet
OCI DL Fundations
4 pages
Neural Networks Backtracking
No ratings yet
Neural Networks Backtracking
14 pages
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
No ratings yet
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
4 pages
Pedagogy-Remembering and Forgetting
No ratings yet
Pedagogy-Remembering and Forgetting
25 pages
Cognitive Psychology IB
No ratings yet
Cognitive Psychology IB
198 pages
IQ Sample
No ratings yet
IQ Sample
6 pages
PEIM - PPT2 The Cone of Experience PDF
No ratings yet
PEIM - PPT2 The Cone of Experience PDF
20 pages
The Organization of Recent and Remote Memories: Paul W. Frankland and Bruno Bontempi
No ratings yet
The Organization of Recent and Remote Memories: Paul W. Frankland and Bruno Bontempi
12 pages
Examination of Mental Functions
No ratings yet
Examination of Mental Functions
13 pages
MANIPULATION TECHNIQUES in DARK PSYCHOLOGY Learn The Secrets of Persuasion and Mind Control Using NLP, Brainwashing ... (LIAM ROBINSON (ROBINSON, LIAM) )
No ratings yet
MANIPULATION TECHNIQUES in DARK PSYCHOLOGY Learn The Secrets of Persuasion and Mind Control Using NLP, Brainwashing ... (LIAM ROBINSON (ROBINSON, LIAM) )
335 pages
C
No ratings yet
C
2 pages
Memory PPT 9 How Works Memory Techniques Good
No ratings yet
Memory PPT 9 How Works Memory Techniques Good
27 pages
D. L. Scharcter - The Seven Sins of Memory PDF
100% (1)
D. L. Scharcter - The Seven Sins of Memory PDF
22 pages
Chapter 4
No ratings yet
Chapter 4
18 pages
Effect of Caffeine On Cognitive Functions
No ratings yet
Effect of Caffeine On Cognitive Functions
5 pages
Abc Machine Learning
No ratings yet
Abc Machine Learning
67 pages
Introduction To Deep Learning: Suresh Jaganathan
No ratings yet
Introduction To Deep Learning: Suresh Jaganathan
73 pages
00roediger, H. L. III, Putnam, A. L., & Smith, M. A. (2011) - Ten Benefits of Testing and Their Applications To Educational Practice
No ratings yet
00roediger, H. L. III, Putnam, A. L., & Smith, M. A. (2011) - Ten Benefits of Testing and Their Applications To Educational Practice
36 pages
Duration of LTM
No ratings yet
Duration of LTM
3 pages
TB1 Chapter 8 - Multiple Choice
No ratings yet
TB1 Chapter 8 - Multiple Choice
54 pages
Convolutional Neural Network (CNN)
No ratings yet
Convolutional Neural Network (CNN)
85 pages
11 ANN (Backpropagation)
No ratings yet
11 ANN (Backpropagation)
37 pages
Wa0003.
No ratings yet
Wa0003.
23 pages
Von Restorff Effect
No ratings yet
Von Restorff Effect
2 pages
Deep Learning References: 1 Textbooks and Surveys About DL
No ratings yet
Deep Learning References: 1 Textbooks and Surveys About DL
9 pages
Reading Ielts 23.11
No ratings yet
Reading Ielts 23.11
3 pages

Machine Translation Wise 2016/2017

Uploaded by

Machine Translation Wise 2016/2017

Uploaded by

Machine Translation

Neural Machine Translation

[Picture from NMT ACL16 slides]

● „Neural Machine Translation is the approach of modeling the

[Picture from NMT ACL16 slides]]

● Statistical models inspired on biological neural networks

● Certain types of NN that consume very raw input data

● Data is processed through many layers of nonlinear

● Input: context of n-1 previous words

● Corpus: „the man runs.“

● It normalize the output vectors to probability distribution

Classical neural language model

● One more feature in the log-linear phrase-based model

Recurrent neural networks language model (RNNLM)

● Condition on arbitrarly long contexts

● Source sentence S of length m: x1, . . . , xm

P (T∣S)=P ( y 1 ,... , y n∣x 1 ,... , x m )

[Picture from NMT ACL16 slides]

● It is a special kind of RNN

● LSTMs have four interating layers

The Encoder (continuous-space representation)

● The encoder linearly projects the 1-of-K coded vector wi with a

● Last encoder hidden state summarizes source sentence

● Projection to 2D using Principal Componnet Analysis (PCA)

● The inverse of the encoder

● Fixed-size vector from which the decoder needs to generate a

● The context vector must contain every signgle detail of the

● The dimensionality of the contect vector must be large enough

● Large models are necessary to cope with large sentences

Bidirectional recurrent neural network (BRNN)

● Use a memory with as many banks as source words, instead of a

● At any point, the forward and backward vectors summarizes a

● This mechanism allows storage of a source sentence as a

● It is a small NN that takes as input the previous decoder‘s hidden

● It contains one hidden layer and outputs a scalar

● The model learn attention (alignment) between two languages

Google NMT system (Sep-Oct/16)

● Deep LSTM network with 8 encoder and 8 decoder layers

● Normal LSTM (left) vs. stacked LSTM (right) with residual

● Wordpiece model (WPM) implementation initially developed to

● Data-driven approach to maximize the language-model

(“_” is a special character added to mark the beginning of a word.)

● Introduction of an artificial token at the beginning of the input

● Experiments: Many to one

● Experiments: One to many

● Experiments: Many to many

● Experiments: Zero-Shot translation

You might also like