0% found this document useful (0 votes)

6 views21 pages

M5 Topic 1 - Encoder Decoder

The document presents an overview of the Encoder-Decoder model, particularly in the context of Sequence-to-Sequence (Seq2Seq) tasks such as machine translation and text summarization. It discusses the evolution from rule-based and statistical machine translation to neural Seq2Seq models utilizing LSTMs, which effectively manage long-term dependencies. Additionally, it highlights the limitations of fixed-size context vectors and introduces the attention mechanism and Transformers as advancements that enhance model performance in natural language processing.

Uploaded by

Nishchal Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views21 pages

M5 Topic 1 - Encoder Decoder

Uploaded by

Nishchal Das

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

THE ENCODER DECODER

MODEL

Presented By:

Rishika Hazarika (210710007037)

Shruti Sarma (210710007047)
Shreyoshi Ghosh (210710007046)
Sequence-to-Sequence (Seq2Seq) Problem

• Involves taking an input sequence (of any length) and mapping it to an output sequence (can be of a different
length).
• Challenge: the model must understand the meaning of the entire input sequence before generating the correct
output.
• Examples:
a. Machine Translation: “I love cats”  “J’aime less chats”
b. Speech Recognition: Audio waveform  “Hello, how are you?”
c. Text Summarization: Long document  Short summary
d. Question Answering: “Who discovered gravity?”  “Isaac Newton”
e. Chatbots: “How are you?”  “I’m doing well, thanks!”

• Goal of a Seq2Seq model: to map one input sequence to an output sequence.

Machine Translation Systems Prior to Seq2Seq Models:
Before deep learning-based Seq2Seq models, machine translation was handled by:
a. Rule-Based Machine Translation:

• Relied on manually written grammatical rules and dictionaries for translating text.
• Example: If translating English to French, “I eat an apple”  “Je mange une pomme”, this required predefined
grammar rules.
• Disadvantage: Needed separate rules for different languages or for every sentence structure.
• Problem: Could not handle new or complex sentences if their grammatical rules were not predefined.

b. Statistical Machine Translation:

• Used probability-based models trained on bilingual corpora.

• Key idea: Break sentences into phrases and find the most probable translation based on statistical alignment.
• Example: If “I love”  “J’aime” appears in most training data, it is chosen.
• Advantage: Learned from large datasets rather than predefined rules; more flexible than rule-based machine translation.
• Disadvantage: Struggled with long sentences; word order issues; lot of training data required.
Neural Sequence-to-Sequence Learning: A Breakthrough
Sequence to Sequence Learning (Sutskever et al., 2014)

• Proposed an end-to-end neural network approach for Seq2Seq tasks with minimal assumptions.
• Used multilayered LSTMS: one to encode input into a fixed-dimensional vector and another to decode it.
• Achieved a BLEU score of 34.8 on the WMT’14 English-French dataset, outperforming Statistical Machine Translation
(33.3).

BLEU (Bilingual Evaluation Understudy) score: a metric to evaluate machine-generated translations by comparing them
with human translations.
Higher BLEU score = better translation quality.
What is Encoder-Decoder Model?
• In the seq2seq model, the encoder and the decoder architecture converts input sequences into output sequences.
• According to Daniel Jurafsky & James H. Martin in their book “Speech and Language Processing: An
Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition”:
The key idea underlying these networks is the use of an encoder network that takes an input sequence and creates a
contextualized representation of it, often called the context. This representation is then passed to a decoder which generates a
tasks specific output sequence.

Fig 1: A basic encoder-decoder model

LSTMs: A Foundation for Encoder-Decoder Models
RNNs were the first models to process sequential data.

• Problem with RNNs  Forget long-term context due to vanishing gradient.

• LSTM Solution  Learns what to remember and forget, handling longer sequences.
• Memory Paths in LSTM:
• Long-Term Memory  Stores important past information.
• Short-Term Memory  Keeps recent context.

Why LSTM for Encoder-Decoder Models?

 Encoders need LSTMs to store and pass meaningful context.
 Decoders use this context for better predictions.
LSTM Architecture: Memory Cells & Gates
• LSTMS use a memory cell to store important information over long sequences, solving the vanishing gradient problem
in RNNs.
• Three gates regulate information flow:
 Input Gate  Decides what new information to store in the memory cell.
 Forget Gate  Removes irrelevant or outdated information.
 Output Gate  Determines what processed information is sent as output.

How it helps in Encoder-Decoder Models?

• Ensures that relevant context is retained across long sequences.
• Helps the encoder store useful information for the decoder to generate accurate outputs.
LSTM Activation Functions: Sigmoid & Tanh
LSTM uses two key activation functions to regulate information flow in the memory cell.

Sigmoid Activation Function ()

• Maps input values between 0 and 1.
• Helps decide what to forget (0) or keep (1).
• Formula: 𝑓 𝑥 =

Hyperbolic Tangent Function (Tanh)

• Maps input values between -1 and 1.
• Decides how much information should be added to or removed
from the memory cell.
• Formula: 𝑓 𝑥 =
Understanding LSTM: A Step-by-Step Example
We are taking some data where the X-axis represents the day it was recorded, and the Y-axis represents the stock value.

Our goal is to remember the stock value on Day 1 to accurately predict the value on Day 5.
Understanding LSTM: A Step-by-Step Example
At each step, the LSTM model goes through three key stages:

 Stage 1: Forget Gate – Determines what percentage of the long-term memory should be retained.
 Stage 2: Input Gate - Creates a potential long-term memory and decides how much of it should be added to
the existing long-term memory.
 Stage 3: Output Gate – Updates the short-term memory by starting with the new long-term memory and
determining how much of it should be passed on to the next step.
Encoder-Decoder Architecture
1. Encoder:
• Takes an input sequence (e.g. a sentence in English) and processes it using layers like RNNs, LSTMs, etc.
• Converts the input into a fixed-length representation called a context vector. This vector captures the meaning
of the entire input sequence.
2. Context Vector:
• This is the compressed form of the input sequence.
• Contains the ‘context’ or meaning of the input sequence.
3. Decoder:
• Takes the context vector and generates the output sequence one step at a time.
• At each step, it predicts the next token (word/character) using previous outputs and the context vector.
• Continues until it generates the full output sequence.

Fig 2: A basic encoder-decoder architecture

Step 1: Word Vector Generation (How Words Become
Numbers)
Why do we need word vectors?
• Computers don’t understand words directly so we convert them into numerical vectors (embeddings).

What are word vectors (embeddings?)

• Dense numerical representations of words in a vector space.

How are word vectors formed?

• Word vectors (embeddings) are learned automatically by training a model on large text data. The idea is:
• Words that appear in similar contexts should have similar vectors.
• The model assigns each word a vector of numbers (300-dimensional).
• These numbers are adjusted so that words with similar meanings end up closer together in the vector space.
Fig3 : Word embedding (Courtesy of Medium)
Step 2: How Does the Encoder Work?
In an LSTM-based encoder, each word passes through an LSTM cell.
States in LSTM-based encoder:
1. Hidden state (𝒉𝒕 ) - The working memory (short-term representation).
2. Cell State (𝒄𝒕 ) − The long − term memory storage
both the hidden state (ℎ ) and the cell state (𝑐 ) are updated at each
step as new words are processed.
How are the States Updated?
Step 1: Forget gate -Forgets irrelevant information from the past (Forget Gate)
𝑓 = 𝜎(Weight h , x + bias )
Step 2: Input Gate & Cell Update gate
𝑖 𝑡 = 𝜎 𝑊 𝑖 ℎ 𝑡 − 1 ,𝑥 𝑡 + 𝑏 𝑖 𝑐 𝑡 = 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑐𝑒𝑙𝑙 𝑠𝑡𝑎𝑡𝑒 + 𝑖 𝑡 ⊙ c~ₜ
Step 3: Decides what part of the memory should become the hidden state.
𝑜 𝑡 = 𝜎 𝑊 𝑜 ℎ 𝑡 − 1 ,𝑥 𝑡 + 𝑏 𝑜 ℎ 𝑡 = 𝑜 𝑡 ⊙ 𝑡𝑎𝑛ℎ(𝑐 𝑡 )
Step 3: How is the Context Vector Formed?
After processing all words, the final hidden state of the encoder becomes the context vector.

Example (Encoding “I love coding” into a context vector):

1. Start with an initial hidden state
ℎ = 0,0,0 (All zeroes initially)
2. Process “I”
ℎ = tanh 𝑊 ∗ ℎ + 𝑊 ∗ 𝑥 + 𝑏
ℎ = [0.1, 0.5, -0.2]
3. Process “love”
ℎ = tanh 𝑊 ∗ ℎ + 𝑊 ∗ 𝑥 + 𝑏
ℎ = [0.3, 0.6, -0.1]
4. Process “coding”
ℎ = tanh 𝑊 ∗ ℎ + 𝑊 ∗ 𝑥 + 𝑏
ℎ = [0.2, 0.8, 0.1]
5. Final hidden state = context vector
context vector = [0.2, 0.8, 0.1]
This context vector summarizes the entire sentence.
Decoder:
• The decoder is the second part of the encoder-
decoder model.
• It generates output sequences from the encoded
information.
• Used in machine translation, text generation,
and speech recognition.

How does a Decoder work?

1. The encoder’s final state (context vector) is
passed as the initial state of the decoder.

2. The first input to the decoder is typically a special

token <SOS> (Start of Sentence)

3. Using the LSTM architecture the first hidden (ht)

state is generated.
4. Compute Scores for Each Word in Vocabulary
𝑠𝑡 =𝑊𝑠 ⋅ℎ𝑡 +𝑏𝑠

Suppose we have 5 words in our

vocabulary

ℎ𝑡 𝑠𝑡 =𝑊𝑠 ⋅ℎ𝑡 +𝑏𝑠

5. Convert Scores to
Probabilities (Softmax Layer) Probability
Words 𝑠𝑡 Word 𝑠𝑡
P(𝒚𝑡)
Apple 2.1 6. The Predicted Word
Apple 2.1 0.32 is used as Input for
Cat 1.5 the Next Time Step.
Cat 1.5 0.20
Dog 1.8
Dog 1.8 0.24 7. Repeat Until End-
Pizza 0.5
Pizza 0.5 0.10 of-Sentence (EOS)
Run 0.3 Token is Generated
Run 0.3 0.08
Training vs Testing - Teacher Forcing Rule
During Training:
• Instead of using the predicted word, we force the correct word
from the training data as input.
• This helps the model learn faster and prevents it from getting stuck
in errors.
Mathematically, in training:

𝑥 𝑡 + 1 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑊𝑜𝑟𝑑 𝑓𝑟𝑜𝑚 𝐷𝑎𝑡𝑎𝑠𝑒𝑡)

During Testing:
• The decoder uses its own predicted words as inputs.
• Errors accumulate if a wrong word is predicted, leading to
poor output quality.
Advantages and Disadvantages of LSTM based Encoder-
decoder
ADVANTAGES DISADVANTAGES
• Better than regular RNNs
• Can Work with Different Input and • Fixed-size context vector – loss of
Output Lengths information
• The encoder compresses the input • Slow for Long Sequences
into a fixed-size context vector,
which acts as a summary of the • High Memory Usage
sentence.

How do we overcome the challenges?

• Attention ( as an add on with existing LSTM architecture)
Paper: Attention-Based Sequence-to-Sequence (Bahdanau et al., 2015)
• Introduced the Transformer model ( attention mechanism and positional encoding)
Paper: "Attention Is All You Need" (Vaswani et al., 2017)
Conclusion:
•LSTM-based Encoder-Decoder models were a major breakthrough
in sequence-to-sequence tasks but struggled with long-term
dependencies.

•Attention Mechanism improved this by dynamically focusing on

relevant parts of the input sequence.

•Transformers (introduced in Attention Is All You Need) completely

replaced LSTMs, enabling parallel processing and greater efficiency.

•Modern NLP models, including GPT and BERT, are built on these
advancements, making deep learning-based language understanding
more powerful than ever.
THANK YOU

Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
Dlunit 4
No ratings yet
Dlunit 4
122 pages
7974 USTA RTC Training Manual
100% (4)
7974 USTA RTC Training Manual
123 pages
L22 - Attention in Deep Learning
No ratings yet
L22 - Attention in Deep Learning
65 pages
Self Introduction
No ratings yet
Self Introduction
3 pages
cl8 Encdec
No ratings yet
cl8 Encdec
51 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
(Slides) Module 44
No ratings yet
(Slides) Module 44
119 pages
Unit 3 - Part 02
No ratings yet
Unit 3 - Part 02
40 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
10 RNN
No ratings yet
10 RNN
56 pages
Unit5 3
No ratings yet
Unit5 3
48 pages
11 RNN
No ratings yet
11 RNN
32 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
Generative AI
No ratings yet
Generative AI
54 pages
DL Co4 PPT-1
No ratings yet
DL Co4 PPT-1
29 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
AI Primer
No ratings yet
AI Primer
12 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Deep Recurrent Neural Networks
No ratings yet
Deep Recurrent Neural Networks
24 pages
Unit III - Recurrent Neural Networks
No ratings yet
Unit III - Recurrent Neural Networks
44 pages
Llms Course Andrew
No ratings yet
Llms Course Andrew
46 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
No ratings yet
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
44 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
Lecture 4 Part2
No ratings yet
Lecture 4 Part2
28 pages
Exploring Sequence-to-Sequence Models - Understanding The Power of Encoder and Decoder Architecture - by Sachinsoni - Medium
No ratings yet
Exploring Sequence-to-Sequence Models - Understanding The Power of Encoder and Decoder Architecture - by Sachinsoni - Medium
18 pages
LSTM
No ratings yet
LSTM
19 pages
Encoder-Decoder Sequence To Sequence Architechure
No ratings yet
Encoder-Decoder Sequence To Sequence Architechure
16 pages
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
No ratings yet
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
35 pages
Report 1 Transformers
No ratings yet
Report 1 Transformers
7 pages
Module 3 Part 2 Encoder
No ratings yet
Module 3 Part 2 Encoder
14 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Review of Literature On Inventory Management For Mba
100% (1)
Review of Literature On Inventory Management For Mba
7 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
32 pages
Computer Vision - ECCV 2022: Shai Avidan Gabriel Brostow Moustapha Cissé Giovanni Maria Farinella Tal Hassner
No ratings yet
Computer Vision - ECCV 2022: Shai Avidan Gabriel Brostow Moustapha Cissé Giovanni Maria Farinella Tal Hassner
804 pages
Transformer Decoder Side
No ratings yet
Transformer Decoder Side
9 pages
Sequence Models-II
No ratings yet
Sequence Models-II
10 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Unit - IV - Natural Language Processing
No ratings yet
Unit - IV - Natural Language Processing
9 pages
Solutions
No ratings yet
Solutions
11 pages
Machine Translation Wise 2016/2017
No ratings yet
Machine Translation Wise 2016/2017
58 pages
7411edn Assignment 1
No ratings yet
7411edn Assignment 1
23 pages
Encoder Decoder
No ratings yet
Encoder Decoder
8 pages
DL Notations
No ratings yet
DL Notations
5 pages
Cs 224N: Assignment #4: 1. Neural Machine Translation With Rnns (45 Points)
No ratings yet
Cs 224N: Assignment #4: 1. Neural Machine Translation With Rnns (45 Points)
10 pages
A Gift of Fire - Chapter 9
100% (1)
A Gift of Fire - Chapter 9
16 pages
NLP Short
No ratings yet
NLP Short
5 pages
RAZ-D 017 Insect Wings
No ratings yet
RAZ-D 017 Insect Wings
18 pages
LSTM Material 1
No ratings yet
LSTM Material 1
3 pages
Thesis Topics in Medicine
100% (3)
Thesis Topics in Medicine
4 pages
LVF Weekly Home Learning Plan Grade 2
No ratings yet
LVF Weekly Home Learning Plan Grade 2
29 pages
Were Was Were Were Were Was Were Were Were Were Were
No ratings yet
Were Was Were Were Were Was Were Were Were Were Were
5 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
MCCEE Preparation 1
80% (5)
MCCEE Preparation 1
1 page
February 2, 2014
No ratings yet
February 2, 2014
10 pages
Lab Exercise No.3 Simple Subdivision
No ratings yet
Lab Exercise No.3 Simple Subdivision
2 pages
Bsge 2a
No ratings yet
Bsge 2a
18 pages
Sociology Term Paper
No ratings yet
Sociology Term Paper
10 pages
Els Q2W2
No ratings yet
Els Q2W2
2 pages
Junaid Ali Khan
No ratings yet
Junaid Ali Khan
4 pages
Forensic Psychology Article Review
No ratings yet
Forensic Psychology Article Review
7 pages
1013 1017 DT 22082024 240822 180503
No ratings yet
1013 1017 DT 22082024 240822 180503
2 pages
Sip Annex 2a Child Friendly School Survey Bnhs
No ratings yet
Sip Annex 2a Child Friendly School Survey Bnhs
5 pages
Bytedance Soft Mask Bert
No ratings yet
Bytedance Soft Mask Bert
9 pages
Final Term Assignment - CCMTT
No ratings yet
Final Term Assignment - CCMTT
3 pages
JMK (Jurnal Manajemen Dan Kewirausahaan) : JMK 5 (3) 2020, 173-182 P-ISSN 2477-3166 E-ISSN 2656-0771
No ratings yet
JMK (Jurnal Manajemen Dan Kewirausahaan) : JMK 5 (3) 2020, 173-182 P-ISSN 2477-3166 E-ISSN 2656-0771
10 pages
Innovative Preschool Activities
No ratings yet
Innovative Preschool Activities
11 pages
CV - May 2023 - Geoffrey Guest
No ratings yet
CV - May 2023 - Geoffrey Guest
7 pages
Hebrew Love Phrases Worksheet Author Linguajunkie
No ratings yet
Hebrew Love Phrases Worksheet Author Linguajunkie
4 pages
SPM Essay Pandemic Covid-19
No ratings yet
SPM Essay Pandemic Covid-19
4 pages
Lista Preturi Final Express Publishing 2015 2016
No ratings yet
Lista Preturi Final Express Publishing 2015 2016
4 pages
Application Letter: Maricel T. Pacariem
No ratings yet
Application Letter: Maricel T. Pacariem
3 pages
National Dong Hwa University Fees
No ratings yet
National Dong Hwa University Fees
2 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet

M5 Topic 1 - Encoder Decoder

Uploaded by

M5 Topic 1 - Encoder Decoder

Uploaded by

THE ENCODER DECODER

Rishika Hazarika (210710007037)

• Goal of a Seq2Seq model: to map one input sequence to an output sequence.

b. Statistical Machine Translation:

• Used probability-based models trained on bilingual corpora.

Fig 1: A basic encoder-decoder model

• Problem with RNNs  Forget long-term context due to vanishing gradient.

Why LSTM for Encoder-Decoder Models?

How it helps in Encoder-Decoder Models?

Sigmoid Activation Function ()

Hyperbolic Tangent Function (Tanh)

Fig 2: A basic encoder-decoder architecture

What are word vectors (embeddings?)

How are word vectors formed?

Example (Encoding “I love coding” into a context vector):

How does a Decoder work?

2. The first input to the decoder is typically a special

3. Using the LSTM architecture the first hidden (ht)

Suppose we have 5 words in our

ℎ𝑡 𝑠𝑡 =𝑊𝑠 ⋅ℎ𝑡 +𝑏𝑠

𝑥 𝑡 + 1 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑊𝑜𝑟𝑑 𝑓𝑟𝑜𝑚 𝐷𝑎𝑡𝑎𝑠𝑒𝑡)

How do we overcome the challenges?

•Attention Mechanism improved this by dynamically focusing on

•Transformers (introduced in Attention Is All You Need) completely

You might also like