0% found this document useful (0 votes)
6 views21 pages

M5 Topic 1 - Encoder Decoder

The document presents an overview of the Encoder-Decoder model, particularly in the context of Sequence-to-Sequence (Seq2Seq) tasks such as machine translation and text summarization. It discusses the evolution from rule-based and statistical machine translation to neural Seq2Seq models utilizing LSTMs, which effectively manage long-term dependencies. Additionally, it highlights the limitations of fixed-size context vectors and introduces the attention mechanism and Transformers as advancements that enhance model performance in natural language processing.

Uploaded by

Nishchal Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views21 pages

M5 Topic 1 - Encoder Decoder

The document presents an overview of the Encoder-Decoder model, particularly in the context of Sequence-to-Sequence (Seq2Seq) tasks such as machine translation and text summarization. It discusses the evolution from rule-based and statistical machine translation to neural Seq2Seq models utilizing LSTMs, which effectively manage long-term dependencies. Additionally, it highlights the limitations of fixed-size context vectors and introduces the attention mechanism and Transformers as advancements that enhance model performance in natural language processing.

Uploaded by

Nishchal Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

THE ENCODER DECODER

MODEL

Presented By:

Rishika Hazarika (210710007037)


Shruti Sarma (210710007047)
Shreyoshi Ghosh (210710007046)
Sequence-to-Sequence (Seq2Seq) Problem

• Involves taking an input sequence (of any length) and mapping it to an output sequence (can be of a different
length).
• Challenge: the model must understand the meaning of the entire input sequence before generating the correct
output.
• Examples:
a. Machine Translation: “I love cats”  “J’aime less chats”
b. Speech Recognition: Audio waveform  “Hello, how are you?”
c. Text Summarization: Long document  Short summary
d. Question Answering: “Who discovered gravity?”  “Isaac Newton”
e. Chatbots: “How are you?”  “I’m doing well, thanks!”

• Goal of a Seq2Seq model: to map one input sequence to an output sequence.


Machine Translation Systems Prior to Seq2Seq Models:
Before deep learning-based Seq2Seq models, machine translation was handled by:
a. Rule-Based Machine Translation:

• Relied on manually written grammatical rules and dictionaries for translating text.
• Example: If translating English to French, “I eat an apple”  “Je mange une pomme”, this required predefined
grammar rules.
• Disadvantage: Needed separate rules for different languages or for every sentence structure.
• Problem: Could not handle new or complex sentences if their grammatical rules were not predefined.

b. Statistical Machine Translation:

• Used probability-based models trained on bilingual corpora.


• Key idea: Break sentences into phrases and find the most probable translation based on statistical alignment.
• Example: If “I love”  “J’aime” appears in most training data, it is chosen.
• Advantage: Learned from large datasets rather than predefined rules; more flexible than rule-based machine translation.
• Disadvantage: Struggled with long sentences; word order issues; lot of training data required.
Neural Sequence-to-Sequence Learning: A Breakthrough
Sequence to Sequence Learning (Sutskever et al., 2014)

• Proposed an end-to-end neural network approach for Seq2Seq tasks with minimal assumptions.
• Used multilayered LSTMS: one to encode input into a fixed-dimensional vector and another to decode it.
• Achieved a BLEU score of 34.8 on the WMT’14 English-French dataset, outperforming Statistical Machine Translation
(33.3).

BLEU (Bilingual Evaluation Understudy) score: a metric to evaluate machine-generated translations by comparing them
with human translations.
Higher BLEU score = better translation quality.
What is Encoder-Decoder Model?
• In the seq2seq model, the encoder and the decoder architecture converts input sequences into output sequences.
• According to Daniel Jurafsky & James H. Martin in their book “Speech and Language Processing: An
Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition”:
The key idea underlying these networks is the use of an encoder network that takes an input sequence and creates a
contextualized representation of it, often called the context. This representation is then passed to a decoder which generates a
tasks specific output sequence.

Fig 1: A basic encoder-decoder model


LSTMs: A Foundation for Encoder-Decoder Models
RNNs were the first models to process sequential data.

• Problem with RNNs  Forget long-term context due to vanishing gradient.


• LSTM Solution  Learns what to remember and forget, handling longer sequences.
• Memory Paths in LSTM:
• Long-Term Memory  Stores important past information.
• Short-Term Memory  Keeps recent context.

Why LSTM for Encoder-Decoder Models?


 Encoders need LSTMs to store and pass meaningful context.
 Decoders use this context for better predictions.
LSTM Architecture: Memory Cells & Gates
• LSTMS use a memory cell to store important information over long sequences, solving the vanishing gradient problem
in RNNs.
• Three gates regulate information flow:
 Input Gate  Decides what new information to store in the memory cell.
 Forget Gate  Removes irrelevant or outdated information.
 Output Gate  Determines what processed information is sent as output.

How it helps in Encoder-Decoder Models?


• Ensures that relevant context is retained across long sequences.
• Helps the encoder store useful information for the decoder to generate accurate outputs.
LSTM Activation Functions: Sigmoid & Tanh
LSTM uses two key activation functions to regulate information flow in the memory cell.

Sigmoid Activation Function ()


• Maps input values between 0 and 1.
• Helps decide what to forget (0) or keep (1).
• Formula: 𝑓 𝑥 =

Hyperbolic Tangent Function (Tanh)


• Maps input values between -1 and 1.
• Decides how much information should be added to or removed
from the memory cell.
• Formula: 𝑓 𝑥 =
Understanding LSTM: A Step-by-Step Example
We are taking some data where the X-axis represents the day it was recorded, and the Y-axis represents the stock value.

Our goal is to remember the stock value on Day 1 to accurately predict the value on Day 5.
Understanding LSTM: A Step-by-Step Example
At each step, the LSTM model goes through three key stages:

 Stage 1: Forget Gate – Determines what percentage of the long-term memory should be retained.
 Stage 2: Input Gate - Creates a potential long-term memory and decides how much of it should be added to
the existing long-term memory.
 Stage 3: Output Gate – Updates the short-term memory by starting with the new long-term memory and
determining how much of it should be passed on to the next step.
Encoder-Decoder Architecture
1. Encoder:
• Takes an input sequence (e.g. a sentence in English) and processes it using layers like RNNs, LSTMs, etc.
• Converts the input into a fixed-length representation called a context vector. This vector captures the meaning
of the entire input sequence.
2. Context Vector:
• This is the compressed form of the input sequence.
• Contains the ‘context’ or meaning of the input sequence.
3. Decoder:
• Takes the context vector and generates the output sequence one step at a time.
• At each step, it predicts the next token (word/character) using previous outputs and the context vector.
• Continues until it generates the full output sequence.

Fig 2: A basic encoder-decoder architecture


Step 1: Word Vector Generation (How Words Become
Numbers)
Why do we need word vectors?
• Computers don’t understand words directly so we convert them into numerical vectors (embeddings).

What are word vectors (embeddings?)


• Dense numerical representations of words in a vector space.

How are word vectors formed?


• Word vectors (embeddings) are learned automatically by training a model on large text data. The idea is:
• Words that appear in similar contexts should have similar vectors.
• The model assigns each word a vector of numbers (300-dimensional).
• These numbers are adjusted so that words with similar meanings end up closer together in the vector space.
Fig3 : Word embedding (Courtesy of Medium)
Step 2: How Does the Encoder Work?
In an LSTM-based encoder, each word passes through an LSTM cell.
States in LSTM-based encoder:
1. Hidden state (𝒉𝒕 ) - The working memory (short-term representation).
2. Cell State (𝒄𝒕 ) − The long − term memory storage
both the hidden state (ℎ ) and the cell state (𝑐 ) are updated at each
step as new words are processed.
How are the States Updated?
Step 1: Forget gate -Forgets irrelevant information from the past (Forget Gate)
𝑓 = 𝜎(Weight h , x + bias )
Step 2: Input Gate & Cell Update gate
𝑖 𝑡 = 𝜎 𝑊 𝑖 ℎ 𝑡 − 1 ,𝑥 𝑡 + 𝑏 𝑖 𝑐 𝑡 = 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑐𝑒𝑙𝑙 𝑠𝑡𝑎𝑡𝑒 + 𝑖 𝑡 ⊙ c~ₜ
Step 3: Decides what part of the memory should become the hidden state.
𝑜 𝑡 = 𝜎 𝑊 𝑜 ℎ 𝑡 − 1 ,𝑥 𝑡 + 𝑏 𝑜 ℎ 𝑡 = 𝑜 𝑡 ⊙ 𝑡𝑎𝑛ℎ(𝑐 𝑡 )
Step 3: How is the Context Vector Formed?
After processing all words, the final hidden state of the encoder becomes the context vector.

Example (Encoding “I love coding” into a context vector):


1. Start with an initial hidden state
ℎ = 0,0,0 (All zeroes initially)
2. Process “I”
ℎ = tanh 𝑊 ∗ ℎ + 𝑊 ∗ 𝑥 + 𝑏
ℎ = [0.1, 0.5, -0.2]
3. Process “love”
ℎ = tanh 𝑊 ∗ ℎ + 𝑊 ∗ 𝑥 + 𝑏
ℎ = [0.3, 0.6, -0.1]
4. Process “coding”
ℎ = tanh 𝑊 ∗ ℎ + 𝑊 ∗ 𝑥 + 𝑏
ℎ = [0.2, 0.8, 0.1]
5. Final hidden state = context vector
context vector = [0.2, 0.8, 0.1]
This context vector summarizes the entire sentence.
Decoder:
• The decoder is the second part of the encoder-
decoder model.
• It generates output sequences from the encoded
information.
• Used in machine translation, text generation,
and speech recognition.

How does a Decoder work?


1. The encoder’s final state (context vector) is
passed as the initial state of the decoder.

2. The first input to the decoder is typically a special


token <SOS> (Start of Sentence)

3. Using the LSTM architecture the first hidden (ht)


state is generated.
4. Compute Scores for Each Word in Vocabulary
𝑠𝑡 =𝑊𝑠 ⋅ℎ𝑡 +𝑏𝑠

Suppose we have 5 words in our


vocabulary

ℎ𝑡 𝑠𝑡 =𝑊𝑠 ⋅ℎ𝑡 +𝑏𝑠

5. Convert Scores to
Probabilities (Softmax Layer) Probability
Words 𝑠𝑡 Word 𝑠𝑡
P(𝒚𝑡)
Apple 2.1 6. The Predicted Word
Apple 2.1 0.32 is used as Input for
Cat 1.5 the Next Time Step.
Cat 1.5 0.20
Dog 1.8
Dog 1.8 0.24 7. Repeat Until End-
Pizza 0.5
Pizza 0.5 0.10 of-Sentence (EOS)
Run 0.3 Token is Generated
Run 0.3 0.08
Training vs Testing - Teacher Forcing Rule
During Training:
• Instead of using the predicted word, we force the correct word
from the training data as input.
• This helps the model learn faster and prevents it from getting stuck
in errors.
Mathematically, in training:

𝑥 𝑡 + 1 = 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔(𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑊𝑜𝑟𝑑 𝑓𝑟𝑜𝑚 𝐷𝑎𝑡𝑎𝑠𝑒𝑡)

During Testing:
• The decoder uses its own predicted words as inputs.
• Errors accumulate if a wrong word is predicted, leading to
poor output quality.
Advantages and Disadvantages of LSTM based Encoder-
decoder
ADVANTAGES DISADVANTAGES
• Better than regular RNNs
• Can Work with Different Input and • Fixed-size context vector – loss of
Output Lengths information
• The encoder compresses the input • Slow for Long Sequences
into a fixed-size context vector,
which acts as a summary of the • High Memory Usage
sentence.

How do we overcome the challenges?


• Attention ( as an add on with existing LSTM architecture)
Paper: Attention-Based Sequence-to-Sequence (Bahdanau et al., 2015)
• Introduced the Transformer model ( attention mechanism and positional encoding)
Paper: "Attention Is All You Need" (Vaswani et al., 2017)
Conclusion:
•LSTM-based Encoder-Decoder models were a major breakthrough
in sequence-to-sequence tasks but struggled with long-term
dependencies.

•Attention Mechanism improved this by dynamically focusing on


relevant parts of the input sequence.

•Transformers (introduced in Attention Is All You Need) completely


replaced LSTMs, enabling parallel processing and greater efficiency.

•Modern NLP models, including GPT and BERT, are built on these
advancements, making deep learning-based language understanding
more powerful than ever.
THANK YOU

You might also like