0% found this document useful (0 votes)

44 views21 pages

Transformer

Uploaded by

Shukdev Datta 1911838042

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views21 pages

Transformer

Uploaded by

Shukdev Datta 1911838042

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Transformer in NLP

INSTRUCTOR NAME: SHUKDEV DATTA

ML DEVELOPER AT INNOVATIVE SKILLS
Transformer in NLP
Transformer models are a type of deep learning model that is used for natural language processing
(NLP) tasks. They can learn long-range dependencies between words in a sentence, which makes them
very powerful for tasks such as machine translation, text summarization, and question answering.

Transformer models work by first encoding the input sentence into a sequence of vectors. This encoding
is done using a self-attention mechanism, which allows the model to learn the relationships between the
words in the sentence.

Once the input sentence has been encoded, the model decodes it into a sequence of output tokens. This
decoding is also done using a self-attention mechanism.

The attention mechanism is what allows transformer models to learn long-range dependencies between
words in a sentence. The attention mechanism works by focusing on the most relevant words in the
input sentence when decoding the output tokens.
Encoding & Decoding
NLP transformer architecture
NLP transformer architecture
The transformer model is made up of two main components: an encoder and a decoder. The encoder takes
the input sentence as input and produces a sequence of vectors. The decoder then takes these vectors as
input and produces the output sentence.
Working Principal of transformer
architecture
The encoder consists of a stack of self-attention layers. Each self-attention layer takes a sequence of
vectors as input and produces a new sequence of vectors. The self-attention layer works by first
computing a score for each pair of words in the input sequence. The score for a pair of words is a measure
of how related the two words are. The self-attention layer then uses these scores to compute a weighted
sum of the input vectors. The weighted sum is the output of the self-attention layer.
Working Principal of transformer
architecture
Working Principal of transformer
architecture
The decoder consists of a stack of self-attention layers and a recurrent neural network (RNN). The self-
attention layers work the same way as in the encoder. The RNN takes the output of the self-attention
layers as input and produces a sequence of output tokens. The output tokens are the words in the output
sentence.

The attention mechanism is what allows the transformer model to learn long-range dependencies
between words in a sentence. The attention mechanism works by focusing on the most relevant words in
the input sentence when decoding the output tokens.

For example, let’s say we want to translate the sentence “I love you” from English to Spanish. The
transformer model would first encode the sentence into a sequence of vectors. Then, the model would
decode the vectors into a sequence of Spanish words. The attention mechanism would allow the model
to focus on the words “I” and “you” in the English sentence when decoding the Spanish words “te amo”.
Encoding in details
Decoding in details
Encoding only models
Decoding only models
Differences
What are transformer models
built of
Embedding layer: The embedding layer converts the input text into a sequence of vectors. The vectors represent the
meaning of the words in the text.

Self-attention layers: The self-attention layers allow the model to learn long-range dependencies between words in
a sentence. The self-attention layers work by computing a score for each pair of words in the sentence. The score for
a pair of words is a measure of how related the two words are. The self-attention layers then use these scores to
compute a weighted sum of the input vectors. The weighted sum is the output of the self-attention layer.

Positional encoding: The positional encoding layer adds information about the position of each word in the
sentence. This is important for learning long-range dependencies, as it allows the model to know which words are
close to each other in the sentence.

Decoder: The decoder takes the output of the self-attention layers as input and produces a sequence of output
tokens. The output tokens are the words in the output sentence.
Positional Encoding
Training techniques of
Transformer models
Masked language modeling: Masked language modeling is a technique used to train transformer
models to predict the missing words in a sentence. This helps the model to learn to attend to the most
relevant words in a sentence.

Attention masking: Attention masking is a technique used to prevent the model from attending to
future words in a sentence. This is important for preventing the model from learning circular
dependencies.

Gradient clipping: Gradient clipping is a technique used to prevent the gradients from becoming too
large. This helps to stabilize the training process and prevent the model from overfitting.
Masked Language Modeling
Masked language modeling is a technique used to train transformer models, like BERT, to predict the
missing words in a sentence. The idea is to randomly mask (replace) some of the words in the input
sentence with a special token, such as [MASK], and then train the model to predict what the original
words were. This helps the model to learn to attend to the most relevant words in a sentence, as it has
to figure out which words are missing and what they might be based on the context of the surrounding
words. This is useful for tasks like language understanding, where the model needs to be able to
understand and generate human-like text. Masked language modeling is a key component of the
pretraining process for transformer models, as it helps the model to learn the relationships between
words and how to generate text that is coherent and grammatically correct.
Attention Masking
Attention masking is a technique used in transformer models, like BERT, to prevent the model from
attending to future words in a sentence during training. The idea is to mask (ignore) the attention scores
for any word that comes after the current word in the input sequence. This is important for preventing
the model from learning circular dependencies, where the model might learn to predict a word based on
future words that it shouldn't have access to. By masking the attention scores for future words, the
model is forced to only attend to the relevant words that come before the current word, which helps to
improve the accuracy of the model's predictions. Attention masking is a key component of the pre-
training process for transformer models, as it helps the model to learn the relationships between words
and how to generate text that is coherent and grammatically correct.
Attention Masking Example
Gradient Clipping
Gradient clipping is a technique used in machine learning to prevent the gradients of the model's
parameters from becoming too large during training. Gradients are used to update the model's
parameters in the direction that reduces the loss function, which measures how well the model is
performing on the training data. If the gradients become too large, it can cause the model's parameters
to change too much at each step, which can lead to the model becoming unstable and overfitting to the
training data. Gradient clipping helps to stabilize the training process by setting a threshold for the
maximum allowed gradient value. If the gradient exceeds this threshold, it is scaled down so that it
doesn't become too large. This helps to prevent the model from overfitting to the training data and
improves its generalization performance on unseen data. Gradient clipping is a common technique used
in deep learning and is particularly useful when training deep neural networks with many layers.
Thank You!!!

Answer For Introduction To Generative AI Quiz
75% (8)
Answer For Introduction To Generative AI Quiz
5 pages
LLM Cheat Sheetpdf
No ratings yet
LLM Cheat Sheetpdf
7 pages
Predetermined Motion Time System
100% (1)
Predetermined Motion Time System
5 pages
Fast Animal Detection in Uav Images Using Convolutional Neural Networks
No ratings yet
Fast Animal Detection in Uav Images Using Convolutional Neural Networks
4 pages
Transformer
No ratings yet
Transformer
31 pages
Generative AI
No ratings yet
Generative AI
54 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Tianzheng Troy Wang CIS498EAS499 Submission
No ratings yet
Tianzheng Troy Wang CIS498EAS499 Submission
51 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
Transformer Explained
No ratings yet
Transformer Explained
29 pages
Quiz1 Answers
No ratings yet
Quiz1 Answers
29 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
DL Co4 PPT-1
No ratings yet
DL Co4 PPT-1
29 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Transformers: Intro
No ratings yet
Transformers: Intro
7 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
GenAI For Developers
No ratings yet
GenAI For Developers
205 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
22 pages
Transformer Architecture Explained in LLMs
No ratings yet
Transformer Architecture Explained in LLMs
2 pages
Report 1 Transformers
No ratings yet
Report 1 Transformers
7 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Transformers
No ratings yet
Transformers
10 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
Transformer
No ratings yet
Transformer
10 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Unit 2 Generative AI
No ratings yet
Unit 2 Generative AI
14 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
Transformer Architecture Explained
No ratings yet
Transformer Architecture Explained
8 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
Aiayn
No ratings yet
Aiayn
15 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
19 pages
Attention
No ratings yet
Attention
15 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Shivam Final
No ratings yet
Shivam Final
34 pages
Transformer's Not Working Properly in This Room
No ratings yet
Transformer's Not Working Properly in This Room
65 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Transformers
No ratings yet
Transformers
27 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Transformers
No ratings yet
Transformers
15 pages
Unit 5
No ratings yet
Unit 5
5 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Uppwise Standard PPT 2
No ratings yet
Uppwise Standard PPT 2
13 pages
Lec 12
No ratings yet
Lec 12
30 pages
L22 - Attention in Deep Learning
No ratings yet
L22 - Attention in Deep Learning
65 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Definition:: Large Language Models (LLMS)
No ratings yet
Definition:: Large Language Models (LLMS)
41 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Assignment-2 Solution
No ratings yet
Assignment-2 Solution
4 pages
Mock-2 Decision Tree
No ratings yet
Mock-2 Decision Tree
1 page
Classification Algorithms
No ratings yet
Classification Algorithms
44 pages
BERT
No ratings yet
BERT
21 pages
Instructor Name: Shukdev Datta ML Developer at Innovative Skills
No ratings yet
Instructor Name: Shukdev Datta ML Developer at Innovative Skills
17 pages
Physics-Informed Neural Networks For Modeling Physiological Time Series For Cuf Ess Blood Pressure Estimation
No ratings yet
Physics-Informed Neural Networks For Modeling Physiological Time Series For Cuf Ess Blood Pressure Estimation
15 pages
Dynamic Programming
No ratings yet
Dynamic Programming
8 pages
New Second-Order Limiting Directional Derivatives and C - Optimization
No ratings yet
New Second-Order Limiting Directional Derivatives and C - Optimization
20 pages
Unit 4 Integration
No ratings yet
Unit 4 Integration
34 pages
MCA Syllabus New Updated
No ratings yet
MCA Syllabus New Updated
53 pages
Unit11 Eigen Values and Eigen Vector Part 2 PDF
No ratings yet
Unit11 Eigen Values and Eigen Vector Part 2 PDF
12 pages
Act03 CSP 4 Queens Sol
No ratings yet
Act03 CSP 4 Queens Sol
13 pages
Ai Fundamentals Final Quiz Source by Ate Zein
No ratings yet
Ai Fundamentals Final Quiz Source by Ate Zein
25 pages
Multiplying Polynomialsusing FOILand Box Methods
No ratings yet
Multiplying Polynomialsusing FOILand Box Methods
3 pages
PSRM 2 Assignement 4
No ratings yet
PSRM 2 Assignement 4
3 pages
Bulian 2004 Estimation of Nonlinear Roll Decay Parameters Using An Analytical Approximate Solution of The Decay Time
No ratings yet
Bulian 2004 Estimation of Nonlinear Roll Decay Parameters Using An Analytical Approximate Solution of The Decay Time
28 pages
Algorithms and Data Structures-Searching Algorithms
No ratings yet
Algorithms and Data Structures-Searching Algorithms
15 pages
Unlocking The Power of Long Short-Term Memory (LSTM) Networks - by Sachinsoni - Medium
No ratings yet
Unlocking The Power of Long Short-Term Memory (LSTM) Networks - by Sachinsoni - Medium
23 pages
Artificial Intelligence in Mechanical Engineering: A Case Study On Vibration Analysis of Cracked Cantilever Beam
No ratings yet
Artificial Intelligence in Mechanical Engineering: A Case Study On Vibration Analysis of Cracked Cantilever Beam
4 pages
2012-Distribution System State Estimation Using An Artificial Neural Network Approach For Pseudo Measurement Modeling
No ratings yet
2012-Distribution System State Estimation Using An Artificial Neural Network Approach For Pseudo Measurement Modeling
9 pages
RBD Problems 2021
No ratings yet
RBD Problems 2021
12 pages
Lesson 10
No ratings yet
Lesson 10
68 pages
Introduction To Wavelet
No ratings yet
Introduction To Wavelet
44 pages
574 AP Sum 01
No ratings yet
574 AP Sum 01
17 pages
DSL Lab Manual
No ratings yet
DSL Lab Manual
41 pages
Algo Book
No ratings yet
Algo Book
368 pages
Analytic II - HW3 - 1106
No ratings yet
Analytic II - HW3 - 1106
6 pages
Mid 2 N
No ratings yet
Mid 2 N
2 pages
Finals Practice AY202324S2 Ans
No ratings yet
Finals Practice AY202324S2 Ans
13 pages
Practical Answrs
No ratings yet
Practical Answrs
22 pages
PAM With Natural Sampling
No ratings yet
PAM With Natural Sampling
66 pages
DSP Lab Record
No ratings yet
DSP Lab Record
51 pages

Transformer

Uploaded by

Transformer

Uploaded by

Transformer in NLP

INSTRUCTOR NAME: SHUKDEV DATTA

You might also like