0% found this document useful (0 votes)

30 views16 pages

Transformer Architectures: For NLP

The document provides an overview of transformer architectures used in NLP, detailing the evolution of large language models like BERT, RoBERTa, XLNet, DistilBERT, and the GPT family. It explains the key features and improvements of each model, including their training processes and architectural differences. The summary highlights the advancements in NLP capabilities and the scale differences among the GPT models.

Uploaded by

krish.bonzo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views16 pages

Transformer Architectures: For NLP

Uploaded by

krish.bonzo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Transformer Architectures

[email protected]
VY86LFGKPW
for NLP
Understanding the large transformer-based
modern architectures used for NLP tasks

This file is meant for personal use by [email protected] only.

Sharing or publishing the contents in part or full is liable for legal action.
• The Evolution of Large Language Models
• The BERT Family (BERT, RoBERTa, XLNet and
[email protected]
VY86LFGKPW Agenda
Distilbert)
• The GPT Family

This file is meant for personal use by [email protected] only.

Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Evolution of Large Language Models
Popular BERT variations Generative Transformer
which improved on BERT’s models applied to
training data, procedure, Conversational AI with
performance and increasingly exhaustive
computational cost text corpuses for training

RoBERTa, XLNet & DistilBERT ChatGPT & Beyond

[email protected]
VY86LFGKPW 2018-2019 2023+

2018 2018-2021
The BERT Transformer The GPT Family

The first popular OpenAI’s

modern Transformer autoregressive
architecture. A favorite Decoder-only models
among practitioners. (GPT, GPT-2, GPT-3)
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BERT - History & Context
Bidirectional Encoder Representations from Transformers

Although the original Transformer paper had come out in 2017, it wasn’t until the release of BERT
in 2018 that the industry truly started to take notice of Transformer architectures for NLP.

BERT was released

[email protected] by a team from Google, and has become the de-facto industry baseline in
VY86LFGKPW
NLP due to its ubiquitous good performance on all categories of NLP tasks.

BERT’s training essentially consists of two stages: A Pre-training Stage and a Fine-Tuning Stage

Pre-training Stage Fine-tuning Stage

This file is meant for personal use by [email protected] only.

Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BERT - Pre-training + Fine-tuning
Then, only the Encoder blocks are taken for
The Pre-training of BERT was done using an
Fine-tuning on Next Sentence Prediction (NSP).
Encoder-Decoder architecture
The fine-tuned Encoders are what eventually
on Masked Language Modeling (MLM).
become BERT.

Masked Language Modelling Next Sentence Prediction Single Classification

Neuron to perform
[email protected]
VY86LFGKPW Binary Classification for
ENCODER Next Sentence
I LOVE ICE CREAM IN SUMMER Prediction.
ENCODER ENCODER
DECODER
ENCODER ENCODER
DECODER
I LOVE ICE [MASK] IN [MASK] ENCODER
DECODER
I LOVE ICE CREAM IN SUMMER SENTENCE 1 SENTENCE 2

This file is meant for personal use by [email protected] only.

The BERT architecture has 2 variants, both of which are Encoder-only architectures.

1. BERTBASE has 12 Encoder blocks, hidden dimension of 768, 110 M parameters & 12 Attention heads.
2. BERTLARGE has 24 Encoder blocks, hidden dimension of 1024, 340 M parameters & 16 Attention heads.

[email protected]
VY86LFGKPW
BERT’s pre-training allows it to BERT can be further fine-tuned
learn high-quality latent with fewer computational
representations of words and resources on smaller datasets
sentences, including context. specific to various NLP tasks.

The contextualized embeddings

The success of BERT has made it
BERT returns means that it will
a staple of Google’s NLP
return different embeddings
operations on its Search Engine
even for the same word
in multiple languages.
appearing in different contexts.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
From BERT to RoBERTa
Robustly Optimized BERT pre-training Approach

The RoBERTa model, released by a Facebook AI research team in 2019, was just a modification to
key hyperparameters in the way BERT was trained.

RoBERTa has the exact same architecture as BERT, but the Pre-training Approach was changed to
Randomly Masked Language Modeling, where the choice of word being masked is randomly
changed in each epoch of training.
[email protected]
VY86LFGKPW

In addition, RoBERTa does not perform fine-tuning using Next Sentence Prediction, as the authors
postulated that the procedure does not yield much difference to the quality of representations.

Despite just these minor tweaks (together with a much

bigger training dataset) RoBERTa was able to significantly
outperform BERT on several NLP benchmarks, and is one
of the state-of-the-art NLP models available today.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
RoBERTa - Training Process

The following is an example of this process for RoBERTa:

Randomly Masked Language Modelling

EPOCH 1 EPOCH 2 EPOCH 3

I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER
[email protected]
VY86LFGKPW

DECODER STACK DECODER STACK DECODER STACK

ENCODER STACK ENCODER STACK ENCODER STACK

I LOVE ICE [MASK] IN [MASK] I LOVE [MASK] CREAM IN SUMMER I [MASK] ICE CREAM [MASK] SUMMER

I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
XLNet - Another Alternative

XLNet, another variation on BERT which was released around the same time as RoBERTa in 2019,
uses a different Pre-training technique to improve on BERT’s Masked Language Modeling.

XLNet again uses the exact same architecture as BERT. But the XLNet authors believed that
predicting all the masked tokens with the Decoder in one go neglects the sequential
dependencies between tokens that exist in Natural Language.
[email protected]
VY86LFGKPW
So XLNet proposes a different idea - Permutation Language Modeling.

Rather than predicting both “Ice” and “Cream” in the same run, for instance, the model predicts
one masked token in one run (let’s say “Ice”), then uses this to predict “Cream” in the next run.

And just like with RoBERTa, this minor tweak also leads to a
substantial improvement over BERT, often by a large
margin, on several NLP tasks.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
XLNet - Training Process

Here’s the XLNet training process illustrated with the same example:

Permutation Language Modelling

EPOCH 1 EPOCH 2

I LOVE [MASK] CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER

[email protected]
VY86LFGKPW

DECODER STACK DECODER STACK

ENCODER STACK ENCODER STACK

I LOVE [MASK] [MASK] IN SUMMER I LOVE [MASK] CREAM IN SUMMER

I LOVE ICE CREAM IN SUMMER

This file is meant for personal use by [email protected] only.

DistilBERT goes down a more important direction, and applies an idea to improve BERT’s
computational cost that is generalizable to any Deep Learning model - Knowledge Distillation.

In DistilBERT, a simpler student

model is simply trained to
Target Output replicate the BERT (teacher’s)
[email protected] performance through the
VY86LFGKPW
mechanism described.
Complex Teacher Architecture
Input Distillation Loss

Predicted Output

Simpler Student Architecture

Backpropagation to modify student’s

weights
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
DistilBERT - The Idea of Knowledge Distillation

This simple idea, also called Teacher-Student Training, allows us to reduce the sometimes
unnecessary size and complexity of Large Language Models like BERT, and is a boon to
resource-constrained settings such as individual laptops, mobile phones and other edge devices
that still need to deploy heavy Deep Learning applications.

With its much smaller size, DistilBERT was shown to reduce the size of BERT by 40%, improve on
compute speed by 60% and yet, still retain 97% of BERT’s Natural Language Understanding abilities.
[email protected]
VY86LFGKPW

Training a Student Model on the output of a Teacher

Model in this manner, forces it to become more efficient,
allowing us to build a much simpler model that can
adapt its parameters to perform nearly as well as a more
complex model.

This idea can of course be applied to any architecture

as the Teacher, or even recursively applied to
continually try to create simpler
This file models.
is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The GPT Family - Decoder-only Models
Generative Pre-trained Transformer

OpenAI, on the other hand, went down a different path in the pursuit of creating Large Language
Models (LLMs) that are capable of generating high-quality text - Decoder-only Architectures.

Decoder-only models generate Natural Language text in an autoregressive fashion (the output of
one timestep is the input to the next) based on a prompt.
[email protected]
VY86LFGKPW
The difference from the earlier Encoder-Decoder style architecture is that in a Decoder-only
model, the initial input prompt is also used as input into the Decoder.

This is the key architectural innovation behind the GPT

series of models:
GPT (2018), GPT-2 (2019) and GPT-3 (2020)

Outside of this, the GPT models have shown an emergent

better understanding of Natural Language simply as the
size of their training dataset has
This file increased.
is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The GPT Family - Training Process
The following is an example of how the GPT models (GPT, GPT-2 & GPT-3) achieve this
autoregressive text generation by utilizing the input prompt text as well.

First Timestep Second Timestep

What do we Like in Summer ? We [MASK] [MASK] [MASK] What do we Like in Summer ? We Love [MASK] [MASK]

[email protected]
VY86LFGKPW

DECODER STACK DECODER STACK

What do we Like in Summer ? [ MASK] [MASK] [MASK] [MASK] What do we Like in Summer ? We [MASK] [MASK] [MASK]

What do we Like in Summer? We love Ice Cream

This file is meant for personal use by [email protected] only.

Parameters 117 Million 1.5 Billion 175 Billion

[email protected]
VY86LFGKPW
Decoder Blocks 12 48 96

Context Token Size 512 1024 2048

Hidden Layers 768 1600 12288

Batch Size 64 512use by [email protected]

This file is meant for personal 3.2 Milliononly.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary
So in order to summarize:
1. We gained an overview of the immediate evolution of Large Language Models in
the first few years of the release of the Transformer architecture.
2. We understood the key ideas behind BERT, and how variants such as RoBERTa,
XLNet, DistilBERT improved on it in different ways.
3. We’ve also gained an overview to the idea behind the GPT family of LLMs from
OpenAI, and how they utilize a Decoder-only architecture for generating text.
[email protected]
VY86LFGKPW
4. Finally, we saw a quick comparison of the difference in scale between GPT, GPT-2
and GPT-3 - the main point of difference between these iterations of the GPT family.

This file is meant for personal use by [email protected] only.

Difference Between BART and BERT
No ratings yet
Difference Between BART and BERT
2 pages
21CSE356T-NLP-Unit 4.2
No ratings yet
21CSE356T-NLP-Unit 4.2
31 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
Transformer Basics
No ratings yet
Transformer Basics
17 pages
ADL AyushKumarShukla
No ratings yet
ADL AyushKumarShukla
13 pages
Stanford Dataset 2.0
No ratings yet
Stanford Dataset 2.0
9 pages
11 Bert
No ratings yet
11 Bert
66 pages
BERT and Its Variation
No ratings yet
BERT and Its Variation
29 pages
All About Encoder-Decoder Models
No ratings yet
All About Encoder-Decoder Models
50 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
Albert: A L Bert S - L L R: ITE FOR ELF Supervised Earning of Anguage Epresentations
No ratings yet
Albert: A L Bert S - L L R: ITE FOR ELF Supervised Earning of Anguage Epresentations
16 pages
LSTM To BERT
No ratings yet
LSTM To BERT
30 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
Week 3: Deeplearning - Ai
No ratings yet
Week 3: Deeplearning - Ai
98 pages
14 04 Transformers
No ratings yet
14 04 Transformers
11 pages
Report Bert
No ratings yet
Report Bert
2 pages
A Comparison of LSTM and BERT For Small Corpus: Aysu Ezen-Can SAS Inst. September 14, 2020
No ratings yet
A Comparison of LSTM and BERT For Small Corpus: Aysu Ezen-Can SAS Inst. September 14, 2020
12 pages
BERT
No ratings yet
BERT
98 pages
Mobilebert: A Compact Task-Agnostic Bert For Resource-Limited Devices
No ratings yet
Mobilebert: A Compact Task-Agnostic Bert For Resource-Limited Devices
13 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
2024 Semeval-1 72
No ratings yet
2024 Semeval-1 72
6 pages
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
No ratings yet
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
16 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
Modernbert or Debertav3
No ratings yet
Modernbert or Debertav3
11 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
No ratings yet
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
186 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
BERT
No ratings yet
BERT
1 page
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
Lec 02
No ratings yet
Lec 02
33 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
AI Assignment Presentation
No ratings yet
AI Assignment Presentation
11 pages
Bert
No ratings yet
Bert
36 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
Bert Ayman
No ratings yet
Bert Ayman
5 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
20 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
Bert 1
No ratings yet
Bert 1
4 pages
BERT
No ratings yet
BERT
4 pages
A Primer in BERTology - What We Know About How BERT Works
No ratings yet
A Primer in BERTology - What We Know About How BERT Works
23 pages
The Birth of BERT
No ratings yet
The Birth of BERT
7 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
Data Structure Mcqs
83% (6)
Data Structure Mcqs
45 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Algorithm BERT
No ratings yet
Algorithm BERT
1 page
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
Kami Export - 2G3 Revision For EOY - Sec 2 Topics Solutions
No ratings yet
Kami Export - 2G3 Revision For EOY - Sec 2 Topics Solutions
12 pages
Digital Cont Lec p2
No ratings yet
Digital Cont Lec p2
37 pages
Sampling
No ratings yet
Sampling
13 pages
Unit-5 Computer Vision
No ratings yet
Unit-5 Computer Vision
3 pages
CheatSheet Scikit Learn
No ratings yet
CheatSheet Scikit Learn
10 pages
Missing Data
No ratings yet
Missing Data
71 pages
Lesson 2.0 DFT
No ratings yet
Lesson 2.0 DFT
42 pages
Algorithm Lab Manual
No ratings yet
Algorithm Lab Manual
64 pages
Aiming For A Star Pure A 2025
No ratings yet
Aiming For A Star Pure A 2025
65 pages
DSA - Mini Project - PDF 2101051
No ratings yet
DSA - Mini Project - PDF 2101051
12 pages
Pes TP TR112 PSDP 090523
No ratings yet
Pes TP TR112 PSDP 090523
112 pages
Ebooks File An Introduction To Iterative Toeplitz Solvers Raymond Hon-Fu Chan and Xiao-Qing Jin All Chapters
No ratings yet
Ebooks File An Introduction To Iterative Toeplitz Solvers Raymond Hon-Fu Chan and Xiao-Qing Jin All Chapters
77 pages
Machine Learning Full Course
No ratings yet
Machine Learning Full Course
25 pages
6 Uninformed Search
No ratings yet
6 Uninformed Search
13 pages
Inquiry Matrix Sheet 2005 10 11
No ratings yet
Inquiry Matrix Sheet 2005 10 11
1 page
Maths 1
No ratings yet
Maths 1
3 pages
Changing The Rock Physics Function at Any Time in Petrel
No ratings yet
Changing The Rock Physics Function at Any Time in Petrel
4 pages
ORF 544 Week 1 Introduction and Overview
No ratings yet
ORF 544 Week 1 Introduction and Overview
103 pages
Cascade Control of A Continuous Stirred Tank Reactor (CSTR) : October 2013
No ratings yet
Cascade Control of A Continuous Stirred Tank Reactor (CSTR) : October 2013
9 pages
Practice Test
No ratings yet
Practice Test
8 pages
Assignment Ada 1
No ratings yet
Assignment Ada 1
5 pages
Module 4b - Control Systems
No ratings yet
Module 4b - Control Systems
31 pages
23bai10541 - Saad - Lab Manual
No ratings yet
23bai10541 - Saad - Lab Manual
11 pages
Euler's Method Simpson's Rule: X X DX
No ratings yet
Euler's Method Simpson's Rule: X X DX
1 page
S - N Student Requirements Unit 1 Methods Checkpoint 1
No ratings yet
S - N Student Requirements Unit 1 Methods Checkpoint 1
2 pages
Mesonet: A Compact Facial Video Forgery Detection Network
No ratings yet
Mesonet: A Compact Facial Video Forgery Detection Network
7 pages
Genetic Algorithms: Jaume I University - Intelligent Systems (EI1028)
No ratings yet
Genetic Algorithms: Jaume I University - Intelligent Systems (EI1028)
7 pages
Time Series Analysis
No ratings yet
Time Series Analysis
4 pages
Speaker Dependent Continuous Kannada Speech Recognition Using HMM
No ratings yet
Speaker Dependent Continuous Kannada Speech Recognition Using HMM
4 pages
Groovy for Domain-Specific Languages, Second Edition: Extend and enhance your Java applications with domain-specific scripting in Groovy
From Everand
Groovy for Domain-Specific Languages, Second Edition: Extend and enhance your Java applications with domain-specific scripting in Groovy
Fergal Dearle
No ratings yet