0% found this document useful (0 votes)
30 views16 pages

Transformer Architectures: For NLP

The document provides an overview of transformer architectures used in NLP, detailing the evolution of large language models like BERT, RoBERTa, XLNet, DistilBERT, and the GPT family. It explains the key features and improvements of each model, including their training processes and architectural differences. The summary highlights the advancements in NLP capabilities and the scale differences among the GPT models.

Uploaded by

krish.bonzo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views16 pages

Transformer Architectures: For NLP

The document provides an overview of transformer architectures used in NLP, detailing the evolution of large language models like BERT, RoBERTa, XLNet, DistilBERT, and the GPT family. It explains the key features and improvements of each model, including their training processes and architectural differences. The summary highlights the advancements in NLP capabilities and the scale differences among the GPT models.

Uploaded by

krish.bonzo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Transformer Architectures

[email protected]
VY86LFGKPW
for NLP
Understanding the large transformer-based
modern architectures used for NLP tasks

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
• The Evolution of Large Language Models
• The BERT Family (BERT, RoBERTa, XLNet and
[email protected]
VY86LFGKPW Agenda
Distilbert)
• The GPT Family

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The Evolution of Large Language Models
Popular BERT variations Generative Transformer
which improved on BERT’s models applied to
training data, procedure, Conversational AI with
performance and increasingly exhaustive
computational cost text corpuses for training

RoBERTa, XLNet & DistilBERT ChatGPT & Beyond


[email protected]
VY86LFGKPW 2018-2019 2023+

2018 2018-2021
The BERT Transformer The GPT Family

The first popular OpenAI’s


modern Transformer autoregressive
architecture. A favorite Decoder-only models
among practitioners. (GPT, GPT-2, GPT-3)
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BERT - History & Context
Bidirectional Encoder Representations from Transformers

Although the original Transformer paper had come out in 2017, it wasn’t until the release of BERT
in 2018 that the industry truly started to take notice of Transformer architectures for NLP.

BERT was released


[email protected] by a team from Google, and has become the de-facto industry baseline in
VY86LFGKPW
NLP due to its ubiquitous good performance on all categories of NLP tasks.

BERT’s training essentially consists of two stages: A Pre-training Stage and a Fine-Tuning Stage

Pre-training Stage Fine-tuning Stage

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BERT - Pre-training + Fine-tuning
Then, only the Encoder blocks are taken for
The Pre-training of BERT was done using an
Fine-tuning on Next Sentence Prediction (NSP).
Encoder-Decoder architecture
The fine-tuned Encoders are what eventually
on Masked Language Modeling (MLM).
become BERT.

Masked Language Modelling Next Sentence Prediction Single Classification


Neuron to perform
[email protected]
VY86LFGKPW Binary Classification for
ENCODER Next Sentence
I LOVE ICE CREAM IN SUMMER Prediction.
ENCODER ENCODER
DECODER
ENCODER ENCODER
DECODER
I LOVE ICE [MASK] IN [MASK] ENCODER
DECODER
I LOVE ICE CREAM IN SUMMER SENTENCE 1 SENTENCE 2

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
BERT - Variants

The BERT architecture has 2 variants, both of which are Encoder-only architectures.

1. BERTBASE has 12 Encoder blocks, hidden dimension of 768, 110 M parameters & 12 Attention heads.
2. BERTLARGE has 24 Encoder blocks, hidden dimension of 1024, 340 M parameters & 16 Attention heads.

[email protected]
VY86LFGKPW
BERT’s pre-training allows it to BERT can be further fine-tuned
learn high-quality latent with fewer computational
representations of words and resources on smaller datasets
sentences, including context. specific to various NLP tasks.

The contextualized embeddings


The success of BERT has made it
BERT returns means that it will
a staple of Google’s NLP
return different embeddings
operations on its Search Engine
even for the same word
in multiple languages.
appearing in different contexts.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
From BERT to RoBERTa
Robustly Optimized BERT pre-training Approach

The RoBERTa model, released by a Facebook AI research team in 2019, was just a modification to
key hyperparameters in the way BERT was trained.

RoBERTa has the exact same architecture as BERT, but the Pre-training Approach was changed to
Randomly Masked Language Modeling, where the choice of word being masked is randomly
changed in each epoch of training.
[email protected]
VY86LFGKPW

In addition, RoBERTa does not perform fine-tuning using Next Sentence Prediction, as the authors
postulated that the procedure does not yield much difference to the quality of representations.

Despite just these minor tweaks (together with a much


bigger training dataset) RoBERTa was able to significantly
outperform BERT on several NLP benchmarks, and is one
of the state-of-the-art NLP models available today.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
RoBERTa - Training Process

The following is an example of this process for RoBERTa:

Randomly Masked Language Modelling

EPOCH 1 EPOCH 2 EPOCH 3

I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER
[email protected]
VY86LFGKPW

DECODER STACK DECODER STACK DECODER STACK

ENCODER STACK ENCODER STACK ENCODER STACK

I LOVE ICE [MASK] IN [MASK] I LOVE [MASK] CREAM IN SUMMER I [MASK] ICE CREAM [MASK] SUMMER

I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
XLNet - Another Alternative

XLNet, another variation on BERT which was released around the same time as RoBERTa in 2019,
uses a different Pre-training technique to improve on BERT’s Masked Language Modeling.

XLNet again uses the exact same architecture as BERT. But the XLNet authors believed that
predicting all the masked tokens with the Decoder in one go neglects the sequential
dependencies between tokens that exist in Natural Language.
[email protected]
VY86LFGKPW
So XLNet proposes a different idea - Permutation Language Modeling.

Rather than predicting both “Ice” and “Cream” in the same run, for instance, the model predicts
one masked token in one run (let’s say “Ice”), then uses this to predict “Cream” in the next run.

And just like with RoBERTa, this minor tweak also leads to a
substantial improvement over BERT, often by a large
margin, on several NLP tasks.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
XLNet - Training Process

Here’s the XLNet training process illustrated with the same example:

Permutation Language Modelling

EPOCH 1 EPOCH 2

I LOVE [MASK] CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER


[email protected]
VY86LFGKPW

DECODER STACK DECODER STACK

ENCODER STACK ENCODER STACK

I LOVE [MASK] [MASK] IN SUMMER I LOVE [MASK] CREAM IN SUMMER

I LOVE ICE CREAM IN SUMMER

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
DistilBERT - The Idea of Knowledge Distillation

DistilBERT goes down a more important direction, and applies an idea to improve BERT’s
computational cost that is generalizable to any Deep Learning model - Knowledge Distillation.

In DistilBERT, a simpler student


model is simply trained to
Target Output replicate the BERT (teacher’s)
[email protected] performance through the
VY86LFGKPW
mechanism described.
Complex Teacher Architecture
Input Distillation Loss

Predicted Output

Simpler Student Architecture

Backpropagation to modify student’s


weights
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
DistilBERT - The Idea of Knowledge Distillation

This simple idea, also called Teacher-Student Training, allows us to reduce the sometimes
unnecessary size and complexity of Large Language Models like BERT, and is a boon to
resource-constrained settings such as individual laptops, mobile phones and other edge devices
that still need to deploy heavy Deep Learning applications.

With its much smaller size, DistilBERT was shown to reduce the size of BERT by 40%, improve on
compute speed by 60% and yet, still retain 97% of BERT’s Natural Language Understanding abilities.
[email protected]
VY86LFGKPW

Training a Student Model on the output of a Teacher


Model in this manner, forces it to become more efficient,
allowing us to build a much simpler model that can
adapt its parameters to perform nearly as well as a more
complex model.

This idea can of course be applied to any architecture


as the Teacher, or even recursively applied to
continually try to create simpler
This file models.
is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The GPT Family - Decoder-only Models
Generative Pre-trained Transformer

OpenAI, on the other hand, went down a different path in the pursuit of creating Large Language
Models (LLMs) that are capable of generating high-quality text - Decoder-only Architectures.

Decoder-only models generate Natural Language text in an autoregressive fashion (the output of
one timestep is the input to the next) based on a prompt.
[email protected]
VY86LFGKPW
The difference from the earlier Encoder-Decoder style architecture is that in a Decoder-only
model, the initial input prompt is also used as input into the Decoder.

This is the key architectural innovation behind the GPT


series of models:
GPT (2018), GPT-2 (2019) and GPT-3 (2020)

Outside of this, the GPT models have shown an emergent


better understanding of Natural Language simply as the
size of their training dataset has
This file increased.
is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The GPT Family - Training Process
The following is an example of how the GPT models (GPT, GPT-2 & GPT-3) achieve this
autoregressive text generation by utilizing the input prompt text as well.

First Timestep Second Timestep

What do we Like in Summer ? We [MASK] [MASK] [MASK] What do we Like in Summer ? We Love [MASK] [MASK]

[email protected]
VY86LFGKPW

DECODER STACK DECODER STACK

What do we Like in Summer ? [ MASK] [MASK] [MASK] [MASK] What do we Like in Summer ? We [MASK] [MASK] [MASK]

What do we Like in Summer? We love Ice Cream

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
The GPT Family - A Comparison
GPT-3
GPT-2
GPT

Parameters 117 Million 1.5 Billion 175 Billion

[email protected]
VY86LFGKPW
Decoder Blocks 12 48 96

Context Token Size 512 1024 2048

Hidden Layers 768 1600 12288

Batch Size 64 512use by [email protected]


This file is meant for personal 3.2 Milliononly.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Summary
So in order to summarize:
1. We gained an overview of the immediate evolution of Large Language Models in
the first few years of the release of the Transformer architecture.
2. We understood the key ideas behind BERT, and how variants such as RoBERTa,
XLNet, DistilBERT improved on it in different ways.
3. We’ve also gained an overview to the idea behind the GPT family of LLMs from
OpenAI, and how they utilize a Decoder-only architecture for generating text.
[email protected]
VY86LFGKPW
4. Finally, we saw a quick comparison of the difference in scale between GPT, GPT-2
and GPT-3 - the main point of difference between these iterations of the GPT family.

This file is meant for personal use by [email protected] only.


Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

You might also like