Transformer Architectures: For NLP
Transformer Architectures: For NLP
[email protected]
VY86LFGKPW
for NLP
Understanding the large transformer-based
modern architectures used for NLP tasks
2018 2018-2021
The BERT Transformer The GPT Family
Although the original Transformer paper had come out in 2017, it wasn’t until the release of BERT
in 2018 that the industry truly started to take notice of Transformer architectures for NLP.
BERT’s training essentially consists of two stages: A Pre-training Stage and a Fine-Tuning Stage
The BERT architecture has 2 variants, both of which are Encoder-only architectures.
1. BERTBASE has 12 Encoder blocks, hidden dimension of 768, 110 M parameters & 12 Attention heads.
2. BERTLARGE has 24 Encoder blocks, hidden dimension of 1024, 340 M parameters & 16 Attention heads.
[email protected]
VY86LFGKPW
BERT’s pre-training allows it to BERT can be further fine-tuned
learn high-quality latent with fewer computational
representations of words and resources on smaller datasets
sentences, including context. specific to various NLP tasks.
The RoBERTa model, released by a Facebook AI research team in 2019, was just a modification to
key hyperparameters in the way BERT was trained.
RoBERTa has the exact same architecture as BERT, but the Pre-training Approach was changed to
Randomly Masked Language Modeling, where the choice of word being masked is randomly
changed in each epoch of training.
[email protected]
VY86LFGKPW
In addition, RoBERTa does not perform fine-tuning using Next Sentence Prediction, as the authors
postulated that the procedure does not yield much difference to the quality of representations.
I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER
[email protected]
VY86LFGKPW
I LOVE ICE [MASK] IN [MASK] I LOVE [MASK] CREAM IN SUMMER I [MASK] ICE CREAM [MASK] SUMMER
I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER I LOVE ICE CREAM IN SUMMER
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
XLNet - Another Alternative
XLNet, another variation on BERT which was released around the same time as RoBERTa in 2019,
uses a different Pre-training technique to improve on BERT’s Masked Language Modeling.
XLNet again uses the exact same architecture as BERT. But the XLNet authors believed that
predicting all the masked tokens with the Decoder in one go neglects the sequential
dependencies between tokens that exist in Natural Language.
[email protected]
VY86LFGKPW
So XLNet proposes a different idea - Permutation Language Modeling.
Rather than predicting both “Ice” and “Cream” in the same run, for instance, the model predicts
one masked token in one run (let’s say “Ice”), then uses this to predict “Cream” in the next run.
And just like with RoBERTa, this minor tweak also leads to a
substantial improvement over BERT, often by a large
margin, on several NLP tasks.
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
XLNet - Training Process
Here’s the XLNet training process illustrated with the same example:
EPOCH 1 EPOCH 2
DistilBERT goes down a more important direction, and applies an idea to improve BERT’s
computational cost that is generalizable to any Deep Learning model - Knowledge Distillation.
Predicted Output
This simple idea, also called Teacher-Student Training, allows us to reduce the sometimes
unnecessary size and complexity of Large Language Models like BERT, and is a boon to
resource-constrained settings such as individual laptops, mobile phones and other edge devices
that still need to deploy heavy Deep Learning applications.
With its much smaller size, DistilBERT was shown to reduce the size of BERT by 40%, improve on
compute speed by 60% and yet, still retain 97% of BERT’s Natural Language Understanding abilities.
[email protected]
VY86LFGKPW
OpenAI, on the other hand, went down a different path in the pursuit of creating Large Language
Models (LLMs) that are capable of generating high-quality text - Decoder-only Architectures.
Decoder-only models generate Natural Language text in an autoregressive fashion (the output of
one timestep is the input to the next) based on a prompt.
[email protected]
VY86LFGKPW
The difference from the earlier Encoder-Decoder style architecture is that in a Decoder-only
model, the initial input prompt is also used as input into the Decoder.
What do we Like in Summer ? We [MASK] [MASK] [MASK] What do we Like in Summer ? We Love [MASK] [MASK]
[email protected]
VY86LFGKPW
What do we Like in Summer ? [ MASK] [MASK] [MASK] [MASK] What do we Like in Summer ? We [MASK] [MASK] [MASK]
[email protected]
VY86LFGKPW
Decoder Blocks 12 48 96