0% found this document useful (0 votes)
55 views56 pages

LLM Learning

Uploaded by

13910235173
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views56 pages

LLM Learning

Uploaded by

13910235173
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Transformer Based Learning for BERT and GPT

Language Models

Lecture Notes on Deep Learning

Avi Kak and Charles Bouman

Purdue University

Friday 19th April, 2024 13:29

©2024 Avinash Kak, Purdue University


Purdue University 1
Preamble To TOC To HowTo

Large Language Modeling (LLM) made its debut in the year 2019. OpenAI
officially released the GPT-2 model in February 2019 and Google officially
released its BERT model in October 2019.
What is most distinctive about LLMs is that they can ingest terabytes of publicly
available textual datasets, learn from that data without any supervision, and
become experts in word-to-word, clause-to-clause, sentence-to-sentence, and
paragraph-to-paragraph continuity properties of narratives.
My goal in this lecture is to highlight some of the important aspects of LLMs, the
architectures of their neural networks that are based on Transformers, how they
carry out unsupervised learning, etc.
I’ll start by illustrating several of the LLM concepts through my explanations of
BERT. The main reason for that is that my acquaintanceship with BERT dates
back almost to the year it was born. The GPT models have entered my
consciousness rather recently.
BERT started out as an acronym for Bidirectional Encoder Representations from
Transforms.
Purdue University 2
Preamble (contd.)
However, from the way the acronym BERT is now used in publications and in
general communications, you could say that BERT has become a noun unto
itself. BERT was first presented in the paper:
https://arxiv.org/pdf/1810.04805.pdf

Note that, in retrospect, the word “Bidirectional” in what BERT stands for is an
accident of history and now is probably a source of confusion for most new users
of BERT. As you know, there’s nothing fundamentally unidirectional about a
Transformer.
[Based on the explanation in my Week 14 lecture, what a Transformer does is to learn an attention map for the input.
Ignoring the batch axis, if you feed an [Nw , M] tensor at its input, where Nw is the number of elements (words, patches,
etc.) in the input sequence and M the size of the embedding vector representation of each element, the output of the
Transformer will also be shaped [Nw , M], but with a difference. When you multiply the learned Q and K T tensors for the
final output, you will get a Nw × Nw array that is the Attention Map over the Nw elements of the input sequence. Each
element of the Attention Map array indicates as to what extent any given element indexed i in the input attends to another
element indexed j in the same input.]

The word “Bidirectional” in BERT came about because it was Google’s advance
over the OpenAI’s first version of GPT that was programmed to calculate
self-attention in an autoregressive manner — by scanning a sentence left to right.
In an Autoregressive Model, the dot-products that go into the calculations of the
attention for each word only depend on the previous words in the input sentence.
Purdue University 3
Preamble (contd.)
Large Language Models are meant to help us cope with the following dilemma:
Solving a practical problem with complex neural-network architectures based on
Transformers requires a large amount of labeled training data. But creating
labeled data can be expensive.
You run into this dilemma particularly when you initialize the learnable
parameters with random weights, a common practice for the simpler
neural networks.
Over the years, researchers have posited that it should be possible to reduce the
burden of supervised training for solving a specific task if we first initialized the
learnable parameters with inexpensive unsupervised training. The
researchers established the veracity of such claims with small-scale experiments.
But, now, through LLMs, we know that the above is true in general. We can
significantly reduce the extent of supervised learning needed if we first initialize
the learnable weights with inexpensive unsupervised learning on freely available
public datasets.

Purdue University 4
Preamble (contd.)
I have already mentioned the main publication for BERT. For GPT-2, the official
publication is
https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

and for GPT-3, the publication is


https://arxiv.org/pdf/2005.14165.pdf

Here is a brief chronology of the more famous LLMs that have been released to
date:
GPT-2 officially released in Feb 2019

|
v

BERT officially released in Oct 2019

|
v

GPT-3 officially released in Nov 2022

|
v

GPT-4 officially released in March 2023

Purdue University 5
Preamble (contd.)

Here is a comparison of BERT with GPT-3 by Celeste Mottesi dated February 9,


2023:
https://blog.invgate.com/gpt-3-vs-bert

According to the author, whereas GPT-3 performs better at tasks such as


summarization and translation, BERT can be expected to perform better at tasks
such as sentiment analysis and natural language understanding.
Since GPT-3 was training on 45 TB of data, whereas BERT was trained on mere
3 TB, one would think GPT-3 would beat BERT hands down. Perhaps, the
deep-learning mantra “the more data you have the better off you are” does not
apply to every aspect of natural language processing.

Purdue University 6
Preamble – How to Learn from These Slides

At your first reading of these slides, just focus on thoroughly understanding the
following three concepts related to LLMs:

At the most fundamental level, you need to understand what’s meant by generative
training as opposed to discriminative training.

You also need to understand as to why tokenizers play a critical role in generative
training. What would happen if we used no tokenizers at all, just used the words directly
as the tokens?

Why is the fixed input token length of 512 in BERT possibly a good compromise between
the amount of text you need for learning sentence-to-sentence continuity and what you
need for computational efficiency in training with feasible hardware resources.

Purdue University 7
Outline
1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 8
Generative vs. Discriminative Training of a Neural Network

Outline

1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 9
Generative vs. Discriminative Training of a Neural Network

Generative vs. Discriminative Training

For most people, the word “generative” is evocative of creating


“something from nothing” — for example generating images from
noise using either diffusion or adversarial learning. That is NOT the
meaning of “generative” in “Generative vs. Discriminative”.

To convey the ideas of what’s meant by “generative vs.


discriminative” in training a neural network, let’s say that X
represents the input to the network and Y its output. Typically, you
train a classification neural network discriminatively because you trust
the training data and you want the network to make a correct
prediction Y for the given input X .

Therefore, the main focus in discriminative training is to get the


network to correctly estimate the conditional probability p(Y |X ) —
because the accuracy of X is not open to question.

Purdue University 10
Generative vs. Discriminative Training of a Neural Network

Generative vs. Discriminative Training (contd.)


In “generative” training, on the other hand, while you may still need
to make discriminations based on the output Y , you also want to
make the network learn the inter-relationships between the different
elements of the input X .
That way, the network would also become smarter about the
probabilistic variability in the different X that may correspond to the
same Y . In this manner, you could say that you want to put both the
input X and the output Y on an equal footing with regard to where
you want to posit your trust.
That begs the question: How to learn X ?
Modern thinking is that if X has any structure at all — all languages
are highly rich in structure — and if we could prescribe learning
objectives for the different facets of that structure, all we would need
to do would be to let the neural network loose in the internet so that
it can ingest as many different examples of X as it can find and thus
Purdue University
learn the structure of X . 11
Generative vs. Discriminative Training of a Neural Network

Generative vs. Discriminative Training (contd.)


Relative to the supervised training of neural networks, the approach
outlined at the bottom of the previous slide should cost nothing.
It is this idea that is foundational to LLMs. They all use unsupervised
learning to learn X .

The figure on Slide 14 is meant to capture the unsupervised


generative learning at a very general level. As shown, you feed a
sequence of tokens at the input to a Transformer based network.
The precise format of the input sequence would depend on what sort
of object function you are trying to maximize (or what sort of a loss
you are trying to minimize).
For example, if the goal is to train the network at NSP (Next
Sentence Prediction), the token sequence at the input will have two
separate parts, one for a given sentence (sentenceA ) and the other for
the sentence that follows (sentenceB ). Now the objective would be to
Purdue Universitythe conditional probability p(sentenceB |sentenceA ).
maximize 12
Generative vs. Discriminative Training of a Neural Network

Generative vs. Discriminative Training (contd.)


On the other hand, in Masked Language Modeling, the goal would be
to mask out one of the tokens in the input token sequence and to
maximize the probability of the correct word at that position given all
the context words. So if the input token is expressed as
(w1 , w2 , . . . , wN ) and if we consider wm to have been masked, we
would want to maximize p(wm |w1 , . . . , wm−1 , wm+1 , . . . , wN ).

For yet another possibility, if your language model is autoregressive —


meaning that the production of each token would depend only on the
previously produced tokens — you would want to maximize
p(wi |wi−1 , wi−2 , . . . , w1 ).

Purdue University 13
Generative vs. Discriminative Training of a Neural Network

Generative Learning

Purdue University 14
The Architecture of BERT and the Formatting of its Input

Outline

1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 15
The Architecture of BERT and the Formatting of its Input

BERT Architecture
The basic architecture of BERT is exactly the same as that of the
Master Encoder shown on Slide 36 of my Week 14 slides. That slide
presented both the Master Encoder and the Master Decoder in a
Transformer-based machine translation framework. The figure on the
next slide is just the Master Encoder part of the Week 14 figure as
the neural architecture for BERT.

Since the embedding vectors play a central role in the calculation of


Attention, you need to know right off the bat that the size of the
embeddings in BERT is either 768 or 1024 depending on which
version of BERT you are using. There are two versions as you will see.

In the BERT paper, the symbol H represents the size of the


embedding vectors.
[For some reason, that paper refers to embeddings as “Hidden vectors” or “Hidden states.” In retrospect, it sounds
bizarre to be using those names for the embeddings — especially because of the unique (and semantically rich) role
played by the “hidden state” in Recurrent Neural Networks that we covered in my Week 13 lecture.]

Purdue University 16
The Architecture of BERT and the Formatting of its Input

BERT Architecture (contd.)

Figure: The BERT architecture, as shown above, is exactly the same as the Master Encoder part of the overall architecture
onPurdue
Slide 36 ofUniversity
my Week 14 slides. 17
The Architecture of BERT and the Formatting of its Input

BERT Architecture (contd.)

What I have shown as Basic Encoders in the figure on the previous


slide are called Transformer Blocks in the BERT architecture.

The two versions of BERT are denoted BERTbase and BERTlarge .

In BERTbase , the number of Transformer Blocks is 12 and the number


of Attention Heads is also 12. On the other hand, in BERTlarge , the
number of Transformer Blocks is 24, and the number of Attention
Heads 16.

Another point of difference between BERTbase and BERTlarge is the size


of the embeddings. In keeping with the numbers presented on Slide
16, the embeddings are of size 768 for BERTbase and 1024 for BERTlarge .

Also note that the number of learnable parameters in BERTbase is 110


million and the same in BERTlarge is 340 million.
Purdue University 18
The Architecture of BERT and the Formatting of its Input

BERT Input Format


What makes BERT so versatile, and as to why it continues to be used
for domain-specific LLMs, is the nature of the “formatting” of the
input on which it is trained.

As shown in the caption at the bottom of the figure in Slide 17, the
input consists of a pair of token sequence, with each token sequence
representing a span of continuous text. The two spans at the input
could represent, say, a (Question, Answer ) pair, a
(Sentence, Next Sentence) pair, etc.

Although, for most applications, you would want to feed a pair of


token sequences at its input, BERT would also accept single token
sequence.

Feeding just one token sequence at the input is important for one of
the two modes — the mask based mode — for the generative training
of BERT. More on that later.
Purdue University 19
The Architecture of BERT and the Formatting of its Input

BERT Input Format (contd.)


There is an upper limit on the length of a token sequence: 512. This
is dictated by the role played by the parameter max seq length in the
constructor of the SelfAttention class for the Transformers in my Week
14 lecture.
[Since people have gotten used to seeing long narratives that the LLMs can now return in response to human
supplied prompts, you are likely to wonder if using 512 as a limit on the input to the BERT network is way too
limiting for modern times. The answer to that is “Not necessarily.” The most basic goal of an LLM is to learn in an
unsupervised manner the fundamental language continuity properties, from word-to-word, clause-to-clause,
sentence-to-sentence, paragraph-to-paragraph, and so on. BERT can do a pretty good job of learning the
word-to-word, clause-to-clause and sentence-to-sentence continuity properties. Consider, for example, the two input
sentences in a Question/Answer framework. Ignoring the special tokens for a moment, you have 256 tokens for the
Question part and the same number for the Answer part. Since the tokenizers used for LLMs do not break up the
most frequently used words, so, for the sake of argument, let’s assume our sentences are composed of only the
common words that may not get the ax from the tokenizer. Next consider that, in English, a typical sentence has
between 20 and 30 words. So a span of 256 words is likely to capture several sentence-to-sentence transitions,
including, obviously, word-to-word and clause-to-clause transitions. You might still ask: What about the
paragraph-to-paragraph continuity properties? Those can be taken care of in a supervised extension of BERT as
discussed later. Supervised extensions of BERT are not computationally demanding.]

The two token sequences at the input are separated by a special


token denoted SEP. More precisely, each of the two token sequences
ends in the token SEP. Obviously, SEP is represented by a learnable
embedding of its own.
Purdue University 20
The Architecture of BERT and the Formatting of its Input

BERT Input Format (contd.)


For convenience, we refer to the text sequence that comes before the
SEP token as “Sentence-A” and the text sequence that comes after the
separator token as “Sentence-B”.
A special learnable per-class token, called the classification token, is
prepended to the token sequence for each sentence pair input. This
token is denoted [CLS]. [The class token is useful for applications, such as sentiment analysis, in
which you want to classify text. When solving a classification problem, you would feed the embedding corresponding
to the [CLS] input at the output of the Master Encoder into an MLP for classification. ]

The figure on the next slide shows the placement of the [CLS] token
for the imaginary case when the two-sentence input is limited to 10
tokens.

In order to give the network a deeper sense of which token belongs to


which of the two input sequences, we add to the embedding of each
input token a learnable per A or B token, as also shown on the next
slide.
Purdue University 21
The Architecture of BERT and the Formatting of its Input

BERT Input Format (contd.)

Figure: This figure is taken from the original BERT paper

Purdue University 22
The Architecture of BERT and the Formatting of its Input

BERT Input Format and the Tokens


And, as in the Transformer implementations you saw in my Week 14
lecture, the network also needs to be acquire a sense of where exactly
an input token belongs in relation to the other tokens in a sequence.
As shown in the figure on the previous slide, we designate a
per-position token (more accurately, the learnable embedding for a
per-position token) for that purpose., The embeddings for the position
tokens are added to the embeddings for the input sentence tokens.
Since BERT allows for a maximum of 512 for the number of tokens
at the input, so you will have a total of 512 position tokens, each
represented by its embedding.
BERT uses the WordPiece tokenization algorithm for segmenting the
words into tokens. To start with, WordPiece initializes the token
vocabulary the individual characters in the words. It then progressively
merges the characters on the basis of their joint probabilities of the
merged
Purdue symbols vis-a-vis their marginal probabilities.
University 23
The Architecture of BERT and the Formatting of its Input

The Tokenizer in BERT


The end result is that the words that occur frequently are left alone
and those that occur relatively rarely are broken into subwords.
The subwords generated in this manner can be identified in the
output of the tokenizer by the prefix ’##’, as you will see in the
tokenizer output on Slides 26 and 27. The mark ’##’ is called
“continuing subword prefix” in the parlance of the BERT tokenizer.
By the way, there is also an upper limit of 100 characters in a word
that is to be subject to tokenization.
As you would expect, the decomposition of the rarer words in this
manner leads to the much more frequently occurring subwords, as
demonstrated on Slides 26 and 27.
One of my motivations for the tokenizer demo on Slides 26 and 27 is
to dispel a rather commonly held misconception that the main job of
the tokenizer to break up long words into smaller subwords.
Purdue University 24
The Architecture of BERT and the Formatting of its Input

The Tokenizer in BERT (contd.)


The next two slides are a demo of the WordPiece tokenizer in BERT.
You will need to install the Python package transformers with a
command like “sudo pip install transformers” if you want to execute
the code yourself. Installing transformers will automatically pull in the
following packages: safetensors, regex, huggingface-hub, tokenizers,
transformers.

And when you execute the code in this demo, it will further download
the files
vocab.txt
tokenizer.json
config.json
tokenizer_config.json

As you would expect, the tokenizer vocabulary (about 30,000+


tokens) is in the file vocab.txt. The file tokenizer.json has the integer
indexes for the vocab entries. If want to know where these files are
stored, execute the following command at the top of your directory
tree:
Purdue University 25
find . -name vocab.txt
The Architecture of BERT and the Formatting of its Input

BERT Tokenizer (contd.)


## demo_bert_tokenizer.py

from transformers import BertTokenizer ## (1)

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") ## (2)

tokens = tokenizer.tokenize("All work and no play makes Jack a dull boy!") ## (3)
print(tokens)
## [’all’, ’work’, ’and’, ’no’, ’play’, ’makes’, ’jack’, ’a’, ’dull’, ’boy’, ’!’] ## (4)

tokens = tokenizer.tokenize("Song in Mary Poppins: It’s supercalifragilisticexpialidocious") ## (5)


print(tokens)
## [’song’, ’in’, ’mary’, ’pop’, ’##pins’, ’:’, ’it’, "’", ’s’, ’super’, ’##cal’, ’##if’,
## ’##rag’, ’##ilis’, ’##tic’, ’##ex’, ’##pia’, ’##lid’, ’##oc’, ’##ious’] ## (6)

FILE = open("bert_vocab.txt") ## (7)


all_tokens = FILE.read().splitlines() ## readlines() instead will retain ’\n’ at end of each entry ## (8)
FILE.close()

all_tokens.sort( key=lambda x: len(x), reverse=True ) ## (9)

print("\n\nThe longest ten entries in BERT tokanizer vocab: ", all_tokens[:10]) ## (10)
## [’telecommunications’, ’interdisciplinary’, ’telecommunication’, ’responsibilities’, ’autobiographical’,
## ’intercontinental’, ’entrepreneurship’, ’unconstitutional’, ’northamptonshire’, ’characterization’]
## ## (11)

print("\n\nThe shortest ten entries in BERT tokanizer vocab: ", all_tokens[-10:]) ## (12)
## [’!’, ’(’, ’)’, ’,’, ’-’, ’.’, ’/’, ’:’, ’?’, ’~’] ## (13)

Purdue University
(Continued on the next slide .....) 26
The Architecture of BERT and the Formatting of its Input

BERT Tokenizer (contd.)


(...... continued from the previous slide)
print("\n\nExperiments with the ’tokenizer.json’ file that has
the mappings from the tokens to the integer indexes:\n\n")
import json
FILE = open("tokenizer.json") ## (14)
all_entries = json.load(FILE) ## (15)
for key in all_entries[’model’][’vocab’]: ## (16)
print("%s : %s\n" % (key, all_entries[’model’][’vocab’][key])) ## (17)
inverse_look_up = {v:k for k,v in all_entries[’model’][’vocab’].items()} ## (18)
print("\n\nshowing the token for the index 102: ", inverse_look_up[102]) ## [SEP] ## (19)

As you can tell from the code starting in Line (14), the mappings
from the tokens to the integer index values are stored in the JSON
file tokenizer.json. We load the JSON file into our script in Line (16)
and access the integer index values for the tokens at the nested
dictionary at the key vocab of the dictionary that is at the key model.
Lines (16) and (17) will print out the integer index values for all
30,000+ tokens that BERT uses.
We invert the indexes in Line (18) and query the inverted index for
theUniversity
Purdue token associated with the integer 102. 27
Pre-Training BERT

Outline

1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 28
Pre-Training BERT

Pre-Training BERT
The Generative Pre-Training of BERT is carried out with respect to
the following two tasks:
Masked Language Modeling (MLM)
Next Sentence Prediction (NSP)
.
Regarding MLM, during unsupervised training, you mask out a certain
percentage of the tokens at the input to the BERT network shown in
Slide 17. The tokens that are masked are selected at random.

Subsequently, in keeping with the display in Slide 14, the embedding


vectors that correspond to the masked tokens at the input are fed into
nn.Softmax over the 30,000+ vocabulary of the BERT tokenizer to find
the ML (maximum-likelihood) predictions for the masked tokens. The
loss thus calculated is backpropagated, as you would expect. BERT
pre-training calls for randomly masking 15% of the input tokens.
Purdue University 29
Pre-Training BERT

Pre-Training BERT (contd.)


And, regarding NSP, again during unsupervised training, the goal is to
feed (sentence-A, sentence-B) pairs into the BERT network, with 50%
of the time sentence-B being the actual next sentence to sentence-A,
and the other 50% of the time a randomly chosen sentence.
The goal is to carry out a classification task with a loss function like
the Binary Cross-Entropy Loss (nn.BCELoss) using the labels isNext and
notNext.

Purdue University 30
Pre-Training BERT

Datasets Used for Pre-Training BERT


The unsupervised pre-training for BERT has been carried out using
the following two datasets:
The BookCorpus that consists of about 7,000 self-published book. The
dataset consists of roughly 985 million words. According to Wikipedia,
this dataset is not longer available for training.
The English Wikipedia with its roughly 2.5 Billion words.

Purdue University 31
RoBERTa as a Higher-Performance BERT

Outline

1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 32
RoBERTa as a Higher-Performance BERT

RoBERTa as a More Powerful BERT


Shortly after BERT was announced, the following publication
“RoBERTa: A Robustly Optimized BERT Pretraining Approach” by Liu et al.
improved upon some important aspects of BERT to create a more
robust pre-training framework:
https://arxiv.org/pdf/1907.11692.pdf

The authors claim that, their version of BERT outperforms on the


same competition datasets that were used for evaluating BERT.
Whether or not the difference in the performance numbers is
significant, that’s for you to decide. As far as what I can tell, BERT
still dominates the scene (for those who have not switched over to
OpenAI’s GPT based models.).
Note that in the basic architecture of RoBERTa, the formatting of
the input data, the unsupervised training protocol, etc., are the same
as for BERT. However, RoBERTa was trained with a much larger
dataset.
Purdue University 33
RoBERTa as a Higher-Performance BERT

RoBERTa (contd.)
From my perspective, an important difference between BERT and
RoBERTa is that the latter uses the BPE (Byte Pair Encoding)
algorithm for tokenization. It is the same tokenizer that is used for
the GPT models from OpenAI.
Here are two significant differences between BPE and the WordPiece
tokenizer used in BERT:
While, for the most part, WordPiece uses the ASCII characters as the
basic units for merging in order to form the subwords, BPE use the
bytes directly. This allows BPE to be more general with respect to the
different languages. [This is not to imply that WordPiece does not recognize other languages. If you
direct the output of the script I showed on Slides 26 and 27 into a text file and scroll the file, you will see the
basic symbols from practically all the languages in the 30,000+ vocabulary of the WordPiece tokenizer. ]
BPE elicits the help of a pre-tokenizer to count the frequencies of all
the words in the corpus. It subsequently uses those frequencies for
merging the bytes into subwords on the basis of the frequencies of the
merged bytes vis-a-vis the frequencies the subwords that were merged.
Purdue University 34
Fine-Tuning BERT with Supervised Training

Outline

1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 35
Fine-Tuning BERT with Supervised Training

But First a Demo of the Power of Pre-Training


The pre-trained BERT is frequently customized by supervised
fine-tuning for applications that involve question answering, sentiment
analysis, and that call for language inference.
Before I describe how you can customize BERT, it’s interesting to
note that BERT packs a lot of punch even without any further
fine-tuning. I illustrate this with the script on the next slide.
The script is about evaluating which of the two sentences sentence B or
sentence Cis a better continuation as the next sentence for sentence A
[If you are new to the US, you are probably thinking that Purdue Pharma in sentence B must refer to the famous
Pharmacy department at Purdue University. Purdue Pharma, a pharmaceutical company owned by a family known
as Sacklers, is the maker of the painkiller drug OxyContin. It was believed that this drug caused what’s known as
the Opoid Crisis in the United States with hundreds of thousands of people developing a strong addiction to the
drug. Highly questionable business practices by the company were considered to have contributed to the problem.]

When you execute the script, you get the following answer:
CASE 1:
prediction loss: tensor(1.0133e-05, grad_fn=<NllLossBackward0>)
True Pair
CASE 2:
prediction loss: tensor(2.7418e-06, grad_fn=<NllLossBackward0>)
Purdue True Pair
University 36
Fine-Tuning BERT with Supervised Training

A Demo of the Power of Pre-Training (contd.)

## nsp1.py

from transformers import BertTokenizer, BertForNextSentencePrediction ## (1)


import torch

tokenizer = BertTokenizer.from_pretrained(’bert-base-uncased’) ## (2)


model = BertForNextSentencePrediction.from_pretrained(’bert-base-uncased’) ## (3)

sentence_A = "Purdue University is famous for its engineering program \


at both graduate and undergraduate levels." ## (4)
sentence_B = "The Supreme Court heard arguments over a bankruptcy deal \
for Purdue Pharma that would give billions of dollars to \
those harmed by the opoid epidemic." ## (5)
sentence_C = "Purdue has an enrollment of over 50000 students at its \
West Lafayette campus alone." ## (6)
print("\n\nCASE 1:") ## (7)
## ’tokenized’ is a dictionary with keys: ’input_ids’, ’token_type_ids’, ’attention_mask’ ## (8)
tokenized = tokenizer(sentence_A, sentence_B, return_tensors=’pt’) ## (9)
labels = torch.LongTensor([0]) ## (10)
predict = model(**tokenized, labels=labels) ## (11)
print("\n\nprediction loss: ", predict.loss) ## (12)
prediction = torch.argmax(predict.logits) ## (13)
if prediction == 0: ## (14)
print("True Pair") ## (15)
else: ## (16)
print("False Pair") ## (17)

(Continued on the next slide .....)


Purdue University 37
Fine-Tuning BERT with Supervised Training

A Demo of the Power of Pre-Training (contd.)

(...... continued from the previous slide)


print("\n\nCASE 2:") ## (18)
tokenized = tokenizer(sentence_A, sentence_C, return_tensors=’pt’) ## (19)
labels = torch.LongTensor([0]) ## (20)
predict = model(**tokenized, labels=labels) ## (21)
print("\n\nprediction loss: ", predict.loss) ## (22)
prediction = torch.argmax(predict.logits) ## (23)
if prediction == 0: ## (24)
print("True Pair") ## (25)
else: ## (26)
print("False Pair") ## (27)

The good news is that it declares that sentence C is significantly closer


to sentence A than sentence B.

The bad news is that it declares both as the true matches. It is to


rectify these issues you need to carry out the sort of supervised
learning based customization described next.

Purdue University 38
Fine-Tuning BERT with Supervised Training

A Demo of the Power of Pre-Training (contd.)


The good news mentioned on the previous slide speaks to the power of unsupervised
generative pre-training. Think about it: Without any supervision at all, just by letting
loose your neural network on the publicly available data in the internet, it can figure
out how to create a continuous narrative for you on a topic for which it is prompted.

About the code shown on the previous two slides, the statement in
Line (3) will download the pre-trained BERT model as a 440 MB
archive consisting of a safetensors archive.
safetensors is the new recommended way to store a model, especially
one that is meant for distribution over the internet. In the past, the
practice has been (and still continues to be) to store the model
weights in a “.bin” file using Python’s pickle function. But Google
folks who created safetensors say that it is relatively easy to embed
malicious code in a pickled archive. For further information, here is
the GitHub link where can find out more about safetensors:
https://github.com/huggingface/safetensors
Purdue University 39
Fine-Tuning BERT with Supervised Training

Fine-Tuning BERT
Getting to the main subject of this section, according to the authors
of BERT, “compared to pre-training, fine-tuning is relatively
inexpensive. All of the results in the paper can be replicated in at
most 1 hour on a single Cloud TPU, or a few hours on a GPU,
starting from the exact same pre-trained model.”

For answering questions, the supervised fine-tuning would consist of


feeding (Sentence-A, Sentence-B) pairs at the input, where
Sentence-A would be the question and Sentence-B (possibly in the
form of a short para) the answer.

At the same time, you would also create sentence pairs in which the
second sentence is NOT the answer. The target would be the binary
labels isAnswer and isNotAnswer. In this manner, this would boil
down to a binary classification problem with the Binary Cross-Entropy
Loss as provided by nn.BCELoss.
Purdue University 40
Fine-Tuning BERT with Supervised Training

Fine-Tuning BERT
Fine tuning of the pre-trained BERT is even easier for a Sentiment
Analysis application. Now the input to the BERT network would
consist of (Sentence-A, ∅) pairs and the prediction made through the
embedding vector for the class-token [CLS] would be either True or
False depending on the ground-truth tag for the text in Sentence-A.
More generally, though, the ability of BERT to predict the next
sentence can be leveraged into several different ways of fine-tuning:
MNLI (Multi-Genre Natural Language Inference): This is a large-scale entailment
classification task. Given a pair of sentences, the “goal is to predict
whether the second sentence is an entailment, contradiction, or neutral
with respect to the first sentence.”
QQP (Quora Question Pairs): This is a binary classification task in which the goal is to
determine whether the two questions, as represented by Sentence-A and
Sentence-B, are semantically equivalent.
QNLI (Question Natural Language Inference): This is a version of the Stanford Question
Answering Dataset. In this case, in each sentence pair (Sentence-A,
Sentence-B), either Sentence-B contains the answer to the question in
Sentence-A or it does not.
.
Appendix B of the BERT paper lists several additional fine-tuning
Purdue University 41
Getting Around the 512-Token Limitation of BERT

Outline

1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 42
Getting Around the 512-Token Limitation of BERT

BERT is Great, But ...


BERT has been a huge commercial success story for Google. Many
businesses have used it to create smarter customer facing e-commerce
webpages. Google itself used BERT in practically every facet of its
search engine.
Here is a quote from an NVIDIA article on BERT:
“ BERT models are able to understand the nuances of expressions at a much finer
level. For example, when processing the sequence ’Bob needs some medicine from
the pharmacy. His stomach is upset, so can you grab him some antacids?’ BERT
is better able to understand that “Bob” “his” and “him” are all the same person.
Previously, the query ’how to fill bob’s prescriptions’ might fail to understand that
the person being referenced in the second sentence is Bob. With the BERT model
applied, it’s able to understand how all these connections relate.”

However, BERT has one limitation — the 512 token limit at the
input to the BERT network.
BERT’s size limitation on the number input tokens is not problem if
the goal is to do a good job of learning good sentence-to-sentence
continuations.
Purdue University 43
Getting Around the 512-Token Limitation of BERT

BERT is Great, But ...(contd.)


But what about also learning sentence-to-paragraph,
paragraph-to-sentence, and paragraph-to-paragraph continuations?
Or similar continuations that relate a sentence to longer-than-a-para
narrative?

At this point, it’s good to review why BERT was designed with the
512 limit on the number of input tokens. To understand that you
have to first come to grips with the fact that EVERY Transformer based
implementation is designed for some expected maximum number of tokens at its input.
As to why:
[Every transformer based implementation for any deep learning application is designed for a specific value of what I
have labeled max seq length in my Transformer implementation in DLStudio. You can think of that as the
expected maximum number of words in the input sequence. If Nw is the value for this maximum number of words,
your attention map at output of any of the BasicEncoders will be a 2D array of shape Nw × Nw . That makes sense
because an Attention Map is supposed to tell as to what extent each word at the input attends to every other word
in the same input. From the standpoint of the calculations involved, the Attention Map is a result of the dot product
Q · K T of the query Q and the key K tensors. Ignoring the batch axis and also assuming single-headed attention for
the sake of argument, at the output of each BasicEncoder, the Q tensor is calculated by multiplying the learnable
matrix WQ of shape [Nw , M] with the transpose of the data matrix that is actually at the output of the encoder,
which is also of shape [Nw , M]. Since these matrices must be learnable, their sizes have to set in advance.
Therefore, the max value for Nw must be known in advance, as must the value for the size M of the embeddings.]

Purdue University 44
Getting Around the 512-Token Limitation of BERT

Approaches Based on Segmenting the Longer Inputs

If you wanted to learn narrative continuity properties at a level higher


than what BERT currently does, one would ask why not just use a
large value for max number of input tokens.

Unfortunately, the solution to BERT’s limitation is not as easy as that


because of the amount of input data that must be ingested for
generative pre-training in general. In general, the size of GPU
memory you need goes up quadratically with the max size of the
number of tokens at the input.
[Consider the following: The Transformer implementation I presented in my Week 14 lecture, I run out of GPU
memory if I exceed 4 Attention Heads and 4 Basic Encoders for the embedding size of 256. It would be infeasible
for me to be able to run that code with the numbers used in BERT for the same parameters. My guess is that
doubling the number of input tokens for BERT would obviously be technically feasible, but it would considerably
length the execution time and, also, create a heavier-duty product for its customization for specific applications. ]

As it turns out, it is possible to extend BERT in its fine-tuning phase


to create a more expansive framework that can learn language
continuities at much larger level of abstraction.
Purdue University 45
Getting Around the 512-Token Limitation of BERT

RoBERT, ToBERT, and Sliding Windows

For example, the following publication by Pappagari et. al. “Hierarchical


Transformers for Long Document Classification”:

https://arxiv.org/pdf/1910.10781.pdf

shows that if your overall goal is just classification it is possible to use


the following strategy to learn language continuity properties with
input token sequences that exceed the 512 limit used in BERT: (1)
You split long input token sequences into 512-token segments in order
to conform to the input constraints of BERT. (2) You feed the output
of BERT on each of the 512-token input segment into an
LSTM-based recurrent layer (I’m sure you’d also be able to use a
GRU). Subsequently, your classification is made on the basis of the
hidden state of the LSTM.

The authors of the paper described above called their overall


framework RoBERT for “Recurrence over BERT.”
Purdue University 46
Getting Around the 512-Token Limitation of BERT

RoBERT, ToBERT, and Sliding Windows (contd.)


As a variant of the method described on the previous slide, the
authors of the same paper also claim that can achieve similar results
by feeding BERT;s output for each of the 512-token input segments
into another Transformer. They referred to that variant as ToBERT
for “Transformer over BERT”.
Another way of extending BERT for longer inputs (that I believe
would only work for classification tasks) is to use what’s known as the
“Sliding Window Approach”. A google search with a string like
“sliding window for extending BERT” will throw up code fragments
that are based on this technique.
The methods presented above work (when they do) because: (1)
BERT gives you a very strong model of language continuity — that is
definitely the case at the sentence level since it is not unlikely for 512
tokens to include multiple sentences. And (2) the methods described
above are all applied in the supervised fine-tuning phase of BERT
when
Purdue you are dealing with small amounts of data.
University 47
GPT-2: The Architecture

Outline

1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 48
GPT-2: The Architecture

GPT-2: The Architecture


The main publication for GPT-2 is the paper “Language Models are
Unsupervised Multitask Learners”by Radford et al.:
https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

There exist several variants of GPT-2 that are labeled Small, Medium,
Large, and Extra Large. The sizes of the embedding vectors are: 768,
1024, 1280, and 1600, respectively. And the number of what I refer
to as the BasicDecoders are: 12, 24, 36, and 48.
What’s interesting about the architecture of GPT-2 is that it
corresponds to the right-half of the Transformer-based
Encoder-Decoder architecture for English-to-Spanish translation that I
show on Slide 36 of my Week 14 slides.
As you’ll recall from my Week 14 lecture, the Decoder side of that
language translation framework works on autoregressive principles —
meaning that the each new word that is emitted by the Decoder
depends
Purdue only on all the words that came before.
University 49
GPT-2: The Architecture

GPT-2: The Architecture (contd.)


That’s exactly the case with the GPT-2 architecture: It’s language
modeling is autoregressive model in the sense that each word
produced depends only on the words that come before it. Note that
GPT-2’s model does not the need the Encoder side of the
architecture you see see in Slide 36 of my Week 14 slides.

Purdue University 50
Unsupervised Learning in GPT-2

Outline

1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 51
Unsupervised Learning in GPT-2

GPT-2: Unsupervised Learning

still in works

Purdue University 52
GPT-3: The Architecture

Outline

1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 53
GPT-3: The Architecture

GPT-3

The main publication for GPT-3 is the paper “Language Models are
Few-Shot Learners”by Brown et al.:
https://arxiv.org/pdf/2005.14165.pdf

still in works

Purdue University 54
Unsupervised Learning in GPT-3

Outline

1 Generative vs. Discriminative Training of a Neural Network 9

2 The Architecture of BERT and the Formatting of its Input 15

3 Pre-Training BERT 28

4 RoBERTa as a Higher-Performance BERT 32

5 Fine-Tuning BERT with Supervised Training 35

6 Getting Around the 512-Token Limitation of BERT 42

7 GPT-2: The Architecture 48

8 Unsupervised Learning in GPT-2 51

9 GPT-3: The Architecture 53

10 Unsupervised Learning in GPT-3 55

Purdue University 55
Unsupervised Learning in GPT-3

GPT-3: Unsupervised Learning

still in works

Purdue University 56

You might also like