LLM Learning
LLM Learning
Language Models
Purdue University
Large Language Modeling (LLM) made its debut in the year 2019. OpenAI
officially released the GPT-2 model in February 2019 and Google officially
released its BERT model in October 2019.
What is most distinctive about LLMs is that they can ingest terabytes of publicly
available textual datasets, learn from that data without any supervision, and
become experts in word-to-word, clause-to-clause, sentence-to-sentence, and
paragraph-to-paragraph continuity properties of narratives.
My goal in this lecture is to highlight some of the important aspects of LLMs, the
architectures of their neural networks that are based on Transformers, how they
carry out unsupervised learning, etc.
I’ll start by illustrating several of the LLM concepts through my explanations of
BERT. The main reason for that is that my acquaintanceship with BERT dates
back almost to the year it was born. The GPT models have entered my
consciousness rather recently.
BERT started out as an acronym for Bidirectional Encoder Representations from
Transforms.
Purdue University 2
Preamble (contd.)
However, from the way the acronym BERT is now used in publications and in
general communications, you could say that BERT has become a noun unto
itself. BERT was first presented in the paper:
https://arxiv.org/pdf/1810.04805.pdf
Note that, in retrospect, the word “Bidirectional” in what BERT stands for is an
accident of history and now is probably a source of confusion for most new users
of BERT. As you know, there’s nothing fundamentally unidirectional about a
Transformer.
[Based on the explanation in my Week 14 lecture, what a Transformer does is to learn an attention map for the input.
Ignoring the batch axis, if you feed an [Nw , M] tensor at its input, where Nw is the number of elements (words, patches,
etc.) in the input sequence and M the size of the embedding vector representation of each element, the output of the
Transformer will also be shaped [Nw , M], but with a difference. When you multiply the learned Q and K T tensors for the
final output, you will get a Nw × Nw array that is the Attention Map over the Nw elements of the input sequence. Each
element of the Attention Map array indicates as to what extent any given element indexed i in the input attends to another
element indexed j in the same input.]
The word “Bidirectional” in BERT came about because it was Google’s advance
over the OpenAI’s first version of GPT that was programmed to calculate
self-attention in an autoregressive manner — by scanning a sentence left to right.
In an Autoregressive Model, the dot-products that go into the calculations of the
attention for each word only depend on the previous words in the input sentence.
Purdue University 3
Preamble (contd.)
Large Language Models are meant to help us cope with the following dilemma:
Solving a practical problem with complex neural-network architectures based on
Transformers requires a large amount of labeled training data. But creating
labeled data can be expensive.
You run into this dilemma particularly when you initialize the learnable
parameters with random weights, a common practice for the simpler
neural networks.
Over the years, researchers have posited that it should be possible to reduce the
burden of supervised training for solving a specific task if we first initialized the
learnable parameters with inexpensive unsupervised training. The
researchers established the veracity of such claims with small-scale experiments.
But, now, through LLMs, we know that the above is true in general. We can
significantly reduce the extent of supervised learning needed if we first initialize
the learnable weights with inexpensive unsupervised learning on freely available
public datasets.
Purdue University 4
Preamble (contd.)
I have already mentioned the main publication for BERT. For GPT-2, the official
publication is
https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
Here is a brief chronology of the more famous LLMs that have been released to
date:
GPT-2 officially released in Feb 2019
|
v
|
v
|
v
Purdue University 5
Preamble (contd.)
Purdue University 6
Preamble – How to Learn from These Slides
At your first reading of these slides, just focus on thoroughly understanding the
following three concepts related to LLMs:
At the most fundamental level, you need to understand what’s meant by generative
training as opposed to discriminative training.
You also need to understand as to why tokenizers play a critical role in generative
training. What would happen if we used no tokenizers at all, just used the words directly
as the tokens?
Why is the fixed input token length of 512 in BERT possibly a good compromise between
the amount of text you need for learning sentence-to-sentence continuity and what you
need for computational efficiency in training with feasible hardware resources.
Purdue University 7
Outline
1 Generative vs. Discriminative Training of a Neural Network 9
3 Pre-Training BERT 28
Purdue University 8
Generative vs. Discriminative Training of a Neural Network
Outline
3 Pre-Training BERT 28
Purdue University 9
Generative vs. Discriminative Training of a Neural Network
Purdue University 10
Generative vs. Discriminative Training of a Neural Network
Purdue University 13
Generative vs. Discriminative Training of a Neural Network
Generative Learning
Purdue University 14
The Architecture of BERT and the Formatting of its Input
Outline
3 Pre-Training BERT 28
Purdue University 15
The Architecture of BERT and the Formatting of its Input
BERT Architecture
The basic architecture of BERT is exactly the same as that of the
Master Encoder shown on Slide 36 of my Week 14 slides. That slide
presented both the Master Encoder and the Master Decoder in a
Transformer-based machine translation framework. The figure on the
next slide is just the Master Encoder part of the Week 14 figure as
the neural architecture for BERT.
Purdue University 16
The Architecture of BERT and the Formatting of its Input
Figure: The BERT architecture, as shown above, is exactly the same as the Master Encoder part of the overall architecture
onPurdue
Slide 36 ofUniversity
my Week 14 slides. 17
The Architecture of BERT and the Formatting of its Input
As shown in the caption at the bottom of the figure in Slide 17, the
input consists of a pair of token sequence, with each token sequence
representing a span of continuous text. The two spans at the input
could represent, say, a (Question, Answer ) pair, a
(Sentence, Next Sentence) pair, etc.
Feeding just one token sequence at the input is important for one of
the two modes — the mask based mode — for the generative training
of BERT. More on that later.
Purdue University 19
The Architecture of BERT and the Formatting of its Input
The figure on the next slide shows the placement of the [CLS] token
for the imaginary case when the two-sentence input is limited to 10
tokens.
Purdue University 22
The Architecture of BERT and the Formatting of its Input
And when you execute the code in this demo, it will further download
the files
vocab.txt
tokenizer.json
config.json
tokenizer_config.json
tokens = tokenizer.tokenize("All work and no play makes Jack a dull boy!") ## (3)
print(tokens)
## [’all’, ’work’, ’and’, ’no’, ’play’, ’makes’, ’jack’, ’a’, ’dull’, ’boy’, ’!’] ## (4)
print("\n\nThe longest ten entries in BERT tokanizer vocab: ", all_tokens[:10]) ## (10)
## [’telecommunications’, ’interdisciplinary’, ’telecommunication’, ’responsibilities’, ’autobiographical’,
## ’intercontinental’, ’entrepreneurship’, ’unconstitutional’, ’northamptonshire’, ’characterization’]
## ## (11)
print("\n\nThe shortest ten entries in BERT tokanizer vocab: ", all_tokens[-10:]) ## (12)
## [’!’, ’(’, ’)’, ’,’, ’-’, ’.’, ’/’, ’:’, ’?’, ’~’] ## (13)
Purdue University
(Continued on the next slide .....) 26
The Architecture of BERT and the Formatting of its Input
As you can tell from the code starting in Line (14), the mappings
from the tokens to the integer index values are stored in the JSON
file tokenizer.json. We load the JSON file into our script in Line (16)
and access the integer index values for the tokens at the nested
dictionary at the key vocab of the dictionary that is at the key model.
Lines (16) and (17) will print out the integer index values for all
30,000+ tokens that BERT uses.
We invert the indexes in Line (18) and query the inverted index for
theUniversity
Purdue token associated with the integer 102. 27
Pre-Training BERT
Outline
3 Pre-Training BERT 28
Purdue University 28
Pre-Training BERT
Pre-Training BERT
The Generative Pre-Training of BERT is carried out with respect to
the following two tasks:
Masked Language Modeling (MLM)
Next Sentence Prediction (NSP)
.
Regarding MLM, during unsupervised training, you mask out a certain
percentage of the tokens at the input to the BERT network shown in
Slide 17. The tokens that are masked are selected at random.
Purdue University 30
Pre-Training BERT
Purdue University 31
RoBERTa as a Higher-Performance BERT
Outline
3 Pre-Training BERT 28
Purdue University 32
RoBERTa as a Higher-Performance BERT
RoBERTa (contd.)
From my perspective, an important difference between BERT and
RoBERTa is that the latter uses the BPE (Byte Pair Encoding)
algorithm for tokenization. It is the same tokenizer that is used for
the GPT models from OpenAI.
Here are two significant differences between BPE and the WordPiece
tokenizer used in BERT:
While, for the most part, WordPiece uses the ASCII characters as the
basic units for merging in order to form the subwords, BPE use the
bytes directly. This allows BPE to be more general with respect to the
different languages. [This is not to imply that WordPiece does not recognize other languages. If you
direct the output of the script I showed on Slides 26 and 27 into a text file and scroll the file, you will see the
basic symbols from practically all the languages in the 30,000+ vocabulary of the WordPiece tokenizer. ]
BPE elicits the help of a pre-tokenizer to count the frequencies of all
the words in the corpus. It subsequently uses those frequencies for
merging the bytes into subwords on the basis of the frequencies of the
merged bytes vis-a-vis the frequencies the subwords that were merged.
Purdue University 34
Fine-Tuning BERT with Supervised Training
Outline
3 Pre-Training BERT 28
Purdue University 35
Fine-Tuning BERT with Supervised Training
When you execute the script, you get the following answer:
CASE 1:
prediction loss: tensor(1.0133e-05, grad_fn=<NllLossBackward0>)
True Pair
CASE 2:
prediction loss: tensor(2.7418e-06, grad_fn=<NllLossBackward0>)
Purdue True Pair
University 36
Fine-Tuning BERT with Supervised Training
## nsp1.py
Purdue University 38
Fine-Tuning BERT with Supervised Training
About the code shown on the previous two slides, the statement in
Line (3) will download the pre-trained BERT model as a 440 MB
archive consisting of a safetensors archive.
safetensors is the new recommended way to store a model, especially
one that is meant for distribution over the internet. In the past, the
practice has been (and still continues to be) to store the model
weights in a “.bin” file using Python’s pickle function. But Google
folks who created safetensors say that it is relatively easy to embed
malicious code in a pickled archive. For further information, here is
the GitHub link where can find out more about safetensors:
https://github.com/huggingface/safetensors
Purdue University 39
Fine-Tuning BERT with Supervised Training
Fine-Tuning BERT
Getting to the main subject of this section, according to the authors
of BERT, “compared to pre-training, fine-tuning is relatively
inexpensive. All of the results in the paper can be replicated in at
most 1 hour on a single Cloud TPU, or a few hours on a GPU,
starting from the exact same pre-trained model.”
At the same time, you would also create sentence pairs in which the
second sentence is NOT the answer. The target would be the binary
labels isAnswer and isNotAnswer. In this manner, this would boil
down to a binary classification problem with the Binary Cross-Entropy
Loss as provided by nn.BCELoss.
Purdue University 40
Fine-Tuning BERT with Supervised Training
Fine-Tuning BERT
Fine tuning of the pre-trained BERT is even easier for a Sentiment
Analysis application. Now the input to the BERT network would
consist of (Sentence-A, ∅) pairs and the prediction made through the
embedding vector for the class-token [CLS] would be either True or
False depending on the ground-truth tag for the text in Sentence-A.
More generally, though, the ability of BERT to predict the next
sentence can be leveraged into several different ways of fine-tuning:
MNLI (Multi-Genre Natural Language Inference): This is a large-scale entailment
classification task. Given a pair of sentences, the “goal is to predict
whether the second sentence is an entailment, contradiction, or neutral
with respect to the first sentence.”
QQP (Quora Question Pairs): This is a binary classification task in which the goal is to
determine whether the two questions, as represented by Sentence-A and
Sentence-B, are semantically equivalent.
QNLI (Question Natural Language Inference): This is a version of the Stanford Question
Answering Dataset. In this case, in each sentence pair (Sentence-A,
Sentence-B), either Sentence-B contains the answer to the question in
Sentence-A or it does not.
.
Appendix B of the BERT paper lists several additional fine-tuning
Purdue University 41
Getting Around the 512-Token Limitation of BERT
Outline
3 Pre-Training BERT 28
Purdue University 42
Getting Around the 512-Token Limitation of BERT
However, BERT has one limitation — the 512 token limit at the
input to the BERT network.
BERT’s size limitation on the number input tokens is not problem if
the goal is to do a good job of learning good sentence-to-sentence
continuations.
Purdue University 43
Getting Around the 512-Token Limitation of BERT
At this point, it’s good to review why BERT was designed with the
512 limit on the number of input tokens. To understand that you
have to first come to grips with the fact that EVERY Transformer based
implementation is designed for some expected maximum number of tokens at its input.
As to why:
[Every transformer based implementation for any deep learning application is designed for a specific value of what I
have labeled max seq length in my Transformer implementation in DLStudio. You can think of that as the
expected maximum number of words in the input sequence. If Nw is the value for this maximum number of words,
your attention map at output of any of the BasicEncoders will be a 2D array of shape Nw × Nw . That makes sense
because an Attention Map is supposed to tell as to what extent each word at the input attends to every other word
in the same input. From the standpoint of the calculations involved, the Attention Map is a result of the dot product
Q · K T of the query Q and the key K tensors. Ignoring the batch axis and also assuming single-headed attention for
the sake of argument, at the output of each BasicEncoder, the Q tensor is calculated by multiplying the learnable
matrix WQ of shape [Nw , M] with the transpose of the data matrix that is actually at the output of the encoder,
which is also of shape [Nw , M]. Since these matrices must be learnable, their sizes have to set in advance.
Therefore, the max value for Nw must be known in advance, as must the value for the size M of the embeddings.]
Purdue University 44
Getting Around the 512-Token Limitation of BERT
https://arxiv.org/pdf/1910.10781.pdf
Outline
3 Pre-Training BERT 28
Purdue University 48
GPT-2: The Architecture
There exist several variants of GPT-2 that are labeled Small, Medium,
Large, and Extra Large. The sizes of the embedding vectors are: 768,
1024, 1280, and 1600, respectively. And the number of what I refer
to as the BasicDecoders are: 12, 24, 36, and 48.
What’s interesting about the architecture of GPT-2 is that it
corresponds to the right-half of the Transformer-based
Encoder-Decoder architecture for English-to-Spanish translation that I
show on Slide 36 of my Week 14 slides.
As you’ll recall from my Week 14 lecture, the Decoder side of that
language translation framework works on autoregressive principles —
meaning that the each new word that is emitted by the Decoder
depends
Purdue only on all the words that came before.
University 49
GPT-2: The Architecture
Purdue University 50
Unsupervised Learning in GPT-2
Outline
3 Pre-Training BERT 28
Purdue University 51
Unsupervised Learning in GPT-2
still in works
Purdue University 52
GPT-3: The Architecture
Outline
3 Pre-Training BERT 28
Purdue University 53
GPT-3: The Architecture
GPT-3
The main publication for GPT-3 is the paper “Language Models are
Few-Shot Learners”by Brown et al.:
https://arxiv.org/pdf/2005.14165.pdf
still in works
Purdue University 54
Unsupervised Learning in GPT-3
Outline
3 Pre-Training BERT 28
Purdue University 55
Unsupervised Learning in GPT-3
still in works
Purdue University 56