BERT Slides
BERT Slides
scratch
Umar Jamil
Downloaded from: https://github.com/hkproj/bert-from-scratch
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0):
https://creativecommons.org/licenses/by-nc/4.0/legalcode
We usually train a neural network to predict these probabilities. A neural network trained on a
large corpora of text is known as a Large Language Model (LLM).
Output sequence (10 tokens) TK1 TK2 TK3 TK4 TK5 TK6 TK7 TK8 TK9 TK10
Neural Network
(Transformer Encoder)
Prompt
Neural Network
(Transformer Encoder)
3552.566 9980.851 6666.314 7512.261 5463.142 3571.487 2128.306 952.207 3065.914 5555.992 2128.306
2745.925 8373.997 6239.623 8207.994 8669.221 9007.898 1685.236 5450.840 8145.629 5722.099 1685.236
… … … … … … … … … … …
Embedding
(vector of size 512) … … … … … … … … … … …
1070.708 8752.749 4611.106 6827.572 9521.112 9664.859 9648.558 1.658 5491.627 3623.291 9648.558
1652.976 4445.452 1937.651 3222.745 9338.361 1971.318 7568.973 2671.529 1746.477 9791.989 7568.973
We define dmodel = 512, which represents the size of the embedding vector of each word
Source: Speech and Language Processing 3rd Edition Draft, Dan Jurafsky and James H. Martin
We commonly use the cosine similarity, which is based on the dot product between the two
vectors.
Encoder Input 4562.843 8386.358 1013.103 845.160 1034.689 7394.715 8652.636 4448.448 3722.530 1362.544
… … … … … … … … … …
7395.997 9878.506 2487.140 7411.603 5240.469 1362.285 8461.192 3863.333 2594.810 1406.061
5830.822 6096.133 7675.256 1092.178 9843.646 40.205 3316.334 4838.994 2743.197 6417.903
𝑑
10000 𝑚𝑜𝑑𝑒𝑙 PE(0, 1) PE(1, 1) PE(2, 1)
… … …
𝑝𝑜𝑠
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 + 1 = c𝑜𝑠 2𝑖
𝑑 Sentence 2 I LOVE YOU
10000 𝑚𝑜𝑑𝑒𝑙
PE(0, 0) PE(1, 0) PE(2, 0)
Before 7909.878 8386.358 … … 9878.506 6096.133 7909.878 8386.358 … … 9878.506 6096.133 7909.878 8386.358 … … 9878.506 6096.133
my 6167.866 1013.103 … … 2487.140 7675.256 6167.866 1013.103 … … 2487.140 7675.256 6167.866 1013.103 … … 2487.140 7675.256
bed 7480.045 845.160 … … 7411.603 1092.178 7480.045 845.160 … … 7411.603 1092.178 7480.045 845.160 … … 7411.603 1092.178
lies 4497.961 1034.689 … … 5240.469 9843.646 4497.961 1034.689 … … 5240.469 9843.646 4497.961 1034.689 … … 5240.469 9843.646
a 3687.495 7394.715 … … 1362.285 40.205 3687.495 7394.715 … … 1362.285 40.205 3687.495 7394.715 … … 1362.285 40.205
pool 9559.480 8652.636 … … 8461.192 3316.334 9559.480 8652.636 … … 8461.192 3316.334 9559.480 8652.636 … … 8461.192 3316.334
of 5779.258 4448.448 … … 3863.333 4838.994 5779.258 4448.448 … … 3863.333 4838.994 5779.258 4448.448 … … 3863.333 4838.994
moon 2000.151 3722.530 … … 2594.810 2743.197 2000.151 3722.530 … … 2594.810 2743.197 2000.151 3722.530 … … 2594.810 2743.197
bright 3323.149 1362.544 … … 1406.061 6417.903 3323.149 1362.544 … … 1406.061 6417.903 3323.149 1362.544 … … 1406.061 6417.903
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘 [SOS] Before my bed lies a pool of moon bright
[SOS] 0.62 0.19 0.02 0.02 0.04 0.01 0.00 0.09 0.00 0.02
Before 0.15 0.00 0.00 0.01 0.00 0.00 0.17 0.00 0.67 0.00
my 0.09 0.02 0.56 0.02 0.01 0.08 0.11 0.02 0.05 0.03
bed 0.10 0.06 0.03 0.00 0.53 0.12 0.01 0.11 0.00 0.04
Q X KT
lies 0.02 0.00 0.00 0.05 0.80 0.00 0.02 0.04 0.01 0.06
softmax =
(10, 512) (512, 10) a 0.01 0.00 0.02 0.02 0.00 0.03 0.68 0.16 0.03 0.06
pool 0.00 0.16 0.02 0.00 0.03 0.56 0.00 0.00 0.22 0.01
of 0.22 0.00 0.01 0.05 0.19 0.44 0.00 0.00 0.04 0.04
512
moon 0.00 0.67 0.01 0.00 0.02 0.03 0.23 0.01 0.00 0.03
bright 0.06 0.00 0.03 0.03 0.43 0.21 0.03 0.06 0.13 0.03
(10, 10)
Umar Jamil – https://github.com/hkproj/bert-from-scratch
The self-attention mechanism: the reason
behind the causal mask
A language model is a probabilistic model that assign probabilities to sequence of words.
In practice, a language model allows us to compute the following:
To model the probability distribution above, each word should only depend on words that
come before it (left context).
We will see later that in BERT we make use of both, the left and the right context.
[SOS] 5.45 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ [SOS] 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Before 4.28 2.46 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ Before 0.86 0.14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
my 8.17 3.56 5.54 -∞ -∞ -∞ -∞ -∞ -∞ -∞ my 0.92 0.01 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00
bed 6.71 4.13 6.76 0.79 -∞ -∞ -∞ -∞ -∞ -∞ bed 0.47 0.04 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00
lies 5.43 7.59 3.91 6.14 9.03 -∞ -∞ -∞ -∞ -∞ lies 0.02 0.18 0.00 0.04 0.75 0.00 0.00 0.00 0.00 0.00
a 4.42 4.35 7.55 3.14 1.35 7.57 -∞ -∞ -∞ -∞ a 0.02 0.02 0.47 0.01 0.00 0.48 0.00 0.00 0.00 0.00
pool 8.36 6.00 4.56 0.52 3.13 6.78 9.00 -∞ -∞ -∞ pool 0.31 0.03 0.01 0.00 0.00 0.06 0.59 0.00 0.00 0.00
of 2.21 3.72 4.16 6.30 0.66 6.14 7.46 6.77 -∞ -∞ of 0.00 0.01 0.02 0.15 0.00 0.12 0.47 0.23 0.00 0.00
moon 4.08 6.22 5.00 4.20 5.72 5.35 7.46 3.55 4.70 -∞ moon 0.02 0.16 0.05 0.02 0.10 0.07 0.55 0.01 0.03 0.00
bright 6.43 8.88 6.17 3.65 4.54 5.22 5.51 5.55 0.64 1.38 bright 0.07 0.71 0.05 0.03 0.01 0.02 0.03 0.03 0.02 0.03
𝑄𝐾 𝑇 𝑄𝐾 𝑇
softmax
𝑑𝑘 𝑑𝑘
[SOS] 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Before 0.86 0.14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
my 0.92 0.01 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00
bed 0.47 0.04 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00
lies 0.02 0.18 0.00 0.04 0.75 0.00 0.00 0.00 0.00 0.00 X V = Attention Output
a 0.02 0.02 0.47 0.01 0.00 0.48 0.00 0.00 0.00 0.00
(10, 512) (10, 512)
pool 0.31 0.03 0.01 0.00 0.00 0.06 0.59 0.00 0.00 0.00
of 0.00 0.01 0.02 0.15 0.00 0.12 0.47 0.23 0.00 0.00 Each row of the “Attention Output” matrix represents the embedding of the
output sequence: it captures not only the meaning of each token, not only its
moon 0.02 0.16 0.05 0.02 0.10 0.07 0.55 0.01 0.03 0.00 position, but also the interaction of each token with all the other tokens, but only
the interactions for which the softmax score is not zero. All the 512 dimensions of
bright 0.07 0.71 0.05 0.03 0.01 0.02 0.03 0.03 0.02 0.03 each vector only depend on the attention scores that are non-zero.
(10, 10)
Umar Jamil – https://github.com/hkproj/bert-from-scratch
Outline
• Language Models
• Training
• Inference
• Transformer architecture (Encoder)
• Embedding vectors
• Positional encoding
• Self attention and causal mask
• BERT
• The importance of the left and the right context
• BERT pre-training
• Masked Language Model task
• Next Sentence Prediction task
• BERT fine-tuning
• Text Classification Task
• Question Answering Task
• BERTBASE
• 12 encoder layers
• The size of the hidden size of the feedforward layer is 3072 Output layer depending on
• 12 attention heads. the specific task
• BERTLARGE
• 24 encoder layers
• The size of the hidden size of the feedforward layer is 4096
• 16 attention heads.
Uses the WordPiece tokenizer, which also allows sub-word tokens. The vocabulary
size is ~ 30,000 tokens.
Question Answering in GPT/LLaMA: Prompt Engineering Question Answering in BERT: Fine Tuning
Pre-Trained BERT
Fine Tune on QA
User: Hello! My internet line is not working, could you send a technician?
Operator: Hello! Let me check. Meanwhile, can you try restarting your WiFi router?
User: I have already restarted it but looks like the red light is not going away.
Operator: All right. I’ll send someone.
Rome is the capital of Italy, which is why it hosts many government buildings.
Randomly select one or more tokens and replace them with the
special token [MASK]
Rome is the [MASK] of Italy, which is why it hosts many government buildings.
capital
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘 [SOS] Before my bed lies a pool of moon bright
[SOS] 0.62 0.19 0.02 0.02 0.04 0.01 0.00 0.09 0.00 0.02
Before 0.15 0.00 0.00 0.01 0.00 0.00 0.17 0.00 0.67 0.00
my 0.09 0.02 0.56 0.02 0.01 0.08 0.11 0.02 0.05 0.03
bed 0.10 0.06 0.03 0.00 0.53 0.12 0.01 0.11 0.00 0.04
Q X KT
lies 0.02 0.00 0.00 0.05 0.80 0.00 0.02 0.04 0.01 0.06
softmax =
(10, 768) (768, 10) a 0.01 0.00 0.02 0.02 0.00 0.03 0.68 0.16 0.03 0.06
pool 0.00 0.16 0.02 0.00 0.03 0.56 0.00 0.00 0.22 0.01
of 0.22 0.00 0.01 0.05 0.19 0.44 0.00 0.00 0.04 0.04
768
moon 0.00 0.67 0.01 0.00 0.02 0.03 0.23 0.01 0.00 0.03
bright 0.06 0.00 0.03 0.03 0.43 0.21 0.03 0.06 0.13 0.03
(10, 10)
Umar Jamil – https://github.com/hkproj/bert-from-scratch
Masked Language Model (MLM): details
Rome is the capital of Italy, which is why it hosts many government buildings.
The pre-training procedure selects 15% of the tokens from the sentence to be masked.
When a token is selected to be masked (suppose the word “capital” is selected):
• 80% of the time it is replaced with the [MASK] token → Rome is the [MASK] of Italy, which is why it hosts many government buildings.
• 10% of the time it is replaced with a random token → Rome is the zebra of Italy, which is why it hosts many government buildings.
• 10% of the time it is not replaced → Rome is the capital of Italy, which is why it hosts many government buildings.
Output (14 tokens): TK1 TK2 TK3 TK4 TK5 TK6 TK7 TK8 TK9 TK10 TK11 TK12 TK13 TK14
Input (14 tokens): Rome is the [mask] of Italy, which is why it hosts many government buildings.
Sentence A Before my bed lies a pool of moon bright • 50% of the time, we select the
I could imagine that it's frost on the ground actual next sentence.
Sentence B I look up and see the bright shining moon • 50% of the time we select a
Bowing my head I am thinking of home random sentence from the text.
IsNext NotNext
Input (20 tokens): [CLS] Before my bed lies a pool of moon bright [SEP] I look up and see the bright shining moon
Sentence A Sentence B
Umar Jamil – https://github.com/hkproj/bert-from-scratch
[CLS] token in BERT The [CLS] token always interacts with all the other tokens, as we do not
use any mask.
So, we can consider the [CLS] token as a token that “captures” the
information from all the other tokens.
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘 [CLS] Before my bed lies a pool of moon bright
[CLS] 0.62 0.19 0.02 0.02 0.04 0.01 0.00 0.09 0.00 0.02
Before 0.15 0.00 0.00 0.01 0.00 0.00 0.17 0.00 0.67 0.00
my 0.09 0.02 0.56 0.02 0.01 0.08 0.11 0.02 0.05 0.03
bed 0.10 0.06 0.03 0.00 0.53 0.12 0.01 0.11 0.00 0.04
Q X KT
lies 0.02 0.00 0.00 0.05 0.80 0.00 0.02 0.04 0.01 0.06
softmax =
(10, 768) (768, 10) a 0.01 0.00 0.02 0.02 0.00 0.03 0.68 0.16 0.03 0.06
pool 0.00 0.16 0.02 0.00 0.03 0.56 0.00 0.00 0.22 0.01
of 0.22 0.00 0.01 0.05 0.19 0.44 0.00 0.00 0.04 0.04
768
moon 0.00 0.67 0.01 0.00 0.02 0.03 0.23 0.01 0.00 0.03
bright 0.06 0.00 0.03 0.03 0.43 0.21 0.03 0.06 0.13 0.03
(10, 10)
Umar Jamil – https://github.com/hkproj/bert-from-scratch
[CLS] token: output sequence
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘
[CLS] 0.62 0.19 0.02 0.02 0.04 0.01 0.00 0.09 0.00 0.02
Before 0.15 0.00 0.00 0.01 0.00 0.00 0.17 0.00 0.67 0.00
my 0.09 0.02 0.56 0.02 0.01 0.08 0.11 0.02 0.05 0.03
bed 0.10 0.06 0.03 0.00 0.53 0.12 0.01 0.11 0.00 0.04
lies 0.02 0.00 0.00 0.05 0.80 0.00 0.02 0.04 0.01 0.06 X V = Attention Output
a 0.01 0.00 0.02 0.02 0.00 0.03 0.68 0.16 0.03 0.06
(10, 768) (10, 768)
pool 0.00 0.16 0.02 0.00 0.03 0.56 0.00 0.00 0.22 0.01
of 0.22 0.00 0.01 0.05 0.19 0.44 0.00 0.00 0.04 0.04 Each row of the “Attention Output” matrix represents the embedding of the
output sequence: it captures not only the meaning of each token, not only its
moon 0.00 0.67 0.01 0.00 0.02 0.03 0.23 0.01 0.00 0.03 position, but also the interaction of each token with all the other tokens, but only
the interactions for which the softmax score is not zero. All the 512 dimensions of
bright 0.06 0.00 0.03 0.03 0.43 0.21 0.03 0.06 0.13 0.03 each vector only depend on the attention scores that are non-zero.
(10, 10)
Umar Jamil – https://github.com/hkproj/bert-from-scratch
Outline
• Language Models
• Training
• Inference
• Transformer architecture (Encoder)
• Embedding vectors
• Positional encoding
• Self attention and causal mask
• BERT
• The importance of the left and the right context
• BERT pre-training
• Masked Language Model task
• Next Sentence Prediction task
• BERT fine-tuning
• Text Classification Task
• Question Answering Task
Pre-Trained BERT
My router’s led is not working, I tried My router’s web page doesn’t allow me to In this month’s bill I have been charged
changing the power socket but still change password anymore… I tried 100$ instead of the usual 60$, why is that?
nothing. restarting it but nothing.
My router’s led is not working, I tried changing the power socket but still nothing.
Input (16 tokens): [CLS] My router’s led is not working, I tried changing the power socket but still nothing.
Sentence A Sentence B
Umar Jamil – https://github.com/hkproj/bert-from-scratch
Question Answering: start and end positions
Target (1 token): start=TK10, end=TK10
Input (27 tokens): [CLS] What is the fashion capital of China? [SEP] Shanghai is a City in China, it is also a financial center, its fashion
capital and industrial city.