Lecture 03 - Introduction To LLMs
Lecture 03 - Introduction To LLMs
Conversation AI
BITS Pilani
Pilani Campus
(S1-24_AIMLCZG521)
Session Content
I. Introduction to Language Models
1. What are LLMs?
2. Tokenization:
• Byte-Pair Encoding (BPE) for GPT
• WordPiece for BERT
3. Autoregressive Models
II. Transformer Architecture
1. Motivation
2. Self-Attention Mechanism
3. Multi-Head Attention
4. Positional Encoding
5. Encoder-Decoder Structure
6. Layer Normalization and Residual Connections
Evolution:
• Moved beyond simpler models like n-grams,
which struggled with sparse data and
generalization (Huang et al., 2018).
• Modern LLMs leverage advanced architectures
(e.g., Transformers) for improved performance.
"This is an example.“ => Initial tokens: ["Th", "is", " ", "i", "s", " ", "a", "n", " ", "e", "x", "a", "m", "p", "l", "e", "."]
=> Frequent pairs ("is", " ")
Reference: Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2016)
• Merge Pairs: Merge pairs based on their scores, prioritizing pairs where the individual parts are less frequent.
• Repeat: Continue the process until the desired vocabulary size is reached.
WordPiece's use of a likelihood-based scoring method distinguishes it from BPE's simpler frequency-based
approach, often resulting in a vocabulary that better captures the linguistic structure of the training data.
cat
Co-occurred words in a local context window
with
meat
Tri-gram extraction
• fastText allows sharing subword representations across words, since words are represented
by the aggregation of their n-grams
Word2Vec probability expression
Represent a word by the sum of the
vector representations of its n-grams
N-gram embedding
Share representation
Reference: Attention is all you need. (Vaswani et al., 2017) & https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Self-Attention
• To calculate the attention weight from a query word 𝑤𝑞 (e.g, “rabbit”) to another word 𝑤𝑘
• Each word is represented as a query, key and value vector. The vectors are obtained from the
input embeddings multiplied by a weight matrix.
Concatenation
Scaling:
• Scaling helps manage large values from dot products
when key dimension dk is large.
• Scaling keeps values in a manageable range,
stabilizing the softmax function.
• Maintains balance in gradient values during
backpropagation, enhancing training efficiency.
• Faster Convergence: Training deep models becomes quicker due to improved gradient flow,
which facilitates faster convergence.
• Stabilizing Networks: They stabilize training, leading to faster and more reliable convergence,
especially for deep architectures. This results in better model performance and higher accuracy on
tasks.
Reference: Layer Normalization (Ba et al., 2016) & Deep Residual Learning for Image Recognition (from He et al., 2015)
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Thank you