0% found this document useful (0 votes)
22 views32 pages

Lecture 03 - Introduction To LLMs

Module 3 of the Conversation AI course at BITS Pilani introduces Large Language Models (LLMs) and their underlying technologies, including tokenization methods like Byte-Pair Encoding and WordPiece, as well as the Transformer architecture. It highlights the advantages of LLMs over traditional chatbots in understanding language, context, and user intent, enabling more nuanced interactions. The module also covers the mechanics of autoregressive models, attention mechanisms, and the importance of positional encoding in understanding word order.

Uploaded by

vipula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views32 pages

Lecture 03 - Introduction To LLMs

Module 3 of the Conversation AI course at BITS Pilani introduces Large Language Models (LLMs) and their underlying technologies, including tokenization methods like Byte-Pair Encoding and WordPiece, as well as the Transformer architecture. It highlights the advantages of LLMs over traditional chatbots in understanding language, context, and user intent, enabling more nuanced interactions. The module also covers the mechanics of autoregressive models, attention mechanisms, and the importance of positional encoding in understanding word order.

Uploaded by

vipula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Module 3: Introduction to LLMs

Conversation AI
BITS Pilani
Pilani Campus
(S1-24_AIMLCZG521)
Session Content
I. Introduction to Language Models
1. What are LLMs?
2. Tokenization:
• Byte-Pair Encoding (BPE) for GPT
• WordPiece for BERT
3. Autoregressive Models
II. Transformer Architecture
1. Motivation
2. Self-Attention Mechanism
3. Multi-Head Attention
4. Positional Encoding
5. Encoder-Decoder Structure
6. Layer Normalization and Residual Connections

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Large Language Models (LLMs)
LLMs are AI models trained on massive text datasets
to understand and generate human language.
They work by predicting the probability of the next
word in a sequence.

Evolution:
• Moved beyond simpler models like n-grams,
which struggled with sparse data and
generalization (Huang et al., 2018).
• Modern LLMs leverage advanced architectures
(e.g., Transformers) for improved performance.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Advantages of LLM-powered Conversational AI
Feature Traditional Chatbots LLM-Powered Chatbots
Basic understanding based on keywords and pattern Deep understanding of semantics, context, and nuances in language.
Language
matching. Example: User: "I wanna pay my bill." Chatbot: Example: User: "I'm kinda in a rush, can I quickly settle my balance?"
Understanding
Matches "pay bill" to a predefined trigger. Chatbot: Understands urgency and desire to pay the bill.
Rule-based, keyword matching, or ML classifiers on limited
Contextual understanding using LLMs; robust even with variations.
Intent labeled data. Example: If user says, "check balance,"
Example: User: "What do I owe?" or "How much is left on my account?"
Recognition chatbot matches it to the "check balance" intent based on
Chatbot identifies the intent as "check balance."
keywords.
Predefined slots with rule-based or separate NER model Seamless extraction by LLM; handles complex and nuanced entities.
extraction. Example: For "book flight," chatbot extracts Example: User: "I need a flight for two to New York, sometime next
Slot Filling
"destination" and "date" from "flight to London tomorrow" week in the evening." LLM extracts destination, number of passengers,
using rules. and flexible date/time.
Knowledge Requires predefined queries and structured data access or Can generate queries, interpret results, and leverage broader information
Intensive limited external search integration. Example: For "check sources (database and internet) based on user requests. Example:
(Database / balance," chatbot uses a fixed query: SELECT balance User: "What are some good restaurants near the Eiffel Tower?" Chatbot
Internet Search) FROM accounts WHERE user_id = [user_id]. searches internet and summarizes top recommendations.
Requires integration with separate translation services. Can perform translations directly; understand multiple languages.
Translation Example: User: "Hola" (Spanish). Chatbot sends to a Example: User: "Bonjour, je veux réserver un billet." (French). Chatbot
translation API to get "Hello" and then matches the intent. understands the intent directly in French and proceeds to book a ticket.
Dynamic adaptation of responses based on user history and inferred
Stores user information and preferences in a database.
User Profiling & preferences. Example: If a user frequently asks about vegetarian
Example: Chatbot knows the user's name and preferred
Personalization recipes, the chatbot might suggest vegetarian options without explicit
payment method from their profile.
prompting.
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Advantages of LLM-powered Conversational AI
Feature Traditional Chatbots LLM-Powered Chatbots
Relies on session variables or limited context windows; Enhanced context understanding through LLM's attention mechanisms.
Context Example: Chatbot stores city = "London" in a session Example: User: "I want to book a flight." Chatbot: "Where to?" User:
Management variable and uses it if the user asks, "What's the "Actually, make it a train." Chatbot understands "it" refers to the travel
weather there?" booking and adapts accordingly.
State machines or rule-based systems defining the Dynamic generation; context-aware, adaptable to user input. Example:
Dialog conversation flow. Example: After "check balance," User: "I'm not sure which plan is right for me." Chatbot: Asks clarifying
Management chatbot moves to the next state: "Do you want to make questions about needs and suggests options, adapting the flow based on
a payment?" user responses.
Uses rule-based or basic ML models for sentiment More nuanced sentiment analysis, including sarcasm and subtle emotional
Sentiment
detection. Example: If user uses words like "angry" or cues. Example: User: "Great, another error..." Chatbot detects sarcastic
Analysis
"frustrated," chatbot detects negative sentiment. tone and understands the user is expressing frustration, not positivity.
Pre-defined fallback responses or prompts for
Error Handling / Contextually relevant fallback responses; attempts rephrasing or escalation.
clarification. Example: If the chatbot doesn't
Fallback Example: User: "I need a flibbertigibbet." Chatbot: "I'm not familiar with
understand, it might say: "I'm sorry, I didn't understand.
Strategies 'flibbertigibbet.' Were you perhaps looking for a specific product or service?"
Can you rephrase?"
Tracks key metrics like user engagement and Deeper conversational data analysis, pattern identification, and optimization
Analytics and
conversation duration. Example: Records how many insights. Example: Identifies common points of confusion in conversations
Performance
users complete a booking or how long a conversation or suggests improvements to dialogue flow based on analyzing user
Monitoring
lasts. interactions.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Language Models
• The classic definition of a language model (LM) is a probability distribution over each
token sequence
𝑤1, 𝑤2, … , 𝑤𝑛 , whether it’s a good or bad one.
• Sally fed my cat with meat: P(I, feed, my, cat, with, meat) = 0.03,
• My cat fed Sally with meat: P(My, cat, fed, Sally, with, meat) = 0.005,
• fed cat meat my my with: P(fed, cat, meat, my, my, with) = 0.0001

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Tokenization
Other Tokenization methods Advanced Tokenization methods used in LLMs
• Whitespace Tokenization WordPiece Tokenization: Used by BERT (Bidirectional Encoder Representations from
• Phrase Tokenization
• Word Tokenization Transformers), DistilBERT
• Punctuation-Based • Splits text into subword units, which helps in handling rare words and reducing the
Tokenization
• MWET (Multi-Word vocabulary size
Expression Tokenization) Byte-Pair Encoding (BPE) Tokenization: Used by GPT-2, GPT-3, GPT-4
• Sentence Tokenization
• Character Tokenization • Combines the most frequent pairs of bytes in a corpus to create subwords, which helps in
• Semantic Tokenization efficiently handling large vocabularies.
• Treebank Word Tokenization
• Number Tokenization Unigram Language Model Tokenization: Used by T5 (Text-To-Text Transfer Transformer)
• Tweet Tokenization • Uses a probabilistic model to determine the most likely subword units.
• N-gram Tokenization
• Syllable Tokenization SentencePiece Tokenization: Used by ALBERT (A Lite BERT), T5
• Character N-gram
Tokenization • A more general approach that can handle languages without clear word boundaries, like
Chinese or Japanese.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Tokenization in LLMs - BPE
Byte-Pair Encoding (BPE) is a tokenization method that iteratively merges the most frequent pairs of
bytes in a text corpus to create a fixed-size vocabulary.

How BPE Works:


• Start with a base vocabulary of individual characters.
• Identify the most frequent pair of bytes (or characters) in the text.
• Merge this pair into a new token.
• Repeat the process until the desired vocabulary size is reached.

"This is an example.“ => Initial tokens: ["Th", "is", " ", "i", "s", " ", "a", "n", " ", "e", "x", "a", "m", "p", "l", "e", "."]
=> Frequent pairs ("is", " ")

Reference: Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2016)

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Tokenization in LLMs - BPE
Advantages:
• Efficient handling of rare words and subword units.
• Reduces the vocabulary size, making the model more efficient.
• Helps in better generalization by breaking down words into subword units.
Limitations:
• Can produce fragmented tokens for some languages, especially those with complex morphology.
• May not capture semantic meaning as effectively as other tokenization methods.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Tokenization in LLMs - Wordpiece
WordPiece Tokenization is a method used to split text into smaller subword units, which helps in handling rare
words and improving the efficiency of language models.

How WordPiece Works:


• Start with a Base Vocabulary: Begin with a small vocabulary that includes individual characters and special tokens.
• Identify Subword Units: Count How Often Letters Appear Together
• Compute Scores for Pairs: Instead of merging the most frequent pairs, WordPiece computes a score for each pair
using the formula:

• Merge Pairs: Merge pairs based on their scores, prioritizing pairs where the individual parts are less frequent.
• Repeat: Continue the process until the desired vocabulary size is reached.

Reference: Google’s Neural Machine Translation System (Wu et al., 2016)


Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Tokenization in LLMs - Wordpiece
Advantages:
• Efficient Handling of Rare Words: Breaks down rare words into subword units, improving model performance.
• Reduces Vocabulary Size: Helps in creating a manageable vocabulary size.
• Better Generalization: By using subword units, it can generalize better across different words and contexts.
Limitations:
• Complexity: The scoring mechanism adds complexity compared to simpler methods.
• Fragmentation: Can still produce fragmented tokens, especially for languages with complex morphology.

WordPiece's use of a likelihood-based scoring method distinguishes it from BPE's simpler frequency-based
approach, often resulting in a vocabulary that better captures the linguistic structure of the training data.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Representing Text in Language Models
• Unsupervised learning of text representations—No supervision needed
• Word embedding: Embed one-hot vectors into lower-dimensional space—Address “curse of dimensionality”
➢ Captures useful properties of word semantics
➢ Word similarity: Words with similar meanings are embedded closer
➢ Word analogy: Linear relationships between words (e.g. king - queen = man - woman)
Word Similarity
Word Analogy

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Distributed Representations: Word2Vec

• Assumption: If two words have similar contexts,


then they have similar semantic meanings!
feed
• Word2Vec Training objective:
➢ To learn word vector representations that are good at
predicting the nearby words. my

cat
Co-occurred words in a local context window
with

meat

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Considering subwords - fasttext
• fastText improves upon Word2Vec by incorporating subword information into word embedding

Tri-gram extraction

• fastText allows sharing subword representations across words, since words are represented
by the aggregation of their n-grams
Word2Vec probability expression
Represent a word by the sum of the
vector representations of its n-grams

N-gram embedding

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Limitations of embeddings

• They are context-free embeddings: each word is mapped to only one


vector regardless of its context!
E.g. “bank” is a polysemy, but only has one representation

“Open a bank account” “On the river bank”

Share representation

• It does not consider the order of words


• It treats the words in the context window equally

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Autoregressive language models

• The chain rule of probability:


• P(Sally, fed, my, cat, with, meat) = P(Sally)
* P(fed | Sally)
* P(my | Sally, fed)
* P(cat | Sally, fed, my)
* P(with | Sally, fed, my, cat)
* P(meat | Sally, fed, my, cat, with)

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Generation

• If we already have a good language model, a given text prompt 𝑤 1:𝑛 ,


and we want the model to generate a good sentence completion
with the length of L: How to find 𝑤 𝑛+1:𝑛+𝐿 with the highest
probability?
• Enumerate over all possible combinations?

• Next token prediction: generating the next token step by


step, starting from 𝑤𝑛+1 using 𝑝 𝑤𝑛+1 𝑤 1:𝑛
• To select the next token with 𝑝 𝑤𝑛+1 𝑤 1:𝑛 , there are also
different decoding approaches.
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Different Decoding Approaches

• Greedy decoding: At each step, always select 𝑤𝑡 with the highest


𝑝 𝑤𝑡 𝑤 1:𝑡–1
• Beam Search: Keep track of k possible paths at each step instead
of just one. Reasonable beam size k: 5-10 .

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Attention is all you need
• Self-Attention: Each token attends to every other token in the
sentence, but with different weights
• Demo: https://github.com/jessevig/bertviz

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Encoder and Decoder
• NLP tasks can be generally decomposed into language
understanding and language generation.
• Encoder models are generally used to understand input
sentences, and decoder models are generally used to generate
sentences.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Positional Encoding in Transformers
Sentence-1: "Dog chases cat" vs. Sentence-2: "Cat chases dog“
Self-Attention is Order-Agnostic
• Transformers use self-attention, which treats each word as a separate entity without
inherent order.
• Self-attention alone doesn't understand "first," "second," or "last."
Order is Crucial for Meaning
• Sentence meaning depends heavily on word order.
• Without positional information, "Who ate what?" and "What ate who?" would look the same
to the model!
Enter Positional Encoding
• Positional encoding adds information about the position of words in a sequence.
• It allows the transformer to understand the order of words and thus, the sentence's meaning.

Reference: Attention is all you need. (Vaswani et al., 2017)


Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Types of Positional Encoding
Absolute Positional Encoding:
• Sinusoidal Functions (Original Transformer):
➢ Uses sine and cosine functions to generate unique encodings for each position.
➢ Allows the model to learn relative positions by leveraging the trigonometric properties.
➢ Formula:

• Learned Positional Embeddings:


➢ Trainable embeddings are learned specifically for each position.
➢ Similar to word embeddings, each position gets a unique vector representation.
Relative Positional Encoding:
• Instead of absolute position, it focuses on the relative distance between words.
• This can be more efficient in capturing relationships between words, especially for longer sequences.
• Examples: Transformer-XL, T5

Reference: Attention is all you need. (Vaswani et al., 2017) & https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Self-Attention
• To calculate the attention weight from a query word 𝑤𝑞 (e.g, “rabbit”) to another word 𝑤𝑘
• Each word is represented as a query, key and value vector. The vectors are obtained from the
input embeddings multiplied by a weight matrix.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Multi-Head Attention
• Input: Multiple Independent sets of query, key, value matrix
• Output: Concatenate the outputs of attention heads
• Advantage: Each attention head focus on one subspace

Concatenation

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Multi-Head Attention

Scaling:
• Scaling helps manage large values from dot products
when key dimension dk is large.
• Scaling keeps values in a manageable range,
stabilizing the softmax function.
• Maintains balance in gradient values during
backpropagation, enhancing training efficiency.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Transformer Model Architecture
• Input Embedding
• Positional Encoding
• 12 Transformer layers
• 6 encoder layers
• 6 decoder layers
• Linear + Softmax layer for next
word prediction

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Encoder Model
• Multi-head attention layer captures
information from different subspaces
at different positions

• Feed-forward layer is applied to each


token position without interaction with
other positions

• Residual connection and layer


normalization
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Decoder Model

• Multi-head self-attention: only allowed


to attend to earlier positions (left
side).
• Q, K, V matrices are both from the
previously generated tokens
• Multi-head cross-attention: attend to
the input sequence.
• Q is from the generated tokens
• K, V matrices are from the input tokens

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Encoder & Decoder blocks

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Layer Normalization & Residual Connections
Challenge: Hard to train.
• Issues: Vanishing/exploding gradients, slow convergence.
.
Solutions:
• Layer Normalization: Normalizes activations within each layer, preventing drastic shifts
during training. Think of it like keeping the "volume" consistent across layers.
• Residual Connections: Creates "shortcuts" that allow information to bypass layers, ensuring
gradients flow more easily. Like providing "express lanes" for data and gradients.

Reference: Layer Normalization (Ba et al., 2016)

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus


Layer Normalization & Residual Connections
• Research Backing: Layer Normalization and Residual Connections, as explored in Ba et al.
(2016), have significantly improved deep learning model performance.

• Faster Convergence: Training deep models becomes quicker due to improved gradient flow,
which facilitates faster convergence.

• Stabilizing Networks: They stabilize training, leading to faster and more reliable convergence,
especially for deep architectures. This results in better model performance and higher accuracy on
tasks.

Reference: Layer Normalization (Ba et al., 2016) & Deep Residual Learning for Image Recognition (from He et al., 2015)
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Thank you

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

You might also like