0% found this document useful (0 votes)

22 views32 pages

Lecture 03 - Introduction To LLMs

Module 3 of the Conversation AI course at BITS Pilani introduces Large Language Models (LLMs) and their underlying technologies, including tokenization methods like Byte-Pair Encoding and WordPiece, as well as the Transformer architecture. It highlights the advantages of LLMs over traditional chatbots in understanding language, context, and user intent, enabling more nuanced interactions. The module also covers the mechanics of autoregressive models, attention mechanisms, and the importance of positional encoding in understanding word order.

Uploaded by

vipula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views32 pages

Lecture 03 - Introduction To LLMs

Uploaded by

vipula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Module 3: Introduction to LLMs

Conversation AI
BITS Pilani
Pilani Campus
(S1-24_AIMLCZG521)
Session Content
I. Introduction to Language Models
1. What are LLMs?
2. Tokenization:
• Byte-Pair Encoding (BPE) for GPT
• WordPiece for BERT
3. Autoregressive Models
II. Transformer Architecture
1. Motivation
2. Self-Attention Mechanism
3. Multi-Head Attention
4. Positional Encoding
5. Encoder-Decoder Structure
6. Layer Normalization and Residual Connections

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Large Language Models (LLMs)
LLMs are AI models trained on massive text datasets
to understand and generate human language.
They work by predicting the probability of the next
word in a sequence.

Evolution:
• Moved beyond simpler models like n-grams,
which struggled with sparse data and
generalization (Huang et al., 2018).
• Modern LLMs leverage advanced architectures
(e.g., Transformers) for improved performance.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Advantages of LLM-powered Conversational AI
Feature Traditional Chatbots LLM-Powered Chatbots
Basic understanding based on keywords and pattern Deep understanding of semantics, context, and nuances in language.
Language
matching. Example: User: "I wanna pay my bill." Chatbot: Example: User: "I'm kinda in a rush, can I quickly settle my balance?"
Understanding
Matches "pay bill" to a predefined trigger. Chatbot: Understands urgency and desire to pay the bill.
Rule-based, keyword matching, or ML classifiers on limited
Contextual understanding using LLMs; robust even with variations.
Intent labeled data. Example: If user says, "check balance,"
Example: User: "What do I owe?" or "How much is left on my account?"
Recognition chatbot matches it to the "check balance" intent based on
Chatbot identifies the intent as "check balance."
keywords.
Predefined slots with rule-based or separate NER model Seamless extraction by LLM; handles complex and nuanced entities.
extraction. Example: For "book flight," chatbot extracts Example: User: "I need a flight for two to New York, sometime next
Slot Filling
"destination" and "date" from "flight to London tomorrow" week in the evening." LLM extracts destination, number of passengers,
using rules. and flexible date/time.
Knowledge Requires predefined queries and structured data access or Can generate queries, interpret results, and leverage broader information
Intensive limited external search integration. Example: For "check sources (database and internet) based on user requests. Example:
(Database / balance," chatbot uses a fixed query: SELECT balance User: "What are some good restaurants near the Eiffel Tower?" Chatbot
Internet Search) FROM accounts WHERE user_id = [user_id]. searches internet and summarizes top recommendations.
Requires integration with separate translation services. Can perform translations directly; understand multiple languages.
Translation Example: User: "Hola" (Spanish). Chatbot sends to a Example: User: "Bonjour, je veux réserver un billet." (French). Chatbot
translation API to get "Hello" and then matches the intent. understands the intent directly in French and proceeds to book a ticket.
Dynamic adaptation of responses based on user history and inferred
Stores user information and preferences in a database.
User Profiling & preferences. Example: If a user frequently asks about vegetarian
Example: Chatbot knows the user's name and preferred
Personalization recipes, the chatbot might suggest vegetarian options without explicit
payment method from their profile.
prompting.
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Advantages of LLM-powered Conversational AI
Feature Traditional Chatbots LLM-Powered Chatbots
Relies on session variables or limited context windows; Enhanced context understanding through LLM's attention mechanisms.
Context Example: Chatbot stores city = "London" in a session Example: User: "I want to book a flight." Chatbot: "Where to?" User:
Management variable and uses it if the user asks, "What's the "Actually, make it a train." Chatbot understands "it" refers to the travel
weather there?" booking and adapts accordingly.
State machines or rule-based systems defining the Dynamic generation; context-aware, adaptable to user input. Example:
Dialog conversation flow. Example: After "check balance," User: "I'm not sure which plan is right for me." Chatbot: Asks clarifying
Management chatbot moves to the next state: "Do you want to make questions about needs and suggests options, adapting the flow based on
a payment?" user responses.
Uses rule-based or basic ML models for sentiment More nuanced sentiment analysis, including sarcasm and subtle emotional
Sentiment
detection. Example: If user uses words like "angry" or cues. Example: User: "Great, another error..." Chatbot detects sarcastic
Analysis
"frustrated," chatbot detects negative sentiment. tone and understands the user is expressing frustration, not positivity.
Pre-defined fallback responses or prompts for
Error Handling / Contextually relevant fallback responses; attempts rephrasing or escalation.
clarification. Example: If the chatbot doesn't
Fallback Example: User: "I need a flibbertigibbet." Chatbot: "I'm not familiar with
understand, it might say: "I'm sorry, I didn't understand.
Strategies 'flibbertigibbet.' Were you perhaps looking for a specific product or service?"
Can you rephrase?"
Tracks key metrics like user engagement and Deeper conversational data analysis, pattern identification, and optimization
Analytics and
conversation duration. Example: Records how many insights. Example: Identifies common points of confusion in conversations
Performance
users complete a booking or how long a conversation or suggests improvements to dialogue flow based on analyzing user
Monitoring
lasts. interactions.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Language Models
• The classic definition of a language model (LM) is a probability distribution over each
token sequence
𝑤1, 𝑤2, … , 𝑤𝑛 , whether it’s a good or bad one.
• Sally fed my cat with meat: P(I, feed, my, cat, with, meat) = 0.03,
• My cat fed Sally with meat: P(My, cat, fed, Sally, with, meat) = 0.005,
• fed cat meat my my with: P(fed, cat, meat, my, my, with) = 0.0001

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Tokenization
Other Tokenization methods Advanced Tokenization methods used in LLMs
• Whitespace Tokenization WordPiece Tokenization: Used by BERT (Bidirectional Encoder Representations from
• Phrase Tokenization
• Word Tokenization Transformers), DistilBERT
• Punctuation-Based • Splits text into subword units, which helps in handling rare words and reducing the
Tokenization
• MWET (Multi-Word vocabulary size
Expression Tokenization) Byte-Pair Encoding (BPE) Tokenization: Used by GPT-2, GPT-3, GPT-4
• Sentence Tokenization
• Character Tokenization • Combines the most frequent pairs of bytes in a corpus to create subwords, which helps in
• Semantic Tokenization efficiently handling large vocabularies.
• Treebank Word Tokenization
• Number Tokenization Unigram Language Model Tokenization: Used by T5 (Text-To-Text Transfer Transformer)
• Tweet Tokenization • Uses a probabilistic model to determine the most likely subword units.
• N-gram Tokenization
• Syllable Tokenization SentencePiece Tokenization: Used by ALBERT (A Lite BERT), T5
• Character N-gram
Tokenization • A more general approach that can handle languages without clear word boundaries, like
Chinese or Japanese.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Tokenization in LLMs - BPE
Byte-Pair Encoding (BPE) is a tokenization method that iteratively merges the most frequent pairs of
bytes in a text corpus to create a fixed-size vocabulary.

How BPE Works:

• Start with a base vocabulary of individual characters.
• Identify the most frequent pair of bytes (or characters) in the text.
• Merge this pair into a new token.
• Repeat the process until the desired vocabulary size is reached.

"This is an example.“ => Initial tokens: ["Th", "is", " ", "i", "s", " ", "a", "n", " ", "e", "x", "a", "m", "p", "l", "e", "."]
=> Frequent pairs ("is", " ")

Reference: Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2016)

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Tokenization in LLMs - BPE
Advantages:
• Efficient handling of rare words and subword units.
• Reduces the vocabulary size, making the model more efficient.
• Helps in better generalization by breaking down words into subword units.
Limitations:
• Can produce fragmented tokens for some languages, especially those with complex morphology.
• May not capture semantic meaning as effectively as other tokenization methods.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Tokenization in LLMs - Wordpiece
WordPiece Tokenization is a method used to split text into smaller subword units, which helps in handling rare
words and improving the efficiency of language models.

How WordPiece Works:

• Start with a Base Vocabulary: Begin with a small vocabulary that includes individual characters and special tokens.
• Identify Subword Units: Count How Often Letters Appear Together
• Compute Scores for Pairs: Instead of merging the most frequent pairs, WordPiece computes a score for each pair
using the formula:

• Merge Pairs: Merge pairs based on their scores, prioritizing pairs where the individual parts are less frequent.
• Repeat: Continue the process until the desired vocabulary size is reached.

Reference: Google’s Neural Machine Translation System (Wu et al., 2016)

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Tokenization in LLMs - Wordpiece
Advantages:
• Efficient Handling of Rare Words: Breaks down rare words into subword units, improving model performance.
• Reduces Vocabulary Size: Helps in creating a manageable vocabulary size.
• Better Generalization: By using subword units, it can generalize better across different words and contexts.
Limitations:
• Complexity: The scoring mechanism adds complexity compared to simpler methods.
• Fragmentation: Can still produce fragmented tokens, especially for languages with complex morphology.

WordPiece's use of a likelihood-based scoring method distinguishes it from BPE's simpler frequency-based
approach, often resulting in a vocabulary that better captures the linguistic structure of the training data.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Representing Text in Language Models
• Unsupervised learning of text representations—No supervision needed
• Word embedding: Embed one-hot vectors into lower-dimensional space—Address “curse of dimensionality”
➢ Captures useful properties of word semantics
➢ Word similarity: Words with similar meanings are embedded closer
➢ Word analogy: Linear relationships between words (e.g. king - queen = man - woman)
Word Similarity
Word Analogy

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Distributed Representations: Word2Vec

• Assumption: If two words have similar contexts,

then they have similar semantic meanings!
feed
• Word2Vec Training objective:
➢ To learn word vector representations that are good at
predicting the nearby words. my

cat
Co-occurred words in a local context window
with

meat

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Considering subwords - fasttext
• fastText improves upon Word2Vec by incorporating subword information into word embedding

Tri-gram extraction

• fastText allows sharing subword representations across words, since words are represented
by the aggregation of their n-grams
Word2Vec probability expression
Represent a word by the sum of the
vector representations of its n-grams

N-gram embedding

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Limitations of embeddings

• They are context-free embeddings: each word is mapped to only one

vector regardless of its context!
E.g. “bank” is a polysemy, but only has one representation

“Open a bank account” “On the river bank”

Share representation

• It does not consider the order of words

• It treats the words in the context window equally

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Autoregressive language models

• The chain rule of probability:

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Generation

• If we already have a good language model, a given text prompt 𝑤 1:𝑛 ,

and we want the model to generate a good sentence completion
with the length of L: How to find 𝑤 𝑛+1:𝑛+𝐿 with the highest
probability?
• Enumerate over all possible combinations?

• Next token prediction: generating the next token step by

step, starting from 𝑤𝑛+1 using 𝑝 𝑤𝑛+1 𝑤 1:𝑛
• To select the next token with 𝑝 𝑤𝑛+1 𝑤 1:𝑛 , there are also
different decoding approaches.
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Different Decoding Approaches

• Greedy decoding: At each step, always select 𝑤𝑡 with the highest

𝑝 𝑤𝑡 𝑤 1:𝑡–1
• Beam Search: Keep track of k possible paths at each step instead
of just one. Reasonable beam size k: 5-10 .

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Attention is all you need
• Self-Attention: Each token attends to every other token in the
sentence, but with different weights
• Demo: https://github.com/jessevig/bertviz

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Encoder and Decoder
• NLP tasks can be generally decomposed into language
understanding and language generation.
• Encoder models are generally used to understand input
sentences, and decoder models are generally used to generate
sentences.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Positional Encoding in Transformers
Sentence-1: "Dog chases cat" vs. Sentence-2: "Cat chases dog“
Self-Attention is Order-Agnostic
• Transformers use self-attention, which treats each word as a separate entity without
inherent order.
• Self-attention alone doesn't understand "first," "second," or "last."
Order is Crucial for Meaning
• Sentence meaning depends heavily on word order.
• Without positional information, "Who ate what?" and "What ate who?" would look the same
to the model!
Enter Positional Encoding
• Positional encoding adds information about the position of words in a sequence.
• It allows the transformer to understand the order of words and thus, the sentence's meaning.

Reference: Attention is all you need. (Vaswani et al., 2017)

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Types of Positional Encoding
Absolute Positional Encoding:
• Sinusoidal Functions (Original Transformer):
➢ Uses sine and cosine functions to generate unique encodings for each position.
➢ Allows the model to learn relative positions by leveraging the trigonometric properties.
➢ Formula:

• Learned Positional Embeddings:

➢ Trainable embeddings are learned specifically for each position.
➢ Similar to word embeddings, each position gets a unique vector representation.
Relative Positional Encoding:
• Instead of absolute position, it focuses on the relative distance between words.
• This can be more efficient in capturing relationships between words, especially for longer sequences.
• Examples: Transformer-XL, T5

Reference: Attention is all you need. (Vaswani et al., 2017) & https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Self-Attention
• To calculate the attention weight from a query word 𝑤𝑞 (e.g, “rabbit”) to another word 𝑤𝑘
• Each word is represented as a query, key and value vector. The vectors are obtained from the
input embeddings multiplied by a weight matrix.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Multi-Head Attention
• Input: Multiple Independent sets of query, key, value matrix
• Output: Concatenate the outputs of attention heads
• Advantage: Each attention head focus on one subspace

Concatenation

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Multi-Head Attention

Scaling:
• Scaling helps manage large values from dot products
when key dimension dk is large.
• Scaling keeps values in a manageable range,
stabilizing the softmax function.
• Maintains balance in gradient values during
backpropagation, enhancing training efficiency.

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Transformer Model Architecture
• Input Embedding
• Positional Encoding
• 12 Transformer layers
• 6 encoder layers
• 6 decoder layers
• Linear + Softmax layer for next
word prediction

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Encoder Model
• Multi-head attention layer captures
information from different subspaces
at different positions

• Feed-forward layer is applied to each

token position without interaction with
other positions

• Residual connection and layer

normalization
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Decoder Model

• Multi-head self-attention: only allowed

to attend to earlier positions (left
side).
• Q, K, V matrices are both from the
previously generated tokens
• Multi-head cross-attention: attend to
the input sequence.
• Q is from the generated tokens
• K, V matrices are from the input tokens

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Encoder & Decoder blocks

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Layer Normalization & Residual Connections
Challenge: Hard to train.
• Issues: Vanishing/exploding gradients, slow convergence.
.
Solutions:
• Layer Normalization: Normalizes activations within each layer, preventing drastic shifts
during training. Think of it like keeping the "volume" consistent across layers.
• Residual Connections: Creates "shortcuts" that allow information to bypass layers, ensuring
gradients flow more easily. Like providing "express lanes" for data and gradients.

Reference: Layer Normalization (Ba et al., 2016)

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Layer Normalization & Residual Connections
• Research Backing: Layer Normalization and Residual Connections, as explored in Ba et al.
(2016), have significantly improved deep learning model performance.

• Faster Convergence: Training deep models becomes quicker due to improved gradient flow,
which facilitates faster convergence.

• Stabilizing Networks: They stabilize training, leading to faster and more reliable convergence,
especially for deep architectures. This results in better model performance and higher accuracy on
tasks.

Reference: Layer Normalization (Ba et al., 2016) & Deep Residual Learning for Image Recognition (from He et al., 2015)
Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus
Thank you

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Lecture 05 - Prompt Engineering
100% (1)
Lecture 05 - Prompt Engineering
31 pages
Large Language Models (LLM)
No ratings yet
Large Language Models (LLM)
139 pages
Generative AI For Dummies
67% (3)
Generative AI For Dummies
6 pages
Generative AI 101 Introduction To The Fundamentals Michael-Callaghan
100% (1)
Generative AI 101 Introduction To The Fundamentals Michael-Callaghan
145 pages
Mod 4
No ratings yet
Mod 4
69 pages
Practical Guide To Using LLMs by Andrej Karpathy Feb 29 2025
No ratings yet
Practical Guide To Using LLMs by Andrej Karpathy Feb 29 2025
8 pages
LLM Cheat Sheetpdf
No ratings yet
LLM Cheat Sheetpdf
7 pages
LLM and Gen AI
No ratings yet
LLM and Gen AI
4 pages
The Best LLMs Cheatsheet - Part 1
No ratings yet
The Best LLMs Cheatsheet - Part 1
16 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
LLM - Introduction 2024
No ratings yet
LLM - Introduction 2024
77 pages
AI Tools
No ratings yet
AI Tools
19 pages
Lecture 01 - Foundations of Conversational AI
No ratings yet
Lecture 01 - Foundations of Conversational AI
21 pages
Path To The LLM & Generative AI
No ratings yet
Path To The LLM & Generative AI
12 pages
Lesson 01 Getting Started With GenAI
No ratings yet
Lesson 01 Getting Started With GenAI
48 pages
Lesson 1 Intro
No ratings yet
Lesson 1 Intro
51 pages
Syllabus-Topics in Computer Vision
100% (1)
Syllabus-Topics in Computer Vision
5 pages
Paniit Demystifying Llms
No ratings yet
Paniit Demystifying Llms
66 pages
AI Training For Language Teachers
No ratings yet
AI Training For Language Teachers
167 pages
Lecture 04 - Pre-Trained Language Models (PLMS)
No ratings yet
Lecture 04 - Pre-Trained Language Models (PLMS)
36 pages
Genaitoolboxltslides 1736779963542
No ratings yet
Genaitoolboxltslides 1736779963542
38 pages
03 NLP Document
No ratings yet
03 NLP Document
38 pages
Slides
No ratings yet
Slides
63 pages
Day 1
No ratings yet
Day 1
32 pages
ChatGPT in The Age of Generative AI and Large Lang
No ratings yet
ChatGPT in The Age of Generative AI and Large Lang
60 pages
Large Language Models
No ratings yet
Large Language Models
40 pages
LLMs Overview and OpenAI API Ver 1-8 - Final NLP Day-UM6P-Nov 2023
No ratings yet
LLMs Overview and OpenAI API Ver 1-8 - Final NLP Day-UM6P-Nov 2023
45 pages
The Future of AI Exploring The Potential of Large Concept Models
No ratings yet
The Future of AI Exploring The Potential of Large Concept Models
18 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Secure Key Establishmen
No ratings yet
Secure Key Establishmen
229 pages
Artificial Intelligence Boost Productivity
No ratings yet
Artificial Intelligence Boost Productivity
35 pages
Robotics - PPT For Ros Etc Students Good
No ratings yet
Robotics - PPT For Ros Etc Students Good
15 pages
Clase1 Generating Your First Text
No ratings yet
Clase1 Generating Your First Text
18 pages
LLM - A Introduction To Generative AI
100% (1)
LLM - A Introduction To Generative AI
31 pages
Llmdevdaysession 1 Stakeholderreviewdt 202311151700153986852
No ratings yet
Llmdevdaysession 1 Stakeholderreviewdt 202311151700153986852
43 pages
GLM 4 Voice
No ratings yet
GLM 4 Voice
14 pages
Chapter 1
No ratings yet
Chapter 1
29 pages
Gradivo ChatGPT in Umetna Inteligenca V Praksi
No ratings yet
Gradivo ChatGPT in Umetna Inteligenca V Praksi
38 pages
Virtual Agent Chatbot Using Open Artificial Intelligence Final
No ratings yet
Virtual Agent Chatbot Using Open Artificial Intelligence Final
16 pages
D1UA401B Research Methodology-UNIT-4 Pazhanisamy-BBA IV Semester Section19
No ratings yet
D1UA401B Research Methodology-UNIT-4 Pazhanisamy-BBA IV Semester Section19
108 pages
ChatGPT Deck
No ratings yet
ChatGPT Deck
17 pages
Unit-6 Ai Tools-Chatgpt
100% (1)
Unit-6 Ai Tools-Chatgpt
9 pages
Unit 5 A.I
No ratings yet
Unit 5 A.I
17 pages
LLM Review
No ratings yet
LLM Review
16 pages
Introduction To LLMs and LLMOps
No ratings yet
Introduction To LLMs and LLMOps
12 pages
LLM Review
No ratings yet
LLM Review
31 pages
Chat GPT
No ratings yet
Chat GPT
17 pages
Compact Guide To Large Language Models
No ratings yet
Compact Guide To Large Language Models
9 pages
LLM Overview
No ratings yet
LLM Overview
3 pages
LLM Intro
No ratings yet
LLM Intro
19 pages
Global Logic Interview Questions and Answers
No ratings yet
Global Logic Interview Questions and Answers
6 pages
Fai Unit-5 TB
No ratings yet
Fai Unit-5 TB
7 pages
LLM Presentation
No ratings yet
LLM Presentation
10 pages
Lang Chain
No ratings yet
Lang Chain
7 pages
Hws
No ratings yet
Hws
34 pages
Introduction To Large Language Models
No ratings yet
Introduction To Large Language Models
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
Large Language Model Algorithms in Plain English
No ratings yet
Large Language Model Algorithms in Plain English
8 pages
Principal Component Analysis (PCA) - : San José State University Math 253: Mathematical Methods For Data Visualization
No ratings yet
Principal Component Analysis (PCA) - : San José State University Math 253: Mathematical Methods For Data Visualization
49 pages
NLP Crash Course Comprehensive
No ratings yet
NLP Crash Course Comprehensive
2 pages
Lecture 5 and 6 Clairaut Equation
No ratings yet
Lecture 5 and 6 Clairaut Equation
17 pages
Parlai: A Dialog Research Software Platform
No ratings yet
Parlai: A Dialog Research Software Platform
7 pages
Pakki Lab Programs
No ratings yet
Pakki Lab Programs
23 pages
2017-Miller Et al-ParlAI - A Dialog Research Software Platform
No ratings yet
2017-Miller Et al-ParlAI - A Dialog Research Software Platform
6 pages
Information Technology Fundamentals: CCIT4085
No ratings yet
Information Technology Fundamentals: CCIT4085
43 pages
Dual-Field Multiplier Architecture For Cryptographic Applications
No ratings yet
Dual-Field Multiplier Architecture For Cryptographic Applications
5 pages
Curve Fitting Least Square Fit Method: Sandeep Kumar
No ratings yet
Curve Fitting Least Square Fit Method: Sandeep Kumar
27 pages
Blast 2 Sequences: Salman Khan Current Gpa in Bioinf 4 Gpa
No ratings yet
Blast 2 Sequences: Salman Khan Current Gpa in Bioinf 4 Gpa
45 pages
Wavelets and Filter Banks: 4C8 Integrated Systems Design
No ratings yet
Wavelets and Filter Banks: 4C8 Integrated Systems Design
41 pages
SNF Ee338dsp Sp2021 Week1and2
No ratings yet
SNF Ee338dsp Sp2021 Week1and2
54 pages
The Unphysicality of Hilbert Spaces
No ratings yet
The Unphysicality of Hilbert Spaces
13 pages
Fallsem2024-25 Sts4021 Ss Ch2024250100090 Reference Material I 13-08-2024 Binary Palindrome 14
No ratings yet
Fallsem2024-25 Sts4021 Ss Ch2024250100090 Reference Material I 13-08-2024 Binary Palindrome 14
23 pages
DSAD Qz2
No ratings yet
DSAD Qz2
11 pages
QB - Data Science
No ratings yet
QB - Data Science
7 pages
Cryptography Policy V
No ratings yet
Cryptography Policy V
3 pages
FAVAR
No ratings yet
FAVAR
13 pages
Outline of Artificial Intelligence
No ratings yet
Outline of Artificial Intelligence
21 pages
Tkde 2020 2997688
No ratings yet
Tkde 2020 2997688
14 pages
Solution of Differential Equation by Using Power Series Method PDF
No ratings yet
Solution of Differential Equation by Using Power Series Method PDF
5 pages
ME 301 HW3 2016-2017 Fall Solution
No ratings yet
ME 301 HW3 2016-2017 Fall Solution
4 pages
DP1 Phy P2 SL MS
No ratings yet
DP1 Phy P2 SL MS
9 pages
Data Science and Deep Learning For Real-Time Financial Market Prediction
No ratings yet
Data Science and Deep Learning For Real-Time Financial Market Prediction
4 pages
Performance Analysis of Vision Transformer Based Architecture For Cursive Handwritten Text Recognition
No ratings yet
Performance Analysis of Vision Transformer Based Architecture For Cursive Handwritten Text Recognition
6 pages
BUSI4489 MSDS First Sit 23 24 Final
No ratings yet
BUSI4489 MSDS First Sit 23 24 Final
5 pages
Quiz-Week 4-Q2-Domain, Range, Vertical Line Test
No ratings yet
Quiz-Week 4-Q2-Domain, Range, Vertical Line Test
1 page
5 - Linear Programming Problems Ex. Module-6-B
No ratings yet
5 - Linear Programming Problems Ex. Module-6-B
9 pages
Kenichi Maruno - N-Soliton Solutions of Two-Dimensional Soliton Cellular Automata
No ratings yet
Kenichi Maruno - N-Soliton Solutions of Two-Dimensional Soliton Cellular Automata
33 pages
The Ultimate Guide to Chatbot Development:: From Beginner to Pro
From Everand
The Ultimate Guide to Chatbot Development:: From Beginner to Pro
M. Mangum
No ratings yet
Coding Creativity - How to Build A Chatbot or Art Generator from Scratch with Bonus: The Ai Prompting Bible
From Everand
Coding Creativity - How to Build A Chatbot or Art Generator from Scratch with Bonus: The Ai Prompting Bible
Michael Ferguson
No ratings yet
Unlocking Your Potential with ChatGPT
From Everand
Unlocking Your Potential with ChatGPT
Bill Vincent
No ratings yet

Lecture 03 - Introduction To LLMs

Uploaded by

Lecture 03 - Introduction To LLMs

Uploaded by

Module 3: Introduction to LLMs

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

How BPE Works:

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

How WordPiece Works:

Reference: Google’s Neural Machine Translation System (Wu et al., 2016)

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

• Assumption: If two words have similar contexts,

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

• They are context-free embeddings: each word is mapped to only one

“Open a bank account” “On the river bank”

• It does not consider the order of words

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

• The chain rule of probability:

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

• If we already have a good language model, a given text prompt 𝑤 1:𝑛 ,

• Next token prediction: generating the next token step by

• Greedy decoding: At each step, always select 𝑤𝑡 with the highest

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Reference: Attention is all you need. (Vaswani et al., 2017)

• Learned Positional Embeddings:

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

• Feed-forward layer is applied to each

• Residual connection and layer

• Multi-head self-attention: only allowed

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Reference: Layer Normalization (Ba et al., 2016)

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

Conversation AI (AIMLCZG521) BITS Pilani, Pilani Campus

You might also like