Lec20 LLM
Lec20 LLM
Lecture 20
Large Language Models
Roshan Sharma
Some slides borrowed from Danqi Chen, Chenyan Xiong and Graham Neubig – thanks!
1
Agenda
● Emergent Abilities and Scaling Effects
● What are LLMs?
● Modern LLM Architecture
● LLM Training Procedure
● LLM Inference – Prompting, In-Context Learning and Chain of Thought
● Evaluating LLMs
● Multimodal LLMs
Review: Language Models as Generalists
● Language models can be used to not just perform a single task, but
multiple tasks by learning to predict the next token or sentence
Review: The LLM Era – Paradigm Shift in Machine Learning
BERT GPT
Oct 2018 Jun 2018
Representation Generation
4
Review: The LLM Era – Paradigm Shift in Machine Learning
BERT GPT
Oct 2018 Jun 2018
Representation Generation
5
GPT 2 – Generalizing to Unseen Tasks
● LMs can be used for different tasks by pre-training a “base” model and
then fine-tuning for the task(s) of interest
● Practical Issues:
○ Too many copies of the model
○ Need for large-scale labeled data for fine-tuning
○ Can do only specific task
GPT 2 – Generalizing to Unseen Tasks
● LMs can be used for different tasks by pre-training a “base” model and
then fine-tuning for the task(s) of interest
● Practical Issues:
○ Too many copies of the model
○ Need for large-scale labeled data for fine-tuning
● Multi-task Training?
○ Data remains a challenge
○ Humans don’t need such large volumes of data to learn – can we do better?
P(output | input)
P(output | input,task)
GPT 2 – Task Specifications
● Primary shift comes from modeling assumptions from single-task to
general model
Single Task Model General Model
P(output | input)
P(output | input,task)
● Open AI Study : Scaling Laws for Neural Language Models (Kaplan et al.
2020)
Scaling - (Kaplan,2020)
● Open AI Study : Scaling Laws for Neural Language Models (Kaplan et al.
2020)
● Key Findings:
o Performance depends strongly on scale, and weakly on the model shape
○ Larger models are more sample-efficient
○ Smooth power laws (y = axk) b/w empirical performance & N - parameters, D -
dataset size, C - compute
Scaling Effects
● The effect of some hyperparameters on big LMs can be predicted before
training – optimizer (Adam v/s SGD), model depth, LSTM v/s Transformer
● Idea:
○ Train a few smaller models
○ Establish a scaling law (e.g. ADAM vs SGD scaling law)
○ Select optimal hyper param based on the scaling law prediction
Model Scaling: GPT-3
Source:
https://bmk.sh/2020/05/29/GPT-3-A-Brief
175b params!
GPT-2 was 1.5b
Emergent Abilities with GPT-3 – Wei et. al 2022
● Emergent abilities:
○ not present in smaller models but is present in larger models
○ Do LLMs like GPT3 have these ?
● Findings:
○ GPT-3 trained on text can do arithmetic problems like addition and subtraction
○ Different abilities “emerge” at different scales
Emergent Abilities with GPT-3 – Wei et. al 2022
● Emergent abilities:
○ not present in smaller models but is present in larger models
○ Do LLMs like GPT3 have these ?
● Findings:
○ GPT-3 trained on text can do arithmetic problems like addition and subtraction
○ Different abilities “emerge” at different scales
Emergent Abilities with GPT-3 – Wei et. al 2022
● Emergent abilities:
○ not present in smaller models but is present in larger models
○ Do LLMs like GPT3 have these ?
● Findings:
○ GPT-3 trained on text can do arithmetic problems like addition and subtraction
○ Different abilities “emerge” at different scales
○ Model scale is not the only contributor to emergence – for 14 BIG-Bench tasks,
LaMDA 137B and GPT-3 175B models perform at near-random, but PaLM 62B
achieves above-random performance
○ Problems LLMs can’t solve today may be emergent for future LLMs
Large Language Models
● Language models that have many parameters (over 1B) and can perform
multiple tasks through prompting
● Decoder-only (GPT)
○ Pre-training: Auto-regressive Language Modeling
○ Stable training, faster convergence
○ Better generalization after pre-training
● Encoder-decoder (T0/T5)
○ Pre-training : Masked Span Prediction
○ Good for tasks like MT, summarization
T5/ T0 : Masked Span Prediction
● Masked span prediction involves:
○ Mask continuous set of tokens (span) in input
○ Predict this masked span from the decoder
Attention patterns (Wang et. al)
A. A language model with fewer parameters than 175B cannot have any
emergent abilities
B. They are found in large models but not in small models
C. Summarization is likely an emergent ability in a model pre-trained on a
summarization corpus
D. Emergent abilities arise only because of scaling.
Poll 1
Which of the following is true about emergent abilities?
A. A language model with fewer parameters than 175B cannot have any
emergent abilities
B. They are found in large models but not in small models
C. Summarization is likely an emergent ability in a model pre-trained on a
summarization corpus
D. Emergent abilities arise only because of scaling.
Training of Decoder-only LLMs – Llama 2
1. Auto-regressive Pre-training - Train to predict the next token on very
large-scale corpora ( ~3 trillion tokens)
Training of Decoder-only LLMs – Llama 2
1. Auto-regressive Pre-training - Train to predict the next token on very large
scale corpora ( ~3 trillion tokens)
2. Instruction Fine-tuning/ Supervised Fine-tuning (SFT) - Fine-tune the pre-
trained model with pairs of (instruction+input,output) with large dataset
and then with small high-quality dataset
● Purpose
○ Pre-training makes good generalist auto-completes but good SFT builds models
that can do many unseen tasks
○ SFT can also guide nature of outputs in terms of safety and helpfulness
Instruction Tuning (Wei et. al. 2021)
Unsafe Outputs – Alignment Problem
● LLMs may produce
○ Harmful text – unparliamentary language, bias and discrimination
○ Text that can cause direct harm – allowing easy access to dangerous information
● No! The list of harmful outputs is not exhaustive and very large
A. Swishy activations
B. Relativistic positional embeddings
C. Multi-query attention
D. Grouped-query attention
Poll 2
Which of the following is a feature of Llama 2?
A. Swishy activations
B. Relativistic positional embeddings
C. Multi-query attention
D. Grouped-query attention
LLM Inference: Prompting
● Prompts
○ Tell the model what to do in natural language
○ For example, generate a textual summary of this paragraph:
○ Can be as short or long as required
● Prompt Engineering
○ The task of identifying the correct prompt needed to perform a task
○ General rule of thumb be as specific and descriptive as possible
○ Can be manual or automatic ( prefix-tuning, paraphrasing etc.)
ChatGPT Prompt example
In-context learning/ Few-shot prompting (Brown,21)
● Provide a few examples along with the instruction
Chain of thought prompting (Wei, 2021)
● Get the model to work through the steps of the problem
What to Pick? Stronger
task-specific
performance
1. Full Fine-tuning (FT)
a. + Strongest performance
b. - Need curated and labeled dataset for each new task
(typically 1k-100k+ ex.)
c . - Poor generalization, spurious feature exploitation
2. Few-shot (FS)
a. + Much less task-specific data needed
b. + No spurious feature exploitation
c. - Challenging
3. One-shot (1S)
a. +"Most natural," e.g. giving humans instructions
b. - Challenging
4. Zero-shot (0S)
a. +Most convenient
More convenient,
b. - Challenging, can be ambiguous general, less data 19
Note on Parameter Efficient Fine-tuning
● When we don’t have large enough data for SFT
○ Freeze the LM and keep some parameters trainable (which?)
○ Add an external adapter module to adapt model parameters to the task
○ Perform Low-rank Adaptation (LoRA)
Poll 3
Which of the following describes in-context learning?
Modality Encoder
Text embedding
Speech Encoder
+ Text tokenizer
51
Modeling data using Discrete Units
● Recently discrete units shows promising performance and benefit
Chang, Xuankai, et al. "Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study."
arXiv preprint arXiv:2309.15800 (2023).
○ Storage
■ Audio features (HuBERT): 1024 dim * 32 bit (float)
■ Discrete unit (1000 / 2000-cluster): 12 bit
○ Sequence length (> 50% reduction)
■ De-duplication
■ Subword Modeling
○ Performance is okay
■ >fbank, ~<SSL feature
○ We used semantic features from SSL
■ ASR / ST / SLU
52
Modeling data using Discrete Unit
SpeechLM
Whole Vocab
Text Vocab
Token embedding Speech
Vocab
A1 A2 … T1 T2 …
Speech
Text tokenizer
Quantizer
● Discrete representations
○ Extracted from self-supervised audio models like VQ-VAEs
Open Challenges - LLMs
● New Capabilities
○ Multimodal
○ Multi-lingual
○ More Complex Tasks
● Performance
○ Reduce Hallucinations
○ Improve Alignment with Human Preference
○ Increase Context Length Efficiently
○ Improve Data, Training Strategy, and Model Architecture
● Efficiency
○ Computational cost, time, and money
○ Compute architecture – GPU/ TPU/ HPU
Open Challenges - LLMs
● Safety
○ Reduce Harm
○ Improve Adversarial Robustness
○ Privacy Concerns
● Interpretability
○ Why do LLMs do what they do?
Summary
● LLMs are large-scale models that possess astounding abilities
● Scaling both data and model capacity is important for performance and
leads to the emergence of new abilities
● Decoder-only architectures are popular for convergence and performance
● LLMs are trained using pre-training, SFT, RLHF
● LLMs are evaluated using prompting/ strategies like ICL and CoT
● Multimodal LLMs can process audio, text, images and more.
Thank you!