0% found this document useful (0 votes)
4 views

01 Introduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

01 Introduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

UNIVERSIDADE*FEDERAL

DE*MINAS*GERAIS

Advanced Seminars on Large Language Models

Introduction to LLMs
Rodrygo L. T. Santos
[email protected]
Silhouette of a human female on the left
and a humanoid AI on the right; a white
wire connects their brains through their
mouths symbolizing communication.
By DALL·E 3
Language

A natural ability for humans


◦ Effortless use for communication
◦ Expressive of thoughts, emotions, instructions
A challenge for machines
◦ Ambiguity, context-dependency, nuanced semantics
A milestone towards AGI?
Credit: Shutterstock
Credit: Forbes
Credit: The Verge
By Codex
Credit: The Register
By DALL·E 3
Photorealistic closeup video of two
pirate ships battling each other as
they sail inside a cup of coffee.
By SORA
Language model

A probability distribution over word sequences


◦ 𝑃(“Today is Wednesday”) » 0.001
◦ 𝑃(“Today Wednesday is”) » 0.0000000000001
◦ 𝑃(“The eigenvalue is positive”) » 0.00001
Also a mechanism for “generating” text
◦ 𝑃(“Wednesday”|“Today is”) > 𝑃(“blah”|“Today is”)
Language model

Ideal (aka full dependence) model


◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃 𝑤# 𝑤! … 𝑃 𝑤" 𝑤! … 𝑤"$!
Infeasible in practice
◦ Expensive computation
◦ Poor estimation (data sparsity)
Evolution of language models

Statistical LMs
(1950s-1990s)
Tunable dependence via n-grams

3-gram (”trigram”)
◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃 𝑤# 𝑤! … 𝑃 𝑤" 𝑤"$# , 𝑤"$!
2-gram (”bigram”)
◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃 𝑤# 𝑤! … 𝑃 𝑤" 𝑤"$!
1-gram (”unigram”)
◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃(𝑤# ) … 𝑃(𝑤" )
Improved estimation via smoothing

𝑃(𝑤)

maximum
likelihood
estimation

smoothed
estimation

word 𝑤
Evolution of language models

Statistical LMs Neural LMs


(1950s-1990s) (2013)
Neurons
output Improved word-level representation
𝑤
%# ◦ From sparse to distributional semantics
◦ Better generalization to unseen data
dense Context still lacking
network ◦ Fixed-length input and output
◦ Non-sequential representation

𝑤!:#$!
input
Neurons… with recurrence!
output

𝑤
%# 𝑤
%! 𝑤
%& 𝑤
%'

recurrent dense ℎ% dense ℎ! dense ℎ&


network network network network

ℎ#$!

𝑤#$! 𝑤% 𝑤! 𝑤&
input
Neurons… with recurrence!
output Sequential bless
𝑤
%# ◦ Dynamic state maintains linguistic context
◦ Enables handling variable-length sequences
recurrent Sequential curse
network ◦ Single state as information bottleneck
ℎ#$! ◦ Inherently non-parallelizable

𝑤#$!
input
Evolution of language models

Statistical LMs Neural LMs Pretrained LMs


(1950s-1990s) (2013) (2018)
Vaswani et al. (NIPS 2017)
Neurons… with attention!
output

𝑤
%#

attention The animal didn’t cross the street because it was too ______

𝑤!:#$!
input
Neurons… with attention!
output

𝑤
%#

attention The animal didn’t cross the street because it was too ______

𝑤!:#$!
input
Neurons… with attention!
output

𝑤
%#

attention The animal didn’t cross the street because it was too ______

𝑤!:#$!
input
Neurons… with attention!
output

𝑤
%#

attention The animal didn’t cross the street because it was too ______

𝑤!:#$!
input
Neurons… with attention!
output

𝑤
%#

attention
attn The animal didn’t cross the street because it was too scared

𝑤!:#$!
input
Attention is (not) all you need
𝑤
)#

Prediction: select best output; decode

dense
Enrichment: attend to multiple contexts; add nonlinearities
attn ℎ

Preparation: tokenize; mark position; encode

𝑤!:#$!
Transformer
𝑤
)# Effective representation
◦ Can attend to entire context – no bottleneck
◦ Attention heads as representation subspaces
dense ◦ Order retained via positional encoding
Decoder
attn ℎ Efficient processing
𝑛
◦ Parallelization across tokens and heads
◦ Much faster training and inference
𝑤!:#$!
◦ Scalability to massive training datasets
Transformer architectures
𝑤
)!:% 𝑤
)# 𝑤
)#

Encoder Encoder Decoder Decoder

𝑛 𝑛 𝑛 𝑛

𝑤!:% 𝑤!:#$! 𝑤!:#$!

encoder-only encoder-decoder decoder-only


(e.g. BERT (2018)) (e.g. T5 (2019)) (e.g. GPT (2018))
The power of transfer learning

Self-supervised pretraining (expensive)


◦ Standard language modeling objective
◦ Train on massive textual corpora
Supervised fine-tuning (cheap)
◦ Multiple task-specific objectives
◦ Improved performance downstream
Evolution of language models

Statistical LMs Neural LMs Pretrained LMs Large LMs


(1950s-1990s) (2013) (2018) (2020)
Model size vs. time

GPT-1 GPT-2
117M 1.5B

2018 2019 2020 2021 2022 2023 2024


Model size vs. time

GPT-1 GPT-2 GPT-3


117M 1.5B 175B

2018 2019 2020 2021 2022 2023 2024


Model size vs. time

Are you guys


still there?

GPT-1 GPT-2 GPT-3 GPT-4


117M 1.5B 175B 1.76T*

2018 2019 2020 2021 2022 2023 2024


Model size vs. time
Advent of the Transformer
Availability of massive datasets
Access to powerful computing

GPT-4
1.76T*

2018 2019 2020 2021 2022 2023 2024


Credit: Google
Credit: Google
Credit: Mistral AI
Credit: Mistral AI
Credit: Anthropic
Credit: Anthropic
Credit: Reuters
The power of scaling

LLMs show improved performance with scale


◦ Increased model size (in trillions of parameters)
◦ Increased training size (in trillions of tokens)
Improvements in next token prediction
◦ But also in unforeseen capabilities!
Instruction following
PROMPT COMPLETION

Classify this review: Positive


I loved this film!
Sentiment:

LLM
Instruction following
PROMPT COMPLETION

Classify this review: received a very nice


I loved this film! book review
Sentiment:

LLM
In-context learning
PROMPT COMPLETION

Classify this review: Positive


I don’t like this chair!
Sentiment: Negative

Classify this review: LLM


I loved this film!
Sentiment:
Basic, emerging, augmented capabilities!
The challenges of scaling

System challenges
◦ Substantial compute and energy consumption
◦ Continual learning and adaptation
Data challenges
◦ Data quality and representativeness
◦ Low-resource domains and languages
The challenges of scaling

Human challenges
◦ Responsible alignment
◦ Interpretability and explainability
◦ Privacy and security
Course goals

Understand the fundamentals of LLMs


Explore the capabilities and limitations of LLMs
Keep up with the current state of the field
Have a grasp of where the field is headed
Course scope

LLM architectures – Transformers and beyond


LLM lifecycle
◦ Pretraining: data preparation, objectives
◦ Adaptation: instruction, alignment, PEFT/MEFT
◦ Utilization: prompting, in-context, augmentation
◦ Evaluation: language, downstream
Course structure (tentative)
Intro lectures by instructor Week Mon Wed
Paper seminars by students 18/03 G1 G2
◦ 1 group per class 25/03 G3 G4
(rotate every 2 weeks) 01/04 G1 G2
◦ 2 papers per group 08/04 G3 G4
(30min + 20min discussion) 15/04 G1 G2
◦ 2 students per paper 22/04 G3 G4
Course structure (tentative)
Week Mon Wed
Final paper list and 18/03 G1 G2
25/03 G3 G4
seminar schedule will
01/04 G1 G2
be available later 08/04 G3 G4
today for enrollment 15/04 G1 G2
22/04 G3 G4
Course grading

Seminar presentations
◦ 3x 20% = 60%
Seminar feedback
◦ 21x 1% = 21%
Class participation
◦ 21x 1% = 21%
Course attendance


Os créditos relativos a cada disciplina só
serão conferidos ao aluno que obtiver, no
mínimo, o conceito D e que comprovar efetiva
frequência a, no mínimo, 75% (setenta e cinco
por cento) das atividades em que estiver
matriculado, vedado o abono de faltas.
NGPG, art. 65
Course materials: books & surveys

Build a Large Language Model (from Scratch)


by Raschka (2024)
Large Language Models: A Survey
by Minaee et al. (2024)
A Comprehensive Overview of Large Language Models
by Naveed et al. (2024)
Course materials: books & surveys

Efficient Large Language Models: A Survey


by Wan et al. (2024)
A Survey of Large Language Models
by Zhao et al. (2023)
Course materials: courses and tutorials

Generative AI with Large Language Models


by DeepLearning.AI / AWS
Large Language Models
by Databricks
Neural Networks: Zero to Hero
by Karpathy
Pre-course survey

Fill in a short survey describing your past experience


and expectations related to the course
◦ https://forms.gle/7mcatGc5LtAFM2ta7
UNIVERSIDADE*FEDERAL
DE*MINAS*GERAIS

Coming next…

Architecture of LLMs
Rodrygo L. T. Santos
[email protected]

You might also like