01 Introduction
01 Introduction
DE*MINAS*GERAIS
Introduction to LLMs
Rodrygo L. T. Santos
[email protected]
Silhouette of a human female on the left
and a humanoid AI on the right; a white
wire connects their brains through their
mouths symbolizing communication.
By DALL·E 3
Language
Statistical LMs
(1950s-1990s)
Tunable dependence via n-grams
3-gram (”trigram”)
◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃 𝑤# 𝑤! … 𝑃 𝑤" 𝑤"$# , 𝑤"$!
2-gram (”bigram”)
◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃 𝑤# 𝑤! … 𝑃 𝑤" 𝑤"$!
1-gram (”unigram”)
◦ 𝑃 𝑤! … 𝑤" = 𝑃 𝑤! 𝑃(𝑤# ) … 𝑃(𝑤" )
Improved estimation via smoothing
𝑃(𝑤)
maximum
likelihood
estimation
smoothed
estimation
word 𝑤
Evolution of language models
𝑤!:#$!
input
Neurons… with recurrence!
output
𝑤
%# 𝑤
%! 𝑤
%& 𝑤
%'
ℎ#$!
𝑤#$! 𝑤% 𝑤! 𝑤&
input
Neurons… with recurrence!
output Sequential bless
𝑤
%# ◦ Dynamic state maintains linguistic context
◦ Enables handling variable-length sequences
recurrent Sequential curse
network ◦ Single state as information bottleneck
ℎ#$! ◦ Inherently non-parallelizable
𝑤#$!
input
Evolution of language models
𝑤
%#
attention The animal didn’t cross the street because it was too ______
𝑤!:#$!
input
Neurons… with attention!
output
𝑤
%#
attention The animal didn’t cross the street because it was too ______
𝑤!:#$!
input
Neurons… with attention!
output
𝑤
%#
attention The animal didn’t cross the street because it was too ______
𝑤!:#$!
input
Neurons… with attention!
output
𝑤
%#
attention The animal didn’t cross the street because it was too ______
𝑤!:#$!
input
Neurons… with attention!
output
𝑤
%#
attention
attn The animal didn’t cross the street because it was too scared
𝑤!:#$!
input
Attention is (not) all you need
𝑤
)#
dense
Enrichment: attend to multiple contexts; add nonlinearities
attn ℎ
𝑤!:#$!
Transformer
𝑤
)# Effective representation
◦ Can attend to entire context – no bottleneck
◦ Attention heads as representation subspaces
dense ◦ Order retained via positional encoding
Decoder
attn ℎ Efficient processing
𝑛
◦ Parallelization across tokens and heads
◦ Much faster training and inference
𝑤!:#$!
◦ Scalability to massive training datasets
Transformer architectures
𝑤
)!:% 𝑤
)# 𝑤
)#
𝑛 𝑛 𝑛 𝑛
GPT-1 GPT-2
117M 1.5B
GPT-4
1.76T*
LLM
Instruction following
PROMPT COMPLETION
LLM
In-context learning
PROMPT COMPLETION
System challenges
◦ Substantial compute and energy consumption
◦ Continual learning and adaptation
Data challenges
◦ Data quality and representativeness
◦ Low-resource domains and languages
The challenges of scaling
Human challenges
◦ Responsible alignment
◦ Interpretability and explainability
◦ Privacy and security
Course goals
Seminar presentations
◦ 3x 20% = 60%
Seminar feedback
◦ 21x 1% = 21%
Class participation
◦ 21x 1% = 21%
Course attendance
❝
Os créditos relativos a cada disciplina só
serão conferidos ao aluno que obtiver, no
mínimo, o conceito D e que comprovar efetiva
frequência a, no mínimo, 75% (setenta e cinco
por cento) das atividades em que estiver
matriculado, vedado o abono de faltas.
NGPG, art. 65
Course materials: books & surveys
Coming next…
Architecture of LLMs
Rodrygo L. T. Santos
[email protected]