0% found this document useful (0 votes)
4 views64 pages

lecture2-transformer

This document is a lecture presentation on Generative AI, focusing on Transformer Language Models and the history of large language models (LLMs). It discusses the evolution of language models, including n-gram models and the introduction of RNNs and LSTMs to address challenges in learning long-distance dependencies. Key models and their specifications are also compared, highlighting advancements in the field of machine learning.

Uploaded by

Othniel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views64 pages

lecture2-transformer

This document is a lecture presentation on Generative AI, focusing on Transformer Language Models and the history of large language models (LLMs). It discusses the evolution of language models, including n-gram models and the introduction of RNNs and LSTMs to address challenges in learning long-distance dependencies. Key models and their specifications are also compared, highlighting advancements in the field of machine learning.

Uploaded by

Othniel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

10-423/10-623 Generative AI

Machine Learning Department


School of Computer Science
Carnegie Mellon University

Transformer
Language Models
Matt Gormley
Lecture 2
Jan. 22, 2024

1
Reminders
• Homework 0: PyTorch + Weights & Biases
– Out: Wed, Jan 17
– Due: Wed, Jan 24 at 11:59pm
– Two parts:
1. written part to Gradescope
2. programming part to Gradescope
– unique policy for this assignment: we will grant (essentially) any
and all extension requests

2
Some History of…

LARGE LANGUAGE MODELS

3
Noisy Channel Models
• Prior to 2017, two tasks relied heavily on language models:
– speech recognition
– machine translation
• Definition: a noisy channel model combines a transduction model (probability of
converting y to x) with a language model (probability of y)

ŷ = argmax p(y | x) = argmax p(x | y)p(y)


y y
language
• Goal: to recover y from x transduction model
– For speech: x is acoustic signal, y is transcription model
– For machine translation: x is sentence in source language, y is sentence in target language

4
Large (n-Gram) Language Models
English n-gram
• The earliest (truly) large language models model is ~3 billion
were n-gram models parameters
• Google n-Grams:
– 2006: first release, English n-grams
Number of unigrams: 13,588,391
• trained on 1 trillion tokens of web text (95 billion Number of bigrams: 314,843,401
sentences) Number of trigrams: 977,069,902
• included 1-grams, 2-grams, 3-grams, 4-grams, and 5- Number of fourgrams: 1,313,818,354
grams Number of fivegrams: 1,176,470,663
– 2009 – 2010: n-grams in Japanese, Chinese,
Swedish, Spanish, Romanian, Portuguese,
Polish, Dutch, Italian, French, German, Czech

serve as the incoming 92 accessoire Accessoires </S> 515


serve as the incubator 99 accessoire Accord i-CTDi 65
serve as the independent 794 accessoire Accra accu 312
serve as the index 223 accessoire Acheter cet 1402
serve as the indication 72 accessoire Ajouter au 160
serve as the indicator 120 accessoire Amour Beauté 112
serve as the indicators 45 accessoire Annuaire LOEIL 49
serve as the indispensable 111 accessoire Architecture artiste 531
serve as the indispensible 40 accessoire Attention : 44
serve as the individual 234 5
serve as the industrial 52
Large (n-Gram) Language Models
English n-gram
• The earliest (truly) large language models model is ~3 billion
were n-gram models parameters
• Google n-Grams:
– 2006: first release, English n-grams
Number of unigrams: 13,588,391
• trained on 1 trillion tokens of web text (95 billion Number of bigrams: 314,843,401
sentences) Number of trigrams: 977,069,902
• included 1-grams, 2-grams, 3-grams, 4-grams, and 5- Number of fourgrams: 1,313,818,354
grams Number of fivegrams: 1,176,470,663
– 2009 – 2010: n-grams in Japanese, Chinese,
Swedish, Spanish, Romanian, Portuguese,
Polish, Dutch, Italian, French, German, Czech

serve as the incoming 92 accessoire Accessoires </S> 515


serve as the incubator 99 accessoire Accord i-CTDi 65
Q: Is this a large training set?
serve
serve
as
as
the
the
independent 794
index 223
accessoire
accessoire
Accra accu 312Q: Is this a large model?
Acheter cet 1402
serve as the indication 72 accessoire Ajouter au 160

A: Yes!
serve
serve
as
as
the
the
indicator 120
indicators 45
accessoire
accessoire
Amour Beauté 112
Annuaire LOEIL 49 A: Yes!
serve as the indispensable 111 accessoire Architecture artiste 531
serve as the indispensible 40 accessoire Attention : 44
serve as the individual 234 6
serve as the industrial 52
How large are LLMs?
Comparison of some recent large language models (LLMs)

Model Creators Year of Training Data (# Model Size (#


release tokens) parameters)
GPT-2 OpenAI 2019 ~10 billion (40Gb) 1.5 billion
GPT-3 OpenAI 2020 300 billion 175 billion
(cf. ChatGPT)
PaLM Google 2022 780 billion 540 billion
Chinchilla DeepMind 2022 1.4 trillion 70 billion
LaMDA Google 2022 1.56 trillion 137 billion
(cf. Bard)
LLaMA Meta 2023 1.4 trillion 65 billion
LLaMA-2 Meta 2023 2 trillion 70 billion
GPT-4 OpenAI 2023 ? ?

7
FORGETFUL RNNS

10
Re
Ways of Drawing Neural Networks ca
ll…
Neural Network Diagram
(F) Loss Computation Graph
• The = 12 (y represents
J diagram y )2 a neural network • The diagram represents an algorithm
• Nodes are circles • Nodes are rectangles
• One node per hidden unit • One node per intermediate variable in the
•(E) Node
Output (sigmoid)
is labeled (E’) Label
with the variable algorithm
corresponding 1
y = 1+ (tob)the hidden unit • Node is labeled with the function that it
Given y ∗ computes (inside the box) and also the
• For a fully connected feed-forward neural
network, a hidden unit is a nonlinear variable name (outside the box)
function of nodes in the previous layer • Edges are directed
(D) Output (linear)
• Edges are directed
D • Edges do not have labels (since they don’t
b =
• Each edgej=0 z
is labeled
j j with its weight (side need them)
note: we should be careful about ascribing • For neural networks:
how a matrix can be used to indicate the – Each intercept term should appear as a node
(C) Hidden
labels(sigmoid)
of the edges and pitfalls there)
(C’) Parameters (if it’s not folded in somewhere)
• Other details:
1 – Each parameter should appear as a node
z =
j 1+ , j
( standard
aj ) Given βj , ∀j – Each constant, e.g. a true label or a feature
– Following convention, the
intercept term is NOT shown as a node, but vector should appear in the graph
rather is assumed to be part of the non- – It’s perfectly fine to include the loss
(B) Hidden linear
(linear)
function that yields a hidden unit. (i.e.
its weight does NOT appear in the picture
M
aj = anywhere)
i=0 ji xi , j
– The diagram does NOT include any nodes
related to the loss computation
(A) Input (A’) Parameters
Given xi , i Given αij , ∀i, j
11
Re
RNN Language Model ca
ll…

The bat made noise at night END

p(w1|h1) p(w2|h2) p(w3|h3) p(w4|h4) p(w5|h5) p(w6|h6) p(w7|h7)

h1 h2 h3 h4 h5 h6 h7

START The bat made noise at night

Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 12
RNNs and Forgetting

13
Long Short-Term Memory (LSTM)
Motivation:
• Standard RNNs have trouble learning long
distance dependencies
• LSTMs combat this issue

y1 y2 … yT-1 yT

h1 h2 … hT-1 hT

x1 x2 … xT-1 xT

15
Long Short-Term Memory (LSTM)
Motivation:
• Vanishing gradient problem for Standard RNNs
• Figure shows sensitivity (darker = more sensitive) to the input at
time t=1

16
Figure from (Graves, 2012)
Long Short-Term Memory (LSTM)
Motivation:
• LSTM units have a rich internal structure
• The various “gates” determine the propagation of information
and can choose to “remember” or “forget” information
CHAPTER 4. LONG SHORT-TERM MEMORY 35

17
Figure fromFigure 4.4:2012)
(Graves, Preservation of gradient information by LSTM. As in Fig-
Long Short-Term Memory (LSTM)

y1 y2 y3 y4
are given in Section 7. are given in Section 7. are given in Section 7. are given in Section 7.

2. NETWORK ARCHITECTURE 2. NETWORK ARCHITECTURE 2. NETWORK ARCHITECTURE 2. NETWORK ARCHITECTURE

Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur-
rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se-
quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y =
(y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1
to T : to T : to T : to T :

ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1)
yt = Why ht + by (2) yt = Why ht + by (2) yt = Why ht + by (2) yt = Why ht + by (2)

where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the
input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors
(e.g. bh is hidden bias vector) and H is the hidden layer func- (e.g.Fig.
bh is1.hidden
Long Short-term
bias vector)Memory
and H isCell
the hidden layer func- (e.g.Fig.
bh is1.hidden
Long bias
Short-term
vector)Memory
and H isCell
the hidden layer func- (e.g.Fig. 1. hidden
bh is Long Short-term
bias vector)Memory
and H isCell
the hidden layer func- Fig. 1. Long Short-term Memory Cell
tion. tion. tion. tion.
H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid
function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term
Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built
memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex-
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x1
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x2
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x3
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x4
it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3)
ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4)
c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5)
ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6)
ht = ot tanh(ct ) (7) Fig. 2. hBidirectional
t = ot tanh(cRecurrent
t) Neural Network (7) Fig. 2.hBidirectional
t = ot tanh(cRecurrent
t) Neural Network (7) Fig. 2. h
Bidirectional Recurrent
t = ot tanh(c t) Neural Network (7) Fig. 2. Bidirectional Recurrent Neural Network

where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c
are respectively the input gate, forget gate, output gate and Combing areBRNNs with LSTM
respectively gives
the input bidirectional
gate, forget gate,LSTM [14],gate and
output Combing areBRNNs with LSTM
respectively gives
the input bidirectional
gate, forget gate,LSTM [14],
output gate and Combing areBRNNs with LSTM
respectively givesgate,
the input bidirectional
forget gate,LSTM [14],gate and
output Combing BRNNs with LSTM gives bidirectional LSTM [14],
cell activation vectors, all of which are the same size as the which cancellaccess long-range
activation vectors,context
all of in both are
which inputthedirections.
same size as the which can
cell access long-range
activation vectors,context in bothare
all of which input
thedirections.
same size as the which can cellaccess long-range
activation vectors, context
all of in both are
which inputthedirections.
same size as the which can access long-range context in both input directions.
hidden vector h. The weight matrices from the cell to gate A crucial
hiddenelement
vector of
h. the
Therecent
weightsuccess of hybrid
matrices from systems
the cell to gate A crucial
hiddenelement
vector h.of the
Therecent
weightsuccess of hybrid
matrices from thesystems
cell to gate A crucial
hiddenelement
vector ofh.theTherecent
weightsuccess of hybrid
matrices from systems
the cell to gate A crucial element of the recent success of hybrid systems
vectors (e.g. Wsi ) are diagonal, so element m in each gate is the use of deep(e.g.
vectors architectures, which are so
Wsi ) are diagonal, ableelement
to buildm upinpro-
each gate is the use of deep architectures, which are able to build up
vectors (e.g. Wsi ) are diagonal, so element m in each gate pro- is the usevectors
of deep(e.g.
architectures,
Wsi ) are which
diagonal,are able to buildm
so element up in
pro-each gate is the use of deep architectures, which are able to build up pro-
vector only receives input from element m of the cell vector. gressively higher
vector onlylevel representations
receives input fromof acoustic
element m data.
of theDeep
cell vector. gressively higher
vector onlylevel representations
receives input from of acoustic
element data.
m of theDeep
cell vector. gressively higher
vector onlylevel representations
receives input fromofelement
acoustic mdata.
of theDeep
cell vector. gressively higher level representations of acoustic data. Deep
One shortcoming of conventional RNNs is that they are RNNs can be created
One by stacking
shortcoming multiple RNNRNNs
of conventional hiddenis layers
that they are RNNs can be Onecreated by stacking
shortcoming multiple RNN
of conventional hidden
RNNs is layers
that they are RNNs can beOne created by stacking
shortcoming multiple RNNRNNs
of conventional hiddenislayers
that they are RNNs can be created by stacking multiple RNN hidden layers
only able to make use of previous context. In speech recog- on top of each other, with the output sequence of one
only able to make use of previous context. In speech recog- layer on top only
of eachableother, withuse
to make theofoutput sequence
previous context.of In
onespeech
layer recog- on top of each other, with the output sequence of one
only able to make use of previous context. In speech recog- layer on top of each other, with the output sequence of one layer
nition, where whole utterances are transcribed at once, there formingnition,
the input sequence
where whole for the next,
utterances areastranscribed
shown in Fig. 3. there
at once, formingnition,
the input
where sequence for the next,
whole utterances areastranscribed
shown in at Fig. 3. there
once, formingnition,
the input sequence
where whole for the next,areastranscribed
utterances shown in Fig. 3. there
at once, forming the input sequence for the next, as shown in Fig. 3.
is no reason not to exploit future context as well. Bidirec- Assuming is nothereason
same hidden layer function
not to exploit is used as
future context forwell.
all NBidirec- Assumingis nothereason
same nothidden layer function
to exploit is usedasforwell.
future context all NBidirec- Assuming thereason
is no same hidden layer function
not to exploit is used as
future context forwell.
all N Bidirec- Assuming the same hidden layer function is used for all N
tional RNNs (BRNNs) [13] do this by processing the data in layers intional
the stack,
RNNsthe hidden vector
(BRNNs) [13] dosequences hn are itera-
this by processing the data in layers in the stack,
tional RNNsthe hidden [13]
(BRNNs) vectordosequences hn are itera-
this by processing the data in layers intional
the stack,
RNNsthe(BRNNs)
hidden vector
[13] do sequences hn are itera-
this by processing the data in layers in the stack, the hidden vector sequences hn are itera-
both directions with two separate hidden layers, which are tively computed from n with
both directions = 1 totwo and t = 1hidden
N separate to T : layers, which are tively computed from nwith
both directions to Nseparate
= 1 two and t = hidden
1 to T : layers, which are tively computed from n with
both directions = 1 totwoN and to T : layers, which are
t = 1hidden
separate tively computed from n = 1 to N and t = 1 to T :
then fed forwards to the same output layer. As illustrated in hntthen
= HfedWforwards to 1the
n
+ same
Whn houtput
n layer.
n As illustrated in
(11) hntthen
=H fedW forwards to
n the
1 same output
+ Whn hn hnt 1 layer.
+ bnh As illustrated
(11) ! in hnt then
= HfedWforwards nto 1the same output
+ Whn hn hnt 1 + layer.
bnh As illustrated
(11) in hnt = H Whn hnt 1
+ Whn hn hnt + bnh (11)
! h n 1 h n ht nh
t 1 + bh ! h n 1 h n ht h n 1 h n ht ! 1 hn
1
Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h ,
where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are
the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y
by iterating the backward layer from t = T to 1, the forward by iterating the backward layer
N from t = T to 1, the
(12) forward by iterating the
yt backward
= WhN y hlayer
N from t = T to 1, the
(12)forward by iterating theyt =backward
W h N y hN layer from t = T to 1, (12) the forward y t = W h N y hN (12)
layer from t = 1 to T and then updating the output layer:
⇣ ⌘
y t = W h N y ht + b y
layer from t = 1 to T and then updating the output layer:
Deep bidirectional RNNs ⇣ can be implemented by replac-
⌘ Deep bidirectional RNNs ⇣
t + by
layer from t = 1 to T and then updating the output layer:
can be implemented by replac-
⌘ Deep bidirectional RNNs ⇣
t + by
layer from t = 1 to T and then updating the output layer:
can be implemented by replac- ⌘
t + by

Deep bidirectional RNNs can be implemented by replac-


18
! ! ! ! ! ! ! !
h t = H Wx! x + W! ! h t 1 + b! (8) ing each hidden sequence
h t = H h Wwith
n
! xthe
+ forward
W ! ! h and backward
+ b! (8) ing each hidden sequence
h t = H hWxwith
n
! xtthe
+W forward
!! h and
t 1+backward
b! (8) ing each hidden sequence
h t = Hh Wwith
n
! xthe +forward
W! ! handt 1backward
+ b! (8) ing each hidden sequence hn with the forward and backward
h t h h h ! xh t h h t 1 h ! h h h h ! xh t h h h !
⇣ ⌘ sequences h n and h n , and ⇣ ensuring that every hidden layer ⌘ sequences h n and h n , and ⇣ ensuring that every hidden layer
⌘ sequences h n and h n , and ⇣ ensuring that every hidden layer ⌘ sequences h n and h n , and ensuring that every hidden layer
h t = H W xt + W h t+1 + b (9) receives input from = HtheWforward
h tboth xt +and
W backward
h t+1 +layers
b at (9) receives input from = HtheWforward
h t both xt +and
W backward
h t+1 +layers
b at (9) receives input from = HtheW
h tboth forwardxt +and
W backward
h t+1 + b at
layers (9) receives input from both the forward and backward layers at
ht = H (Wxh xt + Whh ht 1 + bh ) (1)
yt = Why ht + by (2)
Long Short-Term Memory
where the W terms denote weight matrices (e.g. W is the
(LSTM) xh
input-hidden
• Inputweight
gate: matrix),
masks out thethe
b terms denote bias vectors
Section 7. (e.g. bh isstandardhidden biasRNN vector)
inputsand H is the hidden layer func- Fig. 1. Long Short-term
tion. • Forget gate: masks out
2. NETWORK ARCHITECTUREH istheusually an elementwise
previous cell application of a sigmoid
function. However
• Cell: storeswethe have found that the Long Short-Term
put sequence x = (x1 , . . . , xT ), a standard recur-
Memory input/forget
(LSTM) mixture[11], which uses purpose-built
architecture
network (RNN) computes the hidden vector se-
= (h1 , . . . , hTmemory • Output
cellsvector
) and output gate:
to store masksy =out is better at finding and ex-
information,
sequence
) by iterating ploiting the equations
long
the following values offrom
thet next
range context. =Fig.
1 1 illustrates a single LSTM
hidden
memory cell. For the version of LSTM used in this paper [12]
ht = H (Wxh Hxis implemented by the following
t + Whh ht 1 + bh ) (1) composite function:
yt = Why ht + by i =
t (Wxi xt + Whi (2)
ht 1 + Wci ct 1 + bi ) (3)
W terms denote weight matrices
ft = (e.g.
(Wxf +xhWishfthe
xt W ht 1 + Wcf ct 1 + bf ) (4)
n weight matrix), the b terms denote bias vectors
c = ft cthidden
idden bias vector) andtH is the 1 + ilayer
t tanh (W x + Whc hFig.
func- xc t t 11.+Long (5)
bc ) Short-term Memory Cell
ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6)
sually an elementwise application of a sigmoid Fig. 2. Bidirectional Recurre
h = ot tanh(ct ) (7)
However we have foundt that the Long Short-Term 19
Figure from (Graves et al., 2013)
ht = H (Wxh xt + Whh ht 1 + bh ) (1)
yt = Why ht + by (2)
Long Short-Term Memory
where the W terms denote weight matrices (e.g. W is the
(LSTM) xh
input-hidden
• Inputweight
gate: matrix),
masks out thethe
b terms denote bias vectors
Section 7. (e.g. bh isstandard hidden biasRNN vector)
inputsand H is the hidden layer func- Fig. 1. Long Short-term
tion. • Forget gate: masks out
2. NETWORK ARCHITECTURE H istheusually an elementwise
previous cell application of a sigmoid
function. However
• Cell: storeswe the have found that the Long Short-Term
put sequence
The cell isx = (x1 , . . . , xT ), a standard recur-
Memory input/forget
(LSTM) mixture[11], which uses purpose-built
architecture
network
the LSTM’s(RNN) computes the hidden vector se-
= (hlong , hTmemory
1 , . . .term
• Output
cellsvector
) and output gate:
to store masksy =out is better at finding and ex-
information,
sequence
) by iteratingand
memory, ploiting the equations
long
the following values offrom
thet next
range context. =Fig.
1 1 illustrates a single LSTM
hidden
helps control memory cell. For the version of LSTM used in this paper [12]
information
ht = H (Wxh Hxis implemented by the following
t + Whh ht 1 + bh ) (1) composite function:
flow over
yt time
= Wsteps
hy ht + by i =
t (Wxi xt + Whi (2)
ht 1 + Wci ct 1 + bi ) (3)
W terms denote weight ft =matrices
(Wxf (e.g. +xhWishfthe
xt W ht 1 + Wcf ct 1 + bf ) (4)
n weight matrix), the b terms denote bias vectors
The hidden Identical to the
idden biasisvector)
state the and ctH=isfthe
t cthidden
1 + ilayer
t tanh (W x + Whc hFig.
func- xc t t 11.+Long (5)Elman’s
bc ) Short-term networks
Memory Cell
output of ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) hidden state
suallytheanLSTM
elementwise application of a sigmoid Fig. 2. Bidirectional Recurre
ht = ot tanh(ct ) (7)
However we have found that the Long Short-Term
cell 20
Figure from (Graves et al., 2013)
Long Short-Term Memory (LSTM)

y1 y2 y3 y4
are given in Section 7. are given in Section 7. are given in Section 7. are given in Section 7.

2. NETWORK ARCHITECTURE 2. NETWORK ARCHITECTURE 2. NETWORK ARCHITECTURE 2. NETWORK ARCHITECTURE

Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur-
rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se-
quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y =
(y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1
to T : to T : to T : to T :

ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1)
yt = Why ht + by (2) yt = Why ht + by (2) yt = Why ht + by (2) yt = Why ht + by (2)

where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the
input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors
(e.g. bh is hidden bias vector) and H is the hidden layer func- (e.g.Fig.
bh is1.hidden
Long Short-term
bias vector)Memory
and H isCell
the hidden layer func- (e.g.Fig.
bh is1.hidden
Long bias
Short-term
vector)Memory
and H isCell
the hidden layer func- (e.g.Fig. 1. hidden
bh is Long Short-term
bias vector)Memory
and H isCell
the hidden layer func- Fig. 1. Long Short-term Memory Cell
tion. tion. tion. tion.
H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid
function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term
Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built
memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex-
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x1
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x2
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x3
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x4
it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3)
ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4)
c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5)
ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6)
ht = ot tanh(ct ) (7) Fig. 2. hBidirectional
t = ot tanh(cRecurrent
t) Neural Network (7) Fig. 2.hBidirectional
t = ot tanh(cRecurrent
t) Neural Network (7) Fig. 2. h
Bidirectional Recurrent
t = ot tanh(c t) Neural Network (7) Fig. 2. Bidirectional Recurrent Neural Network

where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c
are respectively the input gate, forget gate, output gate and Combing areBRNNs with LSTM
respectively gives
the input bidirectional
gate, forget gate,LSTM [14],gate and
output Combing areBRNNs with LSTM
respectively gives
the input bidirectional
gate, forget gate,LSTM [14],
output gate and Combing areBRNNs with LSTM
respectively givesgate,
the input bidirectional
forget gate,LSTM [14],gate and
output Combing BRNNs with LSTM gives bidirectional LSTM [14],
cell activation vectors, all of which are the same size as the which cancellaccess long-range
activation vectors,context
all of in both are
which inputthedirections.
same size as the which can
cell access long-range
activation vectors,context in bothare
all of which input
thedirections.
same size as the which can cellaccess long-range
activation vectors, context
all of in both are
which inputthedirections.
same size as the which can access long-range context in both input directions.
hidden vector h. The weight matrices from the cell to gate A crucial
hiddenelement
vector of
h. the
Therecent
weightsuccess of hybrid
matrices from systems
the cell to gate A crucial
hiddenelement
vector h.of the
Therecent
weightsuccess of hybrid
matrices from thesystems
cell to gate A crucial
hiddenelement
vector ofh.theTherecent
weightsuccess of hybrid
matrices from systems
the cell to gate A crucial element of the recent success of hybrid systems
vectors (e.g. Wsi ) are diagonal, so element m in each gate is the use of deep(e.g.
vectors architectures, which are so
Wsi ) are diagonal, ableelement
to buildm upinpro-
each gate is the use of deep architectures, which are able to build up
vectors (e.g. Wsi ) are diagonal, so element m in each gate pro- is the usevectors
of deep(e.g.
architectures,
Wsi ) are which
diagonal,are able to buildm
so element up in
pro-each gate is the use of deep architectures, which are able to build up pro-
vector only receives input from element m of the cell vector. gressively higher
vector onlylevel representations
receives input fromof acoustic
element m data.
of theDeep
cell vector. gressively higher
vector onlylevel representations
receives input from of acoustic
element data.
m of theDeep
cell vector. gressively higher
vector onlylevel representations
receives input fromofelement
acoustic mdata.
of theDeep
cell vector. gressively higher level representations of acoustic data. Deep
One shortcoming of conventional RNNs is that they are RNNs can be created
One by stacking
shortcoming multiple RNNRNNs
of conventional hiddenis layers
that they are RNNs can be Onecreated by stacking
shortcoming multiple RNN
of conventional hidden
RNNs is layers
that they are RNNs can beOne created by stacking
shortcoming multiple RNNRNNs
of conventional hiddenislayers
that they are RNNs can be created by stacking multiple RNN hidden layers
only able to make use of previous context. In speech recog- on top of each other, with the output sequence of one
only able to make use of previous context. In speech recog- layer on top only
of eachableother, withuse
to make theofoutput sequence
previous context.of In
onespeech
layer recog- on top of each other, with the output sequence of one
only able to make use of previous context. In speech recog- layer on top of each other, with the output sequence of one layer
nition, where whole utterances are transcribed at once, there formingnition,
the input sequence
where whole for the next,
utterances areastranscribed
shown in Fig. 3. there
at once, formingnition,
the input
where sequence for the next,
whole utterances areastranscribed
shown in at Fig. 3. there
once, formingnition,
the input sequence
where whole for the next,areastranscribed
utterances shown in Fig. 3. there
at once, forming the input sequence for the next, as shown in Fig. 3.
is no reason not to exploit future context as well. Bidirec- Assuming is nothereason
same hidden layer function
not to exploit is used as
future context forwell.
all NBidirec- Assumingis nothereason
same nothidden layer function
to exploit is usedasforwell.
future context all NBidirec- Assuming thereason
is no same hidden layer function
not to exploit is used as
future context forwell.
all N Bidirec- Assuming the same hidden layer function is used for all N
tional RNNs (BRNNs) [13] do this by processing the data in layers intional
the stack,
RNNsthe hidden vector
(BRNNs) [13] dosequences hn are itera-
this by processing the data in layers in the stack,
tional RNNsthe hidden [13]
(BRNNs) vectordosequences hn are itera-
this by processing the data in layers intional
the stack,
RNNsthe(BRNNs)
hidden vector
[13] do sequences hn are itera-
this by processing the data in layers in the stack, the hidden vector sequences hn are itera-
both directions with two separate hidden layers, which are tively computed from n with
both directions = 1 totwo and t = 1hidden
N separate to T : layers, which are tively computed from nwith
both directions to Nseparate
= 1 two and t = hidden
1 to T : layers, which are tively computed from n with
both directions = 1 totwoN and to T : layers, which are
t = 1hidden
separate tively computed from n = 1 to N and t = 1 to T :
then fed forwards to the same output layer. As illustrated in hntthen
= HfedWforwards to 1the
n
+ same
Whn houtput
n layer.
n As illustrated in
(11) hntthen
=H fedW forwards to
n the
1 same output
+ Whn hn hnt 1 layer.
+ bnh As illustrated
(11) ! in hnt then
= HfedWforwards nto 1the same output
+ Whn hn hnt 1 + layer.
bnh As illustrated
(11) in hnt = H Whn hnt 1
+ Whn hn hnt + bnh (11)
! h n 1 h n ht nh
t 1 + bh ! h n 1 h n ht h n 1 h n ht ! 1 hn
1
Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h ,
where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are
the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y
by iterating the backward layer from t = T to 1, the forward by iterating the backward layer
N from t = T to 1, the
(12) forward by iterating the
yt backward
= WhN y hlayer
N from t = T to 1, the
(12)forward by iterating theyt =backward
W h N y hN layer from t = T to 1, (12) the forward y t = W h N y hN (12)
layer from t = 1 to T and then updating the output layer:
⇣ ⌘
y t = W h N y ht + b y
layer from t = 1 to T and then updating the output layer:
Deep bidirectional RNNs ⇣ can be implemented by replac-
⌘ Deep bidirectional RNNs ⇣
t + by
layer from t = 1 to T and then updating the output layer:
can be implemented by replac-
⌘ Deep bidirectional RNNs ⇣
t + by
layer from t = 1 to T and then updating the output layer:
can be implemented by replac- ⌘
t + by

Deep bidirectional RNNs can be implemented by replac-


21
! ! ! ! ! ! ! !
h t = H Wx! x + W! ! h t 1 + b! (8) ing each hidden sequence
h t = H h Wwith
n
! xthe
+ forward
W ! ! h and backward
+ b! (8) ing each hidden sequence
h t = H hWxwith
n
! xtthe
+W forward
!! h and
t 1+backward
b! (8) ing each hidden sequence
h t = Hh Wwith
n
! xthe +forward
W! ! handt 1backward
+ b! (8) ing each hidden sequence hn with the forward and backward
h t h h h ! xh t h h t 1 h ! h h h h ! xh t h h h !
⇣ ⌘ sequences h n and h n , and ⇣ ensuring that every hidden layer ⌘ sequences h n and h n , and ⇣ ensuring that every hidden layer
⌘ sequences h n and h n , and ⇣ ensuring that every hidden layer ⌘ sequences h n and h n , and ensuring that every hidden layer
h t = H W xt + W h t+1 + b (9) receives input from = HtheWforward
h tboth xt +and
W backward
h t+1 +layers
b at (9) receives input from = HtheWforward
h t both xt +and
W backward
h t+1 +layers
b at (9) receives input from = HtheW
h tboth forwardxt +and
W backward
h t+1 + b at
layers (9) receives input from both the forward and backward layers at
When training deep networks i
stochastic gradient descent it has been
Deep Bidirectional LSTM (DBLSTM)
Fig. 3. Deep Recurrent Neural Network select minibatches of frames random
ing set, rather than using whole utte
• is impossible with RNN-HMM hyb
Figure: input/output
gradients
layers not shown are a function of the entire
Another difference is that hyb
• Same general
trained with an acoustic context w
topology asside
ther a Deep
of the one being classified. T
DBLSTM,
Bidirectional RNN, since it is as able to store
but withinternally,
LSTM unitsand the data was therefore
at a time.
in the hidden layers
For some of the experiments Ga
• No additional
to the network weights during trai
representational
was added once per training sequen
power over DBRNN,
timestep. Weight noise tends to ‘sim
in the
but easier to sense
learnofinreducing the amount
practiceto transmit the parameters [16, 17], w
sation.
22
Figure from (Graves et al., 2013)
When training deep networks i
stochastic gradient descent it has been
Deep Bidirectional LSTM (DBLSTM)
Fig. 3. Deep Recurrent Neural Network select minibatches of frames random
ing set, rather than using whole utte
is impossible
How important is thiswith RNN-HMM hyb
gradients are a function of the entire
particular architecture?
Another difference is that hyb
trained with an acoustic context w
Jozefowiczther
etside of the one being classified. T
al. (2015)
evaluated DBLSTM,
10,000 since it is as able to store
internally, and the data was therefore
different LSTM-like
at a time.
architecturesFor and some of the experiments Ga
found several
to thevariants
network weights during trai
that workedwasjust
addedasonce per training sequen
well on several tasks.
timestep. Weight noise tends to ‘sim
in the sense of reducing the amount
to transmit the parameters [16, 17], w
sation.
23
Figure from (Graves et al., 2013)
Why not just use LSTMs for everything?
Everyone did, for a time.

But…
1. They still have difficulty with long-range dependencies
2. Their computation is inherently serial, so can’t be easily
parallelized on a GPU
3. Even though they (mostly) solve the vanishing gradient problem,
they can still suffer from exploding gradients

24
Transformer Language Models

MODEL: GPT

25
Attention
!
4
x!4 = a4,j vj
j=1

a4,1 a4,2 a4,3 a4,4

softmax
s4,1 s4,2 s4,3 s4,4

v1 v2 v3 v4

26
Attention
!
1
x!1 = a1,j vj
j=1

a1,1

softmax
s1,1

v1

27
Attention
!
2
x!2 = a2,j vj
j=1

a2,1 a2,2

softmax
s2,1 s2,2

v1 v2

28
Attention
!
3
x!3 = a3,j vj
j=1

a3,1 a3,1 a3,1

softmax
s3,1 s3,2 s3,3

v1 v2 v3

29
Attention
!
4
x!4 = a4,j vj
j=1

a4,1 a4,2 a4,3 a4,4

softmax
s4,1 s4,2 s4,3 s4,4

v1 v2 v3 v4

30
Attention
!
t
x1 ’ x2 ’ x3 ’ x4 ’ x!t = at,j vj
j=1

a4,1 a4,2 a4,3 a4,4


attention weights
softmax
s4,1 s4,2 s4,3 s4,4
scores
v1 v2 v3 v4
values

31
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1

a4,1 a4,2 a4,3 a4,4

softmax
s4,1 s4,2 s4,3 s4,4

Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4

32
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1

a4,1 a4,2 a4,3 a4,4

softmax
s4,1 s4,2 s4,3 s4,4

Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4

33
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1

a4,1 a4,2 a4,3 a4,4

softmax
Wq s4,1 s4,2 s4,3 s4,4

q1 q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4

34
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1

a4,1 a4,2 a4,3 a4,4

softmax
Wq s4,1 s4,2 s4,3 s4,4
kTj q4 /
!
s4,j = dk scores
q1 q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4

35
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1

a4,1 a4,2 a4,3 a4,4


a4 = softmax(s4 ) attention weights

softmax
Wq s4,1 s4,2 s4,3 s4,4
kTj q4 /
!
s4,j = dk scores
q1 q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4

36
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1

a4,1 a4,2 a4,3 a4,4


a4 = softmax(s4 ) attention weights

softmax
Wq s4,1 s4,2 s4,3 s4,4
kTj q4 /
!
s4,j = dk scores
q1 attention
q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4

37
Scaled Dot-Product Attention
!
t
x1 ’ x2 ’ x3 ’ x4 ’ x!t = at,j vj
j=1

at = softmax(st ) attention weights

Wq
kTj qt /
!
st,j = dk scores

attention qj = WTq xj queries


Wk

kj = WTk xj keys
Wv
vj = WTv xj values
x1 x2 x3 x4

38
Re
Animation of 3D Convolution ca
ll…
http://cs231n.github.io/convolutional-networks/

39
Figure from Fei-Fei Li & Andrej Karpathy & Justin Johnson (CS231N)
Multi-headed Attention

x1 ’ x2 ’ x3 ’ x4 ’ • Just as we can have


1st multiple channels in a
head 2nd
Wq head 3rd convolution layer, we
head can use multiple heads
in an attention layer
• Each head gets its own
Wk parameters
multi-headed attention
• We can concatenate all
the outputs to get a
Wv single vector for each
time step

x1 x2 x3 x4

40
• To ensure the dimension of the
input embedding xt is the same
as the output embedding xt’, Multi-headed Attention
Transformers usually choose
the embedding sizes and
number of heads appropriately:
• dmodel = dim. of inputs
• dk = dim. of each output
• h = # of heads • Just as we can have
• Choose dk = dmodel / h x1 ’ x2 ’ x3 ’ x4 ’ multiple channels in a
• Then concatenate the outputs convolution layer, we
Wq can use multiple heads
in an attention layer
• Each head gets its own
Wk multi-headed attention parameters
• We can concatenate all
the outputs to get a
single vector for each
Wv time step

x1 x2 x3 x4

41
• To ensure the dimension of the
input embedding xt is the same
as the output embedding xt’, Multi-headed Attention
Transformers usually choose
the embedding sizes and
number of heads appropriately:
• dmodel = dim. of inputs
• dk = dim. of each output
• h = # of heads • Just as we can have
• Choose dk = dmodel / h x1 ’ x2 ’ x3 ’ x4 ’ multiple channels in a
• Then concatenate the outputs convolution layer, we
can use multiple heads
in an attention layer
Wq • Each head gets its own
Wk
multi-headed attention parameters
Wv
• We can concatenate all
the outputs to get a
single vector for each
time step

x1 x2 x3 x4

42
Re
RNN Language Model ca
ll…

The bat made noise at night END

p(w1|h1) p(w2|h2) p(w3|h3) p(w4|h4) p(w5|h5) p(w6|h6) p(w7|h7)

h1 h2 h3 h4 h5 h6 h7

START The bat made noise at night

Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 43
Transformer Language Model
The bat made noise
Each layer of a Transformer LM
consists of several sublayers:
Important! 1. attention
• RNN computation p(w1|h1) p(w2|h2) p(w3|h3) p(w4|h4)
2. feed-forward neural network
graph grows 3. layer normalization
linearly with the 4. residual connections
number of input h1 h2 h3 h4
tokens Each hidden vector looks back at
the hidden vectors of the current
• Transformer-LM and previous timesteps in the
computation graph previous layer.
grows quadratically
with the number of The language model part is just like
input tokens an RNN-LM!

x1 x2 x3 x4 …
44
Transformer Language Model
The bat made noise
Each layer of a Transformer LM
consists of several sublayers:
Important! 1. attention
• RNN computation p(w1|h1) p(w2|h2) p(w3|h3) p(w4|h4)
2. feed-forward neural network
graph grows 3. layer normalization
linearly with the 4. residual connections
number of input h1 h2 h3 h4
tokens Each hidden vector looks back at
the hidden vectors of the current
• Transformer-LM and previous timesteps in the
computation graph previous layer.
grows quadratically
with the number of The language model part is just like
input tokens an RNN-LM!

x1 x2 x3 x4 …
45
Layer Normalization
• The Problem: internal Given input a ∈ RK , LayerNorm computes output b ∈ RK :
covariate shift occurs
during training of a deep a−µ
network when a small b=γ" ⊕β
σ
change in the low layers
amplifies into a large !K
where we have mean µ = K k=1 ak ,
1
change in the high layers " !
K
• One Solution: Layer standard deviation σ = K 1
k=1 (ak − µ) ,
2

normalization normalizes and parameters γ ∈ RK , β ∈ RK .


each layer and learns " and ⊕ denote elementwise multiplication and addition.
elementwise gain/bias
• Such normalization allows
for higher learning rates
(for faster convergence)
without issues of
diverging gradients

47
Figure from https://arxiv.org/pdf/1607.06450.pdf
Residual Connections
Residual Connection
• The Problem: as network Plain Connection
depth grows very large, a b
performance degradation b
occurs that is not explained
by overfitting (i.e. train / test b = b! + a
error both worsen)
• One Solution: Residual b = f (a)
connections pass a copy of
the input alongside another b! = f (a)
function so that information a
can flow more directly
• These residual connections a
allow for effective training
of very deep networks that
perform better than their
shallower (though still deep)
counterparts

48
Figure from https://arxiv.org/pdf/1512.03385.pdf
Residual Connections
Residual Connection
• The Problem: as network Plain Connection
depth grows very large, a b
performance degradation b
occurs that is not explained
by overfitting (i.e. train / test
error both worsen)
• One Solution: Residual b = f (a) b = f (a) + a
connections pass a copy of
the input alongside another
function so that information a
can flow more directly
• These residual connections a
allow for effective training
of very deep networks that
perform better than their Why are residual connections helpful?
shallower (though still deep)
counterparts Instead of f(a) having to learn a full
transformation of a, f(a) only needs to learn an
additive modification of a (i.e. the residual).
49
Figure from https://arxiv.org/pdf/1512.03385.pdf
Transformer Layer
x1’ x2’ x3’ x4’

layer normalization
Each layer of a Transformer LM
consists of several sublayers:
1. attention
2. feed-forward neural network
residual connections
3. layer normalization
4. residual connections
feed forward neural network

layer normalization

residual connections

Wq

Wk
multi-headed attention
Wv

x1 x2 x3 x4
50
Transformer Layer
x1’ x2’ x3’ x4’

layer normalization
Each layer of a Transformer LM
consists of several sublayers:
1. attention
2. feed-forward neural network
residual connections
3. layer normalization
4. residual connections
feed forward neural network

Transformer
Layer layer normalization

residual connections

Wq

Wk
multi-headed attention
Wv

x1 x2 x3 x4
51
Transformer Layer
x1’ x2’ x3’ x4’

layer normalization
Each layer of a Transformer LM
consists of several sublayers:
1. attention
2. feed-forward neural network
residual connections
3. layer normalization
4. residual connections
feed forward neural network

Transformer
Layer layer normalization

residual connections

Wq

Wk
multi-headed attention
Wv

x1 x2 x3 x4
52
Transformer Layer
Each layer of a Transformer LM
consists of several sublayers:
1. attention
2. feed-forward neural network
3. layer normalization
4. residual connections

x1’ x2’ x3’ x4’

Transformer layer

x1 x2 x3 x4
53
Transformer Language Model
The bat made noise
Each layer of a Transformer LM
consists of several sublayers:
1. attention
p(w2|h2) p(w4|h4)
2. feed-forward neural network
p(w1|h1) p(w3|h3)
3. layer normalization
4. residual connections
h1 h2 h3 h4
Each hidden vector looks back at
the hidden vectors of the current
Transformer layer and previous timesteps in the
previous layer.

Transformer layer
The language model part is just like
an RNN-LM.

Transformer layer
x1 x2 x3 x4 …
54
!
4

In-Class Poll x!4 =


j=1
a4,j vj

Question: a4,1 a4,2 a4,3 a4,4


a4 = softmax(s4 ) attention weights
Suppose we have the following input softmax
embeddings and attention weights: Wq s4,1 s4,2 s4,3 s4,4
s4,j = kTj q4 / dk scores
!

• x1 = [1,0,0,0] a4,1 = 0.1 Wk


q1 q2 q3 q4
qj = WTq xj queries

• x2 = [0,1,0,0] a4,2 = 0.2 k1 k2 k3 k4


kj = WTk xj keys

• x3 = [0,0,2,0] a4,3 = 0.6


Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4

• x4 = [0,0,0,1] a4,4 = 0.1


And Wv = I. Then we can compute x4’. Answer:
Now suppose we swap the
embeddings x2 and x3 such that
• x2 = [0,0,2,0]
• x3 = [0,1,0,0]
What is the new value of x4’? 55
The bat made noise

Position Embeddings
p(w1|h1) p(w2|h2) p(w3|h3) p(w4|h4)
• The Problem: Because attention is position
invariant, we need a way to learn about positions
• The Solution: Use (or learn) a collection of position h1 h2 h3 h4
specific embeddings: pt represents what it means
to be in position t. And add this to the word
embedding wt. Transformer layer
The key idea is that every word that appears in
position t uses the same position embedding pt
• There are a number of varieties of position Transformer layer
embeddings:
– Some are fixed (based on sine and cosine), whereas
others are learned (like word embeddings) Transformer layer
– Some are absolute (as described above) but we can
also use relative position embeddings (i.e. relative
to the position of the query vector) + + + +

p1 p2 p3 p4

w1 w2 w3 w4
56
GPT-3
• GPT stands for Generative Pre-trained Transformer
• GPT is just a Transformer LM, but with a huge number of
parameters

Model # layers dimension dimension # attention # params


of states of inner heads
states
GPT (2018) 12 768 3072 12 117M
GPT-2 48 1600 -- -- 1542M
(2019)
GPT-3 96 12288 4*12288 96 175000M
(2020)

59
IMPLEMENTING A TRANSFORMER LM

60
Matrix Version of Single-Headed Attention
!
4
x!4 = a4,j vj
j=1

a4,1 a4,2 a4,3 a4,4


a4 = softmax(s4 ) attention weights

softmax
Wq s4,1 s4,2 s4,3 s4,4
kTj q4 /
!
s4,j = dk scores
q1 q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4

61
Matrix Version of Single-Headed Attention
!
4
x!4 = a4,j vj
j=1

a4,1a4,2a4,3a4,4
a4 = softmax(s4 ) attention weights

softmax
Wq s4,1s4,2s4,3s4,4
kTj q4 /
!
s4,j = dk scores
q1 q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4

62
Matrix Version of Single-Headed Attention
• For speed, we compute
X = AV = softmax(QK / dk )V
!
all the queries at once ! T
using matrix operations
• First we pack the
queries, keys, values into a4,1a4,2a4,3a4,4
matrices A = [a1 , . . . , a4 ]T = softmax(S)
• Then we compute all the
queries at once softmax
Wq s4,1s4,2s4,3s4,4
S = [s1 , . . . , s4 ] = QK / dk
!
T T

q1 q2 q3 q4
Wk
Q = [q1 , . . . , q4 ]T = XWq
k1 k2 k3 k4
K = [k1 , . . . , k4 ]T = XWk
Wv v1 v2 v3 v4
V = [v1 , . . . , v4 ]T = XWv
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
63
Matrix Version of Single-Headed Attention
• For speed, we compute
X = AV = softmax(QK / dk )V
!
all the queries at once ! T
using matrix operations
• First we pack the
queries, keys, values into
matrices A = [a1 , . . . , a4 ]T = softmax(S)
• Then we compute all the
queries at once softmax
Wq
S = [s1 , . . . , s4 ] = QK / dk
!
T T

q1 q2 q3 q4
Wk
Q = [q1 , . . . , q4 ]T = XWq
k1 k2 k3 k4
K = [k1 , . . . , k4 ]T = XWk
Wv v1 v2 v3 v4
V = [v1 , . . . , v4 ]T = XWv
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
64
Matrix Version of Single-Headed Attention
Holy cow, that’s a lot of new
X = AV = softmax(QK / dk )V
!
arrows… do we always ! T
want/need all of those?
• Suppose we’re training
our transformer to predict
the next token(s) given A = [a1 , . . . , a4 ]T = softmax(S)
the input…
softmax
• … then attending to
tokens that come Wqafter
S = [s1 , . . . , s4 ] = QK / dk
!
T T
the current token is
cheating! q1 q2 q3 q4
So what is this model?
Wk
Q = [q1 , . . . , q4 ]T = XWq
• This version is the k1 k2 k3 k4
standard Transformer K = [k1 , . . . , k4 ]T = XWk
block. (more on this later!)
v1 v2 v3 v4
• But we want theWv V = [v1 , . . . , v4 ]T = XWv
Transformer LM block
x1 x2 x3 x4
• And that requires X = [x1 , . . . , x4 ]T
masking!
65
Matrix Version of Single-Headed Attention
X = AV = softmax(QK / dk )V
!
! T

Question: How is the


softmax applied?
A = softmax(S)
A. column-wise
softmax B. row-wise
Wq
S = QK / dk
!
T
Answer:
q1 q2 q3 q4
Q = XWq
Wk
k1 k2 k3 k4
K = XWk
v1 v2 v3 v4
V = XWv
Wv

x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
66
Matrix Version of Single-Headed (Causal) Attention
Insight: if some element in
X = AV = softmax(QK / dk + M)V
!
! T
the input to the softmax is
-∞, then the corresponding
output is 0!
Acausal = softmax(S + M)
Question: For a causal LM
softmax
which is the correct matrix? In practice, the
Wq attention weights are
A:
S = QK / dk
  !
0 0 0 0 T
−∞
M=
0 0 0
 computed for all time
−∞ −∞ 0 0 steps T, then we mask
q1 q2 q3 q4
Q = XWq
−∞ −∞ −∞ 0
B:   out (by setting to –inf)
0 −∞ −∞ Wk −∞ all the inputs to the
0 0 −∞ −∞
M= k1 k2 k3 k4 softmax that are for
K = XWk

0 0 0 −∞
C: 0 0 0 0 the timesteps to the

0 −∞ −∞ −∞

right of the query.
v1 v2 v3 v4
0 W v
V = XWv
−∞ −∞ −∞
M= 
−∞ −∞ 0 −∞
−∞ −∞ −∞ 0
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
Answer:
67
Matrix Version of Multi-Headed (Causal) Attention

X = concat(X!(1) , X!(2) , X!(3) )

x1 ’ x2 ’ x3 ’ x4 ’
Q (K )
(i) (i) T
! "
W(1)
X!(i) = softmax √ + M V(i)
q
W(2) dk
W(3)
q
q

(1)
Wk
Q(i) = XW(i)
(2)
Wk (3) multi-headed attention
Wk q

(i)
W(1) K(i) = XWk
W(2)
v

W(3)
v

V(i) = XW(i)
v
v
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
68
Matrix Version of Multi-Headed (Causal) Attention

X = concat(X!(1) , . . . , X!(h) )

x1 ’ x2 ’ x3 ’ x4 ’
Q (K )
(i) (i) T
! "
W(1)
X!(i) = softmax √ + M V(i)
q
W(2) dk
W(3)
q
q

(1)
Wk
Q(i) = XW(i)
(2)
Wk (3) multi-headed attention
Wk q

(i)
W(1) K(i) = XWk
W(2)
v

W(3)
v

V(i) = XW(i)
v
v
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
69
Recall:
To ensure the dimension of the input
Matrix Version of Multi-Headed (Causal) Attention
embedding xt is the same as the
output embedding xt’, Transformers
usually choose the embedding sizes
and number of heads appropriately:
• dmodel = dim. of inputs X = concat(X!(1) , . . . , X!(h) )
• dk = dim. of each output
• h = # of heads
• Choose dk = dmodel / h
x1 ’ x2 ’ x3 ’ x4 ’
Q (K )
(i) (i) T
! "
X!(i) = softmax √ + M V(i)
Wq dk

Wk multi-headed attention Q(i) = XW(i)


q

(i)
K(i) = XWk
Wv
V(i) = XW(i)
v
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
70

You might also like