lecture2-transformer
lecture2-transformer
Transformer
Language Models
Matt Gormley
Lecture 2
Jan. 22, 2024
1
Reminders
• Homework 0: PyTorch + Weights & Biases
– Out: Wed, Jan 17
– Due: Wed, Jan 24 at 11:59pm
– Two parts:
1. written part to Gradescope
2. programming part to Gradescope
– unique policy for this assignment: we will grant (essentially) any
and all extension requests
2
Some History of…
3
Noisy Channel Models
• Prior to 2017, two tasks relied heavily on language models:
– speech recognition
– machine translation
• Definition: a noisy channel model combines a transduction model (probability of
converting y to x) with a language model (probability of y)
4
Large (n-Gram) Language Models
English n-gram
• The earliest (truly) large language models model is ~3 billion
were n-gram models parameters
• Google n-Grams:
– 2006: first release, English n-grams
Number of unigrams: 13,588,391
• trained on 1 trillion tokens of web text (95 billion Number of bigrams: 314,843,401
sentences) Number of trigrams: 977,069,902
• included 1-grams, 2-grams, 3-grams, 4-grams, and 5- Number of fourgrams: 1,313,818,354
grams Number of fivegrams: 1,176,470,663
– 2009 – 2010: n-grams in Japanese, Chinese,
Swedish, Spanish, Romanian, Portuguese,
Polish, Dutch, Italian, French, German, Czech
A: Yes!
serve
serve
as
as
the
the
indicator 120
indicators 45
accessoire
accessoire
Amour Beauté 112
Annuaire LOEIL 49 A: Yes!
serve as the indispensable 111 accessoire Architecture artiste 531
serve as the indispensible 40 accessoire Attention : 44
serve as the individual 234 6
serve as the industrial 52
How large are LLMs?
Comparison of some recent large language models (LLMs)
7
FORGETFUL RNNS
10
Re
Ways of Drawing Neural Networks ca
ll…
Neural Network Diagram
(F) Loss Computation Graph
• The = 12 (y represents
J diagram y )2 a neural network • The diagram represents an algorithm
• Nodes are circles • Nodes are rectangles
• One node per hidden unit • One node per intermediate variable in the
•(E) Node
Output (sigmoid)
is labeled (E’) Label
with the variable algorithm
corresponding 1
y = 1+ (tob)the hidden unit • Node is labeled with the function that it
Given y ∗ computes (inside the box) and also the
• For a fully connected feed-forward neural
network, a hidden unit is a nonlinear variable name (outside the box)
function of nodes in the previous layer • Edges are directed
(D) Output (linear)
• Edges are directed
D • Edges do not have labels (since they don’t
b =
• Each edgej=0 z
is labeled
j j with its weight (side need them)
note: we should be careful about ascribing • For neural networks:
how a matrix can be used to indicate the – Each intercept term should appear as a node
(C) Hidden
labels(sigmoid)
of the edges and pitfalls there)
(C’) Parameters (if it’s not folded in somewhere)
• Other details:
1 – Each parameter should appear as a node
z =
j 1+ , j
( standard
aj ) Given βj , ∀j – Each constant, e.g. a true label or a feature
– Following convention, the
intercept term is NOT shown as a node, but vector should appear in the graph
rather is assumed to be part of the non- – It’s perfectly fine to include the loss
(B) Hidden linear
(linear)
function that yields a hidden unit. (i.e.
its weight does NOT appear in the picture
M
aj = anywhere)
i=0 ji xi , j
– The diagram does NOT include any nodes
related to the loss computation
(A) Input (A’) Parameters
Given xi , i Given αij , ∀i, j
11
Re
RNN Language Model ca
ll…
h1 h2 h3 h4 h5 h6 h7
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 12
RNNs and Forgetting
13
Long Short-Term Memory (LSTM)
Motivation:
• Standard RNNs have trouble learning long
distance dependencies
• LSTMs combat this issue
y1 y2 … yT-1 yT
h1 h2 … hT-1 hT
x1 x2 … xT-1 xT
15
Long Short-Term Memory (LSTM)
Motivation:
• Vanishing gradient problem for Standard RNNs
• Figure shows sensitivity (darker = more sensitive) to the input at
time t=1
16
Figure from (Graves, 2012)
Long Short-Term Memory (LSTM)
Motivation:
• LSTM units have a rich internal structure
• The various “gates” determine the propagation of information
and can choose to “remember” or “forget” information
CHAPTER 4. LONG SHORT-TERM MEMORY 35
17
Figure fromFigure 4.4:2012)
(Graves, Preservation of gradient information by LSTM. As in Fig-
Long Short-Term Memory (LSTM)
y1 y2 y3 y4
are given in Section 7. are given in Section 7. are given in Section 7. are given in Section 7.
Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur-
rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se-
quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y =
(y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1
to T : to T : to T : to T :
ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1)
yt = Why ht + by (2) yt = Why ht + by (2) yt = Why ht + by (2) yt = Why ht + by (2)
where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the
input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors
(e.g. bh is hidden bias vector) and H is the hidden layer func- (e.g.Fig.
bh is1.hidden
Long Short-term
bias vector)Memory
and H isCell
the hidden layer func- (e.g.Fig.
bh is1.hidden
Long bias
Short-term
vector)Memory
and H isCell
the hidden layer func- (e.g.Fig. 1. hidden
bh is Long Short-term
bias vector)Memory
and H isCell
the hidden layer func- Fig. 1. Long Short-term Memory Cell
tion. tion. tion. tion.
H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid
function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term
Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built
memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex-
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x1
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x2
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x3
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x4
it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3)
ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4)
c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5)
ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6)
ht = ot tanh(ct ) (7) Fig. 2. hBidirectional
t = ot tanh(cRecurrent
t) Neural Network (7) Fig. 2.hBidirectional
t = ot tanh(cRecurrent
t) Neural Network (7) Fig. 2. h
Bidirectional Recurrent
t = ot tanh(c t) Neural Network (7) Fig. 2. Bidirectional Recurrent Neural Network
where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c
are respectively the input gate, forget gate, output gate and Combing areBRNNs with LSTM
respectively gives
the input bidirectional
gate, forget gate,LSTM [14],gate and
output Combing areBRNNs with LSTM
respectively gives
the input bidirectional
gate, forget gate,LSTM [14],
output gate and Combing areBRNNs with LSTM
respectively givesgate,
the input bidirectional
forget gate,LSTM [14],gate and
output Combing BRNNs with LSTM gives bidirectional LSTM [14],
cell activation vectors, all of which are the same size as the which cancellaccess long-range
activation vectors,context
all of in both are
which inputthedirections.
same size as the which can
cell access long-range
activation vectors,context in bothare
all of which input
thedirections.
same size as the which can cellaccess long-range
activation vectors, context
all of in both are
which inputthedirections.
same size as the which can access long-range context in both input directions.
hidden vector h. The weight matrices from the cell to gate A crucial
hiddenelement
vector of
h. the
Therecent
weightsuccess of hybrid
matrices from systems
the cell to gate A crucial
hiddenelement
vector h.of the
Therecent
weightsuccess of hybrid
matrices from thesystems
cell to gate A crucial
hiddenelement
vector ofh.theTherecent
weightsuccess of hybrid
matrices from systems
the cell to gate A crucial element of the recent success of hybrid systems
vectors (e.g. Wsi ) are diagonal, so element m in each gate is the use of deep(e.g.
vectors architectures, which are so
Wsi ) are diagonal, ableelement
to buildm upinpro-
each gate is the use of deep architectures, which are able to build up
vectors (e.g. Wsi ) are diagonal, so element m in each gate pro- is the usevectors
of deep(e.g.
architectures,
Wsi ) are which
diagonal,are able to buildm
so element up in
pro-each gate is the use of deep architectures, which are able to build up pro-
vector only receives input from element m of the cell vector. gressively higher
vector onlylevel representations
receives input fromof acoustic
element m data.
of theDeep
cell vector. gressively higher
vector onlylevel representations
receives input from of acoustic
element data.
m of theDeep
cell vector. gressively higher
vector onlylevel representations
receives input fromofelement
acoustic mdata.
of theDeep
cell vector. gressively higher level representations of acoustic data. Deep
One shortcoming of conventional RNNs is that they are RNNs can be created
One by stacking
shortcoming multiple RNNRNNs
of conventional hiddenis layers
that they are RNNs can be Onecreated by stacking
shortcoming multiple RNN
of conventional hidden
RNNs is layers
that they are RNNs can beOne created by stacking
shortcoming multiple RNNRNNs
of conventional hiddenislayers
that they are RNNs can be created by stacking multiple RNN hidden layers
only able to make use of previous context. In speech recog- on top of each other, with the output sequence of one
only able to make use of previous context. In speech recog- layer on top only
of eachableother, withuse
to make theofoutput sequence
previous context.of In
onespeech
layer recog- on top of each other, with the output sequence of one
only able to make use of previous context. In speech recog- layer on top of each other, with the output sequence of one layer
nition, where whole utterances are transcribed at once, there formingnition,
the input sequence
where whole for the next,
utterances areastranscribed
shown in Fig. 3. there
at once, formingnition,
the input
where sequence for the next,
whole utterances areastranscribed
shown in at Fig. 3. there
once, formingnition,
the input sequence
where whole for the next,areastranscribed
utterances shown in Fig. 3. there
at once, forming the input sequence for the next, as shown in Fig. 3.
is no reason not to exploit future context as well. Bidirec- Assuming is nothereason
same hidden layer function
not to exploit is used as
future context forwell.
all NBidirec- Assumingis nothereason
same nothidden layer function
to exploit is usedasforwell.
future context all NBidirec- Assuming thereason
is no same hidden layer function
not to exploit is used as
future context forwell.
all N Bidirec- Assuming the same hidden layer function is used for all N
tional RNNs (BRNNs) [13] do this by processing the data in layers intional
the stack,
RNNsthe hidden vector
(BRNNs) [13] dosequences hn are itera-
this by processing the data in layers in the stack,
tional RNNsthe hidden [13]
(BRNNs) vectordosequences hn are itera-
this by processing the data in layers intional
the stack,
RNNsthe(BRNNs)
hidden vector
[13] do sequences hn are itera-
this by processing the data in layers in the stack, the hidden vector sequences hn are itera-
both directions with two separate hidden layers, which are tively computed from n with
both directions = 1 totwo and t = 1hidden
N separate to T : layers, which are tively computed from nwith
both directions to Nseparate
= 1 two and t = hidden
1 to T : layers, which are tively computed from n with
both directions = 1 totwoN and to T : layers, which are
t = 1hidden
separate tively computed from n = 1 to N and t = 1 to T :
then fed forwards to the same output layer. As illustrated in hntthen
= HfedWforwards to 1the
n
+ same
Whn houtput
n layer.
n As illustrated in
(11) hntthen
=H fedW forwards to
n the
1 same output
+ Whn hn hnt 1 layer.
+ bnh As illustrated
(11) ! in hnt then
= HfedWforwards nto 1the same output
+ Whn hn hnt 1 + layer.
bnh As illustrated
(11) in hnt = H Whn hnt 1
+ Whn hn hnt + bnh (11)
! h n 1 h n ht nh
t 1 + bh ! h n 1 h n ht h n 1 h n ht ! 1 hn
1
Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h ,
where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are
the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y
by iterating the backward layer from t = T to 1, the forward by iterating the backward layer
N from t = T to 1, the
(12) forward by iterating the
yt backward
= WhN y hlayer
N from t = T to 1, the
(12)forward by iterating theyt =backward
W h N y hN layer from t = T to 1, (12) the forward y t = W h N y hN (12)
layer from t = 1 to T and then updating the output layer:
⇣ ⌘
y t = W h N y ht + b y
layer from t = 1 to T and then updating the output layer:
Deep bidirectional RNNs ⇣ can be implemented by replac-
⌘ Deep bidirectional RNNs ⇣
t + by
layer from t = 1 to T and then updating the output layer:
can be implemented by replac-
⌘ Deep bidirectional RNNs ⇣
t + by
layer from t = 1 to T and then updating the output layer:
can be implemented by replac- ⌘
t + by
y1 y2 y3 y4
are given in Section 7. are given in Section 7. are given in Section 7. are given in Section 7.
Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur- Given an input sequence x = (x1 , . . . , xT ), a standard recur-
rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se- rent neural network (RNN) computes the hidden vector se-
quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y = quence h = (h1 , . . . , hT ) and output vector sequence y =
(y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1 (y1 , . . . , yT ) by iterating the following equations from t = 1
to T : to T : to T : to T :
ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1) ht = H (Wxh xt + Whh ht 1 + bh ) (1)
yt = Why ht + by (2) yt = Why ht + by (2) yt = Why ht + by (2) yt = Why ht + by (2)
where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the where the W terms denote weight matrices (e.g. Wxh is the
input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors input-hidden weight matrix), the b terms denote bias vectors
(e.g. bh is hidden bias vector) and H is the hidden layer func- (e.g.Fig.
bh is1.hidden
Long Short-term
bias vector)Memory
and H isCell
the hidden layer func- (e.g.Fig.
bh is1.hidden
Long bias
Short-term
vector)Memory
and H isCell
the hidden layer func- (e.g.Fig. 1. hidden
bh is Long Short-term
bias vector)Memory
and H isCell
the hidden layer func- Fig. 1. Long Short-term Memory Cell
tion. tion. tion. tion.
H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid H is usually an elementwise application of a sigmoid
function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term function. However we have found that the Long Short-Term
Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built Memory (LSTM) architecture [11], which uses purpose-built
memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex- memory cells to store information, is better at finding and ex-
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x1
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x2
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x3
ploiting long range context. Fig. 1 illustrates a single LSTM
memory cell. For the version of LSTM used in this paper [12]
H is implemented by the following composite function:
x4
it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3) it = (Wxi xt + Whi ht 1 + Wci ct 1 + bi ) (3)
ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4) ft = (Wxf xt + Whf ht 1 + Wcf ct 1 + bf ) (4)
c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5) c t = ft c t 1 + it tanh (Wxc xt + Whc ht 1 + bc ) (5)
ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6) ot = (Wxo xt + Who ht 1 + Wco ct + bo ) (6)
ht = ot tanh(ct ) (7) Fig. 2. hBidirectional
t = ot tanh(cRecurrent
t) Neural Network (7) Fig. 2.hBidirectional
t = ot tanh(cRecurrent
t) Neural Network (7) Fig. 2. h
Bidirectional Recurrent
t = ot tanh(c t) Neural Network (7) Fig. 2. Bidirectional Recurrent Neural Network
where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c where is the logistic sigmoid function, and i, f , o and c
are respectively the input gate, forget gate, output gate and Combing areBRNNs with LSTM
respectively gives
the input bidirectional
gate, forget gate,LSTM [14],gate and
output Combing areBRNNs with LSTM
respectively gives
the input bidirectional
gate, forget gate,LSTM [14],
output gate and Combing areBRNNs with LSTM
respectively givesgate,
the input bidirectional
forget gate,LSTM [14],gate and
output Combing BRNNs with LSTM gives bidirectional LSTM [14],
cell activation vectors, all of which are the same size as the which cancellaccess long-range
activation vectors,context
all of in both are
which inputthedirections.
same size as the which can
cell access long-range
activation vectors,context in bothare
all of which input
thedirections.
same size as the which can cellaccess long-range
activation vectors, context
all of in both are
which inputthedirections.
same size as the which can access long-range context in both input directions.
hidden vector h. The weight matrices from the cell to gate A crucial
hiddenelement
vector of
h. the
Therecent
weightsuccess of hybrid
matrices from systems
the cell to gate A crucial
hiddenelement
vector h.of the
Therecent
weightsuccess of hybrid
matrices from thesystems
cell to gate A crucial
hiddenelement
vector ofh.theTherecent
weightsuccess of hybrid
matrices from systems
the cell to gate A crucial element of the recent success of hybrid systems
vectors (e.g. Wsi ) are diagonal, so element m in each gate is the use of deep(e.g.
vectors architectures, which are so
Wsi ) are diagonal, ableelement
to buildm upinpro-
each gate is the use of deep architectures, which are able to build up
vectors (e.g. Wsi ) are diagonal, so element m in each gate pro- is the usevectors
of deep(e.g.
architectures,
Wsi ) are which
diagonal,are able to buildm
so element up in
pro-each gate is the use of deep architectures, which are able to build up pro-
vector only receives input from element m of the cell vector. gressively higher
vector onlylevel representations
receives input fromof acoustic
element m data.
of theDeep
cell vector. gressively higher
vector onlylevel representations
receives input from of acoustic
element data.
m of theDeep
cell vector. gressively higher
vector onlylevel representations
receives input fromofelement
acoustic mdata.
of theDeep
cell vector. gressively higher level representations of acoustic data. Deep
One shortcoming of conventional RNNs is that they are RNNs can be created
One by stacking
shortcoming multiple RNNRNNs
of conventional hiddenis layers
that they are RNNs can be Onecreated by stacking
shortcoming multiple RNN
of conventional hidden
RNNs is layers
that they are RNNs can beOne created by stacking
shortcoming multiple RNNRNNs
of conventional hiddenislayers
that they are RNNs can be created by stacking multiple RNN hidden layers
only able to make use of previous context. In speech recog- on top of each other, with the output sequence of one
only able to make use of previous context. In speech recog- layer on top only
of eachableother, withuse
to make theofoutput sequence
previous context.of In
onespeech
layer recog- on top of each other, with the output sequence of one
only able to make use of previous context. In speech recog- layer on top of each other, with the output sequence of one layer
nition, where whole utterances are transcribed at once, there formingnition,
the input sequence
where whole for the next,
utterances areastranscribed
shown in Fig. 3. there
at once, formingnition,
the input
where sequence for the next,
whole utterances areastranscribed
shown in at Fig. 3. there
once, formingnition,
the input sequence
where whole for the next,areastranscribed
utterances shown in Fig. 3. there
at once, forming the input sequence for the next, as shown in Fig. 3.
is no reason not to exploit future context as well. Bidirec- Assuming is nothereason
same hidden layer function
not to exploit is used as
future context forwell.
all NBidirec- Assumingis nothereason
same nothidden layer function
to exploit is usedasforwell.
future context all NBidirec- Assuming thereason
is no same hidden layer function
not to exploit is used as
future context forwell.
all N Bidirec- Assuming the same hidden layer function is used for all N
tional RNNs (BRNNs) [13] do this by processing the data in layers intional
the stack,
RNNsthe hidden vector
(BRNNs) [13] dosequences hn are itera-
this by processing the data in layers in the stack,
tional RNNsthe hidden [13]
(BRNNs) vectordosequences hn are itera-
this by processing the data in layers intional
the stack,
RNNsthe(BRNNs)
hidden vector
[13] do sequences hn are itera-
this by processing the data in layers in the stack, the hidden vector sequences hn are itera-
both directions with two separate hidden layers, which are tively computed from n with
both directions = 1 totwo and t = 1hidden
N separate to T : layers, which are tively computed from nwith
both directions to Nseparate
= 1 two and t = hidden
1 to T : layers, which are tively computed from n with
both directions = 1 totwoN and to T : layers, which are
t = 1hidden
separate tively computed from n = 1 to N and t = 1 to T :
then fed forwards to the same output layer. As illustrated in hntthen
= HfedWforwards to 1the
n
+ same
Whn houtput
n layer.
n As illustrated in
(11) hntthen
=H fedW forwards to
n the
1 same output
+ Whn hn hnt 1 layer.
+ bnh As illustrated
(11) ! in hnt then
= HfedWforwards nto 1the same output
+ Whn hn hnt 1 + layer.
bnh As illustrated
(11) in hnt = H Whn hnt 1
+ Whn hn hnt + bnh (11)
! h n 1 h n ht nh
t 1 + bh ! h n 1 h n ht h n 1 h n ht ! 1 hn
1
Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h , Fig. 2, a BRNN computes the forward hidden sequence h ,
where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are where we define h0 = x. The network outputs yt are
the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y the backward hidden sequence h and the output sequence y
by iterating the backward layer from t = T to 1, the forward by iterating the backward layer
N from t = T to 1, the
(12) forward by iterating the
yt backward
= WhN y hlayer
N from t = T to 1, the
(12)forward by iterating theyt =backward
W h N y hN layer from t = T to 1, (12) the forward y t = W h N y hN (12)
layer from t = 1 to T and then updating the output layer:
⇣ ⌘
y t = W h N y ht + b y
layer from t = 1 to T and then updating the output layer:
Deep bidirectional RNNs ⇣ can be implemented by replac-
⌘ Deep bidirectional RNNs ⇣
t + by
layer from t = 1 to T and then updating the output layer:
can be implemented by replac-
⌘ Deep bidirectional RNNs ⇣
t + by
layer from t = 1 to T and then updating the output layer:
can be implemented by replac- ⌘
t + by
But…
1. They still have difficulty with long-range dependencies
2. Their computation is inherently serial, so can’t be easily
parallelized on a GPU
3. Even though they (mostly) solve the vanishing gradient problem,
they can still suffer from exploding gradients
24
Transformer Language Models
MODEL: GPT
25
Attention
!
4
x!4 = a4,j vj
j=1
softmax
s4,1 s4,2 s4,3 s4,4
v1 v2 v3 v4
26
Attention
!
1
x!1 = a1,j vj
j=1
a1,1
softmax
s1,1
v1
27
Attention
!
2
x!2 = a2,j vj
j=1
a2,1 a2,2
softmax
s2,1 s2,2
v1 v2
28
Attention
!
3
x!3 = a3,j vj
j=1
softmax
s3,1 s3,2 s3,3
v1 v2 v3
29
Attention
!
4
x!4 = a4,j vj
j=1
softmax
s4,1 s4,2 s4,3 s4,4
v1 v2 v3 v4
30
Attention
!
t
x1 ’ x2 ’ x3 ’ x4 ’ x!t = at,j vj
j=1
31
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1
softmax
s4,1 s4,2 s4,3 s4,4
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4
32
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1
softmax
s4,1 s4,2 s4,3 s4,4
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4
33
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1
softmax
Wq s4,1 s4,2 s4,3 s4,4
q1 q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4
34
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1
softmax
Wq s4,1 s4,2 s4,3 s4,4
kTj q4 /
!
s4,j = dk scores
q1 q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4
35
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1
softmax
Wq s4,1 s4,2 s4,3 s4,4
kTj q4 /
!
s4,j = dk scores
q1 q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4
36
Scaled Dot-Product Attention
!
4
x!4 = a4,j vj
j=1
softmax
Wq s4,1 s4,2 s4,3 s4,4
kTj q4 /
!
s4,j = dk scores
q1 attention
q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4
37
Scaled Dot-Product Attention
!
t
x1 ’ x2 ’ x3 ’ x4 ’ x!t = at,j vj
j=1
Wq
kTj qt /
!
st,j = dk scores
kj = WTk xj keys
Wv
vj = WTv xj values
x1 x2 x3 x4
38
Re
Animation of 3D Convolution ca
ll…
http://cs231n.github.io/convolutional-networks/
39
Figure from Fei-Fei Li & Andrej Karpathy & Justin Johnson (CS231N)
Multi-headed Attention
x1 x2 x3 x4
40
• To ensure the dimension of the
input embedding xt is the same
as the output embedding xt’, Multi-headed Attention
Transformers usually choose
the embedding sizes and
number of heads appropriately:
• dmodel = dim. of inputs
• dk = dim. of each output
• h = # of heads • Just as we can have
• Choose dk = dmodel / h x1 ’ x2 ’ x3 ’ x4 ’ multiple channels in a
• Then concatenate the outputs convolution layer, we
Wq can use multiple heads
in an attention layer
• Each head gets its own
Wk multi-headed attention parameters
• We can concatenate all
the outputs to get a
single vector for each
Wv time step
x1 x2 x3 x4
41
• To ensure the dimension of the
input embedding xt is the same
as the output embedding xt’, Multi-headed Attention
Transformers usually choose
the embedding sizes and
number of heads appropriately:
• dmodel = dim. of inputs
• dk = dim. of each output
• h = # of heads • Just as we can have
• Choose dk = dmodel / h x1 ’ x2 ’ x3 ’ x4 ’ multiple channels in a
• Then concatenate the outputs convolution layer, we
can use multiple heads
in an attention layer
Wq • Each head gets its own
Wk
multi-headed attention parameters
Wv
• We can concatenate all
the outputs to get a
single vector for each
time step
x1 x2 x3 x4
42
Re
RNN Language Model ca
ll…
h1 h2 h3 h4 h5 h6 h7
Key Idea:
(1) convert all previous words to a fixed length vector
(2) define distribution p(wt | fθ(wt-1, …, w1)) that conditions on
the vector ht = fθ(wt-1, …, w1) 43
Transformer Language Model
The bat made noise
Each layer of a Transformer LM
consists of several sublayers:
Important! 1. attention
• RNN computation p(w1|h1) p(w2|h2) p(w3|h3) p(w4|h4)
2. feed-forward neural network
graph grows 3. layer normalization
linearly with the 4. residual connections
number of input h1 h2 h3 h4
tokens Each hidden vector looks back at
the hidden vectors of the current
• Transformer-LM and previous timesteps in the
computation graph previous layer.
grows quadratically
with the number of The language model part is just like
input tokens an RNN-LM!
x1 x2 x3 x4 …
44
Transformer Language Model
The bat made noise
Each layer of a Transformer LM
consists of several sublayers:
Important! 1. attention
• RNN computation p(w1|h1) p(w2|h2) p(w3|h3) p(w4|h4)
2. feed-forward neural network
graph grows 3. layer normalization
linearly with the 4. residual connections
number of input h1 h2 h3 h4
tokens Each hidden vector looks back at
the hidden vectors of the current
• Transformer-LM and previous timesteps in the
computation graph previous layer.
grows quadratically
with the number of The language model part is just like
input tokens an RNN-LM!
x1 x2 x3 x4 …
45
Layer Normalization
• The Problem: internal Given input a ∈ RK , LayerNorm computes output b ∈ RK :
covariate shift occurs
during training of a deep a−µ
network when a small b=γ" ⊕β
σ
change in the low layers
amplifies into a large !K
where we have mean µ = K k=1 ak ,
1
change in the high layers " !
K
• One Solution: Layer standard deviation σ = K 1
k=1 (ak − µ) ,
2
47
Figure from https://arxiv.org/pdf/1607.06450.pdf
Residual Connections
Residual Connection
• The Problem: as network Plain Connection
depth grows very large, a b
performance degradation b
occurs that is not explained
by overfitting (i.e. train / test b = b! + a
error both worsen)
• One Solution: Residual b = f (a)
connections pass a copy of
the input alongside another b! = f (a)
function so that information a
can flow more directly
• These residual connections a
allow for effective training
of very deep networks that
perform better than their
shallower (though still deep)
counterparts
48
Figure from https://arxiv.org/pdf/1512.03385.pdf
Residual Connections
Residual Connection
• The Problem: as network Plain Connection
depth grows very large, a b
performance degradation b
occurs that is not explained
by overfitting (i.e. train / test
error both worsen)
• One Solution: Residual b = f (a) b = f (a) + a
connections pass a copy of
the input alongside another
function so that information a
can flow more directly
• These residual connections a
allow for effective training
of very deep networks that
perform better than their Why are residual connections helpful?
shallower (though still deep)
counterparts Instead of f(a) having to learn a full
transformation of a, f(a) only needs to learn an
additive modification of a (i.e. the residual).
49
Figure from https://arxiv.org/pdf/1512.03385.pdf
Transformer Layer
x1’ x2’ x3’ x4’
layer normalization
Each layer of a Transformer LM
consists of several sublayers:
1. attention
2. feed-forward neural network
residual connections
3. layer normalization
4. residual connections
feed forward neural network
layer normalization
residual connections
Wq
Wk
multi-headed attention
Wv
x1 x2 x3 x4
50
Transformer Layer
x1’ x2’ x3’ x4’
layer normalization
Each layer of a Transformer LM
consists of several sublayers:
1. attention
2. feed-forward neural network
residual connections
3. layer normalization
4. residual connections
feed forward neural network
Transformer
Layer layer normalization
residual connections
Wq
Wk
multi-headed attention
Wv
x1 x2 x3 x4
51
Transformer Layer
x1’ x2’ x3’ x4’
layer normalization
Each layer of a Transformer LM
consists of several sublayers:
1. attention
2. feed-forward neural network
residual connections
3. layer normalization
4. residual connections
feed forward neural network
Transformer
Layer layer normalization
residual connections
Wq
Wk
multi-headed attention
Wv
x1 x2 x3 x4
52
Transformer Layer
Each layer of a Transformer LM
consists of several sublayers:
1. attention
2. feed-forward neural network
3. layer normalization
4. residual connections
Transformer layer
x1 x2 x3 x4
53
Transformer Language Model
The bat made noise
Each layer of a Transformer LM
consists of several sublayers:
1. attention
p(w2|h2) p(w4|h4)
2. feed-forward neural network
p(w1|h1) p(w3|h3)
3. layer normalization
4. residual connections
h1 h2 h3 h4
Each hidden vector looks back at
the hidden vectors of the current
Transformer layer and previous timesteps in the
previous layer.
Transformer layer
The language model part is just like
an RNN-LM.
Transformer layer
x1 x2 x3 x4 …
54
!
4
Position Embeddings
p(w1|h1) p(w2|h2) p(w3|h3) p(w4|h4)
• The Problem: Because attention is position
invariant, we need a way to learn about positions
• The Solution: Use (or learn) a collection of position h1 h2 h3 h4
specific embeddings: pt represents what it means
to be in position t. And add this to the word
embedding wt. Transformer layer
The key idea is that every word that appears in
position t uses the same position embedding pt
• There are a number of varieties of position Transformer layer
embeddings:
– Some are fixed (based on sine and cosine), whereas
others are learned (like word embeddings) Transformer layer
– Some are absolute (as described above) but we can
also use relative position embeddings (i.e. relative
to the position of the query vector) + + + +
p1 p2 p3 p4
…
w1 w2 w3 w4
56
GPT-3
• GPT stands for Generative Pre-trained Transformer
• GPT is just a Transformer LM, but with a huge number of
parameters
59
IMPLEMENTING A TRANSFORMER LM
60
Matrix Version of Single-Headed Attention
!
4
x!4 = a4,j vj
j=1
softmax
Wq s4,1 s4,2 s4,3 s4,4
kTj q4 /
!
s4,j = dk scores
q1 q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4
61
Matrix Version of Single-Headed Attention
!
4
x!4 = a4,j vj
j=1
a4,1a4,2a4,3a4,4
a4 = softmax(s4 ) attention weights
softmax
Wq s4,1s4,2s4,3s4,4
kTj q4 /
!
s4,j = dk scores
q1 q2 q3 q4
qj = WTq xj queries
Wk
k1 k2 k3 k4
kj = WTk xj keys
Wv v1 v2 v3 v4
vj = WTv xj values
x1 x2 x3 x4
62
Matrix Version of Single-Headed Attention
• For speed, we compute
X = AV = softmax(QK / dk )V
!
all the queries at once ! T
using matrix operations
• First we pack the
queries, keys, values into a4,1a4,2a4,3a4,4
matrices A = [a1 , . . . , a4 ]T = softmax(S)
• Then we compute all the
queries at once softmax
Wq s4,1s4,2s4,3s4,4
S = [s1 , . . . , s4 ] = QK / dk
!
T T
q1 q2 q3 q4
Wk
Q = [q1 , . . . , q4 ]T = XWq
k1 k2 k3 k4
K = [k1 , . . . , k4 ]T = XWk
Wv v1 v2 v3 v4
V = [v1 , . . . , v4 ]T = XWv
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
63
Matrix Version of Single-Headed Attention
• For speed, we compute
X = AV = softmax(QK / dk )V
!
all the queries at once ! T
using matrix operations
• First we pack the
queries, keys, values into
matrices A = [a1 , . . . , a4 ]T = softmax(S)
• Then we compute all the
queries at once softmax
Wq
S = [s1 , . . . , s4 ] = QK / dk
!
T T
q1 q2 q3 q4
Wk
Q = [q1 , . . . , q4 ]T = XWq
k1 k2 k3 k4
K = [k1 , . . . , k4 ]T = XWk
Wv v1 v2 v3 v4
V = [v1 , . . . , v4 ]T = XWv
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
64
Matrix Version of Single-Headed Attention
Holy cow, that’s a lot of new
X = AV = softmax(QK / dk )V
!
arrows… do we always ! T
want/need all of those?
• Suppose we’re training
our transformer to predict
the next token(s) given A = [a1 , . . . , a4 ]T = softmax(S)
the input…
softmax
• … then attending to
tokens that come Wqafter
S = [s1 , . . . , s4 ] = QK / dk
!
T T
the current token is
cheating! q1 q2 q3 q4
So what is this model?
Wk
Q = [q1 , . . . , q4 ]T = XWq
• This version is the k1 k2 k3 k4
standard Transformer K = [k1 , . . . , k4 ]T = XWk
block. (more on this later!)
v1 v2 v3 v4
• But we want theWv V = [v1 , . . . , v4 ]T = XWv
Transformer LM block
x1 x2 x3 x4
• And that requires X = [x1 , . . . , x4 ]T
masking!
65
Matrix Version of Single-Headed Attention
X = AV = softmax(QK / dk )V
!
! T
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
66
Matrix Version of Single-Headed (Causal) Attention
Insight: if some element in
X = AV = softmax(QK / dk + M)V
!
! T
the input to the softmax is
-∞, then the corresponding
output is 0!
Acausal = softmax(S + M)
Question: For a causal LM
softmax
which is the correct matrix? In practice, the
Wq attention weights are
A:
S = QK / dk
!
0 0 0 0 T
−∞
M=
0 0 0
computed for all time
−∞ −∞ 0 0 steps T, then we mask
q1 q2 q3 q4
Q = XWq
−∞ −∞ −∞ 0
B: out (by setting to –inf)
0 −∞ −∞ Wk −∞ all the inputs to the
0 0 −∞ −∞
M= k1 k2 k3 k4 softmax that are for
K = XWk
0 0 0 −∞
C: 0 0 0 0 the timesteps to the
0 −∞ −∞ −∞
right of the query.
v1 v2 v3 v4
0 W v
V = XWv
−∞ −∞ −∞
M=
−∞ −∞ 0 −∞
−∞ −∞ −∞ 0
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
Answer:
67
Matrix Version of Multi-Headed (Causal) Attention
x1 ’ x2 ’ x3 ’ x4 ’
Q (K )
(i) (i) T
! "
W(1)
X!(i) = softmax √ + M V(i)
q
W(2) dk
W(3)
q
q
(1)
Wk
Q(i) = XW(i)
(2)
Wk (3) multi-headed attention
Wk q
(i)
W(1) K(i) = XWk
W(2)
v
W(3)
v
V(i) = XW(i)
v
v
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
68
Matrix Version of Multi-Headed (Causal) Attention
X = concat(X!(1) , . . . , X!(h) )
x1 ’ x2 ’ x3 ’ x4 ’
Q (K )
(i) (i) T
! "
W(1)
X!(i) = softmax √ + M V(i)
q
W(2) dk
W(3)
q
q
(1)
Wk
Q(i) = XW(i)
(2)
Wk (3) multi-headed attention
Wk q
(i)
W(1) K(i) = XWk
W(2)
v
W(3)
v
V(i) = XW(i)
v
v
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
69
Recall:
To ensure the dimension of the input
Matrix Version of Multi-Headed (Causal) Attention
embedding xt is the same as the
output embedding xt’, Transformers
usually choose the embedding sizes
and number of heads appropriately:
• dmodel = dim. of inputs X = concat(X!(1) , . . . , X!(h) )
• dk = dim. of each output
• h = # of heads
• Choose dk = dmodel / h
x1 ’ x2 ’ x3 ’ x4 ’
Q (K )
(i) (i) T
! "
X!(i) = softmax √ + M V(i)
Wq dk
(i)
K(i) = XWk
Wv
V(i) = XW(i)
v
x1 x2 x3 x4
X = [x1 , . . . , x4 ]T
70