0% found this document useful (0 votes)
15 views11 pages

Transformer Structure

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views11 pages

Transformer Structure

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Transformers

Chris Watkins
Programming with gradients
Set up a ‘model’ with : class Mymodel( nn.Module ):
parameters, also known as def __init__():
trainable weights
super().__init__()
weights1 = nn.Linear(2,5)
forward function, which computes relu = nn.ReLU()
output from input
weights2 = nn.Linear(5,3)
loss function
def forward( x ):
x=weights1(x)
x=relu(x)
x=weights2(x)
return x
Setting up the network for learning
mymodel = MyModel() # set up my model, initialize weights
all_my_weights=mymodel.parameters() # get all the weights

# put the weights in the optimizer


my_favourite_optimizer = torch.optim.AdamW( all_my_weights )
Learning
The learning loop of every neural Randomly select X (input)
network and Y (target) from data

optimizer.step() output = mymodel.forward(X)

(optimizer has the weight tensors (this is the ‘forward pass’)


and the gradients; it now uses
the gradients to slightly change
the weights)

loss = loss_function( output, Y )


(loss is just one number)
loss.backward()

(this computes all gradients of loss


wrt weights. The gradients are
attached to the weight tensors. zero all stored gradients
This is ‘back-propagation’.)
GPT transformer design overview

The big, fat, eats, dog, white, cat, …

fat
big, fat, eats, dog, white , cat, …

cat big, fat, eats, dog, white, cat, …

eats big, fat, eats, dog, white, cat, …

Embedding
vectors Transform Each token has a sequence of embedding vectors going through the
mix transformer network, which finally produces predictions of next tokens
process
Exploded view of transformer head applied at token i in GPT
q0 = Qu0

k0 = Ku0
yi0= qi . k0 pi0 Note that we only mix with
v0 = Vu0 previous tokens in the sequence

.
.
.
.
qi = Qui
pi0 v0 + … + pii vi
ki = Kui yii= qi . ki pii
ui
vi = Vui softmax Some
feedforward
Mixing neural
proportions network
Travelling between symbol world and vector world

Symbol to vector:
lookup in
embedding table
Stored (and trained) embedding vector for ‘cat’

cat 3

Symbol probabilities
dot product
s a with each
o the vector in
Vector to symbol: cow embedding
f
. t
cat
dog
.
.
table gives
logits, which
Dot m . are turned
.
Embedding vector (usually product . into symbol
the result of much Stored output a .
. probabilities
processing) embedding x .
. with softmax
vectors
3 transformer
heads, each with
different weights Multiple transformer heads
Token 0

Token 1 Each transformer head applies


same weights to each token
stream

Token 2 The transformer heads all have


different weights ‘inside’ them

Token 3

Head 2 Outputs of the three transformer


heads are concatenated to form
Head 1 input to next layer
Head 0
Two layers of transformer heads, with three transformer heads in each layer

Token 0

Token 1

Token 2

Token 3

Layer 1 Head 2 Layer 2 Head 2


6 different
Layer 1 Head 1 Layer 2 Head 1 transformer
Layer 2 Head 0 heads
Layer 1 Head 0
What have I left out?
Surely it can’t be so simple.

Well, I’ve only left out layer normalization and skip-connections…

It really is this simple.

But GPT3 is very big:


• 96 attention layers
• Various embedding dimensions, up to 12,288
• 175,000,000,000 trainable weights
• Batch size of several million
Why do transformers work so well?
Transformers are the dominant NN architecture (for large problems)
since 2017.

No general agreement or precise theory on why even small


transformers work for non-NLP problems.

No agreement as to why very large and deep transformers work so


well in language modelling and question answering.

You might also like