Transformer Structure
Transformer Structure
Chris Watkins
Programming with gradients
Set up a ‘model’ with : class Mymodel( nn.Module ):
parameters, also known as def __init__():
trainable weights
super().__init__()
weights1 = nn.Linear(2,5)
forward function, which computes relu = nn.ReLU()
output from input
weights2 = nn.Linear(5,3)
loss function
def forward( x ):
x=weights1(x)
x=relu(x)
x=weights2(x)
return x
Setting up the network for learning
mymodel = MyModel() # set up my model, initialize weights
all_my_weights=mymodel.parameters() # get all the weights
fat
big, fat, eats, dog, white , cat, …
Embedding
vectors Transform Each token has a sequence of embedding vectors going through the
mix transformer network, which finally produces predictions of next tokens
process
Exploded view of transformer head applied at token i in GPT
q0 = Qu0
k0 = Ku0
yi0= qi . k0 pi0 Note that we only mix with
v0 = Vu0 previous tokens in the sequence
.
.
.
.
qi = Qui
pi0 v0 + … + pii vi
ki = Kui yii= qi . ki pii
ui
vi = Vui softmax Some
feedforward
Mixing neural
proportions network
Travelling between symbol world and vector world
Symbol to vector:
lookup in
embedding table
Stored (and trained) embedding vector for ‘cat’
cat 3
Symbol probabilities
dot product
s a with each
o the vector in
Vector to symbol: cow embedding
f
. t
cat
dog
.
.
table gives
logits, which
Dot m . are turned
.
Embedding vector (usually product . into symbol
the result of much Stored output a .
. probabilities
processing) embedding x .
. with softmax
vectors
3 transformer
heads, each with
different weights Multiple transformer heads
Token 0
Token 3
Token 0
Token 1
Token 2
Token 3