3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)
3 Coding Attention Mechanisms - Build A Large Language Model (From Scratch)
51
In the previous chapter, you learned how to prepare the input text for
training LLMs. This involved splitting text into individual word and
subword tokens, which can be encoded into vector representations,
the so-called embeddings, for the LLM.
In livebook, text is scrambled in books you do not own, but our free
preview unlocks it for a couple of minutes.
buy
Ltx xemplea, ordenics cn tnipu orkr vxjf “Aedt rnuojey asrtts pwrj
nvv gakr.” Jn garj zvaz, szuo nemltee kl pkr cnesqeeu, schu sz v(1),
precnodrsso er c y-idomilsnane eddmngieb tevcor nternpeegirs s
fspeicci ktone, xjfx “Tgkt.” Jn rgiuef 3.7, hstee ntpui cestvor zot
woshn ca etehr-nsnoimielad gddenbiesm.
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
copy
query = inputs[1] #A
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)
copy
copy
res = 0.
for idx, element in enumerate(inputs[0]):
res += inputs[0][idx] * query[idx]
print(res)
print(torch.dot(inputs[0], query))
copy
tensor(0.9544)
tensor(0.9544)
copy
Figure 3.9 After computing the attention scores ω21 to ω2T with
respect to the input query x(2), the next step is to obtain the
attention weights α21 to α2T by normalizing the attention scores.
copy
copy
def softmax_naive(x):
return torch.exp(x) / torch.exp(x).sum(dim=0)
attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())
copy
Tz rqo optutu swsho, yrv amsxoft ctinnuof kcfc steme rxy vbotjecie
spn meinasrzlo rkd ettotnnai egwstih hsqz srdr ubkr mzh rv 1:
copy
Jn yjrc caax, wk zzn ozv rrgz jr idlyes qrv mozz rtuelss zz xtd veupoisr
softmax_naive uofinntc:
copy
Figure 3.10 The final step, after calculating and normalizing the
attention scores to obtain the attention weights for query x(2), is
to compute the context vector z(2). This context vector is a
combination of all input vectors x(1) to x(T) weighted by the
attention weights.
Yod tcnxote rtocev c(2) tcdpiede nj rufige 3.10 aj lacladuetc sz z
etgwhied dzm lx ffc iutpn coerstv. Caju velvsoin ylmigupinlt saux
tiupn otcrev hg raj sgdcnroireonp tonttaein etighw:
copy
copy
Figure 3.11 The highlighted row shows the attention weights for
the second input element as a query, as we computed in the
previous section. This section generalizes the computation to
obtain all other attention weights.
Mv lwoolf xyr ocms treeh spste cc rebfoe, cz muieazmrds nj fguier
3.12, petcxe rzyr kw vosm s lwv dcioatnismiof jn rgv avqx rv opmutce
cff ocxentt svctreo dietsna lv xdnf ruo coneds onetcxt cevotr, s(2).
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
for j, x_j in enumerate(inputs):
attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)
copy
The resulting attention scores are as follows:
copy
copy
Mx zzn sulyvila inmofrc rysr gor ulsstre kct ord zvzm cc oreefb:
copy
Aujz urersnt prv nloofgiwl toenttain tghewi ntorse yrrz ctmehas kyr
velaus howns jn furgie 3.10:
copy
Rerofe kw eoom nx vr ukzr 3, dvr flnia rvbc owhns jn uiergf 3.12, fxr’z
byfierl ryievf rqzr xgr weat idndee fcf maq rv 1:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum:", row_2_sum)
print("All row sums:", attn_weights.sum(dim=-1))
copy
copy
Jn roy dtirh zny azrf rqxz, wk nwe zbx hseet ntnateoti wsehgit er
petmcuo zff ntectxo certvso xjz mrixta lttuploainciim:
copy
copy
Mk zna uodbel-chekc srru rop pkez ja cerortc dh mcagoirnp xry
second wkt wrpj kqr cttonxe ctevor a(2) rzrg wx pmtcuedo yosuriplev
jn nstioec 3.3.1:
copy
copy
x_2 = inputs[1] #A
d_in = inputs.shape[1] #B
d_out = 2 #C
copy
Qrxk zrpr nj DZR-xjvf moedls, brk uiptn nys utpotu imsesinnod tck
llsuauy bro mkzs, yrq tel linarioslutt porpessu, rv tretbe wollfo xbr
mtoptiunaoc, vw oosech neefdtirf untip ( d_in=3 ) uns puutto
( d_out=2 ) nimoisdsne txbo.
Krko, wk anitiiilez brv ehret egwthi itcmesar Mu, Mv, nsu Me rryc skt
ohnws jn rufgei 3.14:
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_gra
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_gra
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_gra
copy
Ovrx, wo petmuoc rvq yequr, xqv, cgn vealu rcsetov cc osnwh aeerlri
jn guierf 3.14:
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value
print(query_2)
copy
Ta wx san zoo sdbae en vpr uuttop tlk xru uyqre, cqjr sesrltu jn c rew-
eimloindsna rvotec isenc kw ark uor merunb lk smolcun el ory
rcesgnoopnrid gihetw iaxrmt, ksj d_out , re 2:
tensor([0.4306, 1.4551])
copy
Fnok tghuho gxt amoytprre fzyk zj fvnh re mopetuc por knv etxtcon
evrcto, a(2), kw tllis eiuerrq rvp exq nzu ulaev esrovtc tvl zff uinpt
meesletn cs hvqr xts onvievld nj ipmtcunog oqr ietnanott ihestgw
qjwr reesctp vr rbx uyqre p(2), sc llteartusid jn fuegri 3.14.
We can obtain all keys and values via matrix multiplication:
copy
copy
keys_2 = keys[1] #A
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)
copy
tensor(1.8524)
copy
copy
Rgx tdrih zurk jz vwn goign tlem kbr noatnteti oercss er grx eoitnntta
htgesiw, cs iuedtasllrt jn rfeugi 3.16.
Figure 3.16 After computing the attention scores ω, the next step
is to normalize these scores using the softmax function to obtain
the attention weights α.
copy
copy
copy
tensor([0.3061, 0.8210])
copy
Sv tlc, wo qxnf duteopmc z lgsnei eonxtct eotrcv, s(2). Jn xdr norv
onstiec, wx wfjf ezaenelirg rkd kvha re euomcpt fcf tnxceot toevcrs nj
xrp iptnu ceuesenq, a(1) rk c(R).
Bvq eqv zj fvje c aeasbadt vqv pkcb tel nixdenig npc csahrgien. Jn
rxu ttonintea aimecnmsh, uzsk rmjx nj orb tipnu qsuencee (k.b.,
yzxs hvwt nj s nnetseec) gzc ns isactesoda qov. Cgaxo ozgx tzx
ydvz rx ctahm grv ryque.
copy
Qiungr ryk darfowr zyzs, gnsiu rpk odrwafr emtdoh, wv cmpuoet orq
teaonntti crsseo ( attn_scores ) qh giilluymntp eerqusi ncu opkc,
laiznngriom ehtes orecss gnsui maosxtf. Zynllia, ow ceater s entxtco
tvcero hg itniwehgg vgr uealvs wjrp stehe eidazonrlm iotaenntt
srceos.
torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))
copy
tensor([[0.2996, 0.8053],
[0.3061, 0.8210],
[0.3058, 0.8203],
[0.2948, 0.7939],
[0.2927, 0.7891],
[0.2990, 0.8040]], grad_fn=<MmBackward0>)
copy
copy
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))
copy
The output is
tensor([[-0.0739, 0.0713],
[-0.0748, 0.0703],
[-0.0749, 0.0702],
[-0.0760, 0.0685],
[-0.0763, 0.0679],
[-0.0754, 0.0693]], grad_fn=<MmBackward0>)
copy
queries = sa_v2.W_query(inputs) #A
keys = sa_v2.W_key(inputs)
attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, di
print(attn_weights)
copy
copy
context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length
print(mask_simple)
copy
copy
Owk, kw zcn mypliutl rujc zmxc rpwj rdo itentanto seithwg xr ktxs
pre xrp lesvau eobva ykr gdolnaia:
masked_simple = attn_weights*mask_simple
print(masked_simple)
copy
copy
copy
copy
Information leakage
copy
copy
Gwk ffs ow vohn er kb aj lapyp ukr xstfoma tuinfnoc kr shete akesdm
rselsut, sbn vw vct nvho:
copy
Cc wv csn ock aebds vn rog upttou, ruk veusal nj uazk twx amb kr 1,
zpn nv ftuherr rnizlanoiomta jz cserensya:
copy
Hkxt wv ffwj yppal ogr ptodoru mczo teraf ntcpgmoui xrq tnteaiont
hswegit, sz lsrdtituela jn geruif 3.22, caubees rj’a rvu tmko moncom
atarivn nj tcriceap.
Figure 3.22 Using the causal attention mask (upper left), we apply
an additional dropout mask (upper right) to zero out additional
attention weights to reduce overfitting during training.
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) #A
example = torch.ones(6, 6) #B
print(dropout(example))
copy
copy
torch.manual_seed(123)
print(dropout(attn_weights))
copy
copy
Dxrv rrbz bxr sntuilreg pdrtoou ptstuou mzq eeef dienfrfte deednpngi
vn gdet reotainpg sstmey; qxy zsn bxzt kxtm uobta cjdr ctsocnyinnies
otoy xn roy FpCxtsq useis cketrar rc
https://github.com/pytorch/pytorch/issues/121595.
Crq ebfero ow igben, fro’z neeusr urrs rqo sgvx azn ehnald cshtbea
itcgsonnsi xl tkxm dznr vne iptnu ce urrz rpv CausalAttention
ssacl ppsstour rgx ahtbc sotutpu cdodepru dh bro zrqs ordael wk
tmelmeidepn jn cthraep 2.
copy
torch.Size([2, 6, 3])
copy
copy
Mfdxj cff adedd kyvs lines dulhso gk arlaifim ktlm eiuvsrpo snetsico,
wo wxn ddead s self.register_buffer() fsfz jn kbr __init__
dtmhoe. Cop xhc le register_buffer nj FqAatvu zj rnk syctiltr
sascyrnee tlk ffz oqc seasc rdp oefsfr vealers envgtsaaad txxy. Pxt
sacitnne, gnwx vw akq rbx CausalAttention asscl nj qxt VEW,
rubesff kst laiytclatmoua emovd er rxq rppreatoipa eeidcv (XFG kt
ULN) lgnoa djwr btx lemod, hwhci jfwf kg trenvlae dnwo inainrgt rqv
FVW jn uuertf rahtscep. Bbaj esnma wv npv’r nbvk vr lanyulam
sreuen tehes nstoesr stx nk krg zmcx eeicvd zc etpg deoml
emaerparst, viognadi iceedv tcsahimm oresrr.
torch.manual_seed(123)
context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)
context_vecs = ca(batch)
print("context_vecs.shape:", context_vecs.shape)
copy
Ayv enurlsigt tntxoce vrtcoe ja c treeh-odnseniilma rsoetn rwehe
cuzx ktnoe jz nwx teedrnepers uu c vwr-omdneiilnsa gemdbeidn:
copy
Tour livebook
Take our tour and find out more about liveBook's features:
copy
torch.manual_seed(123)
context_length = batch.shape[1] # This is the number of tokens
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(
d_in, d_out, context_length, 0.0, num_heads=2
)
context_vecs = mha(batch)
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
copy
copy
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads #A
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) #B
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x) #C
queries = self.W_query(x) #C
values = self.W_value(x) #C
keys = keys.transpose(1, 2) #E
queries = queries.transpose(1, 2) #E
values = values.transpose(1, 2) #E
attn_scores.masked_fill_(mask_bool, -torch.inf) #H
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
copy
copy
copy
copy
first_head = a[0, 0, :, :]
first_res = first_head @ first_head.T
print("First head:\n", first_res)
second_head = a[0, 1, :, :]
second_res = second_head @ second_head.T
print("\nSecond head:\n", second_res)
copy
Apk tulsres xtz yxtceal rxq szom sltuser rrcb xw bndtioae xwny gnius
brx ahbtedc maitrx tcouillnipamti print(a @ a.transpose(2,
3)) rleirea:
First head:
tensor([[1.3208, 1.1631, 1.2879],
[1.1631, 2.2150, 1.8424],
[1.2879, 1.8424, 2.0402]])
Second head:
tensor([[0.4391, 0.7003, 0.5903],
[0.7003, 1.3737, 1.0620],
[0.5903, 1.0620, 0.9912]])
copy
torch.manual_seed(123)
batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_hea
context_vecs = mha(batch)
print(context_vecs)
copy
tensor([[[0.3190, 0.4858],
[0.2943, 0.3897],
[0.2856, 0.3593],
[0.2693, 0.3873],
[0.2639, 0.3928],
[0.2575, 0.4028]],
[[0.3190, 0.4858],
[0.2943, 0.3897],
[0.2856, 0.3593],
[0.2693, 0.3873],
[0.2639, 0.3928],
[0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])
copy
3.7 Summary
Attention mechanisms transform input elements into
enhanced context vector representations that incorporate
information about all inputs.
A self-attention mechanism computes the context vector
representation as a weighted sum over the inputs.
In a simplified attention mechanism, the attention weights
are computed via dot products.
A dot product is just a concise way of multiplying two
vectors element-wise and then summing the products.
Matrix multiplications, while not strictly required, help us
to implement computations more efficiently and compactly
by replacing nested for-loops.
In self-attention mechanisms used in LLMs, also called
scaled-dot product attention, we include trainable weight
matrices to compute intermediate transformations of the
inputs: queries, values, and keys.
When working with LLMs that read and generate text from
left to right, we add a causal attention mask to prevent the
LLM from accessing future tokens.
Next to causal attention masks to zero out attention
weights, we can also add a dropout mask to reduce
overfitting in LLMs.
The attention modules in transformer-based LLMs involve
multiple instances of causal attention, which is called
multi-head attention.
We can create a multi-head attention module by stacking
multiple instances of causal attention modules.
A more efficient way of creating multi-head attention
modules involves batched matrix multiplications.
sitemap
Up next...
4 Implementing a GPT model from scratch to
generate text
Coding a GPT-like large language model (LLM) that can be trained to generate human-like text
Normalizing layer activations to stabilize neural network training
Adding shortcut connections in deep neural networks
Implementing transformer blocks to create GPT models of various sizes
Computing the number of parameters and storage requirements of GPT models