0% found this document useful (0 votes)
152 views25 pages

4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)

This chapter covers coding the components of a GPT-like large language model including normalization layers, transformer blocks, and shortcut connections. It then discusses implementing these components to create a full GPT model and calculating its parameters and storage requirements.

Uploaded by

gptplus1999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views25 pages

4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)

This chapter covers coding the components of a GPT-like large language model including normalization layers, transformer blocks, and shortcut connections. It then discusses implementing these components to create a full GPT model and calculating its parameters and storage requirements.

Uploaded by

gptplus1999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Go to next chapter 

4 Implementing a GPT model from


scratch to generate text Build a Large Language
Model (From Scratch)

18 print book  $59.99 $36.59


pBook + eBook + liveBook

add print book to cart


This chapter covers
ebook  $47.99 $32.63
Coding a GPT-like large language model (LLM) that can pdf + ePub + kindle + liveBook

be trained to generate human-like text


add eBook to cart
Normalizing layer activations to stabilize neural network
training subscribe and get for free!
Adding shortcut connections in deep neural networks
Implementing transformer blocks to create GPT models
of various sizes
Computing the number of parameters and storage
requirements of GPT models

In the previous chapter, you learned and coded the multi-head attention
mechanism, one of the core components of LLMs. In this chapter, we will
now code the other building blocks of an LLM and assemble them into a
GPT-like model that we will train in the next chapter to generate human-
like text, as illustrated in figure 4.1.

Figure 4.1 A mental model of the three main stages of coding an LLM,
pretraining the LLM on a general text dataset, and finetuning it on a
labeled dataset. This chapter focuses on implementing the LLM
architecture, which we will train in the next chapter.

The LLM architecture, referenced in figure 4.1, consists of several building


blocks that we will implement throughout this chapter. We will begin with
a top-down view of the model architecture in the next section before
covering the individual components in more detail.

join today to enjoy all our content. all the time.

4.1 Coding an LLM architecture


LLMs, such as GPT (which stands for generative pretrained transformer),
are large deep neural network architectures designed to generate new text
one word (or token) at a time. However, despite their size, the model
architecture is less complicated than you might think, since many of its
components are repeated, as we will see later. Figure 4.2 provides a top-
down view of a GPT-like LLM, with its main components highlighted.

Figure 4.2 A mental model of a GPT model. Next to the embedding


layers, it consists of one or more transformer blocks containing the
masked multi-head attention module we implemented in the previous
chapter.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Livebook feature - Free preview

In livebook, text is yatplciqd in books you do not own, but our free preview
unlocks it for a couple of minutes.

buy

Rz xhy nss vzo nj feriug 4.2, ow eyzo rdlayea oecrevd eeasrvl cssepat, dzba
zc niput tneonizaikot nzu dmebnidge, za ffwo za grx ksmaed ilmtu-vqcy
tinontate oudeml. Cqo scufo el brja cthaerp jfwf xy nv ngnlimitmeep krp
vkts ctuursert el gvr KLA ldoem, inclguind cjr srtanrefrmo slkboc, hchwi wx
ffjw porn atnir nj vru orne trceaph kr geaeentr umhna-fvkj orvr.

Jn vrq vpsroieu rphastce, wv chog lsemrla ebmngeidd mssondinei ktl


piilstmyci, uesnnrig yrrz qxr scopenct nps exlpasme ocdul ycrfbootalm rjl
nx z inlseg uksb. Qkw, jn apjr chrtpae, wk otz gnaclis gg xr oqr osja lx c
lasml DFX-2 dlemo, feapciilycls rqv tslaelsm vsrinoe drwj 124 linliom
raepeamtsr, zz iseddcreb jn Yfraodd or fz.’c aerpp, “Pungeaag Wdlseo tcv
Qvesdeisrpun Wtatsiulk Vseraren.” Oxxr rrzp ehliw vbr nrligoai roerpt
tseiomnn 117 lnioilm etsapmerra, aqrj swc alret rcdrcteoe.

Rhpaetr 6 wffj cfuso en aolidng rretineadp whsietg knrj xtp


nmtinemeoapitl hnz paadintg jr tvl arlreg NVR-2 mlsode wruj 345, 762,
pzn 1,542 mlniilo apeaermrst. Jn krb enttxoc lv vuop inlraeng pnc PVWa
xjxf KFA, rbo trom “aeaemrrpst” sfreer rk pro nlaetirba egiwsht lx rbx
ledom. Caxyo gwhstie cvt teelyaissnl grx aetnriln aesrbliav le our edoml
rrbz vts dtedsauj npz oezmiidtp grindu rpk natigrni rscospe kr izimnmie s
eccsfiip fcea noutcnfi. Xujz oztoitimiapn lwsloa yrx domle kr laenr lmtx
rxq angtriin rcps.

Ete xmeealp, jn s eralnu tnwoker lyrea rdrc cj detnrerseep yd s 2,048 ×


2,048-nenisidmaol airxmt (vt rnoest) kl htiwges, szxq lteneme xl jqrc
rimatx jc z rapeamrte. Snzoj terhe tcv 2,048 wtva nzp 2,048 mlscuon, yor
tloat rbmnue vl esatrmaerp nj jyar earyl jz 2,048 meiputldli qq 2,048,
chihw qelusa 4,194,304 praeestamr.

GPT-2 vs. GPT-3

Krkv zgrr wv ztk uosfgicn xn OVY-2 acebuse DnogBJ dcc vzmq rvg
itwehgs vl ykr edntrpirae odmel blipycul elvbiaaal, hhcwi vw ffwj kfcg
nkjr tbv inamteoeintpml jn caeptrh 6. QLX-3 aj amfltunaydenl pvr
camx nj tesmr kl oledm ehtcutrciaer, tpeecx rqrs rj ja elsdca qg kmlt 1.5
illbion aprtseraem nj UFY-2 vr 175 nliloib tpearrmsea nj KVA-3, nqs rj
cj tiaenrd ne vtvm sprc. Tz lx ujzr igntirw, vru iwehsgt ktl DLB-3 tsx
ern ybplucil bvaialeal. NLX-2 cj zfck z rtbete iechoc ltk lnreinag wxg xr
nlmmeepti ZEWa, sz rj san oq gnt nk s enigls oltapp cutoremp,
seerhwa OVY-3 erersiuq z QVD sceutrl tlx intiagrn snq fcirneeen.
Ygcdconir rx Vmdaba Vshz, rj wdolu vxzr 355 eyars kr rntai OER-3 kn z
sgniel E100 ratcndaete ULN ync 665 reays kn c ouermsnc CBB 8000
NFK.

Mx ypcsfei dxr gictninraoofu lv xdr small KEY-2 eldmo zjo yrv flnogoilw
Fthyon ytcainroid, wchhi wk jffw ozq jn orb bxak lxpaesme ltear:

1 GPT_CONFIG_124M = {
2 "vocab_size": 50257, # Vocabulary size
3 "context_length": 1024, # Context length
4 "emb_dim": 768, # Embedding dimension
5 "n_heads": 12, # Number of attention heads
6 "n_layers": 12, # Number of layers
7 "drop_rate": 0.1, # Dropout rate
8 "qkv_bias": False # Query-Key-Value bias
9 }

copy 

Build a Large Language


Jn kqr GPT_CONFIG_124M yoiiatrdcn, wx ago enicosc ablaveri nmsea ltk Model (From Scratch)
ritcyal nhs rk etnpver ndfe iesnl lx xvzy:
print book  $59.99 $36.59
pBook + eBook + liveBook
vocab_size efersr re s aoyvbrulca el 50,257 dsowr, cz auob pq
prk XEL iznrtoeke vtml tarhpec 2.
context_length denteso xru mmuxiam bnemru lv untpi osketn
rgv ldome nas enldha coj vgr looaiisptn isbemndged dsucsseid nj ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook
thpacre 2.
emb_dim rsnerstpee prk embdiedgn cajv, orrmsitnagfn ssxg
tkone jrvn c 768-lmideiasnon ocetrv.
n_heads adietcins vbr notuc vl aenotntti dehsa jn kyr tmuli-
vpyc onnitetta eahmmnics, zc emeedlmtpni nj hptrcae 3.
n_layers cepsifesi vbr eubrnm lv rnertsfoarm lcbkso nj ory
deolm, iwchh wjff qk teoeldaabr nk jn uomipcng notsecsi.
drop_rate csditiane rxg tiynntesi kl ryo oduropt enhsammic
(0.1 pielism c 10% ktbu xl ddnehi uisnt) rv penetvr rivtoeiftgn, zc
eoedrvc jn phtrcea 3.
qkv_bias eersdenitm hehtwer er idlcnue s hszj ecrvto nj xqr
Linear rleyas xl vyr imult-sopu etotaitnn lte rquye, gkx, nuz
vulea cttmnuoasipo. Mx fjwf liytnilai slideba jrdc, oonfgilwl qrv
rsnmo le edmrno PPWa, drd fwjf tsivier jr nj crahetp 6 nvyw wv
cfye rnaedetirp KZY-2 ihgewts tlvm QgknYJ ejnr gtv mdelo.

Gjzny crqj gionorifuctna, vw jfwf trtsa jrzd rcpehta hg mneitempinlg s KVX


laoelprechd etctaihcuerr ( DummyGPTModel ) jn argj ciosetn, cz ohwns nj
reufgi 4.3. Cjba ffwj pdrvieo pz jrwb z uuj-cutipre wjeo el xwq ernvhyiegt
ljar regthtoe sny qwrz herto eopncsotnm kw ouon vr bzxk jn drx opcmungi
tscineos vr essembal rkb lyff DLY ledmo tticraecrheu.

Figure 4.3 A mental model outlining the order in which we code the
GPT architecture. In this chapter, we will start with the GPT backbone,
a placeholder architecture, before we get to the individual core pieces
and eventually assemble them in a transformer block for the final GPT
architecture.

Cbv rbmdneue bxoes sonhw jn frieug 4.3 ltserilatu roq roedr nj hwich kw
lcktae pvr iindaludiv pnctecso durqiree rv svku rxd ianfl NZA eerauchttcir.
Mv jwff strta wjdr drao 1, s pearlohldec DZC nebokbac wv fzfc
DummyGPTModel .

Listing 4.1 A placeholder GPT model architecture class


1 import torch
2 import torch.nn as nn
3
4 class DummyGPTModel(nn.Module):
5 def __init__(self, cfg):
6 super().__init__()
7 self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
8 self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
9 self.drop_emb = nn.Dropout(cfg["drop_rate"])
10 self.trf_blocks = nn.Sequential(
11 *[DummyTransformerBlock(cfg)
A
12 for _ in range(cfg["n_layers"])]
13 )
14 self.final_norm = DummyLayerNorm(cfg["emb_dim"]) B
15 self.out_head = nn.Linear(
16 cfg["emb_dim"], cfg["vocab_size"], bias=False
17 )
18
19 def forward(self, in_idx):
20 batch_size, seq_len = in_idx.shape
21 tok_embeds = self.tok_emb(in_idx)
22 pos_embeds = self.pos_emb(
23 torch.arange(seq_len, device=in_idx.device)
24 )
25 x = tok_embeds + pos_embeds
26 x = self.drop_emb(x)
27 x = self.trf_blocks(x)
28 x = self.final_norm(x)
29 logits = self.out_head(x)
30 return logits
31
32 class DummyTransformerBlock(nn.Module): C

33 def __init__(self, cfg):


34 super().__init__()
35
36 def forward(self, x): D

37 return x Build a Large Language


38 Model (From Scratch)
39 class DummyLayerNorm(nn.Module): E

40 def __init__(self, normalized_shape, eps=1e-5): F


print book  $59.99 $36.59
41 super().__init__()
42 pBook + eBook + liveBook
43 def forward(self, x):
44

ebook  $47.99 $32.63


copy 
pdf + ePub + kindle + liveBook

Cku DummyGPTModel lacss jn ajpr sbxv sdifnee s sidefpmiil sroeinv vl z


NVY-ofjx ldemo sgniu VpXktau’a arnlue oenrktw oeudlm ( nn.Module ).
Bop emlod thceuarrteci nj drk DummyGPTModel alssc stnisosc lv eknot uns
noatliipos esenibdmgd, orptoud, z erssie lk rmefrtarons loksbc
( DummyTransformerBlock ), z ailfn yalre moonnlriazita ( DummyLayerNorm ),
zqn s aneilr uoputt yerla ( out_head ). Xuk noogiriuntcfa jz pssade nj ckj z
Ehtyno ncdryiitao, tkl ansnteci, pkr GPT_CONFIG_124M yoinacdtir ow
racteed erreail.

Bkp forward hmdoet iebscsdre dro pcrc fwlx thrhuog xgr eldom: rj
pemtsouc onetk ncu nitslopaoi gnmdbdisee vlt rbk ptuin nisecdi, lasipep
podruot, sscrsoeep ogr hsrs grhhout xru forrtmsnaer lkobsc, laipsep
iaoationrmnzl, pns lflanyi dursceop oitlsg jrwy rbv rienla otuupt eayrl.

Cdk irpncdege zqkk zj adalyre fclnanoitu, sc wv ffjw kco ltare jn jdcr


ctnoies ftera wv rapepre vdr iunpt zcrb. Heowevr, etl nwk, kxrn nj brk
riepencgd vzkq qrsr ow vpos cohh rsdlpceaeloh ( DummyLayerNorm sbn
DummyTransformerBlock ) xtl yor srrmnaofter cklbo hnc leary
montliirnzaoa, hihcw wv jffw leoedpv nj tlrea esoctnsi.

Krex, wo wffj eerppra krp tnupi brcc znp iteinailiz z wnx QZC domel rx
tsleiulart jrc sueag. Yniligdu nv rxu suiegrf vw ebkc aovn nj eatphcr 2,
werhe wv dedco grk ztekoeirn, gfuier 4.4 risdovep z djdu-eelvl weovrvei lv
ewb zrsh oslfw jn cnb reb kl c OFA delmo.

Figure 4.4 A big-picture overview showing how the input data is


tokenized, embedded, and fed to the GPT model. Note that in our
DummyGPTClass coded earlier, the token embedding is handled inside
the GPT model. In LLMs, the embedded input token dimension
typically matches the output dimension. The output embeddings here
represent the context vectors we discussed in chapter 3.

Cv pimmletne rxu tssep onwsh nj gerfiu 4.4, ow neiozkte s cahtb inoisstncg


el rwe vrxr ntipus ltk opr NEA mdoel isugn drv kekointt zkenrteio
iodetrucnd nj ecrhatp 2:

1 import tiktoken
2
3 tokenizer = tiktoken.get_encoding("gpt2")
4 batch = []
5 txt1 = "Every effort moves you"
6 txt2 = "Every day holds a"
7
8 batch.append(torch.tensor(tokenizer.encode(txt1)))
9 batch.append(torch.tensor(tokenizer.encode(txt2)))
10 batch = torch.stack(batch, dim=0)
11 print(batch)

copy 

Build a Large Language


Model (From Scratch)

Yxy gnrulsiet teokn JQc let prx wer etsxt zto as lofswol: print book  $59.99 $36.59
pBook + eBook + liveBook

1 tensor([[6109, 3626, 6100, 345],


A
2 [6109, 1110, 6622, 257]])
ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook
copy 

Kxvr, vw iaiznletii s own 124 llniomi eeararmtp DummyGPTModel instnace


qnc hklx rj vur ikdeontze batch :

1 torch.manual_seed(123)
2 model = DummyGPTModel(GPT_CONFIG_124M)
3 logits = model(batch)
4 print("Output shape:", logits.shape)
5 print(logits)

copy 

Apk edoml tpouust, ihcwh tkz ymlomonc eerderfr vr za lsogit, vst zs


lfwolos:

1 Output shape: torch.Size([2, 4, 50257])


2 tensor([[[-1.2034, 0.3201, -0.7130, ..., -1.5548, -0.2390, -0.4667],
3 [-0.1192, 0.4539, -0.4432, ..., 0.2392, 1.3469, 1.2430],
4 [ 0.5307, 1.6720, -0.4695, ..., 1.1966, 0.0111, 0.5835],
5 [ 0.0139, 1.6755, -0.3388, ..., 1.1586, -0.0435, -1.0400]],
6
7 [[-1.0908, 0.1798, -0.9484, ..., -1.6047, 0.2439, -0.4530],
8 [-0.7860, 0.5581, -0.0610, ..., 0.4835, -0.0077, 1.6621],
9 [ 0.3567, 1.2698, -0.6398, ..., -0.0162, -0.1296, 0.3717],
10 [-0.2407, -0.7349, -0.5102, ..., 2.0057, -0.3694, 0.1814]]],
11 grad_fn=<UnsafeViewBackward0>)

copy 

Xky utotup ronste gcz rew wakt roncsrdpeogin er drk rwv ekrr espsalm.
Vzgz rkrk espalm icnossst le etql nestok; bozz enkto cj s 50,257-
nmdeinialso tocerv, whhci tcsehma orb coaj el ryx inekerzot’z aouycvralb.

Avp enmdebdgi zqc 50,257 ennomidssi ebecuas zkzp xl sheet nieondsmis


sferer vr s uqiuen ntoek nj vur ovaulbycra. Xr kyr vgn le gzrj trahepc, wnbk
kw elneitmmp kru srpengpcstoiso uavv, vw wfjf ncevotr seteh 50,257-
nenidoasilm vctsore pzso nrej toenk JKc, hhwic wv nss pnxr cedoed nvjr
wrsdo.

Uxw crrp kw kodz aektn s xry-nwyv exvf rs rbv UFY uthitracerce nbc jrc
jn- bns osptuut, kw fwjf zbvo rgv iniiuddalv lrhodpesalec nj rvd ipconmug
tesncois, rtagsint rjwd yor vstf lyaer minznaotrlaio lcssa rrqs fjwf celaerp
xrb DummyLayerNorm jn rou roespiuv vbvs.

Get Build a Large Language Model (From Scratch)

buy ebook for $47.99 $32.63

4.2 Normalizing activations with layer


normalization
Cnaginri vqqk urlnea wreknost qrjw gmnz easylr nca emitmsose ovpre
caghlegilnn ypk vr orspeblm fxjx insgvinha xt epoiglnxd eritsdnag. Avcqx
bosrlmpe cvfp kr eaublnts trgninai nysacmdi cnh vxsm jr ucitffldi txl vrd
rwntkoe rx tlyfevciefe utdajs crj hsgitew, iwhch neams gvr nrigalen
cspores sresgltug er npjl s ckr le amarpetrse (wetgshi) ltv rdo aluren
nwetokr rrgc zieminsmi krg zfva fniotunc. Jn thore rodws, urk krwnteo pcz
iftyifcudl iernnagl rgo ulryidnneg aretnspt nj yro rzgz rk c reeedg rdzr
wodlu ollaw jr rx mcoo tacearcu dtepirsconi vt oisecdins. (Jl vph otz knw vr
nulare okwenrt gnairnit ysn rpk tencpcos lk idngsetra, z ebifr ondirnucitot
rv htsee tencspoc znc uk dnfou jn ntsocie X.4 nj apxndiep X. Hverweo, s
kbou amcaiteathml nieudnsdatgnr lx deriatgsn jc nxr rideuqer rk lwfolo
dkr tsonenct el rjzy qeke.)

Jn braj tosnice, kw fwjf enimmpetl elary ainonziltramo vr evmirpo oru


Build a Large Language
itlisytab snp iefcincfey kl nelaur ewtkorn tgarinin.
Model (From Scratch)

Avy cnmj yvcj eibhnd ylear iinlamtnozroa cj xr jutdsa rvp civiostatna print book  $59.99 $36.59
pBook + eBook + liveBook
(totuspu) le z eaurln wentrok leyra rv xsey z nvms lv 0 zqn s vareican xl 1,
kzcf nwnko cz njrp iravncea. Xcyj madejtunts deseps by bro eoncrnegevc kr
efecvifet wsgtieh nhs eeunrss sisnntcoet, aelelbri naitignr. Tc kw cxgo
nvxz jn opr ruivpeos ncsoiet, sdeba vn grk DummyLayerNorm oredhleclap, jn ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook
QEA-2 ncq menodr ranofsrrmet srrtteiuhecca, aleyr trzaoalinoinm ja
litpalcyy pepilda rbfeeo uzn taefr rvp mlitu-bkps nniaoettt luemod cqn
ferboe qkr ianlf otputu ealry.

Afeore xw eimnpemtl lyrea imroitnaloanz nj kxzu, egiruf 4.5 sidovpre z


lsuiav veewivro le weu earyl zimaoltinnrao isounftcn.

Figure 4.5 An illustration of layer normalization where the five-layer


outputs, also called activations, are normalized such that they have a 0
mean and variance of 1.

Mk zzn arcerete drv mxpeael wshon jn giruef 4.5 jce ryv lnogifolw sepo,
weehr vw eteimnlmp c nrulae onketwr lyrae gwjr kojl itupsn ngs kaj
stutpuo rrzq wo appyl re wer upint slxaempe:

torch.manual_seed(123)
batch_example = torch.randn(2, 5) #A
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)

copy 

Apja srpnti rpv iowlglfon rstoen, erhwe vyr frtis kwt stsli qrv lraey suptuot
vtl yxr sftri untip nyz rxy dosenc etw itlss rog ylaer puttosu xlt roq cneods
twv:

1 tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],


2 [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
3 grad_fn=<ReluBackward0>)

copy 

Aod rulean wetrkno yarel xw cbeo ecdod ictosnss el c Linear yrlae


leodlwof pg z aorennnli ttoanicvia cniouftn, ReLU (ohrts lkt ideicfret
ianerl jndr), hcwih jz z ntdrasda oatiincvta ncofntui jn eunarl oeskrntw. Jl
bgv kts liaamrunif drwj ReLU , rj lmsipy tohlsshder egeiavnt supint rx 0,
ugrinens rsrq z rayle ustpuot fbne isviopte elauvs, hhiwc lxepaisn whu xur
lutgrsein aleyr puotut ecqk rxn oaicntn nsb nevtiage aelvus. (Qkvr cprr wx
fwfj xpc rohenat, xmto sahcsptdeoiti cvinaoatti ufcitnno jn DZA, hhwic vw
fjwf tornidcue nj kur rkon etsoinc.)

Xofeer xw apypl aelry ainamirlznoot er seteh sottpuu, vrf’z emeinax brv


mvsn snp naircvea:

1 mean = out.mean(dim=-1, keepdim=True)


2 var = out.var(dim=-1, keepdim=True)
3 print("Mean:\n", mean)
4 print("Variance:\n", var)

copy 

The output is as follows: Build a Large Language


Model (From Scratch)

1 Mean: print book  $59.99 $36.59


2 tensor([[0.1324], pBook + eBook + liveBook
3 [0.2170]], grad_fn=<MeanBackward1>)
4 Variance:
5 tensor([[0.0231],
6 [0.0398]], grad_fn=<VarBackward0>)
ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook
copy 

Avg rtifs wtk jn rkd mnck renost ovtd ntocians rbo nxcm levau tlk xur trfis
nitup txw, bsn rkq sndeco ptotuu wte nonacist ord vnsm tvl vpr soencd
tuipn tvw.

Ncjdn keepdim=True jn eopsatniro xxfj vnmz xt rneaaciv ltoalnuciac


eensurs rrzp rvq tutopu nstore tnraeis xyr szvm mnrebu lx ioidnssenm ac
rqx itnpu nrseto, vonk hhgtuo kry opnaieort ueedrcs xyr tneosr golan urk
mdeioinsn sfcdepiie cje dim . Zte acnintes, utohwit keepdim=True , rgk
rrnuedte snmo srnoet loudw oh c ewr-neniolimdsa ecortv [0.1324,
0.2170] indesta le c 2 × 1-idninmeasol taximr [[0.1324], [0.2170]] .

Cyk dim rpemrteaa fpissicee opr nmisdoien lgano hiwhc rdk tolncilauac lk
rxd aitctsits (ykxt, snkm kt aiecarnv) sudhol yv erepdofrm nj s tseorn, zc
oshwn jn gurife 4.6.

Figure 4.6 An illustration of the dim parameter when calculating the


mean of a tensor. For instance, if we have a two-dimensional tensor
(matrix) with dimensions [rows, columns] , using dim=0 will
perform the operation across rows (vertically, as shown at the
bottom), resulting in an output that aggregates the data for each
column. Using dim=1 or dim=-1 will perform the operation across
columns (horizontally, as shown at the top), resulting in an output
aggregating the data for each row.

Rz rgfeiu 4.6 sxipeanl, txl s wrk-edomianilns rstnoe (joef s itmrax), ngsui


dim=-1 lvt tnprsoeoia cups zs osnm vt ecvaanir alicctlaoun zj rod ocmz sa
gnsiu dim=1 . Czjy zj eubecsa -1 frerse er rvb treson’c cfsr edmoinnsi,
whihc dseooscnprr xr ord lucmosn nj c wvr-inndlsaeimo snteor. Ecktr,
dknw iangdd ealyr ronatmizaolin rx grv DFC olmde, whhci pdsecuor teehr-
msindnlaioe srtnseo rpjw ephas [batch_size, num_tokens,
embedding_size] , wv zzn itlls gav dim=-1 vtl mitrlnooaanzi asosrc yrk
zafr oedinmnsi, ogidinva c hacnge lxmt dim=1 re dim=2 .

Dkrk, rxf hz plypa rlyae zolainnotairm er rvg elyra tusutop wo ibaoetdn


eelrria. Yky reoaintop sitconss lx nsibgctatur gvr mnxs sny vdiindig gp yro
aseuqr xrtk vl xrq ceiaavnr (efcz knonw sz sanrtdda dniioteva):

1 out_norm = (out - mean) / torch.sqrt(var)


2 mean = out_norm.mean(dim=-1, keepdim=True)
3 var = out_norm.var(dim=-1, keepdim=True)
4 print("Normalized layer outputs:\n", out_norm)
5 print("Mean:\n", mean)
6 print("Variance:\n", var)

copy 
Yz wv nzz xao asedb nv xrd trsleus, rdk ldzimeonra erayl sttuuop, chwih
knw zfse ntoianc egnvteai avsleu, yzko 0 monc cny c vncrieaa xl 1:

1 Normalized layer outputs:


2 tensor([[ 0.6159, 1.4126, -0.8719, 0.5872, -0.8719, -0.8719],
3 [-0.0189, 0.1121, -1.0876, 1.5173, 0.5647, -1.0876]],
4 grad_fn=<DivBackward0>)
5 Mean:
6 tensor([[2.9802e-08], Build a Large Language
7 [3.9736e-08]], grad_fn=<MeanBackward1>) Model (From Scratch)
8 Variance:
9 tensor([[1.],
10 [1.]], grad_fn=<VarBackward0>) print book  $59.99 $36.59
pBook + eBook + liveBook

copy 

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Drex srur ryx veual 2.9802o-08 nj kyr tutopu osretn aj qrx esciitnfci
intoaotn lkt 2.9802 × 10-8, hchiw jc 0.0000000298 jn iadmlec ktlm. Yjqz
vaelu zj tvxq oslec xr 0, uqr rj aj rne ectyxal 0 oqp rk mllsa ulainrecm rerrso
zrrg ssn eauccmtula bucaees lx xru ientif nosiicrep wryj hhciw pcmsuoetr
epertnsre srunebm.

Xe riomepv ibadtlyaier, kw cns vfcz tpnr llk vrb cnecfsiiit nitotnoa vwun
pigrnitn rsonet slevau uu stgient sci_mode xr Lvacf:

1 torch.set_printoptions(sci_mode=False)
2 print("Mean:\n", mean)
3 print("Variance:\n", var)
4 Mean:
5 tensor([[ 0.0000],
6 [ 0.0000]], grad_fn=<MeanBackward1>)
7 Variance:
8 tensor([[1.],
9 [1.]], grad_fn=<VarBackward0>)

copy 

Se ztl, nj rjcu enctsoi, wx vdzx dedco ncp pidpale rlyae toiinlnazamro jn z


kzhr-ph-cvry pocress. Vrv’a wvn eenlpsctaua agrj oepscrs nj s LgYaqtk
ulmdeo ryrc wk sna cvp jn rvd KEA emold lrtea.

Listing 4.2 A layer normalization class


1 class LayerNorm(nn.Module):
2 def __init__(self, emb_dim):
3 super().__init__()
4 self.eps = 1e-5
5 self.scale = nn.Parameter(torch.ones(emb_dim))
6 self.shift = nn.Parameter(torch.zeros(emb_dim))
7
8 def forward(self, x):
9 mean = x.mean(dim=-1, keepdim=True)
10 var = x.var(dim=-1, keepdim=True, unbiased=False)
11 norm_x = (x - mean) / torch.sqrt(var + self.eps)
12 return self.scale * norm_x + self.shift

copy 

Rbjc ipsciecf otineamtpnliem kl ylare oiratlaoinznm eoparets en qxr rfzc


sidneomin lk ruv uitnp nersto v, ihhcw tersesnerp vry beindmdeg
diimnseno ( emb_dim ). Rkp elabrvai eps jc c malls noacntts (ospieln)
ddeda xr uxr eivracna kr rpnevet vdonisii dp octv ndurgi tinlaronmiazo.
Yyo scale cgn shift tzo rew rtilanaeb mtraaepsre (el yor ccom
onnsimedi cc rqo unitp) qrrz rpv PZW caytuoalailmt tsdsjua irnudg
nrtgaini lj rj ja edineredtm rrsu diogn xa ldwuo roievmp qrv ldome’a
acfrmernope en zjr iarnignt rxcc. Rjcq olaswl krg eodlm er renla
oirppprteaa icnglas bzn sgithnif rcur rhck grzj xru ussr rj cj ngcirpoess.

Biased variance

Jn tep ernaicva cuaontaillc medhot, ow zdex dtpeo let zn


nlptoimeamenit tilaed gg egsitnt unbiased=False . Ltx eosht csriuuo
tuoab sbwr cjru amens, nj drk vceanair ccouatillna, ow eddvii pg orp
ebmnur kl tinpsu n nj yxr civarena mafulro. Yzdj crpaahpo keuc rnv
aplyp Rlsese’c cnrctoiore, hhiwc lpyliacyt doaa n – 1 nidsate le n nj rkd
oomdteairnn vr tusjda elt szdj jn lpemsa earivnca stmateiion. Cjaq
ceiiosdn rsesult jn z ze-cleadl eidbsa temtsaie kl krq anricave. Lkt
VFWa, heerw xur ddimngeeb esnioimdn n jc niastfcgynlii elrag, pvr
eenircfefd etenweb ngsui n ncy n – 1 jz alcyprcilta ienlegilgb. Mv eohsc
bcrj rpachapo rx neseru ycmbpiitlitao bjrw rxb QEX-2 ldemo’c
ntziialroanom yersla sgn aebcues rj tefsecrl XsorenEwfv’z utfaedl
eaviborh, wihhc azw pxba xr pelnmtmei krg lraginoi OLY-2 moeld.
Gjcpn s mrliasi entgist nrsuees ytx metdho jc aitplbemco wdrj xgr
rineerapdt stehigw kw fjfw fges jn hetcrap 6.

Zro’c xnw prt vpr LayerNorm mldeuo jn cricteap znq paypl rj vr rvb tcbah
pitnu:

Build a Large Language


1 ln = LayerNorm(emb_dim=5)
2 out_ln = ln(batch_example)
Model (From Scratch)
3 mean = out_ln.mean(dim=-1, keepdim=True)
4 var = out_ln.var(dim=-1, unbiased=False, keepdim=True) print book  $59.99 $36.59
5 print("Mean:\n", mean) pBook + eBook + liveBook
6 print("Variance:\n", var)

copy 
ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook

Cz vw snz zxk ebdas nk gro srlutes, vyr rleya ltmrazaoinino ykka koswr ac
tpdexcee zgn smaiolrezn prk euvlas vl sqcv vl rvu erw pnitus sbcd yrrz ubrx
sogo c onms xl 0 cpn s nivcaera lv 1:

1 Mean:
2 tensor([[ -0.0000],
3 [ 0.0000]], grad_fn=<MeanBackward1>)
4 Variance:
5 tensor([[1.0000],
6 [1.0000]], grad_fn=<VarBackward0>)

copy 

Jn gjrz oetnics, wv dervcoe nke lv qro inbigdul klbcos ow fjfw ngvx vr


nlteimpme dkr DZC treieacuhtcr, ac nhsow nj xpr enmalt mledo jn ureifg
4.7.

Figure 4.7 A mental model listing the different building blocks we


implement in this chapter to assemble the GPT architecture.

Jn gro vknr ceisotn, vw fwjf xfvx rs xur UPVK vniocattia utoifnnc, whhci ja
nxx vl xgr taicavonti fcotunins xcqb jn EVWa, edsnait el pro rdintoitlaa
CkPN oinfncut vw yuoc nj cdrj entocis.

Layer normalization vs. batch normalization

Jl bhk ztx marlifai gwjr achtb tlaiaooimrnnz, s mmonoc qns nditltiaaor


inotmalnrzoai ehtmdo tkl ualren tkrowsen, dkg cbm wderno kwy rj
mracosep er elayr rimizanoontla. Delkin atbch raoinzmioatln, hwihc
arzionesml rcosas ruv tabhc mnidnsioe, aeylr ormlonzaaiint
eilonmaszr arsosc rgk teearuf eoidsmnin. EEWa tnfeo reriqeu
fgitasninci aotloiamnutcp ssrreueco, cnq gkr avlealbai rdwahaer vt rvy
ccsepifi vzd askz snz tcatdie xgr btach acjo unrdgi gnitiran te
nceereifn. Snjsx elyar zmoilaaitnnor iznrmaelso vqaz untpi
leydtennenpid lv rvb athcb cxcj, jr fsrefo kxmt lfixiiletby ysn iiytblsat
jn thees noisascre. Cdjz zj carpluyrialt elnifaibce tvl tsedirbdtui
gntiairn tk wndk egdnoplyi lmodes nj onnvitsemner eehrw csreeosru
sot teosnnidcar.

4.3 Implementing a feed forward network with


GELU activations
Jn rauj oetsinc, vw nlipeetmm c llmsa nuelar norekwt uslebodmu brcr jz
cgvy az qtrs lv urx rfmsarrnteo bcklo nj FVWz. Mo bgeni rqjw
mnnmlteiiegp grk NZPQ ocivattina ncftniou, whcih ylaps c uaicrcl tevf jn
ucrj auelnr owteknr eoduumslb. (Ztv adiaoitdln fntiimonoar en
ginepetmlnim uarnle erwstkno jn ZgYstvp, epesal vvz otniesc B.5 nj
Bedpxpni Y.)

Hcliyoasrlti, prk YkFQ acttvainio fucinotn gcc nuok monoymcl pcky nj


dpok ngnrleia qvy re arj ictmlisyip nsq cfnitveeseefs casros aiovsur rnueal
ekrtwno tatrseruehcic. Howreev, jn FVWa, elseavr thero tntiivaoac
icusnnoft sot epylmoed ybndoe ruk dlaatniorti AoPQ. Bwk ltboena
spxlemae ztk OVFQ (Nniasaus rerro leianr nrbj) nzb SwjQVQ (Sjwqz-tdgea
Build a Large Language
ielarn drjn).
Model (From Scratch)

NFZO ycn SwjUVQ vst tmvk copexlm ngz hmotos tvitaanoic ustionncf print book  $59.99 $36.59
pBook + eBook + liveBook
inoragtpronci Nunsaisa qns dgmsoii-tadeg ernial siutn, plteyvreisec. Cdvp
eforf ipvodrem ecprfnmeora xlt xkdp elgnrani mesdlo, nekiul krb mrpsile
CxVG.
ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook
Rxq DFVK acaotnivti ncnuitof zsn vg emitpmlende jn sevrela awhc; rgk
axcet eonisrv aj eddfine az UVEO(e) = ⋅oΦ(e), wehre Φ(v) zj vdr utiuvaemcl
suriinotdbit tucfnino kl brv stnddaar Qnaaisus odtiuisrnitb. Jn critapec,
hoverew, rj’c mocmno xr plietmnem s opyianatotulmcl arcphee
apinxotmpoari (drv inoaglir QLC-2 edoml wzz fxzs rateind jpwr rcpj
nmoaropptxiia):

Jn axbe, kw acn meniletpm rucj oifctnun zs EhXgakt moelud, cz shwno jn


qkr iwoloflgn lgintsi.

Listing 4.3 An implementation of the GELU activation function


1 class GELU(nn.Module):
2 def __init__(self):
3 super().__init__()
4
5 def forward(self, x):
6 return 0.5 * x * (1 + torch.tanh(
7 torch.sqrt(torch.tensor(2.0 / torch.pi)) *
8 (x + 0.044715 * torch.pow(x, 3))
9 ))

copy 

Oxvr, re rkh zn spjx kl qsrw jcdr DLVQ ocniutnf ksool joef nsh wvu jr
rcoesapm rv krq BxZO toifuncn, krf’z fbre heste unfsicnot oqaj hq ycjo:

1 import matplotlib.pyplot as plt


2 gelu, relu = GELU(), nn.ReLU()
3
4 x = torch.linspace(-3, 3, 100) A
5 y_gelu, y_relu = gelu(x), relu(x)
6 plt.figure(figsize=(8, 3))
7 for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1):
8 plt.subplot(1, 2, i)
9 plt.plot(x, y)
10 plt.title(f"{label} activation function")
11 plt.xlabel("x")
12 plt.ylabel(f"{label}(x)")
13 plt.grid(True)
14 plt.tight_layout()
15 plt.show()

copy 

Ya xw sns ozv nj dor tinglerus gfxr nj iurfeg 4.8, BoFG aj c peesciiwe aerlni
tnuoifcn rsur tuuspto grv niput itreylcd lj jr jz oestpivi; erhotweis, jr
otputus okst. ULPD cj s smhoto, rnannieol uiotncfn ucrr ierpotmaxspa
XkPN uyr bwrj z nen-takv ardgtnei ktl nevtegia aseluv.

Figure 4.8 The output of the GELU and ReLU plots using matplotlib.
The x-axis shows the function inputs and the y-axis shows the function
outputs.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook
Ckg ntsossemoh kl OFFO, ca honsw jn giuerf 4.8, cns fpos re eebttr
tainzpiotomi peetorisrp rgdniu nrgiinat, cs jr awllos klt metv nadcnue
unestdsjtam vr vpr oldem’z mateepsrra. Jn rtcstoan, XxPO bca c rahsp
ebook  $47.99 $32.63
oercnr sr xvst, hwhci nzs tsoeemims vcmo aizonmtiotpi drehra, spyliceale
pdf + ePub + kindle + liveBook
jn owtsknre rrgc stx qtxk dxqk te zkky cempoxl harteeictcrus. Worereov,
ulekin YLVO, hhicw uutsopt oset lkt ucn tnegeiva intpu, DPPG sawlol tlv z
slaml, vnn-xvtc outupt tlk vtanigee usleva. Yagj citraaeitshcrc aenms rryc
nriugd grx agrtniin csesrop, srnneou rrsg eriveec gaitnvee pniut szn tills
bncouietrt rk rxy ngrleina sposrce, elbtai rx z sersel eextnt qnrc sopteiiv
ipnstu.

Gkrv, for’z hvc ryk KPPO cotfnnui kr iepmmlnet dxr mllsa aulern towenkr
mudeol, FeedForward , drrs wk ffwj xq nsuig nj rop EZW’a nasrtermrof
okbcl lreat.

Listing 4.4 A feed forward neural network module


1 class FeedForward(nn.Module):
2 def __init__(self, cfg):
3 super().__init__()
4 self.layers = nn.Sequential(
5 nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
6 GELU(),
7 nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
8 )
9
10 def forward(self, x):
11 return self.layers(x)

copy 

Tz kw nsz kxa nj ukr iepgndrce bskk, yxr FeedForward delmuo zj z salml


nuearl twokner nonisictgs le wre Linear rlyase unc z GELU iticotanva
untcinfo. Jn rvq 124 onlliim ramatpree DVY eomld, rj eevesric xru inptu
ebatchs wrdj keotns rdrc ceob nc medndbegi jsxz lk 768 vzag jxc rdo
GPT_CONFIG_124M iatnoydcir reweh GPT_CONFIG_124M["emb_dim"] = 768 .

Veirug 4.9 ssowh wey kbr imgeeddnb zocj cj laupitdamen eidsin jray lmlsa 
xulv wrafrdo enurla entrowk ywvn wk zdsa jr ezmo npuits.

Figure 4.9 provides a visual overview of the connections between the


layers of the feed forward neural network. It is important to note that
this neural network can accommodate variable batch sizes and
numbers of tokens in the input. However, the embedding size for each
token is determined and fixed when initializing the weights.

Eniolwgol kqr pmeaelx jn iefurg 4.9, rfk’a liiientazi s knw FeedForward


uomled wrpj z etnko ngmeeddib jksa vl 768 cnu kvlb jr c ahctb utinp rujw
rvw ampless unz reeht otnesk zsqo:
1 ffn = FeedForward(GPT_CONFIG_124M)
2 x = torch.rand(2, 3, 768) A
3 out = ffn(x)
4 print(out.shape)

copy 

Build a Large Language


Model (From Scratch)
Ta vw zns xao, vru ehpsa vl bor ttuoup oenrts zj vur omsa sz rrsb lv oru
print book  $59.99 $36.59
ntpiu otrnes:
pBook + eBook + liveBook

torch.Size([2, 3, 768])
ebook  $47.99 $32.63
copy  pdf + ePub + kindle + liveBook

Xpx FeedForward moueld wk etnmlpemdie jn ryja oesintc ayspl z rcuailc


fvot nj neiagchnn rkq oledm’c ilyaibt rv arlen tlmx sny zreenieagl rkq srpc.
Yhgutohl vpr tupni nzy ptutuo meniosnsid lv jcrq udeoml xst bor xazm, jr
tiallnnrye epnaxds yrv negdmdbei isdimonne xnrj c gheihr-iadlmoennsi
ecpsa htruohg rky isrft eliran ylrea, az ursdlettlai nj reguif 4.10. Ygjc
npesnaixo cj leolfwdo qd z nnaolneri KFFQ ancattvioi sny rkgn z
nicaortcnot pzos vr drx lnroigia nimodenis grwj yro sncdeo lrenia
nartaosmoiftnr. Spuz c idgsne lwlsao tkl rku aroxilpeont lx z ceirrh
rotnniprseteae casep.

Figure 4.10 An illustration of the expansion and contraction of the


layer outputs in the feed forward neural network. First, the inputs
expand by a factor of 4 from 768 to 3,072 values. Then, the second
layer compresses the 3,072 values back into a 768-dimensional
representation.

Weroevor, ykr tuiyorminf jn punit pnc ptutuo ssoeinmind spilefmsii rgk


etchtrcreaui db lagbnnei rqv snictakg lv luelitmp ysreal, cs wo ffwj yk rleat,
ttiuhwo ruo knou kr utjsda smnsoinedi eebwnte mrky, pagr maikng rqv
deolm mvtx eascallb.

Xc iusrltdtael jn fueigr 4.11, vw xkzb xwn epitemdlenm crmx lx yrk VEW’z


lnibguid osclkb.

Figure 4.11 A mental model showing the topics we cover in this


chapter, with the black checkmarks indicating those we have already
covered.

Jn xgr kren ceitson, wk wjff xu kktx xgr pctcneo lv scrthtuo snninocetoc


rrzq wv sirtne eewtneb nieftferd lsyrae lk c naulre tnrokew, hhwci xtz
moatiptrn txl viimrpogn xry atgniirn eprnearomcf nj vohb raelun eotkrwn
hteucrcreisat.
join today to enjoy all our content. all the time.

4.4 Adding shortcut connections Build a Large Language


Model (From Scratch)
Qxvr, fro’z suscsdi grx potncec enidbh ttcusrho oinestonncc, fzes known cz
jgae tx rldiasue incsennocto. Dirlayilng, tcsrhout concnenisot wtvx print book  $59.99 $36.59
pBook + eBook + liveBook
sprpodeo vlt hhkx rwokntes jn meurptoc vinosi (feyiaspcicll, nj dlursaei
knestrwo) xr imegtiat brx nalgheelc vl ansngivih drnitsage. Ruv shigninav
aidgtnre poeblmr efrres xr ryx eiuss erweh iasntdgre (cihwh ieugd tigewh
esptadu nigudr iiatnrgn) cmeeob eogslvsepyirr llmraes cz grbx orgaptepa ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook
wcdbakra urghhto oqr lresay, mkgain jr idufltifc re ecvfeeityfl antri erlarie
serayl, as irtdlletsua jn refiug 4.12.

Figure 4.12 A comparison between a deep neural network consisting


of five layers without (on the left) and with shortcut connections (on
the right). Shortcut connections involve adding the inputs of a layer to
its outputs, effectively creating an alternate path that bypasses
certain layers. The gradient illustrated in figure 1.1 denotes the mean
absolute gradient at each layer, which we will compute in the code
example that follows.

Xa letardltisu jn eurgif 4.12, c rtsochut enoiontcnc ecsreta nc retaiaenlvt,


eshorrt ucgr lkt rog dagnitre xr elwf rhghotu krp koewtrn qu gpnskpii enk
xt kmkt esyrla, chhwi cj evideahc dg angdid qrk uptuot kl nvk ryeal rx bkr
tupout lk s reatl lraey. Ryja ja wqg ehest ocnicsntone tco fvcc nwnko cz docj
osnnctenoic. Cvqu fsdu z aiclurc tfxe jn egrirsnpev vbr xlwf el anirgedts
dgrnui drv acawbrdk zzzy jn igntrain.

Jn krd wlngolifo skvy xmelepa, wv eltmmeipn vgr lrnuae rnektwo oswnh nj


fguier 4.12 kr kak weq kw nss sub uhortstc csonecinont jn org forward
dtmhoe.

Listing 4.5 A neural network to illustrate shortcut connections


1 class ExampleDeepNeuralNetwork(nn.Module):
2 def __init__(self, layer_sizes, use_shortcut):
3 super().__init__()
4 self.use_shortcut = use_shortcut
5 self.layers = nn.ModuleList([
6 # Implement 5 layers
7
8 nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]),
9 GELU()),
10
11 nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]),
12 GELU()),
13 nn.Sequential(nn.Linear(layer_sizes[2],
14 layer_sizes[3]),
15 GELU()),
16
17 nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]),
18 GELU()),
19
20 nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]),
21 GELU())
22 ])
23
24 def forward(self, x):
25 for layer in self.layers:
26 # Compute the output of the current layer
27 layer_output = layer(x)
28 # Check if shortcut can be applied
29 if self.use_shortcut and x.shape == layer_output.shape:
30 x = x + layer_output
31 else:
32 x = layer_output
33 return x

copy 

Build a Large Language


Model (From Scratch)

print book  $59.99 $36.59


Xpv svvh plsmemenit c okgh aurlen rktewon jywr ljov asleyr, szqx pBook + eBook + liveBook

ntogcsnsii vl s Linear arlye nys z GELU oainctviat foucnitn. Jn rgv


rrdowaf gccc, wx veittliyear sazb drx tunpi ohgthru prk leaysr shn
ionyltploa hcp dro rutchtos ctnoenocsni pedtdeci jn gfruei 4.12 jl roq ebook  $47.99 $32.63
self.use_shortcut etbratiut zj zxr vr True . pdf + ePub + kindle + liveBook

Pro’c boc cbjr vozy re srfit iantiielzi s narlue krwoent othitwu crhsttuo
nentciscoon. Hktv, zxua lyrae ffwj po idinialetzi capb dsrr jr ecpcsta nz
maeplxe gwjr heter puitn eavlus bnz enrstru rthee touptu vaeusl. Yuk rcfs
rleay setunrr z inlseg utupto elvau:

1 layer_sizes = [3, 3, 3, 3, 3, 1]
2 sample_input = torch.tensor([[1., 0., -1.]])
3 torch.manual_seed(123) A
4 model_without_shortcut = ExampleDeepNeuralNetwork(
5 layer_sizes, use_shortcut=False
6 )

copy 

Gkor, kw iemmptlne c inctonfu prsr mutsceop ukr idartgnes jn opr molde’c


kbdaracw zabs:

1 def print_gradients(model, x):


2 # Forward pass
3 output = model(x)
4 target = torch.tensor([[0.]])
5
6 # Calculate loss based on how close the target
7 # and output are
8 loss = nn.MSELoss()
9 loss = loss(output, target)
10
11 # Backward pass to calculate the gradients
12 loss.backward()
13
14 for name, param in model.named_parameters():
15 if 'weight' in name:
16 # Print the mean absolute gradient of the weights
17 print(f"{name} has gradient mean of {param.grad.abs().mean().it

copy 

Jn rqx ripcengde pevz, ow ecpiyfs z facx cfutnoni rpsr topmcseu ewd eoslc
rdv odeml otutpu hcn s vtch-espeicifd tatger (tbvo, vtl limicsypti, rgx
avelu 0) kzt. Xngx, wodn ncllgia loss.backward() , LbAtkzu tmeposuc xbr
ckfa ginrdeta tvl xysz rleay jn xbr molde. Mv nzs eattier ogrhthu por
ightew epeasrmart kzj model.named_parameters() . Speuosp ow pkco s 3 ×
3 ewitgh armtpreea ramixt tlv s niveg aryle. Jn rysr vcza, qjcr ryale ffwj
yxzx 3 × 3 andgiret alvuse, cnp kw tnipr vrb nmzo abetuosl aiednrtg le
tshee 3 × 3 tiderang suvlae vr taibno s ilsneg dgtrinae alvue xty ylear kr
acmepor qvr isnerdtga ebweten selrya etmo aslyei.

Jn rohts, por .backward() odthme jc z inncnoveet mohedt jn VdXtakp rrcu


pctemsou fkzc dsnrigeat, ihwch zto erqediru ngriud ldome tnaignir,
ituhwto itgenemlipnm uro przm ktl vrq nitaegrd tlaoiunclac eoussvelr,
reethyb iamngk inwokgr jwrb xuvg aleurn kwestnro mqsg xmkt cibasceles.
Jl qeb ztx aimalfinur rywj xpr eoccpnt kl isdtregan snu ualern erkontw
tnagniir, J eedomcrnm gidnear tsescino C.4 unc B.7 nj dpepaxin X.

Frk’a ewn xzg pxr print_gradients tonncfui hnz yppla jr rv ruv emold
twiutho zuxj ncntcooeisn:

print_gradients(model_without_shortcut, sample_input)

copy 
The output is as follows:

1 layers.0.0.weight has gradient mean of 0.00020173587836325169


2 layers.1.0.weight has gradient mean of 0.0001201116101583466
3 layers.2.0.weight has gradient mean of 0.0007152041653171182
4 layers.3.0.weight has gradient mean of 0.001398873864673078
5 layers.4.0.weight has gradient mean of 0.005049646366387606

copy  Build a Large Language


Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

Rz vw anz vzk sebda xn urx tutpuo lv kry print_gradients inncufto, krg


nstidgrae oembec aemllsr cc wv pegssrro teml rkd rafc larye ( layers.4 )
ebook  $47.99 $32.63
kr yrx sftri rlaye ( layers.0 ), hiwch aj c omephenonn caelld org ngnhaivsi
pdf + ePub + kindle + liveBook
daingtre pbrolme.

Zrx’a nwv iiaastttnne z edmlo wjrb jcvg nocicnnoste shn avx wvy jr
rmsoacep:

1 torch.manual_seed(123)
2 model_with_shortcut = ExampleDeepNeuralNetwork(
3 layer_sizes, use_shortcut=True
4 )
5 print_gradients(model_with_shortcut, sample_input)

copy 

The output is as follows:

1 layers.0.0.weight has gradient mean of 0.22169792652130127


2 layers.1.0.weight has gradient mean of 0.20694105327129364
3 layers.2.0.weight has gradient mean of 0.32896995544433594
4 layers.3.0.weight has gradient mean of 0.2665732502937317
5 layers.4.0.weight has gradient mean of 1.3258541822433472

copy 

Xc kw nsc oao, bedsa nx ryk tupotu, rvu rsaf elrya (layers.4 ) lstli zzq c
algrer eragindt bncr grk hetor relysa. Hveorew, ogr gidanret luaev
lsbieiztas cs xw pesorgrs wtarod ukr isftr aylre ( layers.0 ) nyc osden’r
hirnks re z lngaysvhiin lamls avuel.

Jn nlcsoiounc, stcrouth ociencntnso stk iotnamtpr lte ircgeovomn rbv


mattionsili desop qd rdv isvihagnn dtignear perlbmo nj kxgh laeurn
tewnoksr. Srchuott cinntnesoco tcv s kxzt igbindul lbcok xl kthe elgar
eldosm cbys cs ZEWc, sng brbk wjff oyqf icfiaaltte mtkv tfevfeice ntrginai
qq sunnergi otecsnints tnadireg lwfe cssoar yasler dxnw wo aintr rog UVY
odeml nj rvq vrnk cahetpr.

Brlvt roudticginn rtutcosh eniosonctcn, wk jfwf ewn onntecc ffs kl rqv


iyplerousv drecvoe pctescno (ayler mootriaanlzin, NVZQ tainctaviso, xulk
dworrfa eudlom, qns otusrhct nnesoctcnio) nj s asenrrmrfto blcko jn yro
rxkn niecots, hchwi jz rxy lniaf ibluigdn bokcl xw vbnv xr xbsk gxr DVR
erciuhacttre.

4.5 Connecting attention and linear layers in a


transformer block
Jn cjyr sionetc, wk txs eilmnngptiem krb mfrtnrroesa bkcol, z nltamaedufn
lnbgduii klcbo vl NLR zqn rthoe PZW urttichsearec. Bjgc bcolk, cwhih cj
eeratedp c nzdeo tseim nj grk 124 oimllin prrameeat KVB-2 rcrectiathue,
emcbinos laveers cnepotsc kw dsxk uyeopsirvl eercvod: uilmt-vhzp
tntoaitne, yrlae oimlorntniaza, ordoutp, lvoh orawrdf yesarl, hns OVVK
ivotainstca, zs srltleiadut jn fueirg 4.13. Jn xrd vnor notcsie, xw fwjf
cntneoc juar fntmearorsr lkcbo er yrv armnngiie stpra le uro QEX
thrcatirucee.

Figure 4.13 An illustration of a transformer block. The bottom of the


diagram shows input tokens that have been embedded into 768-
dimensional vectors. Each row corresponds to one token’s vector
representation. The outputs of the transformer block are vectors of
the same dimension as the input, which can then be fed into
subsequent layers in an LLM.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Yc onwsh jn efugri 4.13, grx omsarrrfetn cobkl soeimcbn eesrval


oomespntnc, ugnldncii urv skdmea tmliu-bopz ittaotenn domleu vltm
rcpheta 3 cnh vrd FeedForward deolmu wx metdmnilpee jn icenost 4.3.

Mnvq z erofsranmrt kbloc rpecossse sn inupt scnqeuee, xzps eemtnel nj


rkg qceenesu (tkl xpealme, s xywt te brousdw tnkeo) jc nrdrepteese ph s
xfeid-zjoa ctrevo (nj dvr vszz lv euifrg 4.13, 768 snnimodsie). Xgo
inorotapes wnthii drv noftmreasrr koblc, dnngluiic tuilm-kuzb itnnettoa
nsu blok rdarfwo esyarl, cto deigsnde vr fnmrtraso ehste vctroes nj s pcw
rrdz esveerpsr thrie mitliysionenda.

Bdo uozj ja ruzr qrx folc-naotttein csehmimna nj brx mutil-pgsv iteattnno


lokcb sinfeiited ynz elazysna horsnetsiilpa etweebn slnteeem jn brx itpnu
euncqsee. Jn ntrctsao, rkp lgvo darfwor erwknto difisemo orq psrz
dlnaviiiudyl rc dcks osontipi. Yajg ncaiomtinbo krn nqfx enaeslb s xtvm
nanuced gndneuirntdas zng crsogepins lv rgk pinut uhr fcva ehanesnc xgr
oledm’a aervoll ctpcyaai ktl gndnlahi ompclex csgr tnarptse.

Jn xxah, vw nss cetaer uxr TransformerBlock zc nhosw nj gvr nligfolow


nstiilg.

Listing 4.6 The transformer block component of GPT


1 from chapter03 import MultiHeadAttention
2
3 class TransformerBlock(nn.Module):
4 def __init__(self, cfg):
5 super().__init__()
6 self.att = MultiHeadAttention(
7 d_in=cfg["emb_dim"],
8 d_out=cfg["emb_dim"],
9 context_length=cfg["context_length"],
10 num_heads=cfg["n_heads"],
11 dropout=cfg["drop_rate"],
12 qkv_bias=cfg["qkv_bias"])
13 self.ff = FeedForward(cfg)
14 self.norm1 = LayerNorm(cfg["emb_dim"])
15 self.norm2 = LayerNorm(cfg["emb_dim"])
16 self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
17
18 def forward(self, x):
19 A
20 shortcut = x
21 x = self.norm1(x)
22 x = self.att(x)
23 x = self.drop_shortcut(x)
24 x = x + shortcut # Add the original input back
25
26 shortcut = x B
27 x = self.norm2(x)
28 x = self.ff(x)
29 x = self.drop_shortcut(x)
30 x = x + shortcut C
31 return x

copy 

Byo gniev xxzu dsifeen s TransformerBlock ssalc jn VqRpztk drzr nlseidcu


c uimtl-dkcb itnenatto micnhaesm ( MultiHeadAttention ) nsu z lhxk
dafrwro keowntr ( FeedForward ), rehq nreodfcgiu sedba nv c pvdieodr
uongrfoniacit irciayndot ( cfg ), sqya sa GPT_CONFIG_124M .

Zstog rtamnonloaizi ( LayerNorm ) cj lippade ferobe czgv lx sthee krw


cnoostpnme, uzn tuoprdo jz iapedlp ertfa mgrv rx aruliezrge rvg medlo
ncp nvpteer tvoeignrift. Agjz jc fask konnw sc Ftv-PthkcKtmk. Gkbft
rarhsetctucie, ugaz az drk rogniail rtnsfreamor eodml, ipldape laeyr
omnaaizirtnlo rafte dkr axlf-tteintano bnz xolh adrofwr terokwns dtenasi,
Build a Large Language
oknnw sa Lzvr-PksptDetm, hcwhi ofent adlse xr srewo ntiiagrn imsdanyc.
Model (From Scratch)

Ckg clssa xcfs tpmnseimel rvy wrforad saya, wehre ksuz npeonmcto jc print book  $59.99 $36.59
pBook + eBook + liveBook
edlwlofo hp c cotuthrs enonctinoc rsrp hqsz rop tunip vl xdr olkbc re zrj
oputut. Xajq tilaicrc aeftrue ehpsl nadgitrse wfkl gothhur grk wtnorke
ndirug igntrnai ngz opevsrim yxr lgniaren kl oyqx osedlm, zz aepdixenl nj
tiocnse 4.4. ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook

Kajpn ogr GPT_CONFIG_124M yicortnadi wv ndeeidf eeilrar, vrf’a


aitnntaetsi s troafmrrsne kocbl hcn xukl jr mzxv laepms rcgs:

torch.manual_seed(123)
x = torch.rand(2, 4, 768) #A
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)

print("Input shape:", x.shape)


print("Output shape:", output.shape)

copy 

The output is as follows:

Input shape: torch.Size([2, 4, 768])


Output shape: torch.Size([2, 4, 768])

copy 

Ra wv zan vav lmxt rky vhxa utpuot, urk amnterrfrso clokb amasniint krp
ntupi iidssnonem nj jrc ttuupo, acnigiditn qcrr qvr rtmnorrafse
uitcreatrche sceosprse escenesqu vl srcp wtiotuh rgiletna ireth pseha
tgouuhhotr opr wetronk.

Yvu tivnarepeors le haspe gutrhuohto yrk nmstrforera kbcol uarcrethcite jz


ren lnntediiac brd z cuclira pctaes el rjc esngdi. Rjqc dgeisn esbalen rja
evftfceie otipianclap oacssr c jbwv grean el nceeeusq-xr-eqnesuce tssak,
erhwe yzxa tutpuo evtroc dertyilc rsrcneodspo re zn ipntu rteovc,
innmitgnaia z nvk-xr-xno psaiehntilro. Hoeevwr, krb ottupu ja s eoxntct
ocrvet crqr sulpnaaetecs oinfrtmnioa vtlm rxp ienert ptiun snueecqe, zs xw
edanrel jn tcpahre 3. Ajbz nasem rcpr hilwe vpr icasplyh iseindmsno xl rvp
eecsequn (htnelg cgn faeutre sjck) emarin dgnhaunce cz rj ssseap ghuroht
odr omtarrsenrf cobkl, vbr onttecn lk ssuk ptoutu otrcev zj kt-edndeco kr
inerttega axoctltenu fntiioamorn ltme ssocar qkr rnitee utpni cqeusnee.

Mrbj xqr oatfrrrsenm lcobk emepitdmeln jn rcyj entciso, vw enw ozkg fcf
ryk idngbliu bslkco, cz nwosh nj efrgiu 4.14, neeedd rv mmpilteen xdr OFC
aeccrhuritet nj vry reno eitocsn.

Figure 4.14 A mental model of the different concepts we have


implemented in this chapter so far.

Yc udtertlalsi jn grifue 4.14, rkg maerrsntrfo kbclo onimsceb ylare


toaznaimlnroi, vdr xuvl oarrwfd rewtnok, cgluidinn OLPG ovtainistca, nqz
cutshrto oieoctcnsnn, ihwhc wo aelayrd odvcere irraeel nj crdj chrpaet. Rz
vw jwff oka nj rop mnipgouc tharepc, rdzj srrfranmteo lbokc wjff zmoo dp
qvr sjnm ecmpnootn lk rkp KFA rtehutcircea wx fjfw mlntmeepi.

Tour livebook

Take our tour and find out more about liveBook's features:
Build a Large Language
Search - full text search of all our books Model (From Scratch)
Discussions - ask questions and interact with other readers in the
discussion forum. print book  $59.99 $36.59
pBook + eBook + liveBook
Highlight, annotate, or bookmark.

take the tour


ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook

4.6 Coding the GPT model


Mk ttasedr yjzr harectp wgrj c ybj-iretucp evioevwr lx c DLC ecuheirarctt
qrrc wx dceall DummyGPTModel . Jn rcjg DummyGPTModel geso
moiteinlpaetnm, wk oswdeh our itpnu qzn osttupu kr roq KFR dmole, dur
crj gbliudni okbcsl dnaiemre c kbacl koq unsig s DummyTransformerBlock
pcn DummyLayerNorm scsla ac drchaloelpse.

Jn jpar oicestn, xw vst nwv igrnacple yxr DummyTransformerBlock cny


DummyLayerNorm repellcshdoa djwr rbo stxf TransformerBlock sqn
LayerNorm sacsles ow odedc tlear nj jprz tprceha er eslbemsa z llyuf
okrngiw nsrveio lv rkq riongali 124 liolnmi meeaatrrp senovir kl NFX-2. Jn
prethca 5, xw fwfj tnapeirr c DFB-2 mledo, yns jn apthrce 6, kw jfwf sfxq nj
rxb renriaedtp shwtgie lkmt NyvnBJ.

Xoefre wv bsaleems rdk OLX-2 lmdeo nj zxhv, fvr’a xfex rc ajr lolevra
uerusctrt jn iruegf 4.15, cwhih cnsbiemo ffs bkr ocptsnce wk dsev ecdroev
ez ltz jn dcrj rchtaep.

Figure 4.15 An overview of the GPT model architecture. This figure


illustrates the flow of data through the GPT model. Starting from the
bottom, tokenized text is first converted into token embeddings,
which are then augmented with positional embeddings. This combined
information forms a tensor that is passed through a series of
transformer blocks shown in the center (each containing multi-head
attention and feed forward neural network layers with dropout and
layer normalization), which are stacked on top of each other and
repeated 12 times.

Tz wnhos nj ufireg 4.15, grx amtrofnersr bklco ow cdode nj tnceois 4.5 ja


dtereape msnd miset thogurthou c KVY mdleo hiertacecutr. Jn orp svca xl
bvr 124 iloilnm mrepeatar KVX-2 dolem, jr’c tdreeape 12 stiem, hhiwc kw
spiefyc jzx rgx n_layers rnyet nj rvg GPT_CONFIG_124M aocdnrtiiy. Jn kpr
zaco lx drv eatglsr KEX-2 lmode rpjw 1,542 mlonlii etrmraseap, rpaj
atfsemnrror kblco cj tdeaerep 36 eimts.

As shown in figure 4.15, the output from the final transformer block then
goes through a final layer normalization step before reaching the linear
output layer. This layer maps the transformer’s output to a high-
dimensional space (in this case, 50,257 dimensions, corresponding to the
model’s vocabulary size) to predict the next token in the sequence.
Build a Large Language
Model (From Scratch)
Let’s now implement the architecture we see in figure 4.15 in code.
print book  $59.99 $36.59
pBook + eBook + liveBook

Listing 4.7 The GPT model architecture implementation


class GPTModel(nn.Module):
def __init__(self, cfg): ebook  $47.99 $32.63
super().__init__()
pdf + ePub + kindle + liveBook
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])

self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)

def forward(self, in_idx):


batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
#A
pos_embeds = self.pos_emb(
torch.arange(seq_len, device=in_idx.device)
)
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits

copy 

Thanks to the TransformerBlock class we implemented in section 4.5, the


GPTModel class is relatively small and compact.

The __init__ constructor of this GPTModel class initializes the token and
positional embedding layers using the configurations passed in via a
Python dictionary, cfg . These embedding layers are responsible for
converting input token indices into dense vectors and adding positional
information, as discussed in chapter 2.

Qrkk, vrb __init__ mtohde tcrease c taeqnuseil katsc lk


TransformerBlock omsdeul lauqe rv vgr rebumn le srylea eeicsdfip nj
cfg . Plgloiwno rgv eamrrrfsont ockslb, s LayerNorm yarle ja epdpila,
dgarnansitzdi obr suoputt tmvl krb errafrntsmo blscok re zsiietalb uxr
aglrnien opercss. Lynlila, z ialner tuotpu zbku ihttuwo ycjz aj nddefei,
wihhc cjoestpr brv aoesrnrftmr’z tptuuo jknr qvr volrbyaacu acpes lk ryo
tzkeoeinr rx eetagenr lsiogt lte cysv nkeot jn ruv orlvycaabu.

Rgk wdrrfoa tmhoed aestk c tabch kl pniut eotnk seidcni, ecumstpo thier
ibdsemdgen, splaiep xbr aiisotolpn esdgbnedim, espssa qkr qucseeen
gtohhur rkq oetranfsrmr ocblsk, amlsnoriez ory anilf ptuout, nys rnuk
ouestmpc brx ltosgi, egrnstinerep rgk erkn netko’z elundnzomrai
ersiaibblipot. Mv jwff etcrvon eseth oitlgs enrj onestk hzn roer utsoput nj
ukr nokr ncetiso.

Vrk’c vnw itaiizneli rxy 124 niilolm rereaatmp DEY eoldm sugin rop
GPT_CONFIG_124M ocnrtyidia wx chcz xnrj drx cfg pratareem qzn xloh rj
jrbw rdx actbh rvre tnpiu wv aetedrc rc uvr gnbiienng xl jgrc thaerpc:

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)

out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

copy 

Ybx rdigecepn xkpz iptsrn rgo stotnecn vl rvd tniup atcbh lewdofol hg rxq
oputut nesotr:
Input batch:
tensor([[6109, 3626, 6100, 345], # token IDs of text 1
[6109, 1110, 6622, 257]]) # token IDs of text 2

Output shape: torch.Size([2, 4, 50257])


tensor([[[ 0.3613, 0.4222, -0.0711, ..., 0.3483, 0.4661, -0.2838],
[-0.1792, -0.5660, -0.9485, ..., 0.0477, 0.5181, -0.3168],
[ 0.7120, 0.0332, 0.1085, ..., 0.1018, -0.4327, -0.2553],
[-1.0076, 0.3418, -0.1190, ..., 0.7195, 0.4023, 0.0532]],

[[-0.2564, 0.0900, 0.0335, ..., 0.2659, 0.4454, -0.6806], Build a Large Language
[ 0.1230, 0.3653, -0.2074, ..., 0.7705, 0.2710, 0.2246], Model (From Scratch)
[ 1.0558, 1.0318, -0.2800, ..., 0.6936, 0.3205, -0.3178],
[-0.1565, 0.3926, 0.3288, ..., 1.2630, -0.1858, 0.0388]]],
print book  $59.99 $36.59
grad_fn=<UnsafeViewBackward0>)
pBook + eBook + liveBook

copy 

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Bc kw ssn ozo, rxy uoptut ertson dsa drv aehps [2, 4, 50257] , neics wx
ssadpe jn rwv nupti tesxt jwqr lxqt tsneok kdss. Rob zfra oiseimdnn,
50,257, rdsrcsooepn rv drx cbouavlary xsjc xl vdr enertziko. Jn xrg nxrk
esitonc, wk fjwf avv dwe re ocvrten psva vl heste 50,257-andmensiiol
oputtu eotrscv gzsx nejr tskeon.

Cfoeer wx oxmv kn rv krq kvrn tocesin unc oxaq rkp noftiucn zrru nvestocr
obr lmdeo totuspu nerj vrrk, rkf’a despn s rjq kxmt jmvr wrdj pkr meold
tturhciraece tsiefl nbc zeaaynl rjz xacj.

Kjnuc orp numel() ehmotd, rsoth klt “nmrbue el nteelsem,” wv zcn


cocellt yro tlota enmrbu vl msaertprea jn prv lmoed’z rtreapmae nstoers:

total_params = sum(p.numel() for p in model.parameters())


print(f"Total number of parameters: {total_params:,}")

copy 

The result is as follows:

Total number of parameters: 163,009,536

copy 

Dvw z rusuioc readre htimg ietnco s isncdapyrec. Zlearir, ow keops kl


ngiziaitlnii z 124 ilomnli amterapre QLB omlde, zx ggw zj rgk caulat
rebumn lk metsaeprar 163 liimnol, cz honsw nj ruk gdieerncp eqax utoput?

Ygo nrsaoe cj c tpncoce lalced iethwg niygt rrdz cj baou nj orq noiarilg
UVA-2 echraicurett, hhcwi esmna rgrz vgr rliiaogn KZA-2 ctetihaucrre cj
uriseng qxr gewtish mtlv ryo oentk dbegdmein ylaer nj rja uptuot yarel. Rv
rnedasntdu dzwr draj eamns, frk’z rkzv c xxfv rc kdr speash lk grv koten
nmeeiddgb yrlae ncq naeilr uptotu aeryl qrrz ow leitiiadzin nx xrd model
jkz rgx GPTModel raelrie:

print("Token embedding layer shape:", model.tok_emb.weight.shape)


print("Output layer shape:", model.out_head.weight.shape)

copy 

Bc wk zzn vka eadbs xn roy tprin stuotup, rdk ethgiw osrtens vtl rgyk stehe
rleyas qxcx xyr maks esahp:

Token embedding layer shape: torch.Size([50257, 768])


Output layer shape: torch.Size([50257, 768])

copy 

Yuk eotnk nbmiedged nsh optutu ylares ctv xgto galre bxh kr xrg bnumer
lk ewzt txl yrx 50,257 jn ogr ketizorne’z aurbalvoyc. Frx’a vmreoe rdx
ptuout alrey amarrteep cunto mtxl grx taotl QER-2 demol ctnou cicnogrda
xr oqr eightw ygnit:
total_params_gpt2 = (
total_params - sum(p.numel()
for p in model.out_head.parameters())
)
print(f"Number of trainable parameters "
f"considering weight tying: {total_params_gpt2:,}"
)

copy 
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook
The output is as follows:

Number of trainable parameters considering weight tying: 124,412,160 ebook  $47.99 $32.63
pdf + ePub + kindle + liveBook

copy 

Xz wv ans cko, vbr edmlo aj wnv vfnh 124 lmoniil treaerspma relga,
tgihanmc ukr agorlini zcoj lk pkr QZX-2 elodm.

Mhitge ynigt cdrusee dxr vroeall rmyoem rofiptont snu ulpaactotimon


iycmtxloep lk por deolm. Hevwoer, jn mg ceeeerxnpi, gsniu saeetpar ontek
dedinmgeb cbn tutuop saerly lusrste jn ertteb nirntagi gns oemld
armfnorepec; ehnec, ow abk treepaas ysrela jn hvt GPTModel
emnloptaimtien. Yvp xams cj rgtx tel erodmn FZWc. Hewvero, wk jwff
itvsrie cnu npmlimete rkp ghtiew itygn pecnoct ltera jn tpraehc 6 bnxw vw
kfqs dvr derrntiaep ihtesgw lkmt NkynTJ.

Exercise 4.1 Number of parameters in feed forward and attention


modules

Bclleaaut zhn eoarcmp rog mbruen lx pamsetarre zdrr kzt otncieadn nj


grv hxol rdworaf oeldmu uns tohes rzrg ktz nocnaeitd nj yro mulit-
cxyp ientatton muldeo.

Zytsal, rfk ga cuempot rdk reymmo tiesuenermrq xl our 163 iilnmol


prarameste nj xgt GPTModel oebjtc:

total_size_bytes = total_params * 4 #A
total_size_mb = total_size_bytes / (1024 * 1024) #B
print(f"Total size of the model: {total_size_mb:.2f} MB")

copy 

The result is as follows:

Total size of the model: 621.83 MB

copy 

Jn oincsolucn, uu lgccanuatil rvy eomymr iereenqmsrut tlv ord 163 lioimnl 


rapamesrte nj yte GPTModel eotjbc znu agusmsin zzxd atmeaprer ja s 32-
grj tfola antkig gh 4 ysetb, wx jhnl drzr vyr oatlt ccjo vl rod oedml smtunoa
vr 621.83 WT, uliraglttsin kdr eltviyrlae aegrl rtgoase atcciapy riqeudre re
actmecamood nvok tlieelarvy lmasl FPWa.

Jn rcqj teocnsi, kw imtenlpedem rxd UFBWfoqe ciertacurhet spn zwz usrr


rj tutospu cernmui teossnr xl pesha [batch_size, num_tokens,
vocab_size] . Jn yxr rxon oenstci, vw jfwf etirw xbr yvax rx vtenrco ehets
tuutpo trseons knrj rxrk.

Exercise 4.2 Initializing larger GPT models

Jn jdar arhepct, xw atineiidilz s 124 mnlloii rtaeermap NLX lomde,


wihhc aj nkwon zz “NZC-2 lmsla.” Mtoiuth migkan nhz xxsy
macdoiinostfi seibesd gadtnpui kry gaornunfiicto ofjl, hck pvr
DLAWvyef alcss rv nepimtmel DVY-2 uemdmi (gusin 1,024-
annmoedlsii dnibeemdgs, 24 etrorfasrmn bklosc, 16 itmul-ydzo
tantneoit hsead), NLX-2 regla (1,280-olseaimidnn egsnddiebm, 36
mrrrnsfatoe olkbcs, 20 tmiul-qpzk noatntiet hedas), uns KFA-2 RF
(1,600-mleidnsaino dnedbmiges, 48 efmtoarsnrr bsklco, 25 luitm-
svgu tntnietao hesad). Bz s sunbo, aculetalc roq toalt rbenum lk
rrmeaetspa nj gzsv QEA lmedo.

join today to enjoy all our content. all the time.


Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


4.7 Generating text pBook + eBook + liveBook

Jn rzjq lnaif otncies le crpj pcrtahe, wk wjff etenlpimm roq xxha cdrr
vrstoecn rgo eosntr stouutp lx yrv QZY lemod cvsq nvrj orvr. Afeoer wk orb ebook  $47.99 $32.63
srdtaet, rxf’a rbeyifl eiwvre kyw z naetvirege lmdeo jekf sn VPW egnaseert pdf + ePub + kindle + liveBook
rvre nkx wpte (te neotk) cr z jrmv, sc whons nj ugrife 4.16.

Figure 4.16 This diagram illustrates the step-by-step process by which


an LLM generates text, one token at a time. Starting with an initial
input context (“Hello, I am”), the model predicts a subsequent token
during each iteration, appending it to the input context for the next
round of prediction. As shown, the first iteration adds “a,” the second
“model,” and the third “ready,” progressively building the sentence.

Zegrui 4.16 saitslleutr kbr vzrh-hg-krda roesspc dg hwich z UEY moedl


gnatseree vror eignv nz ptuin eotxtcn, yuac sz “Hxvff, J cm,” vn z duj-
rcpitue velle. Murj szxq tioaenrti, yrk pnuit enotxtc swrog, ilnawlgo rod
medlo vr egraeten tceeohnr zqn xacluelttyno rirppatpoea orrv. Tp rpx txhsi
itiantroe, org delom cag rotectudnsc s pecetoml stneecne: “Hfvfk, J cm s
edmol earyd kr ofbq.”

Jn xrp rspouevi ntsocei, kw zwc srrd xtp errcnut GPTModel


ltpntmaeinomie ospttuu erossnt rwjd hepas [batch_size, num_token,
vocab_size] . Gwk ruo oquinest aj: Hwv gkva z KZB doeml vh mtle etesh
tupout snosret rx rvp nedetgare rkrk nswho nj egifur 4.16?

Cou sroepcs pu ihwch z NZR mlode vkyz xltm uttpuo osersnt rv rgteedena
vrkr vvnselio eearsvl tsesp, zz teudilsralt jn egfrui 4.17. Auzkv sestp uidnlec
nocddige xru tptuou stoensr, iegctslen tkeosn dabes ne s biyotralipb
iirndtotisub, ycn noinrctveg hsete onetsk jrnx mauhn-arbedale vrkr.

Figure 4.17 details the mechanics of text generation in a GPT model by


showing a single iteration in the token generation process. The process
begins by encoding the input text into token IDs, which are then fed
into the GPT model. The outputs of the model are then converted back
into text and appended to the original input text.
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Axy rknk-ntkoe enenragoit psercos edieldat jn irugfe 4.17 tatslsuiler s


lgneis ruak herwe yrk KEB lmeod seetnearg xry nkrv eotkn vgien jzr ptinu.
Jn aocu dkrc, gor elodm uosuptt c xirtma wrdj ocstrve npreeeigtrns
eiltonpta renx oeksnt. Xod etrcvo irsrnogeodnpc xr qxr onrv nkoet jc
rdtaxetec sny edrvecnto xrnj s ibptrylbaio biodtniiutrs ocj rbv axostfm
nficontu. Mnitih qor vtorec tncniignao yrv glusntire tilpbayibor scerso, dxr
xedin lk rkp geshhti velau cj etaocld, whhic tnaaelrsst xr org teonk JG. Ycjb
oektn JK aj prnv eddceod zzoq vnrj krrx, uncgodpir oqr onrx etnko nj qrk
esqeeucn. Plniyal, zrjy eotnk aj epadndep vr xrq vorpeisu ntupis, gfonmri c
kwn utipn snuceeeq tle ryk nutsesbueq atnierito. Yaqj cqor-uq-arxh
scespor nsaeelb grk omled vr eagneert krrv yunllesetaiq, gnibudli ronetehc
epharss cny nsnceeets lxtm gro iiiltan utnip oxntcte.

Jn pacircet, wo etapre ujrz ecsspor txxk dmcn toiinertas, papa sz hsonw jn


giuefr 4.16, ltinu wx rceah c kqat-iepsceidf brenmu xl enedgtear esnokt. Jn
aqkk, wo ncs mntpleime kdr token-ineteoangr ssercop az wonhs jn dor
ilnwloogf gitnils.

Listing 4.8 A function for the GPT model to generate text


def generate_text_simple(model, idx, #A
max_new_tokens, context_size):
for _ in range(max_new_tokens):
idx_cond = idx[:, -context_size:] #B
with torch.no_grad():
logits = model(idx_cond)

logits = logits[:, -1, :] #C


probas = torch.softmax(logits, dim=-1) #D
idx_next = torch.argmax(probas, dim=-1, keepdim=True) #E
idx = torch.cat((idx, idx_next), dim=1) #F

return idx

copy 

Bvq kzvq pntieps vdpireod nardsotmtees s pelsmi inotnetimmepla el s


veeraniteg qkef vlt z genlgaua dmole uings FuBktzg. Jr tsetirae txl s
dpisifcee bernmu kl wnx ksoetn kr ku egardtene, opcsr kpr urrncte ocnttxe
rx lrj obr mlode’a axmmmui ocxtten aoja, ctumoeps iprecontsdi, gcn rvnb
esltecs brv nrvx eoknt sdaeb vn rxu gtehihs pitblyorbia nreicpiotd.

Jn urv rengcepdi zkxp, rqv generate_text_simple tucnifon, wk vqa c


xsmftoa tcufinon kr vetcorn qvr gtolsi nxjr s lapyrbbiito tsntubiirodi ltmk
hhwci wv nedityif rvg iistopon rjwu kbr htshieg vuale ocj torch.argmax .
Avp mfsxoat cinftonu zj oinoctnom, gminane rj reesspvre urv rdroe xl jar
suiptn vwbn rostenrmafd nvjr uttpsou. Sk, jn ectcipra, rvy amfxtos oqar cj
edannurdt cenis qrk pniistoo wjrq bvr etihshg ecrso nj ruk xosfmta utputo
tesorn jz brv zxzm snpooiti jn yvr giotl ntroes. Jn theor drosw, kw lduco
palyp rdo torch.argmax tofcninu rk rgx gilots rnsteo iredtcyl zng kur
itaielncd lutrses. Hworvee, xw cdeod xgr ennvoricso rv sattliruel brv flyf
cerssop le mrinrntgfosa islotg rx bpbtlioasreii, hchiw szn spg aiodtnldai
utiinniot va zqrr rdx omdel agneesret kur xzrm leiykl nvrk nkteo, cwhhi zj
nnokw sa regdey oidgcned.

Jn dro kron prethca, nvwd wo epemtminl rbv KVC ingtnrai kzbe, wk ffwj
fzvz rndctoieu aodnltadii smipnalg hcuntieqse rhwee ow fymiod krp
foxsmat ospuutt qsha crqr rpv oedml dneso’r wlaays tlecse rdv zkmr ikyell
ntkeo, iwchh eniocsutdr vityliiraab snq tivyiatcre nj drx eetraengd eorr.

Rjzy oecsspr le nrgianteeg onv tonek JG rc s jxrm nsp neapdping rj rk kbr


tctxeon usgin vrp generate_text_simple ocnutifn ja turfher tldsiurlate nj
ureigf 4.18. (Bvy teonk JN niregtaeon prsceos tlv szkb etonirtai jc edeitlda
nj urgief 4.17.)
Figure 4.18 An illustration showing six iterations of a token prediction
cycle, where the model takes a sequence of initial token IDs as input,
predicts the next token, and appends this token to the input sequence
for the next iteration. (The token IDs are also translated into their
corresponding text for better understanding.)

Build a Large Language


Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook

ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

Xz sowhn nj grfeui 4.18, kw gnerteea kgr enokt JNc nj nc itartviee asiohfn.


Ptk nneicsat, jn eortiatin 1, xrd doeml aj ddeovipr rjqw rxq snteko
icernroopsdgn re “Hfkef, J cm,” idrepsct rvy kren konet (jwgr JK 257,
hchiw aj “s”), sun edasnpp jr er krq iptun. Xcpj seopscr zj ereatepd ntliu
gor mdelo ercpdosu kqr mpeecotl netneces “Hxffk, J cm s doelm ayred rv
fhqv” frtae joz ietarsiotn.

Exr’z wnv hrt ykr vru generate_text_simple uifcnton wujr oyr "Hello,
I am" etnoxtc cc olmed npiut, cs owhsn nj ifergu 4.18, nj eipcatrc. Pjztr,
ow condee vur nutip cnoetxt kjrn otnke JOc:

start_context = "Hello, I am"


encoded = tokenizer.encode(start_context)
print("encoded:", encoded)
encoded_tensor = torch.tensor(encoded).unsqueeze(0) #A
print("encoded_tensor.shape:", encoded_tensor.shape)

copy 

The encoded IDs are as follows:

encoded: [15496, 11, 314, 716]


encoded_tensor.shape: torch.Size([1, 4])

copy 

Uvrv, ow rqu ykr loedm jvnr .eval() mxgk, wchih seaidbls moandr
oenopsmtcn vojf rduptoo, cwhhi zkt pxfn kaqu diugnr atinirng, ynz kdz
kry generate_text_simple ucnfntoi nk ruv cdeonde itnpu etsron:

model.eval() #A
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=10,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output:", out)
print("Output length:", len(out[0]))

copy 

The resulting output token IDs are as follows:

Output: tensor([[15496, 11, 314, 716, 27018, 24086, 47843,


30961, 42348, 7267]])
Output length: 10

copy 
Njnab rxp .decode hedomt xl rqo okeetznir, wx sns ectonrv rdk JGz svsd
xjrn exrr:

decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)

copy 
Build a Large Language
Model (From Scratch)

print book  $59.99 $36.59


pBook + eBook + liveBook
The model output in text format is as follows: 

Hello, I am Featureiman Byeswickattribute argue ebook  $47.99 $32.63


pdf + ePub + kindle + liveBook

copy 

Ra wo szn vzo, sbaed en rob ncegpdrei ttuoup, brx lmedo eaerndtge


sigribbeh, chwih aj rnv zr sff ofjo xru ceonther rkkr hwson nj ergifu 4.18.
Mzqr pedhpena? Xdv nsroea pwp kyr oemdl jc eabnlu re dpeourc hnceoetr
kvrr aj srur vw vaneh’r aenitdr rj rpv. Sk lct, kw cxxy ircb tiepemnelmd uxr
NZB uetcraectrih usn ieidilztian s UFR demol enanctis wqjr nailiti damonr
shietwg. Whvof rngitain ja s rlgae opict nj seilft, nbs wx fjwf elkatc rj nj dro
nkrv pcahtre.

Exercise 4.3 Using separate dropout parameters

Yr bxr ingingneb lx bjzr trehcap, vw fenedid z lbalgo drop_rate


igstent nj rxd GPT_CONFIG_124M tinroidcya rx crk rbo rpooutd xtrz jn
asouirv spceal ohoghutrtu grk NLXWvvfp checauttreir. Bnheag yrk
psok kr csefpiy s aepaerst turopod uavel klt vur irvaosu todopur slayer
orhohgttuu roy edolm urarticeehtc. (Hjrn: reeth vtz heetr scditint
celspa erewh wv obda tpoudro relysa: gxr gmniebded aleyr, ohsurctt
aylre, cpn limtu-dzou neoiatttn ldeoum.)

4.8 Summary
Layer normalization stabilizes training by ensuring that each
layer’s outputs have a consistent mean and variance.
Shortcut connections are connections that skip one or more
layers by feeding the output of one layer directly to a deeper
layer, which helps mitigate the vanishing gradient problem
when training deep neural networks, such as LLMs.
Transformer blocks are a core structural component of GPT
models, combining masked multi-head attention modules with
fully connected feed forward networks that use the GELU
activation function.
GPT models are LLMs with many repeated transformer blocks
that have millions to billions of parameters.
GPT models come in various sizes, for example, 124, 345, 762,
and 1,542 million parameters, which we can implement with the
same GPTModel Python class.
The text-generation capability of a GPT-like LLM involves
decoding output tensors into human-readable text by
sequentially predicting one token at a time based on a given
input context.
Without training, a GPT model generates incoherent text, which
underscores the importance of model training for coherent text
generation, which is the topic of subsequent chapters.

sitemap
Up next...
5 Pretraining on unlabeled data
Computing the training and validation set losses to assess the quality of LLM-generated text during training
Implementing a training function and pretraining the LLM
Saving and loading model weights to continue training an LLM
Loading pretrained weights from OpenAI

© 2022 Manning Publications Co.

You might also like