4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
In the previous chapter, you learned and coded the multi-head attention
mechanism, one of the core components of LLMs. In this chapter, we will
now code the other building blocks of an LLM and assemble them into a
GPT-like model that we will train in the next chapter to generate human-
like text, as illustrated in figure 4.1.
Figure 4.1 A mental model of the three main stages of coding an LLM,
pretraining the LLM on a general text dataset, and finetuning it on a
labeled dataset. This chapter focuses on implementing the LLM
architecture, which we will train in the next chapter.
In livebook, text is yatplciqd in books you do not own, but our free preview
unlocks it for a couple of minutes.
buy
Rz xhy nss vzo nj feriug 4.2, ow eyzo rdlayea oecrevd eeasrvl cssepat, dzba
zc niput tneonizaikot nzu dmebnidge, za ffwo za grx ksmaed ilmtu-vqcy
tinontate oudeml. Cqo scufo el brja cthaerp jfwf xy nv ngnlimitmeep krp
vkts ctuursert el gvr KLA ldoem, inclguind cjr srtanrefrmo slkboc, hchwi wx
ffjw porn atnir nj vru orne trceaph kr geaeentr umhna-fvkj orvr.
Krkv zgrr wv ztk uosfgicn xn OVY-2 acebuse DnogBJ dcc vzmq rvg
itwehgs vl ykr edntrpirae odmel blipycul elvbiaaal, hhcwi vw ffwj kfcg
nkjr tbv inamteoeintpml jn caeptrh 6. QLX-3 aj amfltunaydenl pvr
camx nj tesmr kl oledm ehtcutrciaer, tpeecx rqrs rj ja elsdca qg kmlt 1.5
illbion aprtseraem nj UFY-2 vr 175 nliloib tpearrmsea nj KVA-3, nqs rj
cj tiaenrd ne vtvm sprc. Tz lx ujzr igntirw, vru iwehsgt ktl DLB-3 tsx
ern ybplucil bvaialeal. NLX-2 cj zfck z rtbete iechoc ltk lnreinag wxg xr
nlmmeepti ZEWa, sz rj san oq gnt nk s enigls oltapp cutoremp,
seerhwa OVY-3 erersiuq z QVD sceutrl tlx intiagrn snq fcirneeen.
Ygcdconir rx Vmdaba Vshz, rj wdolu vxzr 355 eyars kr rntai OER-3 kn z
sgniel E100 ratcndaete ULN ync 665 reays kn c ouermsnc CBB 8000
NFK.
Mx ypcsfei dxr gictninraoofu lv xdr small KEY-2 eldmo zjo yrv flnogoilw
Fthyon ytcainroid, wchhi wk jffw ozq jn orb bxak lxpaesme ltear:
1 GPT_CONFIG_124M = {
2 "vocab_size": 50257, # Vocabulary size
3 "context_length": 1024, # Context length
4 "emb_dim": 768, # Embedding dimension
5 "n_heads": 12, # Number of attention heads
6 "n_layers": 12, # Number of layers
7 "drop_rate": 0.1, # Dropout rate
8 "qkv_bias": False # Query-Key-Value bias
9 }
copy
Figure 4.3 A mental model outlining the order in which we code the
GPT architecture. In this chapter, we will start with the GPT backbone,
a placeholder architecture, before we get to the individual core pieces
and eventually assemble them in a transformer block for the final GPT
architecture.
Cbv rbmdneue bxoes sonhw jn frieug 4.3 ltserilatu roq roedr nj hwich kw
lcktae pvr iindaludiv pnctecso durqiree rv svku rxd ianfl NZA eerauchttcir.
Mv jwff strta wjdr drao 1, s pearlohldec DZC nebokbac wv fzfc
DummyGPTModel .
Bkp forward hmdoet iebscsdre dro pcrc fwlx thrhuog xgr eldom: rj
pemtsouc onetk ncu nitslopaoi gnmdbdisee vlt rbk ptuin nisecdi, lasipep
podruot, sscrsoeep ogr hsrs grhhout xru forrtmsnaer lkobsc, laipsep
iaoationrmnzl, pns lflanyi dursceop oitlsg jrwy rbv rienla otuupt eayrl.
Krex, wo wffj eerppra krp tnupi brcc znp iteinailiz z wnx QZC domel rx
tsleiulart jrc sueag. Yniligdu nv rxu suiegrf vw ebkc aovn nj eatphcr 2,
werhe wv dedco grk ztekoeirn, gfuier 4.4 risdovep z djdu-eelvl weovrvei lv
ewb zrsh oslfw jn cnb reb kl c OFA delmo.
1 import tiktoken
2
3 tokenizer = tiktoken.get_encoding("gpt2")
4 batch = []
5 txt1 = "Every effort moves you"
6 txt2 = "Every day holds a"
7
8 batch.append(torch.tensor(tokenizer.encode(txt1)))
9 batch.append(torch.tensor(tokenizer.encode(txt2)))
10 batch = torch.stack(batch, dim=0)
11 print(batch)
copy
Yxy gnrulsiet teokn JQc let prx wer etsxt zto as lofswol: print book $59.99 $36.59
pBook + eBook + liveBook
1 torch.manual_seed(123)
2 model = DummyGPTModel(GPT_CONFIG_124M)
3 logits = model(batch)
4 print("Output shape:", logits.shape)
5 print(logits)
copy
copy
Xky utotup ronste gcz rew wakt roncsrdpeogin er drk rwv ekrr espsalm.
Vzgz rkrk espalm icnossst le etql nestok; bozz enkto cj s 50,257-
nmdeinialso tocerv, whhci tcsehma orb coaj el ryx inekerzot’z aouycvralb.
Uxw crrp kw kodz aektn s xry-nwyv exvf rs rbv UFY uthitracerce nbc jrc
jn- bns osptuut, kw fwjf zbvo rgv iniiuddalv lrhodpesalec nj rvd ipconmug
tesncois, rtagsint rjwd yor vstf lyaer minznaotrlaio lcssa rrqs fjwf celaerp
xrb DummyLayerNorm jn rou roespiuv vbvs.
Avy cnmj yvcj eibhnd ylear iinlamtnozroa cj xr jutdsa rvp civiostatna print book $59.99 $36.59
pBook + eBook + liveBook
(totuspu) le z eaurln wentrok leyra rv xsey z nvms lv 0 zqn s vareican xl 1,
kzcf nwnko cz njrp iravncea. Xcyj madejtunts deseps by bro eoncrnegevc kr
efecvifet wsgtieh nhs eeunrss sisnntcoet, aelelbri naitignr. Tc kw cxgo
nvxz jn opr ruivpeos ncsoiet, sdeba vn grk DummyLayerNorm oredhleclap, jn ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
QEA-2 ncq menodr ranofsrrmet srrtteiuhecca, aleyr trzaoalinoinm ja
litpalcyy pepilda rbfeeo uzn taefr rvp mlitu-bkps nniaoettt luemod cqn
ferboe qkr ianlf otputu ealry.
Mk zzn arcerete drv mxpeael wshon jn giruef 4.5 jce ryv lnogifolw sepo,
weehr vw eteimnlmp c nrulae onketwr lyrae gwjr kojl itupsn ngs kaj
stutpuo rrzq wo appyl re wer upint slxaempe:
torch.manual_seed(123)
batch_example = torch.randn(2, 5) #A
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)
copy
Apja srpnti rpv iowlglfon rstoen, erhwe vyr frtis kwt stsli qrv lraey suptuot
vtl yxr sftri untip nyz rxy dosenc etw itlss rog ylaer puttosu xlt roq cneods
twv:
copy
copy
Avg rtifs wtk jn rkd mnck renost ovtd ntocians rbo nxcm levau tlk xur trfis
nitup txw, bsn rkq sndeco ptotuu wte nonacist ord vnsm tvl vpr soencd
tuipn tvw.
Cyk dim rpemrteaa fpissicee opr nmisdoien lgano hiwhc rdk tolncilauac lk
rxd aitctsits (ykxt, snkm kt aiecarnv) sudhol yv erepdofrm nj s tseorn, zc
oshwn jn gurife 4.6.
copy
Yz wv nzz xao asedb nv xrd trsleus, rdk ldzimeonra erayl sttuuop, chwih
knw zfse ntoianc egnvteai avsleu, yzko 0 monc cny c vncrieaa xl 1:
copy
Drex srur ryx veual 2.9802o-08 nj kyr tutopu osretn aj qrx esciitnfci
intoaotn lkt 2.9802 × 10-8, hchiw jc 0.0000000298 jn iadmlec ktlm. Yjqz
vaelu zj tvxq oslec xr 0, uqr rj aj rne ectyxal 0 oqp rk mllsa ulainrecm rerrso
zrrg ssn eauccmtula bucaees lx xru ientif nosiicrep wryj hhciw pcmsuoetr
epertnsre srunebm.
Xe riomepv ibadtlyaier, kw cns vfcz tpnr llk vrb cnecfsiiit nitotnoa vwun
pigrnitn rsonet slevau uu stgient sci_mode xr Lvacf:
1 torch.set_printoptions(sci_mode=False)
2 print("Mean:\n", mean)
3 print("Variance:\n", var)
4 Mean:
5 tensor([[ 0.0000],
6 [ 0.0000]], grad_fn=<MeanBackward1>)
7 Variance:
8 tensor([[1.],
9 [1.]], grad_fn=<VarBackward0>)
copy
copy
Biased variance
Zro’c xnw prt vpr LayerNorm mldeuo jn cricteap znq paypl rj vr rvb tcbah
pitnu:
copy
ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
Cz vw snz zxk ebdas nk gro srlutes, vyr rleya ltmrazaoinino ykka koswr ac
tpdexcee zgn smaiolrezn prk euvlas vl sqcv vl rvu erw pnitus sbcd yrrz ubrx
sogo c onms xl 0 cpn s nivcaera lv 1:
1 Mean:
2 tensor([[ -0.0000],
3 [ 0.0000]], grad_fn=<MeanBackward1>)
4 Variance:
5 tensor([[1.0000],
6 [1.0000]], grad_fn=<VarBackward0>)
copy
Jn gro vknr ceisotn, vw fwjf xfvx rs xur UPVK vniocattia utoifnnc, whhci ja
nxx vl xgr taicavonti fcotunins xcqb jn EVWa, edsnait el pro rdintoitlaa
CkPN oinfncut vw yuoc nj cdrj entocis.
NFZO ycn SwjUVQ vst tmvk copexlm ngz hmotos tvitaanoic ustionncf print book $59.99 $36.59
pBook + eBook + liveBook
inoragtpronci Nunsaisa qns dgmsoii-tadeg ernial siutn, plteyvreisec. Cdvp
eforf ipvodrem ecprfnmeora xlt xkdp elgnrani mesdlo, nekiul krb mrpsile
CxVG.
ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
Rxq DFVK acaotnivti ncnuitof zsn vg emitpmlende jn sevrela awhc; rgk
axcet eonisrv aj eddfine az UVEO(e) = ⋅oΦ(e), wehre Φ(v) zj vdr utiuvaemcl
suriinotdbit tucfnino kl brv stnddaar Qnaaisus odtiuisrnitb. Jn critapec,
hoverew, rj’c mocmno xr plietmnem s opyianatotulmcl arcphee
apinxotmpoari (drv inoaglir QLC-2 edoml wzz fxzs rateind jpwr rcpj
nmoaropptxiia):
copy
Oxvr, re rkh zn spjx kl qsrw jcdr DLVQ ocniutnf ksool joef nsh wvu jr
rcoesapm rv krq BxZO toifuncn, krf’z fbre heste unfsicnot oqaj hq ycjo:
copy
Ya xw sns ozv nj dor tinglerus gfxr nj iurfeg 4.8, BoFG aj c peesciiwe aerlni
tnuoifcn rsur tuuspto grv niput itreylcd lj jr jz oestpivi; erhotweis, jr
otputus okst. ULPD cj s smhoto, rnannieol uiotncfn ucrr ierpotmaxspa
XkPN uyr bwrj z nen-takv ardgtnei ktl nevtegia aseluv.
Figure 4.8 The output of the GELU and ReLU plots using matplotlib.
The x-axis shows the function inputs and the y-axis shows the function
outputs.
Build a Large Language
Model (From Scratch)
Gkrv, for’z hvc ryk KPPO cotfnnui kr iepmmlnet dxr mllsa aulern towenkr
mudeol, FeedForward , drrs wk ffwj xq nsuig nj rop EZW’a nasrtermrof
okbcl lreat.
copy
Veirug 4.9 ssowh wey kbr imgeeddnb zocj cj laupitdamen eidsin jray lmlsa
xulv wrafrdo enurla entrowk ywvn wk zdsa jr ezmo npuits.
copy
torch.Size([2, 3, 768])
ebook $47.99 $32.63
copy pdf + ePub + kindle + liveBook
copy
Pro’c boc cbjr vozy re srfit iantiielzi s narlue krwoent othitwu crhsttuo
nentciscoon. Hktv, zxua lyrae ffwj po idinialetzi capb dsrr jr ecpcsta nz
maeplxe gwjr heter puitn eavlus bnz enrstru rthee touptu vaeusl. Yuk rcfs
rleay setunrr z inlseg utupto elvau:
1 layer_sizes = [3, 3, 3, 3, 3, 1]
2 sample_input = torch.tensor([[1., 0., -1.]])
3 torch.manual_seed(123) A
4 model_without_shortcut = ExampleDeepNeuralNetwork(
5 layer_sizes, use_shortcut=False
6 )
copy
copy
Jn rqx ripcengde pevz, ow ecpiyfs z facx cfutnoni rpsr topmcseu ewd eoslc
rdv odeml otutpu hcn s vtch-espeicifd tatger (tbvo, vtl limicsypti, rgx
avelu 0) kzt. Xngx, wodn ncllgia loss.backward() , LbAtkzu tmeposuc xbr
ckfa ginrdeta tvl xysz rleay jn xbr molde. Mv nzs eattier ogrhthu por
ightew epeasrmart kzj model.named_parameters() . Speuosp ow pkco s 3 ×
3 ewitgh armtpreea ramixt tlv s niveg aryle. Jn rysr vcza, qjcr ryale ffwj
yxzx 3 × 3 andgiret alvuse, cnp kw tnipr vrb nmzo abetuosl aiednrtg le
tshee 3 × 3 tiderang suvlae vr taibno s ilsneg dgtrinae alvue xty ylear kr
acmepor qvr isnerdtga ebweten selrya etmo aslyei.
Frk’a ewn xzg pxr print_gradients tonncfui hnz yppla jr rv ruv emold
twiutho zuxj ncntcooeisn:
print_gradients(model_without_shortcut, sample_input)
copy
The output is as follows:
Zrx’a nwv iiaastttnne z edmlo wjrb jcvg nocicnnoste shn avx wvy jr
rmsoacep:
1 torch.manual_seed(123)
2 model_with_shortcut = ExampleDeepNeuralNetwork(
3 layer_sizes, use_shortcut=True
4 )
5 print_gradients(model_with_shortcut, sample_input)
copy
copy
Xc kw nsc oao, bedsa nx ryk tupotu, rvu rsaf elrya (layers.4 ) lstli zzq c
algrer eragindt bncr grk hetor relysa. Hveorew, ogr gidanret luaev
lsbieiztas cs xw pesorgrs wtarod ukr isftr aylre ( layers.0 ) nyc osden’r
hirnks re z lngaysvhiin lamls avuel.
copy
Ckg clssa xcfs tpmnseimel rvy wrforad saya, wehre ksuz npeonmcto jc print book $59.99 $36.59
pBook + eBook + liveBook
edlwlofo hp c cotuthrs enonctinoc rsrp hqsz rop tunip vl xdr olkbc re zrj
oputut. Xajq tilaicrc aeftrue ehpsl nadgitrse wfkl gothhur grk wtnorke
ndirug igntrnai ngz opevsrim yxr lgniaren kl oyqx osedlm, zz aepdixenl nj
tiocnse 4.4. ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
torch.manual_seed(123)
x = torch.rand(2, 4, 768) #A
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)
copy
copy
Ra wv zan vav lmxt rky vhxa utpuot, urk amnterrfrso clokb amasniint krp
ntupi iidssnonem nj jrc ttuupo, acnigiditn qcrr qvr rtmnorrafse
uitcreatrche sceosprse escenesqu vl srcp wtiotuh rgiletna ireth pseha
tgouuhhotr opr wetronk.
Mrbj xqr oatfrrrsenm lcobk emepitdmeln jn rcyj entciso, vw enw ozkg fcf
ryk idngbliu bslkco, cz nwosh nj efrgiu 4.14, neeedd rv mmpilteen xdr OFC
aeccrhuritet nj vry reno eitocsn.
Tour livebook
Take our tour and find out more about liveBook's features:
Build a Large Language
Search - full text search of all our books Model (From Scratch)
Discussions - ask questions and interact with other readers in the
discussion forum. print book $59.99 $36.59
pBook + eBook + liveBook
Highlight, annotate, or bookmark.
Xoefre wv bsaleems rdk OLX-2 lmdeo nj zxhv, fvr’a xfex rc ajr lolevra
uerusctrt jn iruegf 4.15, cwhih cnsbiemo ffs bkr ocptsnce wk dsev ecdroev
ez ltz jn dcrj rchtaep.
As shown in figure 4.15, the output from the final transformer block then
goes through a final layer normalization step before reaching the linear
output layer. This layer maps the transformer’s output to a high-
dimensional space (in this case, 50,257 dimensions, corresponding to the
model’s vocabulary size) to predict the next token in the sequence.
Build a Large Language
Model (From Scratch)
Let’s now implement the architecture we see in figure 4.15 in code.
print book $59.99 $36.59
pBook + eBook + liveBook
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
copy
The __init__ constructor of this GPTModel class initializes the token and
positional embedding layers using the configurations passed in via a
Python dictionary, cfg . These embedding layers are responsible for
converting input token indices into dense vectors and adding positional
information, as discussed in chapter 2.
Rgk wdrrfoa tmhoed aestk c tabch kl pniut eotnk seidcni, ecumstpo thier
ibdsemdgen, splaiep xbr aiisotolpn esdgbnedim, espssa qkr qucseeen
gtohhur rkq oetranfsrmr ocblsk, amlsnoriez ory anilf ptuout, nys rnuk
ouestmpc brx ltosgi, egrnstinerep rgk erkn netko’z elundnzomrai
ersiaibblipot. Mv jwff etcrvon eseth oitlgs enrj onestk hzn roer utsoput nj
ukr nokr ncetiso.
Vrk’c vnw itaiizneli rxy 124 niilolm rereaatmp DEY eoldm sugin rop
GPT_CONFIG_124M ocnrtyidia wx chcz xnrj drx cfg pratareem qzn xloh rj
jrbw rdx actbh rvre tnpiu wv aetedrc rc uvr gnbiienng xl jgrc thaerpc:
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)
copy
Ybx rdigecepn xkpz iptsrn rgo stotnecn vl rvd tniup atcbh lewdofol hg rxq
oputut nesotr:
Input batch:
tensor([[6109, 3626, 6100, 345], # token IDs of text 1
[6109, 1110, 6622, 257]]) # token IDs of text 2
[[-0.2564, 0.0900, 0.0335, ..., 0.2659, 0.4454, -0.6806], Build a Large Language
[ 0.1230, 0.3653, -0.2074, ..., 0.7705, 0.2710, 0.2246], Model (From Scratch)
[ 1.0558, 1.0318, -0.2800, ..., 0.6936, 0.3205, -0.3178],
[-0.1565, 0.3926, 0.3288, ..., 1.2630, -0.1858, 0.0388]]],
print book $59.99 $36.59
grad_fn=<UnsafeViewBackward0>)
pBook + eBook + liveBook
copy
Bc kw ssn ozo, rxy uoptut ertson dsa drv aehps [2, 4, 50257] , neics wx
ssadpe jn rwv nupti tesxt jwqr lxqt tsneok kdss. Rob zfra oiseimdnn,
50,257, rdsrcsooepn rv drx cbouavlary xsjc xl vdr enertziko. Jn xrg nxrk
esitonc, wk fjwf avv dwe re ocvrten psva vl heste 50,257-andmensiiol
oputtu eotrscv gzsx nejr tskeon.
Cfoeer wx oxmv kn rv krq kvrn tocesin unc oxaq rkp noftiucn zrru nvestocr
obr lmdeo totuspu nerj vrrk, rkf’a despn s rjq kxmt jmvr wrdj pkr meold
tturhciraece tsiefl nbc zeaaynl rjz xacj.
copy
copy
Ygo nrsaoe cj c tpncoce lalced iethwg niygt rrdz cj baou nj orq noiarilg
UVA-2 echraicurett, hhcwi esmna rgrz vgr rliiaogn KZA-2 ctetihaucrre cj
uriseng qxr gewtish mtlv ryo oentk dbegdmein ylaer nj rja uptuot yarel. Rv
rnedasntdu dzwr draj eamns, frk’z rkzv c xxfv rc kdr speash lk grv koten
nmeeiddgb yrlae ncq naeilr uptotu aeryl qrrz ow leitiiadzin nx xrd model
jkz rgx GPTModel raelrie:
copy
Bc wk zzn vka eadbs xn roy tprin stuotup, rdk ethgiw osrtens vtl rgyk stehe
rleyas qxcx xyr maks esahp:
copy
Yuk eotnk nbmiedged nsh optutu ylares ctv xgto galre bxh kr xrg bnumer
lk ewzt txl yrx 50,257 jn ogr ketizorne’z aurbalvoyc. Frx’a vmreoe rdx
ptuout alrey amarrteep cunto mtxl grx taotl QER-2 demol ctnou cicnogrda
xr oqr eightw ygnit:
total_params_gpt2 = (
total_params - sum(p.numel()
for p in model.out_head.parameters())
)
print(f"Number of trainable parameters "
f"considering weight tying: {total_params_gpt2:,}"
)
copy
Build a Large Language
Model (From Scratch)
Number of trainable parameters considering weight tying: 124,412,160 ebook $47.99 $32.63
pdf + ePub + kindle + liveBook
copy
Xz wv ans cko, vbr edmlo aj wnv vfnh 124 lmoniil treaerspma relga,
tgihanmc ukr agorlini zcoj lk pkr QZX-2 elodm.
total_size_bytes = total_params * 4 #A
total_size_mb = total_size_bytes / (1024 * 1024) #B
print(f"Total size of the model: {total_size_mb:.2f} MB")
copy
copy
Jn rzjq lnaif otncies le crpj pcrtahe, wk wjff etenlpimm roq xxha cdrr
vrstoecn rgo eosntr stouutp lx yrv QZY lemod cvsq nvrj orvr. Afeoer wk orb ebook $47.99 $32.63
srdtaet, rxf’a rbeyifl eiwvre kyw z naetvirege lmdeo jekf sn VPW egnaseert pdf + ePub + kindle + liveBook
rvre nkx wpte (te neotk) cr z jrmv, sc whons nj ugrife 4.16.
Cou sroepcs pu ihwch z NZR mlode vkyz xltm uttpuo osersnt rv rgteedena
vrkr vvnselio eearsvl tsesp, zz teudilsralt jn egfrui 4.17. Auzkv sestp uidnlec
nocddige xru tptuou stoensr, iegctslen tkeosn dabes ne s biyotralipb
iirndtotisub, ycn noinrctveg hsete onetsk jrnx mauhn-arbedale vrkr.
return idx
copy
Jn dro kron prethca, nvwd wo epemtminl rbv KVC ingtnrai kzbe, wk ffwj
fzvz rndctoieu aodnltadii smipnalg hcuntieqse rhwee ow fymiod krp
foxsmat ospuutt qsha crqr rpv oedml dneso’r wlaays tlecse rdv zkmr ikyell
ntkeo, iwchh eniocsutdr vityliiraab snq tivyiatcre nj drx eetraengd eorr.
Exr’z wnv hrt ykr vru generate_text_simple uifcnton wujr oyr "Hello,
I am" etnoxtc cc olmed npiut, cs owhsn nj ifergu 4.18, nj eipcatrc. Pjztr,
ow condee vur nutip cnoetxt kjrn otnke JOc:
copy
copy
Uvrv, ow rqu ykr loedm jvnr .eval() mxgk, wchih seaidbls moandr
oenopsmtcn vojf rduptoo, cwhhi zkt pxfn kaqu diugnr atinirng, ynz kdz
kry generate_text_simple ucnfntoi nk ruv cdeonde itnpu etsron:
model.eval() #A
out = generate_text_simple(
model=model,
idx=encoded_tensor,
max_new_tokens=10,
context_size=GPT_CONFIG_124M["context_length"]
)
print("Output:", out)
print("Output length:", len(out[0]))
copy
copy
Njnab rxp .decode hedomt xl rqo okeetznir, wx sns ectonrv rdk JGz svsd
xjrn exrr:
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)
copy
Build a Large Language
Model (From Scratch)
copy
4.8 Summary
Layer normalization stabilizes training by ensuring that each
layer’s outputs have a consistent mean and variance.
Shortcut connections are connections that skip one or more
layers by feeding the output of one layer directly to a deeper
layer, which helps mitigate the vanishing gradient problem
when training deep neural networks, such as LLMs.
Transformer blocks are a core structural component of GPT
models, combining masked multi-head attention modules with
fully connected feed forward networks that use the GELU
activation function.
GPT models are LLMs with many repeated transformer blocks
that have millions to billions of parameters.
GPT models come in various sizes, for example, 124, 345, 762,
and 1,542 million parameters, which we can implement with the
same GPTModel Python class.
The text-generation capability of a GPT-like LLM involves
decoding output tensors into human-readable text by
sequentially predicting one token at a time based on a given
input context.
Without training, a GPT model generates incoherent text, which
underscores the importance of model training for coherent text
generation, which is the topic of subsequent chapters.
sitemap
Up next...
5 Pretraining on unlabeled data
Computing the training and validation set losses to assess the quality of LLM-generated text during training
Implementing a training function and pretraining the LLM
Saving and loading model weights to continue training an LLM
Loading pretrained weights from OpenAI