0% found this document useful (0 votes)

23 views27 pages

Let's Build Our Own GPT Model From Scratch With PyTorch - by Shubh Mishra - Nov, 2024 - Level Up Coding

Uploaded by

陳賢明

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views27 pages

Let's Build Our Own GPT Model From Scratch With PyTorch - by Shubh Mishra - Nov, 2024 - Level Up Coding

Uploaded by

陳賢明

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov,

hubh Mishra | Nov, 2024 | Level Up Coding

Get unlimited access to the best of Medium for less than $1/week. Become a member

Let’s Build our own GPT Model from Scratch

with PyTorch
Shubh Mishra · Follow
Published in Level Up Coding
15 min read · 2 days ago Open in app

45
Listen Search
Share More

Today, we will step away from our Vision Transformer series and discuss building a
basic variant of a Generative Pre-trained Transformer (GPT).

To be more precise we will build an Auto-Regressive (bi-gram) model i.e. we will

generate a token each time concerning all the previous tokens. Auto-regressive
models generally sequentially generate tokens (characters or words) considering the
previous tokens in the account. e.g. In the sentence “I like to eat” the next tokens for
<I> <like> <to> <eat> could be <ice-cream>, <cookies>, etc.

Statistical/Classical Autoregressive model specifies that the output variable depends

linearly on its own previous values and on a stochastic term (an imperfectly predictable
term);

This imperfectly predictable or stochastic term can be very loosely [1] related to the
next predicted token in our “i like to eat” example, where we help the model be less
deterministic with its choice by letting it randomly choose the next token for
prediction (i.e. <ice-cream>, <cookies>, etc..) which we will understand later in the
article.

As we are implementing a very basic auto-regressive model we will be doing

everything from scratch, using a dataset for generating Text in William

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 1/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Shakespeare’s style. This will be my longest article yet, so take a good breath, feel
free to take breaks as needed. Let’s dive right in!”

Image generated by Author with GPT

Content
1. Loading Data — Creating Data Batch Loader and Data Split.

2. BigramLanguageModel — Coding the Language Model

3. Training — Training the Model and Generating Text.

Note: The code in this article follows this video on GPT by none other than Andrej
Karparthy. His video was in fact my very first implementation of the Attention
mechanism from where I followed various other architectures and papers on
Convolutional Attention, Shifted Windows, etc.

If you’ve already seen it there isn’t much difference you’ll find in the article except
for minor code changes, so you can follow this one as a quick revision. If you
haven’t… Let’s dive straight into it.

Loading Data

# Importing torch specific modules

import torch
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 2/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

import torch.nn as nn
from torch.nn import functional as F

# We start by downloading our shakespeare txt file (stored with the name input.
! wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshak

# reading txt file (encode decode)

text = open('input.txt', 'r',).read()
vocab = sorted(list(set(text)))
encode = lambda s: [vocab.index(c) for c in s]
decode = lambda l: [vocab[c] for c in l]

Instead of using an external tokenizer, we are implementing custom lambda

functions to perform character-level tokenization on our data.

ids = encode("I like to eat")

txt = decode(ids)
print(f"ids: {ids}")
print(f"txt: {txt}")
print(f"".join(txt))

# Output:-
ids: [21, 1, 50, 47, 49, 43, 1, 58, 53, 1, 43, 39, 58]
txt: ['I', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'o', ' ', 'e', 'a', 't']
I like to eat

Split the data with 90/10

x = int(0.9*len(text)) # text is a big string with our entire data

text = torch.tensor(encode(text), dtype=torch.long)
train, val = text[:x], text[x:]

Remember as we are tokenizing at the character level we will be generating at the

character level too, the prudent way here is to create batches of random sentences
in the corpus to feed into our model for training.

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 3/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

batch_size = 32 # batch_size - is how many independent sequences will we proces

block_size = 8 # block_size = is the maximum context length for predictions

device = 'cuda' if torch.cuda.is_available() else 'cpu'

def get_batch(split):
# generate a small batch of data of inputs x and targets y
data = train if split == 'train' else val
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
return x.to(device), y.to(device)

xb, yb = get_batch('train')

The batch is created in the following way…

we want to get random sentences of block size (8) from the corpus so we generate
indexes (ix) of batch size (32), and for each index, we take the next 8 token ids and
stack for each index in the batch (ix), however our target (y) is generated with a one
index more than x (i+1, i + block_size + 1) because we need to predict the next token
in the sequence.

Example:-

ix = [33]
for i in ix:
print(train[i:i+18])
print(train[i+3:i+18+3]) # I've chosen +3 over +1 only for the sake of exam
for i in ix:
print("".join(decode(train[i:i+18])).replace("\n", ""))
print("".join(decode(train[i+3:i+18+3])).replace("\n", ""))

# Output:-
tensor([39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1]
tensor([ 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1]
any further, hear
further, hear me

BigramLanguageModel

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 4/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

A bi-gram is a type of n-gram where n=2, represents a sequence of two consecutive

lexical units (such as words or characters) in text.

1-gram (Unigram) for “i like to eat”: ["i", "like", "to", "eat"]

2-gram (Bigram) for “i like to eat”: ["i like", "like to", "to eat"]

3-gram (Trigram) for “i like to eat”: ["i like to", "like to eat"]

As we are performing an Auto-regressive task we needed to load our data in the

bigram format as we did in the code-block write above.

Now let’s get to the heart of the Article - The Multi-Head Attention. As I’ve already
implemented this part in over a dozen articles in the Vision Transformer series, I
will try to not waste your time and get straight to the concept. (Man… I would have
easily stretched this blog into two parts, but screw it, let’s do it anyways)

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 5/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Figure 1: Illustration taken from here

We see that GPT takes from the Transformer architecture proposed in the Attention
is all you need paper. However, it differs by only stacking the Multi-Head Attention
from the decoder section.

The Bigram Language Model.

class BigramLanguageModel(nn.Module):

def init(self, vocab_size):

super().__init__()
# each token directly reads off the logits for the next token from a lo
self.token_embedding_table = nn.Embedding(vocab_size, embed_size)
self.possitional_embedding = nn.Embedding(block_size, embed_size)
self.linear = nn.Linear(embed_size, vocab_size)
self.block = nn.Sequential(*[Block(embed_size, num_head) for _ in range
self.layer_norm = nn.LayerNorm(embed_size)

def forward(self, idx, targets=None):

B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)

return logits, loss

Here the input idx is our batch that we generated earlier of shape (B, T) where T is Block
Size or Token Length.

Forward pass first generates embedding for each token of shape (B, T, C) As we see in the
figure above we need to add positional embedding to the token embedding. We create
embeddings for the input (idx) so that we can represent the information that the token
holds in a fixed embedding dimension, but this doesn’t provide any information regarding
where the token is (position) that’s why we have to additionally add positional
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 6/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

embeddings to make sure the model has context for the position of the token. If you have
any doubts I would like to request that you directly refer to my prior blog where I’ve
explained all of it in better detail.

nn.Embedding in PyTorch is a layer used to map discrete, categorical values (like word
indices) into continuous, dense vectors. The layer takes integer indices as inputs, where
each index represents a unique categorical item (e.g., a word, token, or even some other
categorical data). Internally, nn.Embedding maintains an embedding matrix of shape
num_embeddings, embedding_dim so it can create a dense representation of each token. As
we are going for a simplistic version of GPT we are directly using nn.Embeddings to
generate positional embeddings rather than going for other standard approaches.

From here, it’s straightforward... We have our Block (A stack of decoder modules) and
finally, we generate new attention matrics with the same shape as the input but where
each token has some information regarding all the tokens preceding it.

Finally, we apply the Layer Norm (Common practice to stabilize training) and then pass it
to a linear layer to map the embedding C to our vocab dimension. The vocab dimension is
simply the number of all the unique characters we have in our input.txt. One of the general
ways to define how accurate our predictions are is to compare the block (attention
modules) output with the target indices.

The output logits are simply supposed to be the probability distribution over vocab size V
predicting the next tokens (target) in the sequence. Thus cross entropy loss is used to
generate a loss for determining how close is our output with the target token sequence.

Now that we have covered the Bigram implementation. It is time to see how the
Block used Multi-Head Attention (MHA) to create the attention metrics.

Multi Head Attention

The reason behind using multi-head attention is that we can directly pass the input
(B, T, embedding_size) to an Attention Block but a faster approach is instead of
directly generating a Q, K, V and computing attention weights of dimension
embedding_size, we create sections of attention modules, calculate the attention
weights separately and then concatenate them in the end.

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 7/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
self.head_size = head_size
self.key = nn.Linear(embed_size, head_size, bias=False)
self.query = nn.Linear(embed_size, head_size, bias=False)
self.value = nn.Linear(embed_size, head_size, bias=False)

def forward(self, x):

B, T, C = x.shape
k = self.key(x)
q = self.query(x)
v = self.value(x)
wei = [email protected](2, 1)/self.head_size**0.5

Figure 2 : Attention Mechanism | Source Image

Thus following the logic above we create a single attention head. Now let’s start making
some sense of it.

We have an input of dimension (batch size, token length, embed dim) after adding the
positional embedding. Here, each token in the input is represented by an embed_dim (64).
But no token has any information regarding all the tokens preceding it.

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 8/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

To create embedding enriched with such information, we use the attention mechanism, by
generating Key, Query, and Value vectors.

The attention mechanism in the Head class is designed to help the model focus on different
parts of the input sequence when generating the output, which is particularly useful in
tasks like language modeling.

The key , query , and value projections come from the concept of querying information
relevant to each token's context in the sequence. Each token is represented by a vector ( x ),
and by linearly transforming it into separate key , query , and value vectors, we can
calculate which tokens in a sequence should attend to each other.

When q (queries) is dotted with k (keys), the result ( wei ) tells us the "relevance" or
"attention" scores between each token and every other token. Higher scores mean a token is
more relevant or "important" to another token in that context. The scaling factor 1 /

sqrt(head_size) prevents these scores from getting too large, which can make the softmax
distribution too sharp and harder to optimize.

def forward(self, x):

B, T, C = x.shape
k = self.key(x)
q = self.query(x)
v = self.value(x)
wei = [email protected](2, 1)/self.head_size**0.5
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
wei = F.softmax(wei, dim=2) # (B , block_size, block_size)
wei = self.dropout(wei)
out = wei@v
return out

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 9/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

The causal mask, tril , is applied to ensure that each token can only "see" itself and
previous tokens. This is crucial for autoregressive tasks like text generation, where
the model should not look ahead at future tokens when predicting the next one.
Setting irrelevant positions to negative infinity masked_fill makes them zero after
softmax, so they don't contribute to the final attention calculation. This keeps the
model from cheating by looking at future tokens as we want to predict the future
token. Finally, we take a dot product between the weight and value metrics and
return our output.

You can check this example output to get a better sense of the transformations that
happen:

q = torch.randint(10, (1, 3, 3))

v = torch.randint(10, (1, 3, 3))
print("Query:\n",q)
print("Value:\n",v)
wei = [email protected](2, 1)/3**0.5
print("weights:\n", wei)
tril = torch.tril(torch.ones(3, 3))
print("Triangular Metrics:\n",tril)
wei = wei.masked_fill(tril == 0, float('-inf'))
print("Masked Weights\n", wei)
print("Softmax ( e^-inf = 0 )\n", F.softmax(wei, dim=2))

# Output:-
Query:
tensor([[[2, 8, 8],
[4, 2, 4],
[1, 2, 9]]])
Value:
tensor([[[9, 5, 7],
[3, 1, 4],
[6, 2, 9]]])
weights:
tensor([[[65.8179, 26.5581, 57.7350],
[42.7239, 17.3205, 36.9504],
[47.3427, 23.6714, 52.5389]]])
Triangular Metrics:
tensor([[1., 0., 0.],
[1., 1., 0.],
[1., 1., 1.]])
Masked Weights
tensor([[[65.8179, -inf, -inf],
[42.7239, 17.3205, -inf],
[47.3427, 23.6714, 52.5389]]])
Softmax ( e^-inf = 0 )

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 10/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

tensor([[[1.0000e+00, 0.0000e+00, 0.0000e+00],

[1.0000e+00, 9.2777e-12, 0.0000e+00],
[5.5073e-03, 2.8880e-13, 9.9449e-01]]])

As we are using multi-head attention this is how we will implement it:

class MultiHeadAttention(nn.Module):
def __init__(self, head_size, num_head):
super().__init__()
self.sa_head = nn.ModuleList([Head(head_size) for _ in range(num_head)]
self.dropout = nn.Dropout(dropout)
self.proj = nn.Linear(embed_size, embed_size)

def forward(self, x):

x = torch.cat([head(x) for head in self.sa_head], dim= -1)
x = self.dropout(self.proj(x))
return x

Here we are passing the input x (B, T, E) to different attention heads, where each
attention head returns the final vector of size (B, T, head_size) where head size = E
(64) / num heads (4) = 16, as we do it in a for loop of range(num_head(4)) we
concatenate it back to it’s original size (B, T, 4*16).

Multihead attention is simply considered faster and much more efficient at the scale
where the embedding dimension is even greater.

After concatenation we pass the final output to the linear projection layer, the sense
of this is to enable the embeddings in the final vector to further communicate what
they have learned about each other during the attention weight computation. After
which it is passed to a dropout layer and returned.

Putting it all together, the standard decoder block is implemented as shown in

Figure 1 above.

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 11/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Figure: 3| Source : Attention is all you need

class FeedForward(nn.Module):
def __init__(self, embed_size):
super().__init__()

self.ff = nn.Sequential(
nn.Linear(embed_size, 4*embed_size),
nn.ReLU(),
nn.Linear(4*embed_size, embed_size),
nn.Dropout(dropout)
)

def forward(self, x):

return self.ff(x)

class Block(nn.Module):
def __init__(self, embed_size, num_head):
super().__init__()
head_size = embed_size // num_head
self.multihead = MultiHeadAttention(head_size, num_head)
self.ff = FeedForward(embed_size)
self.ll1 = nn.LayerNorm(embed_size)
self.ll2 = nn.LayerNorm(embed_size)

def forward(self, x):

x = x + self.multihead(self.ll1(x))
x = x + self.ff(self.ll2(x))
return x

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 12/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

The head size is calculated as explained earlier, the input is simply passed through a
layer norm followed by our multi-head attention network and then another layer
norm finally passing through a feed-forward network.

Getting Back to the Bigram Model

class BigramLanguageModel(nn.Module):

def init(self, vocab_size):

def forward(self, idx, targets=None):

B, T = idx.shape
# idx and targets are both (B,T) tensor of integers
logits = self.token_embedding_table(idx) # (B,T,C)
ps = self.possitional_embedding(torch.arange(T, device=device))
x = logits + ps #(B, T, C)
logits = self.block(x) #(B, T, c)
logits = self.linear(self.layer_norm(logits)) # This suppose to map bet
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)

return logits, loss

def generate(self, idx, max_new_tokens):

# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# crop idx to the last block_size tokens
crop_idx= idx[:, -block_size:].to(device)
# get the predictions
logits, loss = self(crop_idx)
# focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C) from (B, T, C)
# apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# We sample one index from the filtered distribution
idx_next = torch.multinomial(probs, num_samples=1).to(device)
# append sampled index to the running sequence

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 13/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)

return idx

Here we see that the Block is called under the Sequential layer through the range of
the number of layers.

Generating tokens…

We start by passing the single dimension idx tensor (indices for our tokens) to the
generate function along with the maximum number of new tokens we want to generate.
As our model is built for block size 8, we can only pass 8 tokens at a time, thus we crop the
last 8 tokens in the idx (all tokens are selected if idx length is less than the block size).

We pass the crop idx to our BigramLanguageModel, as the logits we generated had the
probable distribution of the target tokens, we are only interested in the last token as the
last token of the targets (y) is the next in the sequence (x) (explained in batch loader
section).

The logits we have now, are of shape (B, C) where C is the Vocab Size, which represents the
probability distribution over the entire vocab for the last token index. Now we simply
apply a softmax over it to turn this vector into a probability vector (i.e. sum of elements =
1).

Now, remember how we at the very beginning of the article talked about
imperfectly predictable or stochastic terms and how we let the model randomly
select what is going to be the next token in the sequence? To do so we use
torch.multinomial which is a statistical strategy used to take out a sample from a
given probability distribution. Here it samples the indices randomly according to
the specified probabilities.

We then finally get the next idx predicted, concatenate it with the previous idx, and
then continue the for loop to keep generating the next index based on the previous
ones till we reach the max tokens.

Training
Luckily the training part is very straightforward.

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 14/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

m = BigramLanguageModel(65).to(device)

optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

# training the model, cause I won't give up without a fight

batch_size = 32
for epoch in range(5000):

# Printing the Training and Validation Loss

if epoch%1000==0:
m.eval()
Loss= 0.0
Val_Loss = 0.0
for k in range(200):
x, y = get_batch(True)

val_ , val_loss = m(x, y)

x1, y1 = get_batch(False)

_, train_loss = m(x1, y1)

Loss += train_loss.item()
Val_Loss += val_loss.item()
avg_loss = Val_Loss/(k+1)

avg_train_loss = Loss/(k+1)
m.train()

print("Epoch: {} \n The validation loss is:{} The Loss is:{}".format

# Forward
data, target = get_batch(False)
logits, loss = m(data, target)
#Backward
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()

Here we keep the train for 5000 epochs, which relatively takes roughly about 2
minutes in 4 GB VRAM Nvidia RTX 3050.

We start with the Forward Pass by fetching our batch from get_batch() and passing it
to our BigramLanguageModel. Setting optimizer.zero_grad(), doing loss. backward(), and
performing an optimizer.step(), we utilize an AdamW Optimizer which is far more enough
for what we need it for.

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 15/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Now the moment you were waiting for..

We create a tensor for the following sentence:

ids = torch.tensor(encode("i like to eat food"), dtype=torch.long).unsqueeze(0)

The shape of ids is with shape (1, 18) (Batch Size, Token). Starting with just 18 tokens
(representing indices in our vocabulary) and generating 2000 more characters. For
context, our vocab is the set of all unique characters in the input.txt which we
implemented earlier under the Data Loading Section i.e. vocab = sorted(list(set(text)))

print("".join(decode(m.generate(torch.zeros([1,1], dtype=torch.long) , max_new_

Output:-

i like to eat food, noBANIO:

Here and
I shake married entreature by colantied at to women oword this swamind-betweet
As Grave eare his sun.

LUCENTIO:
Go what a doubled mistressed well.

Taildoes, not to memble, the peashat you;--are master, in thou comsand of the f
Petruchio?
Fece poor this cockepopen neve so it do old loaps islied I'comment and curh
and blate sure poccient you the miad e'er a to partink,
Unory speitied where buzzarr'd formorns,
Pitedame,
Beach, and whom I firit.

ANDO:
O the virtuus a parros that that is acleast, not for suck could mighreature wel
I'll toence after counteent,
Signior to paptista?
Shile you cappier?

BIANCA:

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 16/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Where womand betire asleck him snall conglithing.

PROSPERO:
I, as expase caspierfed success,
This all no be trutes from the good the island mognied buzent; tensting in this
Do be marriage.

TRANIO:
'Tis, jointer.

GRUCHIO:
Soubt sI'll show I freek born.

PROSPETRUCHIO:
The vant mine; it it

I know… It doesn’t make sense. But we must realize that Language Models are not
trained on mere datasets of Shakespeare's writings. It takes a vast amount of GPU
power, with better tokenization techniques and really large corpus of datasets.

As we trained it on a very small model, our performance is not bad, the output still
makes sense, and have learned to use actual English words than to generate random
gibberish.

You can try to experiment with the model, for different datasets and Token Size,
Batch Size, Number of Layers, etc.

Thank You!
The main goal of this blog was to explain to you in detail how you can build your
language model from scratch and train it on your dataset,... Well, now you know!!!

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 17/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

With this, I can’t thank you enough for giving learning a chance and reading it so far,
I hope that you enjoyed reading this one. The entire code is available here on my
GitHub repository ML-Models where I implement various deep learning
architectures from scratch.

If you liked this article or found it informative please consider giving it a clap and
dropping me a follow. If you have any doubts feel free to reach out.

Thanks for reading!

References:
[1] Large language models are called autoregressive, but they are not a classical
autoregressive model in this sense because they are not linear. Read More.

ChatGPT Pytorch Transformers Deep Learning AI

Written by Shubh Mishra

142 Followers · Writer for Level Up Coding

I write what I learn. Follow on twitter! https://x.com/kianindeed You won't be bored.

More from Shubh Mishra and Level Up Coding

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 18/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Shubh Mishra in The Deep Hub

Building Swin Transformer from Scratch using PyTorch: Hierarchical

Vision Transformer using Shifted…
Hey 👋

Feb 27 81 1

Sanjay Nandakumar in Level Up Coding

Why Advanced LLMs, Such as GPT-4 or Claude, Fail in Critical Use Cases
Despite Large Training Data
Understanding the Limitations; Why Bigger Isn’t Always Better for Critical Applications
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 19/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

2d ago 125

Senthil E in Level Up Coding

Demystifying Code Repositories: Building an AI Assistant for Code

Understanding
From Architecture to Implementation: A Complete Guide to Building an Intelligent Repository
Analysis System

3d ago 88 4

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 20/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Shubh Mishra in The Deep Hub

Building Vision Transformer From Scratch using PyTorch: An Image

worth 16X16 Words.
Hey 👏

Feb 21 68 2

See all from Shubh Mishra

See all from Level Up Coding

Recommended from Medium

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 21/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Abdur Rahman in Stackademic

Python is No More The King of Data Science

5 Reasons Why Python is Losing Its Crown

Oct 23 4.2K 22

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 22/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Vinit Tople

This Should be the #1 Application of Gen AI in your Company

Prioritize this use case in your first wave of LLM investments

Sep 30 629 5

Lists

What is ChatGPT?
9 stories · 465 saves

The New Chatbots: ChatGPT, Bard, and Beyond

12 stories · 498 saves

ChatGPT prompts
50 stories · 2192 saves

Generative AI Recommended Reading

52 stories · 1483 saves

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 23/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Liu Zuo Lin in Level Up Coding

12 Production-Grade Python Code Styles I’ve Picked Up From Work

Read Free…

Nov 2 1.7K 19

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 24/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Andrew Zuo in Mac O’Clock

The M4 MacBook Pro Makes Me Want To Buy A Windows Laptop

Nov 1 980 72

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 25/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Jim Clyde Monge in Generative AI

This AI-Generated Video Fooled The Internet

An AI-generated video fooled thousands of users on Reddit, including myself. If it fooled me, an
average person doesn’t stand a chance.

6d ago 1.1K 26

Lucas Samba

3 Probability Questions I was asked in Walmart Data Scientist Interview

Recently I got an opportunity to interview at Walmart for Data Scientist — 3 position. All thanks
to a referral by my friend working at…

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 26/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding

Aug 23 469 12

See more recommendations

https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 27/27

Family Reunion Planning Meeting Agenda
80% (5)
Family Reunion Planning Meeting Agenda
3 pages
Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
Baylor Manual
80% (5)
Baylor Manual
28 pages
GPT 2 - Learninhg 4
0% (1)
GPT 2 - Learninhg 4
2 pages
Natural Language Processing With Pytorch Readthedocs Io en Latest PDF
No ratings yet
Natural Language Processing With Pytorch Readthedocs Io en Latest PDF
35 pages
GPT in 60 Lines of NumPy - Jay Mody
No ratings yet
GPT in 60 Lines of NumPy - Jay Mody
41 pages
PyTorch For Building Large Language Models
No ratings yet
PyTorch For Building Large Language Models
93 pages
DIP Lab 10
No ratings yet
DIP Lab 10
11 pages
Deep Learning With PyTorch Guide For Beginners and Intermediate
100% (7)
Deep Learning With PyTorch Guide For Beginners and Intermediate
120 pages
Deep Learning Lab: How To Train Your First Neural Network
No ratings yet
Deep Learning Lab: How To Train Your First Neural Network
68 pages
PyTorch - A Comprehensive Overview
No ratings yet
PyTorch - A Comprehensive Overview
7 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Harvard CS197 Lecture 6 & 7 Notes
No ratings yet
Harvard CS197 Lecture 6 & 7 Notes
18 pages
Chatbot Development Key Points
No ratings yet
Chatbot Development Key Points
3 pages
Pytorch Slides
No ratings yet
Pytorch Slides
31 pages
Pytorch
No ratings yet
Pytorch
38 pages
Stars 4 0 0 0 + Forks 7 0 0 + License MIT
No ratings yet
Stars 4 0 0 0 + Forks 7 0 0 + License MIT
19 pages
Building LLaMA 3 From Scratch With Python
No ratings yet
Building LLaMA 3 From Scratch With Python
34 pages
PyTorch Workflow Fundamentals - Zero To Mastery Learn PyTorch For Deep Learning
No ratings yet
PyTorch Workflow Fundamentals - Zero To Mastery Learn PyTorch For Deep Learning
43 pages
Code Explanation
No ratings yet
Code Explanation
8 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
00 Pytorch and Deep Learning Fundamentals PDF
No ratings yet
00 Pytorch and Deep Learning Fundamentals PDF
44 pages
Script 2
No ratings yet
Script 2
2 pages
AI Phase2
No ratings yet
AI Phase2
9 pages
Deep Learning Library PDF
No ratings yet
Deep Learning Library PDF
12 pages
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
Pytorch Demo 1749471354
No ratings yet
Pytorch Demo 1749471354
10 pages
Py Torch
No ratings yet
Py Torch
786 pages
Lecture 15 - Foundation Models - CLIP and GPT
No ratings yet
Lecture 15 - Foundation Models - CLIP and GPT
45 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
No ratings yet
4 Implementing A GPT Model From Scratch To Generate Text - Build A Large Language Model (From Scratch)
52 pages
Pytorch Neural Networks Guide 1717173717
No ratings yet
Pytorch Neural Networks Guide 1717173717
17 pages
Lab 5
No ratings yet
Lab 5
27 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
AI Tools
No ratings yet
AI Tools
19 pages
PyTorch Fundamentals - Zero To Mastery Learn PyTorch For Deep Learning
No ratings yet
PyTorch Fundamentals - Zero To Mastery Learn PyTorch For Deep Learning
45 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
Pgi20s02j - Lab Record
No ratings yet
Pgi20s02j - Lab Record
24 pages
Phil Wang Repos
No ratings yet
Phil Wang Repos
10 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
8 pages
PyTorch 1 - 0 - Bringing Research and Production Together Presentation
No ratings yet
PyTorch 1 - 0 - Bringing Research and Production Together Presentation
108 pages
Pytorch Cheat Sheet For Beginners and Udacity Deep Learning Nanodegree
No ratings yet
Pytorch Cheat Sheet For Beginners and Udacity Deep Learning Nanodegree
23 pages
Ibrahim Badhusha
No ratings yet
Ibrahim Badhusha
37 pages
GNN Hands On 03
No ratings yet
GNN Hands On 03
7 pages
gpt4-1 Prompting Guide - Ipynb
No ratings yet
gpt4-1 Prompting Guide - Ipynb
29 pages
Pytorch Tutorial 1
No ratings yet
Pytorch Tutorial 1
48 pages
Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
代码大模型
No ratings yet
代码大模型
18 pages
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
No ratings yet
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
30 pages
Py Torch
No ratings yet
Py Torch
5 pages
Pytorch Tutorial by Chongruo Wu
No ratings yet
Pytorch Tutorial by Chongruo Wu
84 pages
2c PyTorch4
No ratings yet
2c PyTorch4
4 pages
Module02 PyTorch
No ratings yet
Module02 PyTorch
36 pages
GPT 2 - Learninhg 3
No ratings yet
GPT 2 - Learninhg 3
2 pages
Beginner's PyTorch Guide
No ratings yet
Beginner's PyTorch Guide
35 pages
00 Pytorch Fundamentals - Ipynb - Colab
No ratings yet
00 Pytorch Fundamentals - Ipynb - Colab
24 pages
Pytorch Tutorial 1 Rev 1
No ratings yet
Pytorch Tutorial 1 Rev 1
48 pages
Transformer Structure
No ratings yet
Transformer Structure
11 pages
Conversations with: AI: Developer edition, #1
From Everand
Conversations with: AI: Developer edition, #1
Xinc Cyberwizard
No ratings yet
LLM From Scratch
No ratings yet
LLM From Scratch
27 pages
Building A Streamlit Chatbot With LangChain and Llama 3.1 - Exploring LLMs - 3 - by Abou Zuhayr - Sep, 2024 - GoPenAI
No ratings yet
Building A Streamlit Chatbot With LangChain and Llama 3.1 - Exploring LLMs - 3 - by Abou Zuhayr - Sep, 2024 - GoPenAI
15 pages
BigQuery Change Data Capture (CDC) Using Pub - Sub - by Ajith Urimajalu - Google Cloud - Community - Sep, 2024 - Medium
No ratings yet
BigQuery Change Data Capture (CDC) Using Pub - Sub - by Ajith Urimajalu - Google Cloud - Community - Sep, 2024 - Medium
15 pages
Building A Dynamic Multi-Agent Workflow - Harnessing AI Collaboration With LangChain & LangGraph - by Rohit Kumar - Oct, 2024 - Medium
No ratings yet
Building A Dynamic Multi-Agent Workflow - Harnessing AI Collaboration With LangChain & LangGraph - by Rohit Kumar - Oct, 2024 - Medium
13 pages
How To Enhance AI Chatbots With Real-Time Data From Bright Data Using OpenAI and LangChain - by Victor Yakubu - Jan, 2025 - Python in Plain English
No ratings yet
How To Enhance AI Chatbots With Real-Time Data From Bright Data Using OpenAI and LangChain - by Victor Yakubu - Jan, 2025 - Python in Plain English
22 pages
Building A Smarter RAG - Implementing Graph-Based RAG With Neo4j - by Vinay Jain - Nov, 2024 - Medium
No ratings yet
Building A Smarter RAG - Implementing Graph-Based RAG With Neo4j - by Vinay Jain - Nov, 2024 - Medium
13 pages
Build Your Own Memory-Powered Chatbot With Google Generative AI, LangChain, and Gradio - by Vinod Pillai - Nov, 2024 - Medium
No ratings yet
Build Your Own Memory-Powered Chatbot With Google Generative AI, LangChain, and Gradio - by Vinod Pillai - Nov, 2024 - Medium
13 pages
Building Audiobooks Using The Open-Source XTTS-V2 Model - by Jaimon Jacob - Oct, 2024 - Medium
No ratings yet
Building Audiobooks Using The Open-Source XTTS-V2 Model - by Jaimon Jacob - Oct, 2024 - Medium
14 pages
Effective Prompt Engineering For LLMs - A Developer's Guide To Advanced AI Techniques - by Pankaj - Nov, 2024 - Medium
No ratings yet
Effective Prompt Engineering For LLMs - A Developer's Guide To Advanced AI Techniques - by Pankaj - Nov, 2024 - Medium
16 pages
Beyond Fixed Taxonomies - Zero-Shot Classification and Automated Category Consolidation - by Aimen Louafi - Inside Doctrine - Nov, 2024 - Medium
No ratings yet
Beyond Fixed Taxonomies - Zero-Shot Classification and Automated Category Consolidation - by Aimen Louafi - Inside Doctrine - Nov, 2024 - Medium
22 pages
Azure AI Speech Service - Breaking Language Barriers With Video Translation - by Tajinder Singh - Nov, 2024 - Medium
No ratings yet
Azure AI Speech Service - Breaking Language Barriers With Video Translation - by Tajinder Singh - Nov, 2024 - Medium
16 pages
Best Practices For Deploying Google Cloud VMware Engine Protected Part 2 - by Andres Vigil - Google Cloud - Community - Oct, 2024 - Medium
No ratings yet
Best Practices For Deploying Google Cloud VMware Engine Protected Part 2 - by Andres Vigil - Google Cloud - Community - Oct, 2024 - Medium
20 pages
Best LLM 2024 - Top Models For Speed, Accuracy, and Price - Medium
No ratings yet
Best LLM 2024 - Top Models For Speed, Accuracy, and Price - Medium
17 pages
Deploying A Multimodal RAG System Using VLLM and Milvus - by Zilliz - Nov, 2024 - Medium
No ratings yet
Deploying A Multimodal RAG System Using VLLM and Milvus - by Zilliz - Nov, 2024 - Medium
19 pages
Beyond Skills - Unlocking The Full Potential of Data Scientists. - by Eric Colson - Oct, 2024 - Towards Data Science
No ratings yet
Beyond Skills - Unlocking The Full Potential of Data Scientists. - by Eric Colson - Oct, 2024 - Towards Data Science
19 pages
Building A Telegram Bot in 2024 With Python - by Erich Hohenstein - Level Up Coding
No ratings yet
Building A Telegram Bot in 2024 With Python - by Erich Hohenstein - Level Up Coding
15 pages
Get Started With Chrome Built-In AI - Access Gemini Nano Model Locally - by Romin Irani - Google Cloud - Community - Oct, 2024 - Medium
No ratings yet
Get Started With Chrome Built-In AI - Access Gemini Nano Model Locally - by Romin Irani - Google Cloud - Community - Oct, 2024 - Medium
19 pages
Fine-Tuning Legal-BERT - LLMs For Automated Legal Text Classification - by Drewgelbard - Nov, 2024 - Towards AI
No ratings yet
Fine-Tuning Legal-BERT - LLMs For Automated Legal Text Classification - by Drewgelbard - Nov, 2024 - Towards AI
27 pages
GraphRAG vs. Traditional RAG - Unveiling Avengers Data Secrets - by Eva Jurado Cortés - Data Science at Microsoft - Oct, 2024 - Medium
No ratings yet
GraphRAG vs. Traditional RAG - Unveiling Avengers Data Secrets - by Eva Jurado Cortés - Data Science at Microsoft - Oct, 2024 - Medium
26 pages
AI-Driven Environmental Monitoring and Conservation - by Preeti - Nov, 2024 - Medium
No ratings yet
AI-Driven Environmental Monitoring and Conservation - by Preeti - Nov, 2024 - Medium
23 pages
Building A Smart Travel Agent With LangGraph and OpenAI - by Abhinav Kumar - Artificial Intelligence in Plain English
No ratings yet
Building A Smart Travel Agent With LangGraph and OpenAI - by Abhinav Kumar - Artificial Intelligence in Plain English
14 pages
Build Whatsapp Chatbot With Flask and Open Source LLM - LLAMA3? - by Mayankchugh Jobathk - Medium
No ratings yet
Build Whatsapp Chatbot With Flask and Open Source LLM - LLAMA3? - by Mayankchugh Jobathk - Medium
23 pages
Fine-Tuning Embedding Models - Achieving More With Less - by Nilesh Raghuvanshi - Nov, 2024 - Towards AI
No ratings yet
Fine-Tuning Embedding Models - Achieving More With Less - by Nilesh Raghuvanshi - Nov, 2024 - Towards AI
20 pages
7 Notebook LM Uses You'll Wish You Knew Sooner - by Woyera - Oct, 2024 - Medium
No ratings yet
7 Notebook LM Uses You'll Wish You Knew Sooner - by Woyera - Oct, 2024 - Medium
10 pages
AI vs. Quantum Computing - The New Frontier in Simulating Quantum Systems - by Cogni Down Under - Nov, 2024 - Medium
No ratings yet
AI vs. Quantum Computing - The New Frontier in Simulating Quantum Systems - by Cogni Down Under - Nov, 2024 - Medium
12 pages
A Simple Way To Organize Your Styles & Themes in Flutter - by Leonidas Kanellopoulos - Sep, 2024 - Medium
No ratings yet
A Simple Way To Organize Your Styles & Themes in Flutter - by Leonidas Kanellopoulos - Sep, 2024 - Medium
19 pages
SAP2000 Frame Analysis Tutorial
No ratings yet
SAP2000 Frame Analysis Tutorial
21 pages
A Small Step Towards Reproducing OpenAI O1 - Progress Report On The Steiner Open Source Models - by Yichao 'Peak' Ji - Oct, 2024 - Medium
No ratings yet
A Small Step Towards Reproducing OpenAI O1 - Progress Report On The Steiner Open Source Models - by Yichao 'Peak' Ji - Oct, 2024 - Medium
16 pages
Boe 1
No ratings yet
Boe 1
9 pages
Numpy Tutorial by Expertized Guy
No ratings yet
Numpy Tutorial by Expertized Guy
12 pages
Toshiba RAS-M10SKV-E
No ratings yet
Toshiba RAS-M10SKV-E
52 pages
ILAC - Members (By Category)
No ratings yet
ILAC - Members (By Category)
11 pages
Research On Topology Planning For Wireless Mesh Networks Based On Deep Reinforcement Learning
No ratings yet
Research On Topology Planning For Wireless Mesh Networks Based On Deep Reinforcement Learning
6 pages
Sans 10292
No ratings yet
Sans 10292
31 pages
CSDM 2.0 White Paper Final
No ratings yet
CSDM 2.0 White Paper Final
23 pages
Final - Accops Certified Sales Professional - Fundamentals
No ratings yet
Final - Accops Certified Sales Professional - Fundamentals
14 pages
Ns2-Vw00-p0uyq-174226 Vehicle Repair Shop Side Elevation Rev.0int1
No ratings yet
Ns2-Vw00-p0uyq-174226 Vehicle Repair Shop Side Elevation Rev.0int1
1 page
Hepa Filters 01
No ratings yet
Hepa Filters 01
1 page
Essentials of Cloud Computing A Holistic Perspective Surianarayanan - The Ebook in PDF Format Is Ready For Download
100% (3)
Essentials of Cloud Computing A Holistic Perspective Surianarayanan - The Ebook in PDF Format Is Ready For Download
68 pages
Skyjack 4740 Parts Manual
No ratings yet
Skyjack 4740 Parts Manual
148 pages
Amy Corns - Connecting Scatter Plots and Correlation Coefficients Activity
No ratings yet
Amy Corns - Connecting Scatter Plots and Correlation Coefficients Activity
23 pages
LM Chart Cast Alloys Aluminum
0% (1)
LM Chart Cast Alloys Aluminum
2 pages
Advanced View of Atmega Microcontroller Projects List - ATMega32 AVR
No ratings yet
Advanced View of Atmega Microcontroller Projects List - ATMega32 AVR
146 pages
Gravitee Vs Azure - Cloud Freedom From Your API Management Platform
No ratings yet
Gravitee Vs Azure - Cloud Freedom From Your API Management Platform
5 pages
Anonymous The Rocks of Bawn For Harp Clarinet 47437
No ratings yet
Anonymous The Rocks of Bawn For Harp Clarinet 47437
4 pages
Tsarouchas Anastasios Resume
No ratings yet
Tsarouchas Anastasios Resume
1 page
CFS Families
No ratings yet
CFS Families
4 pages
Description For Engineering CHNG CMZ700-001 (050804) 3decoded
No ratings yet
Description For Engineering CHNG CMZ700-001 (050804) 3decoded
8 pages
Building Mental Models
No ratings yet
Building Mental Models
32 pages
Appendix-B - Major Dams in India
No ratings yet
Appendix-B - Major Dams in India
4 pages
Rotella DD: Two-Stroke Diesel Engine Oil
No ratings yet
Rotella DD: Two-Stroke Diesel Engine Oil
1 page
An 120
No ratings yet
An 120
6 pages
ABHISHEK
No ratings yet
ABHISHEK
3 pages
MAIN Electrical Parts List: Design LOC Sec Code Description
No ratings yet
MAIN Electrical Parts List: Design LOC Sec Code Description
10 pages
The Effect of Controlled Permeable Formwork Liner On The Mechanical Properties of Concrete
No ratings yet
The Effect of Controlled Permeable Formwork Liner On The Mechanical Properties of Concrete
11 pages
Taoufik Hachi Mi
No ratings yet
Taoufik Hachi Mi
11 pages

Let's Build Our Own GPT Model From Scratch With PyTorch - by Shubh Mishra - Nov, 2024 - Level Up Coding

Uploaded by

Let's Build Our Own GPT Model From Scratch With PyTorch - by Shubh Mishra - Nov, 2024 - Level Up Coding

Uploaded by

2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov,

hubh Mishra | Nov, 2024 | Level Up Coding

Let’s Build our own GPT Model from Scratch

To be more precise we will build an Auto-Regressive (bi-gram) model i.e. we will

Statistical/Classical Autoregressive model specifies that the output variable depends

As we are implementing a very basic auto-regressive model we will be doing

Image generated by Author with GPT

2. BigramLanguageModel — Coding the Language Model

3. Training — Training the Model and Generating Text.

# Importing torch specific modules

# reading txt file (encode decode)

Instead of using an external tokenizer, we are implementing custom lambda

ids = encode("I like to eat")

Split the data with 90/10

x = int(0.9*len(text)) # text is a big string with our entire data

Remember as we are tokenizing at the character level we will be generating at the

batch_size = 32 # batch_size - is how many independent sequences will we proces

device = 'cuda' if torch.cuda.is_available() else 'cpu'

The batch is created in the following way…

A bi-gram is a type of n-gram where n=2, represents a sequence of two consecutive

1-gram (Unigram) for “i like to eat”: ["i", "like", "to", "eat"]

As we are performing an Auto-regressive task we needed to load our data in the

Figure 1: Illustration taken from here

The Bigram Language Model.

def __init__(self, vocab_size):

def forward(self, idx, targets=None):

return logits, loss

Multi Head Attention

def forward(self, x):

Figure 2 : Attention Mechanism | Source Image

def forward(self, x):

q = torch.randint(10, (1, 3, 3))

tensor([[[1.0000e+00, 0.0000e+00, 0.0000e+00],

As we are using multi-head attention this is how we will implement it:

def forward(self, x):

Putting it all together, the standard decoder block is implemented as shown in

Figure: 3| Source : Attention is all you need

def forward(self, x):

def forward(self, x):

Getting Back to the Bigram Model

def __init__(self, vocab_size):

def forward(self, idx, targets=None):

return logits, loss

def generate(self, idx, max_new_tokens):

idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)

optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

# training the model, cause I won't give up without a fight

# Printing the Training and Validation Loss

val_ , val_loss = m(x, y)

_, train_loss = m(x1, y1)

print("Epoch: {} \n The validation loss is:{} The Loss is:{}".format

Now the moment you were waiting for..

ids = torch.tensor(encode("i like to eat food"), dtype=torch.long).unsqueeze(0)

print("".join(decode(m.generate(torch.zeros([1,1], dtype=torch.long) , max_new_

i like to eat food, noBANIO:

Where womand betire asleck him snall conglithing.

Thanks for reading!

ChatGPT Pytorch Transformers Deep Learning AI

Written by Shubh Mishra

I write what I learn. Follow on twitter! https://x.com/kianindeed You won't be bored.

More from Shubh Mishra and Level Up Coding

Shubh Mishra in The Deep Hub

Building Swin Transformer from Scratch using PyTorch: Hierarchical

Sanjay Nandakumar in Level Up Coding

Senthil E in Level Up Coding

Demystifying Code Repositories: Building an AI Assistant for Code

Shubh Mishra in The Deep Hub

Building Vision Transformer From Scratch using PyTorch: An Image

See all from Shubh Mishra

See all from Level Up Coding

Recommended from Medium

Abdur Rahman in Stackademic

Python is No More The King of Data Science

This Should be the #1 Application of Gen AI in your Company

The New Chatbots: ChatGPT, Bard, and Beyond

Generative AI Recommended Reading

def init(self, vocab_size):

def init(self, vocab_size):