Let's Build Our Own GPT Model From Scratch With PyTorch - by Shubh Mishra - Nov, 2024 - Level Up Coding
Let's Build Our Own GPT Model From Scratch With PyTorch - by Shubh Mishra - Nov, 2024 - Level Up Coding
Get unlimited access to the best of Medium for less than $1/week. Become a member
45
Listen Search
Share More
Today, we will step away from our Vision Transformer series and discuss building a
basic variant of a Generative Pre-trained Transformer (GPT).
This imperfectly predictable or stochastic term can be very loosely [1] related to the
next predicted token in our “i like to eat” example, where we help the model be less
deterministic with its choice by letting it randomly choose the next token for
prediction (i.e. <ice-cream>, <cookies>, etc..) which we will understand later in the
article.
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 1/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
Shakespeare’s style. This will be my longest article yet, so take a good breath, feel
free to take breaks as needed. Let’s dive right in!”
Content
1. Loading Data — Creating Data Batch Loader and Data Split.
Note: The code in this article follows this video on GPT by none other than Andrej
Karparthy. His video was in fact my very first implementation of the Attention
mechanism from where I followed various other architectures and papers on
Convolutional Attention, Shifted Windows, etc.
If you’ve already seen it there isn’t much difference you’ll find in the article except
for minor code changes, so you can follow this one as a quick revision. If you
haven’t… Let’s dive straight into it.
Loading Data
import torch.nn as nn
from torch.nn import functional as F
# We start by downloading our shakespeare txt file (stored with the name input.
! wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshak
# Output:-
ids: [21, 1, 50, 47, 49, 43, 1, 58, 53, 1, 43, 39, 58]
txt: ['I', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'o', ' ', 'e', 'a', 't']
I like to eat
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 3/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
def get_batch(split):
# generate a small batch of data of inputs x and targets y
data = train if split == 'train' else val
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
return x.to(device), y.to(device)
xb, yb = get_batch('train')
we want to get random sentences of block size (8) from the corpus so we generate
indexes (ix) of batch size (32), and for each index, we take the next 8 token ids and
stack for each index in the batch (ix), however our target (y) is generated with a one
index more than x (i+1, i + block_size + 1) because we need to predict the next token
in the sequence.
Example:-
ix = [33]
for i in ix:
print(train[i:i+18])
print(train[i+3:i+18+3]) # I've chosen +3 over +1 only for the sake of exam
for i in ix:
print("".join(decode(train[i:i+18])).replace("\n", ""))
print("".join(decode(train[i+3:i+18+3])).replace("\n", ""))
# Output:-
tensor([39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1]
tensor([ 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1]
any further, hear
further, hear me
BigramLanguageModel
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 4/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
2-gram (Bigram) for “i like to eat”: ["i like", "like to", "to eat"]
3-gram (Trigram) for “i like to eat”: ["i like to", "like to eat"]
Now let’s get to the heart of the Article - The Multi-Head Attention. As I’ve already
implemented this part in over a dozen articles in the Vision Transformer series, I
will try to not waste your time and get straight to the concept. (Man… I would have
easily stretched this blog into two parts, but screw it, let’s do it anyways)
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 5/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
We see that GPT takes from the Transformer architecture proposed in the Attention
is all you need paper. However, it differs by only stacking the Multi-Head Attention
from the decoder section.
class BigramLanguageModel(nn.Module):
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
Here the input idx is our batch that we generated earlier of shape (B, T) where T is Block
Size or Token Length.
Forward pass first generates embedding for each token of shape (B, T, C) As we see in the
figure above we need to add positional embedding to the token embedding. We create
embeddings for the input (idx) so that we can represent the information that the token
holds in a fixed embedding dimension, but this doesn’t provide any information regarding
where the token is (position) that’s why we have to additionally add positional
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 6/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
embeddings to make sure the model has context for the position of the token. If you have
any doubts I would like to request that you directly refer to my prior blog where I’ve
explained all of it in better detail.
nn.Embedding in PyTorch is a layer used to map discrete, categorical values (like word
indices) into continuous, dense vectors. The layer takes integer indices as inputs, where
each index represents a unique categorical item (e.g., a word, token, or even some other
categorical data). Internally, nn.Embedding maintains an embedding matrix of shape
num_embeddings, embedding_dim so it can create a dense representation of each token. As
we are going for a simplistic version of GPT we are directly using nn.Embeddings to
generate positional embeddings rather than going for other standard approaches.
From here, it’s straightforward... We have our Block (A stack of decoder modules) and
finally, we generate new attention matrics with the same shape as the input but where
each token has some information regarding all the tokens preceding it.
Finally, we apply the Layer Norm (Common practice to stabilize training) and then pass it
to a linear layer to map the embedding C to our vocab dimension. The vocab dimension is
simply the number of all the unique characters we have in our input.txt. One of the general
ways to define how accurate our predictions are is to compare the block (attention
modules) output with the target indices.
The output logits are simply supposed to be the probability distribution over vocab size V
predicting the next tokens (target) in the sequence. Thus cross entropy loss is used to
generate a loss for determining how close is our output with the target token sequence.
Now that we have covered the Bigram implementation. It is time to see how the
Block used Multi-Head Attention (MHA) to create the attention metrics.
The reason behind using multi-head attention is that we can directly pass the input
(B, T, embedding_size) to an Attention Block but a faster approach is instead of
directly generating a Q, K, V and computing attention weights of dimension
embedding_size, we create sections of attention modules, calculate the attention
weights separately and then concatenate them in the end.
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 7/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
self.head_size = head_size
self.key = nn.Linear(embed_size, head_size, bias=False)
self.query = nn.Linear(embed_size, head_size, bias=False)
self.value = nn.Linear(embed_size, head_size, bias=False)
Thus following the logic above we create a single attention head. Now let’s start making
some sense of it.
We have an input of dimension (batch size, token length, embed dim) after adding the
positional embedding. Here, each token in the input is represented by an embed_dim (64).
But no token has any information regarding all the tokens preceding it.
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 8/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
To create embedding enriched with such information, we use the attention mechanism, by
generating Key, Query, and Value vectors.
The attention mechanism in the Head class is designed to help the model focus on different
parts of the input sequence when generating the output, which is particularly useful in
tasks like language modeling.
The key , query , and value projections come from the concept of querying information
relevant to each token's context in the sequence. Each token is represented by a vector ( x ),
and by linearly transforming it into separate key , query , and value vectors, we can
calculate which tokens in a sequence should attend to each other.
When q (queries) is dotted with k (keys), the result ( wei ) tells us the "relevance" or
"attention" scores between each token and every other token. Higher scores mean a token is
more relevant or "important" to another token in that context. The scaling factor 1 /
sqrt(head_size) prevents these scores from getting too large, which can make the softmax
distribution too sharp and harder to optimize.
class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
self.head_size = head_size
self.key = nn.Linear(embed_size, head_size, bias=False)
self.query = nn.Linear(embed_size, head_size, bias=False)
self.value = nn.Linear(embed_size, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_si
self.dropout = nn.Dropout(dropout)
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 9/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
The causal mask, tril , is applied to ensure that each token can only "see" itself and
previous tokens. This is crucial for autoregressive tasks like text generation, where
the model should not look ahead at future tokens when predicting the next one.
Setting irrelevant positions to negative infinity masked_fill makes them zero after
softmax, so they don't contribute to the final attention calculation. This keeps the
model from cheating by looking at future tokens as we want to predict the future
token. Finally, we take a dot product between the weight and value metrics and
return our output.
You can check this example output to get a better sense of the transformations that
happen:
# Output:-
Query:
tensor([[[2, 8, 8],
[4, 2, 4],
[1, 2, 9]]])
Value:
tensor([[[9, 5, 7],
[3, 1, 4],
[6, 2, 9]]])
weights:
tensor([[[65.8179, 26.5581, 57.7350],
[42.7239, 17.3205, 36.9504],
[47.3427, 23.6714, 52.5389]]])
Triangular Metrics:
tensor([[1., 0., 0.],
[1., 1., 0.],
[1., 1., 1.]])
Masked Weights
tensor([[[65.8179, -inf, -inf],
[42.7239, 17.3205, -inf],
[47.3427, 23.6714, 52.5389]]])
Softmax ( e^-inf = 0 )
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 10/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
class MultiHeadAttention(nn.Module):
def __init__(self, head_size, num_head):
super().__init__()
self.sa_head = nn.ModuleList([Head(head_size) for _ in range(num_head)]
self.dropout = nn.Dropout(dropout)
self.proj = nn.Linear(embed_size, embed_size)
Here we are passing the input x (B, T, E) to different attention heads, where each
attention head returns the final vector of size (B, T, head_size) where head size = E
(64) / num heads (4) = 16, as we do it in a for loop of range(num_head(4)) we
concatenate it back to it’s original size (B, T, 4*16).
Multihead attention is simply considered faster and much more efficient at the scale
where the embedding dimension is even greater.
After concatenation we pass the final output to the linear projection layer, the sense
of this is to enable the embeddings in the final vector to further communicate what
they have learned about each other during the attention weight computation. After
which it is passed to a dropout layer and returned.
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 11/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
class FeedForward(nn.Module):
def __init__(self, embed_size):
super().__init__()
self.ff = nn.Sequential(
nn.Linear(embed_size, 4*embed_size),
nn.ReLU(),
nn.Linear(4*embed_size, embed_size),
nn.Dropout(dropout)
)
class Block(nn.Module):
def __init__(self, embed_size, num_head):
super().__init__()
head_size = embed_size // num_head
self.multihead = MultiHeadAttention(head_size, num_head)
self.ff = FeedForward(embed_size)
self.ll1 = nn.LayerNorm(embed_size)
self.ll2 = nn.LayerNorm(embed_size)
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 12/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
The head size is calculated as explained earlier, the input is simply passed through a
layer norm followed by our multi-head attention network and then another layer
norm finally passing through a feed-forward network.
class BigramLanguageModel(nn.Module):
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 13/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
Here we see that the Block is called under the Sequential layer through the range of
the number of layers.
Generating tokens…
We start by passing the single dimension idx tensor (indices for our tokens) to the
generate function along with the maximum number of new tokens we want to generate.
As our model is built for block size 8, we can only pass 8 tokens at a time, thus we crop the
last 8 tokens in the idx (all tokens are selected if idx length is less than the block size).
We pass the crop idx to our BigramLanguageModel, as the logits we generated had the
probable distribution of the target tokens, we are only interested in the last token as the
last token of the targets (y) is the next in the sequence (x) (explained in batch loader
section).
The logits we have now, are of shape (B, C) where C is the Vocab Size, which represents the
probability distribution over the entire vocab for the last token index. Now we simply
apply a softmax over it to turn this vector into a probability vector (i.e. sum of elements =
1).
Now, remember how we at the very beginning of the article talked about
imperfectly predictable or stochastic terms and how we let the model randomly
select what is going to be the next token in the sequence? To do so we use
torch.multinomial which is a statistical strategy used to take out a sample from a
given probability distribution. Here it samples the indices randomly according to
the specified probabilities.
We then finally get the next idx predicted, concatenate it with the previous idx, and
then continue the for loop to keep generating the next index based on the previous
ones till we reach the max tokens.
Training
Luckily the training part is very straightforward.
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 14/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
m = BigramLanguageModel(65).to(device)
avg_train_loss = Loss/(k+1)
m.train()
Here we keep the train for 5000 epochs, which relatively takes roughly about 2
minutes in 4 GB VRAM Nvidia RTX 3050.
We start with the Forward Pass by fetching our batch from get_batch() and passing it
to our BigramLanguageModel. Setting optimizer.zero_grad(), doing loss. backward(), and
performing an optimizer.step(), we utilize an AdamW Optimizer which is far more enough
for what we need it for.
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 15/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
The shape of ids is with shape (1, 18) (Batch Size, Token). Starting with just 18 tokens
(representing indices in our vocabulary) and generating 2000 more characters. For
context, our vocab is the set of all unique characters in the input.txt which we
implemented earlier under the Data Loading Section i.e. vocab = sorted(list(set(text)))
Output:-
LUCENTIO:
Go what a doubled mistressed well.
Taildoes, not to memble, the peashat you;--are master, in thou comsand of the f
Petruchio?
Fece poor this cockepopen neve so it do old loaps islied I'comment and curh
and blate sure poccient you the miad e'er a to partink,
Unory speitied where buzzarr'd formorns,
Pitedame,
Beach, and whom I firit.
ANDO:
O the virtuus a parros that that is acleast, not for suck could mighreature wel
I'll toence after counteent,
Signior to paptista?
Shile you cappier?
BIANCA:
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 16/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
PROSPERO:
I, as expase caspierfed success,
This all no be trutes from the good the island mognied buzent; tensting in this
Do be marriage.
TRANIO:
'Tis, jointer.
GRUCHIO:
Soubt sI'll show I freek born.
PROSPETRUCHIO:
The vant mine; it it
I know… It doesn’t make sense. But we must realize that Language Models are not
trained on mere datasets of Shakespeare's writings. It takes a vast amount of GPU
power, with better tokenization techniques and really large corpus of datasets.
As we trained it on a very small model, our performance is not bad, the output still
makes sense, and have learned to use actual English words than to generate random
gibberish.
You can try to experiment with the model, for different datasets and Token Size,
Batch Size, Number of Layers, etc.
Thank You!
The main goal of this blog was to explain to you in detail how you can build your
language model from scratch and train it on your dataset,... Well, now you know!!!
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 17/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
With this, I can’t thank you enough for giving learning a chance and reading it so far,
I hope that you enjoyed reading this one. The entire code is available here on my
GitHub repository ML-Models where I implement various deep learning
architectures from scratch.
If you liked this article or found it informative please consider giving it a clap and
dropping me a follow. If you have any doubts feel free to reach out.
References:
[1] Large language models are called autoregressive, but they are not a classical
autoregressive model in this sense because they are not linear. Read More.
Follow
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 18/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
Feb 27 81 1
Why Advanced LLMs, Such as GPT-4 or Claude, Fail in Critical Use Cases
Despite Large Training Data
Understanding the Limitations; Why Bigger Isn’t Always Better for Critical Applications
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 19/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
2d ago 125
3d ago 88 4
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 20/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
Feb 21 68 2
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 21/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
Oct 23 4.2K 22
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 22/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
Vinit Tople
Sep 30 629 5
Lists
What is ChatGPT?
9 stories · 465 saves
ChatGPT prompts
50 stories · 2192 saves
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 23/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
Nov 2 1.7K 19
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 24/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
Nov 1 980 72
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 25/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
6d ago 1.1K 26
Lucas Samba
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 26/27
2024/11/10 晚上11:33 Let’s Build our own GPT Model from Scratch with PyTorch | by Shubh Mishra | Nov, 2024 | Level Up Coding
Aug 23 469 12
https://levelup.gitconnected.com/lets-build-our-own-gpt-model-from-scratch-with-pytorch-236a65a1fb54 27/27