0% found this document useful (0 votes)
45 views9 pages

Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure

Natural Language Processing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views9 pages

Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure

Natural Language Processing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Decoder-Only Transformer (LLM) For Question

Asking
"FROM SCRATCH"

Notebook Structure

Data
Data source
Tokenization
Features and Target
Test data
Model Design
Positional encoding
Multi-head attention
Transformer Decoder
Final Architecture
Training script
Simplistic Inference Script
Issues and mistakes
Pre-training with a downstream task
Not masking Padding layers
Context window

In [1]: #necessary imports


import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plot
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
import random

In [2]: if torch.cuda.is_available():
device = torch.device("cuda")
print("GPU is available")
else:
device = torch.device("cpu")
print("GPU is not available, using CPU")

GPU is available

DATA

Data Source
The data I used for this project is the Stanford Question Ansewring Dataset (SQuAD).
SQuAD was prepared such that a question and a context would map to an Answer (Q+C -->
A). I modified this the data so that a Context would map to question (C --> Q).
Find my modified data here = link to dataset

In [3]: data = pd.read_json('{fill with path to your data}').to_dict(orient='list')

Tokenization
In [ ]: #bert tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [5]: #Example
data['conversation'][0]

[{'from': 'human',
Out[5]:
'value': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1
981) is an American singer, songwriter, record producer and actress. Born and raised in
Houston, Texas, she performed in various singing and dancing competitions as a child, an
d rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Mana
ged by her father, Mathew Knowles, the group became one of the world\'s best-selling gir
l groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerousl
y in Love (2003), which established her as a solo artist worldwide, earned five Grammy A
wards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Bo
y".'},
{'from': 'gpt', 'value': 'When did Beyonce start becoming popular?'}]

In [6]: def tokenize_input(qa):


#1. tokenizing with a max seq length of 300 and padding layers
#2. Adding an <sos> and <eos> token to target values. In this case; [CLS] and [SEP]
seq_length = 300
q_tokens = tokenizer(qa[0]['value'],add_special_tokens=False)['input_ids']
a_tokens = tokenizer(qa[1]['value'],padding=True)['input_ids']

x_tokens = q_tokens + a_tokens[:-1]


y_tokens = q_tokens[1:] + a_tokens

x_pad = [0 for i in range(seq_length-len(x_tokens))]


y_pad = [0 for i in range(seq_length-len(x_tokens))]
final_x = x_tokens + x_pad
final_y = y_tokens + y_pad

return final_x, final_y

In [7]: #tokenizing all data


tokens = []
targets = []
for i in random.sample(data['conversation'],len(data['conversation'])):
try:
x, y = tokenize_input(i)

if len(x) == 300:
tokens.append(x)
targets.append(y)
except:
pass

X = torch.IntTensor(tokens)
Y = torch.LongTensor(targets)

Token indices sequence length is longer than the specified maximum sequence length for t
his model (718 > 512). Running this sequence through the model will result in indexing e
rrors
In [8]: X.shape, Y.shape

(torch.Size([124975, 300]), torch.Size([124975, 300]))


Out[8]:

Test Data
Create your test data here

Model Design

Important Notes
Embedding layer: I used the emmbedding layer from the bert model.
Positional encoding: Sinusoidal encoding from Attention is all you need
Attention: Multihead (4 heads)
Linear projection: Projected input to 224 before passing it through decoder
Number of decoders: 8

Embedding layer
In [9]: class embed(torch.nn.Module):
def __init__(self):
super().__init__()
self.embedder = AutoModel.from_pretrained('bert-base-uncased')

def forward(self,x_tokens):
inputs = {'input_ids':x_tokens}
with torch.no_grad():
attention_mask = (inputs['input_ids'] != 0).int()
outputs = self.embedder(**inputs,attention_mask=attention_mask)
embeddings = outputs.last_hidden_state * attention_mask.unsqueeze(-1)
return embeddings
Positional Encoding
In [10]: ## sinusoidal positional encoding

class pos_enc(torch.nn.Module):
def __init__(self) -> None:
super().__init__()

def forward(self,x):
batch_size, max_seq_length, dmodel = x.shape
pe = torch.zeros_like(x) #position encoding matrix

# Compute the positional encoding values


for pos in range(max_seq_length):
for i in range(0, dmodel):
if i % 2 == 0:
pe[:, pos, i] = torch.math.sin(pos / (10000 ** (2 * i / dmodel)))
else:
pe[:, pos, i] = torch.math.cos(pos / (10000 ** (2 * i / dmodel)))

x = x + pe
return x

Self-attention mechanisim
In [11]: class self_attention(torch.nn.Module):
def __init__(self,no_of_heads: int ,shape: tuple, mask: bool=False, QKV: list=[]
'''
Initializes a Self Attention module as described in the "Attention is all you ne
This module splits the input into multiple heads to allow the model to jointly a
from different representation subspaces at different positions. After attention
on each head, the module concatenates and linearly transforms the results.

## Parameters:
* no_of_heads (int): Number of attention heads. To implement single head att

* shape (tuple): A tuple (seq_length, dmodel) where `seq_length` is the lengt


and `dmodel` is the dimensionality of the input feature space

* mask (bool, optional): If True, a mask will be applied to prevent attentio

* QKV (list, optional): A list containing pre-computed Query (Q), Key (K), a
The forward pass computes the multi-head attention for input `x` and returns the
'''
super().__init__()
self.h = no_of_heads
self.seq_length,self.dmodel = shape
self.dk = self.dmodel//self.h
self.softmax = torch.nn.Softmax(dim=-1)
self.mQW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.mKW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.mVW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.output_linear = torch.nn.Linear(self.dmodel,self.dmodel)
self.mask = mask
self.QKV = QKV

def __add_mask(self,atten_values):
#masking attention values
mask_value = -1e9
mask = torch.triu(torch.ones(atten_values.shape) * mask_value, diagonal=1)
masked = atten_values + mask.to(device)
return masked

def forward(self, x):


heads = []
for i in range(self.h):
# Apply linear projections in batch from dmodel => h x d_k
if self.QKV:
q = self.mQW[i](self.QKV[0])
k = self.mKW[i](self.QKV[1])
v = self.mVW[i](self.QKV[2])
else:
q = self.mQW[i](x)
k = self.mKW[i](x)
v = self.mVW[i](x)

# Calculate attention using the projected vectors q, k, and v


self.scores = torch.matmul(q, k.transpose(-1, -2)) / torch.sqrt(torch.te
if self.mask:
self.scores = self.__add_mask(self.scores)

attn = self.softmax(self.scores)
head_i = torch.matmul(attn, v)

heads.append(head_i)

# Concatenate all the heads together


multi_head = torch.cat(heads, dim=-1)
# Final linear layer
output = self.output_linear(multi_head)

return output + x # Residual connection

Decoder
In [12]: class decoder_layer(torch.nn.Module):
def __init__(self,shape: tuple,no_of_heads:int = 1):
'''
Implementation of Transformer Dencoder
Parameters:
shape (tuple): The shape (H, W) of the input tensor
no_of_heads (int): number of heads in the attention mechanism. set this to 1
Returns:
Tensor: The output of the encoder layer after applying attention, feedforwar
'''
super().__init__()

self.max_seq_length,self.dmodel = shape
def ff_weights():
layer1 = torch.nn.Linear(self.dmodel,600)
layer2 = torch.nn.Linear(600,600)
layer3 = torch.nn.Linear(600,self.dmodel)
return layer1,layer2,layer3

self.no_of_heads = no_of_heads

self.multi_head = self_attention(no_of_heads=no_of_heads, mask=True,


shape=(self.max_seq_length,self.dmod

self.layer1,self.layer2,self.layer3 = ff_weights()
self.softmax = torch.nn.Softmax(dim=-1)
self.layerNorm = torch.nn.LayerNorm(shape)
self.relu1 = torch.nn.ReLU()
self.relu2 = torch.nn.ReLU()

def feed_forward(self,x):
f = self.layer1(x)
f = self.relu1(f)
f = self.layer2(f)
f = self.relu2(f)
f = self.layer3(f)

return self.layerNorm(f + x) #residual connection

def forward(self,x):
x = self.multi_head(x)
x = self.layerNorm(x)
x = self.feed_forward(x)
x = self.layerNorm(x)

return x

Full Model Architecture


In [13]: class architecture(torch.nn.Module):
def __init__(self,n_classes,shape) -> None:
super().__init__()
self.max_seq_length,self.dmodel = shape
self.projected_dmodel = 224
self.embedding_layer = embed()
self.proj_to_224 = torch.nn.Linear(self.dmodel, self.projected_dmodel)
self.positional = pos_enc()
self.decoder1 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder2 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder3 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder4 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder5 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder6 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder7 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder8 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
# self.decoder5 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel
self.final_MLP = torch.nn.Linear(self.projected_dmodel,n_classes)
self.softmax = torch.nn.Softmax(dim=2)

def forward(self,x,temperature=1.0):
x = self.embedding_layer(x)
x = self.proj_to_224(x)
x = self.positional(x)
x = self.decoder1(x)
x = self.decoder2(x)
x = self.decoder3(x)
x = self.decoder4(x)
x = self.decoder5(x)
x = self.decoder6(x)
x = self.decoder7(x)
x = self.decoder8(x)
x = self.final_MLP(x)
logits = x / temperature
x = self.softmax(logits)

return x

Training Script
In [ ]: !pip install torchmetrics

In [14]: from torchmetrics import Accuracy


from tqdm import tqdm

In [15]: dataset = torch.utils.data.TensorDataset(X,Y)


loader = torch.utils.data.DataLoader(dataset,batch_size=20,num_workers=0,shuffle=False)

In [ ]: vocab_size = tokenizer.vocab_size
model = architecture(n_classes = vocab_size, shape = (300,768))
model = model.to(device)
model.load_state_dict(torch.load("{fill with path to model weights if any}"))

In [76]: metric = Accuracy(num_classes=vocab_size,task='multiclass').to(device)


optimizer = torch.optim.Adam(model.parameters(),lr=0.0001)
criterion = torch.nn.CrossEntropyLoss(ignore_index=0,label_smoothing=0.01)

In [ ]: from tqdm import tqdm

print('Training Started')
NUM_EPOCHS = 1

for epoch in range(NUM_EPOCHS):


model.train() # Set the model to training mode
running_loss = 0.0
epoch_accuracy = 0.0
num_batches = len(loader)

# Initialize tqdm progress bar


with tqdm(total=num_batches, desc=f"Epoch {epoch + 1}", leave=True) as pbar:
for i, (x_batch, y_batch) in enumerate(loader):
x_batch, y_batch = x_batch.to(device), y_batch.to(device)

# Zero the parameter gradients


optimizer.zero_grad()

# Forward pass
outputs = model(x_batch)
# Flatten the outputs and y_batch tensors one dimension lower
outputs = outputs.view(-1, outputs.shape[-1])
y_batch = y_batch.view(-1)

# Loss calculation
loss = criterion(outputs, y_batch).to(device)

# Backward pass and optimize


loss.backward()
optimizer.step()

# Metrics
argmax_pred = outputs.argmax(axis=1)
metric.update(argmax_pred, y_batch)

# Print statistics
running_loss += loss.item()
if i % 10 == 9: # update every 10 mini-batches
accuracy = metric.compute().item()
epoch_accuracy += accuracy
pbar.set_postfix({'Loss': running_loss / (i + 1), 'Accuracy': accuracy})

# Update the progress bar


pbar.update(1)
# Save model weights periodically
if i % 10 == 9:
torch.save(model.state_dict(), '/kaggle/working/model_weights.pth')

# Compute and print average loss and accuracy for the epoch
avg_loss = running_loss / num_batches
avg_accuracy = epoch_accuracy / (num_batches // 10) # since we're summing accuracy
print(f'Epoch {epoch + 1} - Loss: {avg_loss:.4f}, Accuracy: {avg_accuracy:.4f}')

print('Training Completed')

Training Started
Epoch 1: 66%|██████ | 2770/4166 [1:58:07<1:00:08, 2.59s/it, Loss=9.89, Accuracy=0.2
34]

In [ ]: #Link to download model


from IPython.display import FileLink
FileLink(r'model_weights.pth')

Simplistic Inference Script


In [17]: text = '''Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 198

In [6]: def model_pred(tokens,temp):


model.eval()
with torch.no_grad():
pred = model(tokens,temp)
pred = pred.view(-1, pred.shape[-1]).argmax(axis=1)
return pred

def tokenize_text(text):
seq_length = 300
q_tokens = tokenizer(text,add_special_tokens=False)['input_ids']
pad = [0 for i in range(seq_length-len(q_tokens))]
final_tokens = [q_tokens + pad]
last_index = len(q_tokens)-1

return torch.tensor(final_tokens),last_index

def inference(text, starter='', temperature=1.0):


curr = 0
pred_list = []
t, last_token = tokenize_text(text + '[CLS]' + starter)
t = t.to(device)

while curr != 102:


print('\n',"Generating...")
all_pred = model_pred(t, temperature)
pred = all_pred[last_token].item()
pred_list.append(pred)
t[0][last_token + 1] = pred
last_token += 1
curr = pred

if curr > 10:


break
print("Question from the model: ".upper(), starter + ' ' + tokenizer.decode(pred_lis

return starter + ' ' + tokenizer.decode(pred_list)

inference(text, '')
QUESTION FROM THE MODEL: what is the name of the singer? [SEP]

Issues
Not Enough data: I trained on only 130K samples which is too small
Pre-trainig on a downstream task: Pretraining is supposed to be self supervised

You might also like