0% found this document useful (0 votes)

45 views9 pages

Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure

Natural Language Processing

Uploaded by

Daniel Ayesu Kissiedu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views9 pages

Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure

Natural Language Processing

Uploaded by

Daniel Ayesu Kissiedu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Decoder-Only Transformer (LLM) For Question

Asking
"FROM SCRATCH"

Notebook Structure

Data
Data source
Tokenization
Features and Target
Test data
Model Design
Positional encoding
Multi-head attention
Transformer Decoder
Final Architecture
Training script
Simplistic Inference Script
Issues and mistakes
Pre-training with a downstream task
Not masking Padding layers
Context window

In [1]: #necessary imports

import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plot
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
import random

In [2]: if torch.cuda.is_available():
device = torch.device("cuda")
print("GPU is available")
else:
device = torch.device("cpu")
print("GPU is not available, using CPU")

GPU is available

DATA

Data Source
The data I used for this project is the Stanford Question Ansewring Dataset (SQuAD).
SQuAD was prepared such that a question and a context would map to an Answer (Q+C -->
A). I modified this the data so that a Context would map to question (C --> Q).
Find my modified data here = link to dataset

In [3]: data = pd.read_json('{fill with path to your data}').to_dict(orient='list')

Tokenization
In [ ]: #bert tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [5]: #Example
data['conversation'][0]

[{'from': 'human',
Out[5]:
'value': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1
981) is an American singer, songwriter, record producer and actress. Born and raised in
Houston, Texas, she performed in various singing and dancing competitions as a child, an
d rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Mana
ged by her father, Mathew Knowles, the group became one of the world\'s best-selling gir
l groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerousl
y in Love (2003), which established her as a solo artist worldwide, earned five Grammy A
wards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Bo
y".'},
{'from': 'gpt', 'value': 'When did Beyonce start becoming popular?'}]

In [6]: def tokenize_input(qa):

#1. tokenizing with a max seq length of 300 and padding layers
#2. Adding an <sos> and <eos> token to target values. In this case; [CLS] and [SEP]
seq_length = 300
q_tokens = tokenizer(qa[0]['value'],add_special_tokens=False)['input_ids']
a_tokens = tokenizer(qa[1]['value'],padding=True)['input_ids']

x_tokens = q_tokens + a_tokens[:-1]

y_tokens = q_tokens[1:] + a_tokens

x_pad = [0 for i in range(seq_length-len(x_tokens))]

y_pad = [0 for i in range(seq_length-len(x_tokens))]
final_x = x_tokens + x_pad
final_y = y_tokens + y_pad

return final_x, final_y

In [7]: #tokenizing all data

tokens = []
targets = []
for i in random.sample(data['conversation'],len(data['conversation'])):
try:
x, y = tokenize_input(i)

if len(x) == 300:
tokens.append(x)
targets.append(y)
except:
pass

X = torch.IntTensor(tokens)
Y = torch.LongTensor(targets)

Token indices sequence length is longer than the specified maximum sequence length for t
his model (718 > 512). Running this sequence through the model will result in indexing e
rrors
In [8]: X.shape, Y.shape

(torch.Size([124975, 300]), torch.Size([124975, 300]))

Out[8]:

Test Data
Create your test data here

Model Design

Important Notes
Embedding layer: I used the emmbedding layer from the bert model.
Positional encoding: Sinusoidal encoding from Attention is all you need
Attention: Multihead (4 heads)
Linear projection: Projected input to 224 before passing it through decoder
Number of decoders: 8

Embedding layer
In [9]: class embed(torch.nn.Module):
def __init__(self):
super().__init__()
self.embedder = AutoModel.from_pretrained('bert-base-uncased')

def forward(self,x_tokens):
inputs = {'input_ids':x_tokens}
with torch.no_grad():
attention_mask = (inputs['input_ids'] != 0).int()
outputs = self.embedder(**inputs,attention_mask=attention_mask)
embeddings = outputs.last_hidden_state * attention_mask.unsqueeze(-1)
return embeddings
Positional Encoding
In [10]: ## sinusoidal positional encoding

class pos_enc(torch.nn.Module):
def __init__(self) -> None:
super().__init__()

def forward(self,x):
batch_size, max_seq_length, dmodel = x.shape
pe = torch.zeros_like(x) #position encoding matrix

# Compute the positional encoding values

for pos in range(max_seq_length):
for i in range(0, dmodel):
if i % 2 == 0:
pe[:, pos, i] = torch.math.sin(pos / (10000 ** (2 * i / dmodel)))
else:
pe[:, pos, i] = torch.math.cos(pos / (10000 ** (2 * i / dmodel)))

x = x + pe
return x

Self-attention mechanisim
In [11]: class self_attention(torch.nn.Module):
def __init__(self,no_of_heads: int ,shape: tuple, mask: bool=False, QKV: list=[]
'''
Initializes a Self Attention module as described in the "Attention is all you ne
This module splits the input into multiple heads to allow the model to jointly a
from different representation subspaces at different positions. After attention
on each head, the module concatenates and linearly transforms the results.

## Parameters:
* no_of_heads (int): Number of attention heads. To implement single head att

* shape (tuple): A tuple (seq_length, dmodel) where `seq_length` is the lengt

and `dmodel` is the dimensionality of the input feature space

* mask (bool, optional): If True, a mask will be applied to prevent attentio

* QKV (list, optional): A list containing pre-computed Query (Q), Key (K), a
The forward pass computes the multi-head attention for input `x` and returns the
'''
super().__init__()
self.h = no_of_heads
self.seq_length,self.dmodel = shape
self.dk = self.dmodel//self.h
self.softmax = torch.nn.Softmax(dim=-1)
self.mQW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.mKW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.mVW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.output_linear = torch.nn.Linear(self.dmodel,self.dmodel)
self.mask = mask
self.QKV = QKV

def __add_mask(self,atten_values):
#masking attention values
mask_value = -1e9
mask = torch.triu(torch.ones(atten_values.shape) * mask_value, diagonal=1)
masked = atten_values + mask.to(device)
return masked

def forward(self, x):

heads = []
for i in range(self.h):
# Apply linear projections in batch from dmodel => h x d_k
if self.QKV:
q = self.mQW[i](self.QKV[0])
k = self.mKW[i](self.QKV[1])
v = self.mVW[i](self.QKV[2])
else:
q = self.mQW[i](x)
k = self.mKW[i](x)
v = self.mVW[i](x)

# Calculate attention using the projected vectors q, k, and v

self.scores = torch.matmul(q, k.transpose(-1, -2)) / torch.sqrt(torch.te
if self.mask:
self.scores = self.__add_mask(self.scores)

attn = self.softmax(self.scores)
head_i = torch.matmul(attn, v)

heads.append(head_i)

# Concatenate all the heads together

multi_head = torch.cat(heads, dim=-1)
# Final linear layer
output = self.output_linear(multi_head)

return output + x # Residual connection

Decoder
In [12]: class decoder_layer(torch.nn.Module):
def __init__(self,shape: tuple,no_of_heads:int = 1):
'''
Implementation of Transformer Dencoder
Parameters:
shape (tuple): The shape (H, W) of the input tensor
no_of_heads (int): number of heads in the attention mechanism. set this to 1
Returns:
Tensor: The output of the encoder layer after applying attention, feedforwar
'''
super().__init__()

self.max_seq_length,self.dmodel = shape
def ff_weights():
layer1 = torch.nn.Linear(self.dmodel,600)
layer2 = torch.nn.Linear(600,600)
layer3 = torch.nn.Linear(600,self.dmodel)
return layer1,layer2,layer3

self.no_of_heads = no_of_heads

self.multi_head = self_attention(no_of_heads=no_of_heads, mask=True,

shape=(self.max_seq_length,self.dmod

self.layer1,self.layer2,self.layer3 = ff_weights()
self.softmax = torch.nn.Softmax(dim=-1)
self.layerNorm = torch.nn.LayerNorm(shape)
self.relu1 = torch.nn.ReLU()
self.relu2 = torch.nn.ReLU()

def feed_forward(self,x):
f = self.layer1(x)
f = self.relu1(f)
f = self.layer2(f)
f = self.relu2(f)
f = self.layer3(f)

return self.layerNorm(f + x) #residual connection

def forward(self,x):
x = self.multi_head(x)
x = self.layerNorm(x)
x = self.feed_forward(x)
x = self.layerNorm(x)

return x

Full Model Architecture

In [13]: class architecture(torch.nn.Module):
def __init__(self,n_classes,shape) -> None:
super().__init__()
self.max_seq_length,self.dmodel = shape
self.projected_dmodel = 224
self.embedding_layer = embed()
self.proj_to_224 = torch.nn.Linear(self.dmodel, self.projected_dmodel)
self.positional = pos_enc()
self.decoder1 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder2 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder3 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder4 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder5 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder6 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder7 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
self.decoder8 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel),
# self.decoder5 = decoder_layer(shape=(self.max_seq_length,self.projected_dmodel
self.final_MLP = torch.nn.Linear(self.projected_dmodel,n_classes)
self.softmax = torch.nn.Softmax(dim=2)

def forward(self,x,temperature=1.0):
x = self.embedding_layer(x)
x = self.proj_to_224(x)
x = self.positional(x)
x = self.decoder1(x)
x = self.decoder2(x)
x = self.decoder3(x)
x = self.decoder4(x)
x = self.decoder5(x)
x = self.decoder6(x)
x = self.decoder7(x)
x = self.decoder8(x)
x = self.final_MLP(x)
logits = x / temperature
x = self.softmax(logits)

return x

Training Script
In [ ]: !pip install torchmetrics

In [14]: from torchmetrics import Accuracy

from tqdm import tqdm

In [15]: dataset = torch.utils.data.TensorDataset(X,Y)

loader = torch.utils.data.DataLoader(dataset,batch_size=20,num_workers=0,shuffle=False)

In [ ]: vocab_size = tokenizer.vocab_size
model = architecture(n_classes = vocab_size, shape = (300,768))
model = model.to(device)
model.load_state_dict(torch.load("{fill with path to model weights if any}"))

In [76]: metric = Accuracy(num_classes=vocab_size,task='multiclass').to(device)

optimizer = torch.optim.Adam(model.parameters(),lr=0.0001)
criterion = torch.nn.CrossEntropyLoss(ignore_index=0,label_smoothing=0.01)

In [ ]: from tqdm import tqdm

print('Training Started')
NUM_EPOCHS = 1

for epoch in range(NUM_EPOCHS):

model.train() # Set the model to training mode
running_loss = 0.0
epoch_accuracy = 0.0
num_batches = len(loader)

# Initialize tqdm progress bar

with tqdm(total=num_batches, desc=f"Epoch {epoch + 1}", leave=True) as pbar:
for i, (x_batch, y_batch) in enumerate(loader):
x_batch, y_batch = x_batch.to(device), y_batch.to(device)

# Zero the parameter gradients

optimizer.zero_grad()

# Forward pass
outputs = model(x_batch)
# Flatten the outputs and y_batch tensors one dimension lower
outputs = outputs.view(-1, outputs.shape[-1])
y_batch = y_batch.view(-1)

# Loss calculation
loss = criterion(outputs, y_batch).to(device)

# Backward pass and optimize

loss.backward()
optimizer.step()

# Metrics
argmax_pred = outputs.argmax(axis=1)
metric.update(argmax_pred, y_batch)

# Print statistics
running_loss += loss.item()
if i % 10 == 9: # update every 10 mini-batches
accuracy = metric.compute().item()
epoch_accuracy += accuracy
pbar.set_postfix({'Loss': running_loss / (i + 1), 'Accuracy': accuracy})

# Update the progress bar

pbar.update(1)
# Save model weights periodically
if i % 10 == 9:
torch.save(model.state_dict(), '/kaggle/working/model_weights.pth')

# Compute and print average loss and accuracy for the epoch
avg_loss = running_loss / num_batches
avg_accuracy = epoch_accuracy / (num_batches // 10) # since we're summing accuracy
print(f'Epoch {epoch + 1} - Loss: {avg_loss:.4f}, Accuracy: {avg_accuracy:.4f}')

print('Training Completed')

Training Started
Epoch 1: 66%|██████ | 2770/4166 [1:58:07<1:00:08, 2.59s/it, Loss=9.89, Accuracy=0.2
34]

In [ ]: #Link to download model

from IPython.display import FileLink
FileLink(r'model_weights.pth')

Simplistic Inference Script

In [17]: text = '''Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 198

In [6]: def model_pred(tokens,temp):

model.eval()
with torch.no_grad():
pred = model(tokens,temp)
pred = pred.view(-1, pred.shape[-1]).argmax(axis=1)
return pred

def tokenize_text(text):
seq_length = 300
q_tokens = tokenizer(text,add_special_tokens=False)['input_ids']
pad = [0 for i in range(seq_length-len(q_tokens))]
final_tokens = [q_tokens + pad]
last_index = len(q_tokens)-1

return torch.tensor(final_tokens),last_index

def inference(text, starter='', temperature=1.0):

curr = 0
pred_list = []
t, last_token = tokenize_text(text + '[CLS]' + starter)
t = t.to(device)

while curr != 102:

print('\n',"Generating...")
all_pred = model_pred(t, temperature)
pred = all_pred[last_token].item()
pred_list.append(pred)
t[0][last_token + 1] = pred
last_token += 1
curr = pred

if curr > 10:

break
print("Question from the model: ".upper(), starter + ' ' + tokenizer.decode(pred_lis

return starter + ' ' + tokenizer.decode(pred_list)

inference(text, '')
QUESTION FROM THE MODEL: what is the name of the singer? [SEP]

Issues
Not Enough data: I trained on only 130K samples which is too small
Pre-trainig on a downstream task: Pretraining is supposed to be self supervised

LNG Custody Transfer Handbook PDF
100% (2)
LNG Custody Transfer Handbook PDF
108 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
q1 General Physics Module 2
No ratings yet
q1 General Physics Module 2
13 pages
Internship Supervisor's Evalution Form Makerere
100% (1)
Internship Supervisor's Evalution Form Makerere
2 pages
Normality, T-Test, ANOVA, Chi Square, Correlation
No ratings yet
Normality, T-Test, ANOVA, Chi Square, Correlation
31 pages
IBest DeepLearning
No ratings yet
IBest DeepLearning
123 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
6 - Risk Appetite Statement Template
100% (1)
6 - Risk Appetite Statement Template
5 pages
Font Transfer 2 Autoencoders
No ratings yet
Font Transfer 2 Autoencoders
78 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
Attention Variants
No ratings yet
Attention Variants
24 pages
Chapter 2
No ratings yet
Chapter 2
52 pages
Pytorch Tutorial 1
No ratings yet
Pytorch Tutorial 1
48 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
Modeling Chatglm
No ratings yet
Modeling Chatglm
20 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Coding Attention Mechanisms
No ratings yet
Coding Attention Mechanisms
24 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
PyTorch Crash Course 1713016363
No ratings yet
PyTorch Crash Course 1713016363
15 pages
Astro AI
No ratings yet
Astro AI
20 pages
Project Source
No ratings yet
Project Source
21 pages
Autoencoder From Scratch
No ratings yet
Autoencoder From Scratch
21 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
No ratings yet
Understanding and Coding The Self-Attention Mechanism of Large Language Models From Scratch
20 pages
Convolutional Autoencoder in Pytorch On MNIST Dataset - by Eugenia Anello - DataSeries - Medium
No ratings yet
Convolutional Autoencoder in Pytorch On MNIST Dataset - by Eugenia Anello - DataSeries - Medium
18 pages
How To Implement Multi-Head Attention From Scratch in TensorFlow and Keras
No ratings yet
How To Implement Multi-Head Attention From Scratch in TensorFlow and Keras
20 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
17 pages
Harvard CS197 Lecture 6 & 7 Notes
No ratings yet
Harvard CS197 Lecture 6 & 7 Notes
18 pages
Pytorch 101: Deep Learning PHD Course 2017/2018
No ratings yet
Pytorch 101: Deep Learning PHD Course 2017/2018
19 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
16 pages
Variational AutoEncoders (VAE) With PyTorch - Alexander Van de Kleut
No ratings yet
Variational AutoEncoders (VAE) With PyTorch - Alexander Van de Kleut
17 pages
NLP 4
No ratings yet
NLP 4
10 pages
Transformers Implementations 1731410319
No ratings yet
Transformers Implementations 1731410319
10 pages
Transformer
No ratings yet
Transformer
10 pages
LLM Code Ref
No ratings yet
LLM Code Ref
10 pages
Hyper Parameteres: Dataset
No ratings yet
Hyper Parameteres: Dataset
13 pages
CS541 HW4
No ratings yet
CS541 HW4
11 pages
Mlp-Fromscratch Sigmoid-Mse
No ratings yet
Mlp-Fromscratch Sigmoid-Mse
13 pages
Notes On Implementing Attention - Eli Bendersky
No ratings yet
Notes On Implementing Attention - Eli Bendersky
12 pages
GPT2 From Scratch in PyTorch
No ratings yet
GPT2 From Scratch in PyTorch
13 pages
Fast Transformer Decoding - One Write-Head Is All You Need
No ratings yet
Fast Transformer Decoding - One Write-Head Is All You Need
9 pages
Assignment No 4
No ratings yet
Assignment No 4
8 pages
Self Attention With Trainable Weights 1726701162
No ratings yet
Self Attention With Trainable Weights 1726701162
12 pages
Pytorch Demo 1749471354
No ratings yet
Pytorch Demo 1749471354
10 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
10 pages
Transformers
No ratings yet
Transformers
15 pages
Transformer Flux
No ratings yet
Transformer Flux
11 pages
PyTorch Cheat Sheet & Quick Reference
No ratings yet
PyTorch Cheat Sheet & Quick Reference
6 pages
Karpathy MinGPT Model
No ratings yet
Karpathy MinGPT Model
7 pages
Transformer
No ratings yet
Transformer
4 pages
Transformer 2
No ratings yet
Transformer 2
6 pages
Pytorch Tutorial For Beginner: Department of Computer Science & Engineering University of Washington
No ratings yet
Pytorch Tutorial For Beginner: Department of Computer Science & Engineering University of Washington
11 pages
Transformer
No ratings yet
Transformer
10 pages
Tutorials Sources Beginner Ptcheat
No ratings yet
Tutorials Sources Beginner Ptcheat
7 pages
TXT
No ratings yet
TXT
7 pages
Course No.: Math 1213 Differential Equation Lecturer: Dr. M. Saifur Rahman
100% (1)
Course No.: Math 1213 Differential Equation Lecturer: Dr. M. Saifur Rahman
24 pages
Code File
No ratings yet
Code File
6 pages
DL Lab (6-10) With Output
No ratings yet
DL Lab (6-10) With Output
5 pages
A4
No ratings yet
A4
8 pages
Homework IntroToDL
No ratings yet
Homework IntroToDL
3 pages
Adjusted American - Putney
No ratings yet
Adjusted American - Putney
228 pages
Analysis Report - Soil Nail SGHR100 MacMat
No ratings yet
Analysis Report - Soil Nail SGHR100 MacMat
2 pages
Abstrak NG Thesis Filipino
100% (2)
Abstrak NG Thesis Filipino
6 pages
What Is ISO
No ratings yet
What Is ISO
3 pages
Resume Design For Process Engineering
No ratings yet
Resume Design For Process Engineering
3 pages
A Ph.D. Research Proposal: YUSUF, Idowu Olusola
No ratings yet
A Ph.D. Research Proposal: YUSUF, Idowu Olusola
10 pages
Matter
No ratings yet
Matter
38 pages
PHD Thesis Comments
100% (2)
PHD Thesis Comments
4 pages
Syllabus
No ratings yet
Syllabus
4 pages
s1 Homework
100% (2)
s1 Homework
4 pages
Screening Test: (English & Telugu Versions)
No ratings yet
Screening Test: (English & Telugu Versions)
38 pages
HCHCR6142T HCHCR6142P
No ratings yet
HCHCR6142T HCHCR6142P
4 pages
LT LG400 13 Inspection Maintenance Procedures
No ratings yet
LT LG400 13 Inspection Maintenance Procedures
6 pages
Module 2 LESSON 2
No ratings yet
Module 2 LESSON 2
3 pages
OACC Data Analysis SAC 2019 Final
No ratings yet
OACC Data Analysis SAC 2019 Final
9 pages
Introduction To Bioinformatics - Notes
No ratings yet
Introduction To Bioinformatics - Notes
18 pages
Ccaa Training Catalogue - March 2023 2
No ratings yet
Ccaa Training Catalogue - March 2023 2
27 pages
Provide Compassionate, Provide Compassionate, Provide Compassionate, Respectful and Caring Service Learninig Guide 02
No ratings yet
Provide Compassionate, Provide Compassionate, Provide Compassionate, Respectful and Caring Service Learninig Guide 02
13 pages
Data Analysis Coca Cola
No ratings yet
Data Analysis Coca Cola
7 pages
Sharifi Yazdi2019
No ratings yet
Sharifi Yazdi2019
20 pages
2019 Mark Mathys Award
No ratings yet
2019 Mark Mathys Award
11 pages
Proving Areaof Triangle Using Series
No ratings yet
Proving Areaof Triangle Using Series
9 pages
B Agad Template 2024 2025
No ratings yet
B Agad Template 2024 2025
3 pages
ZYAROCK Artec Pot Leaflet (En)
No ratings yet
ZYAROCK Artec Pot Leaflet (En)
2 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure

Uploaded by

Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure

Uploaded by

Decoder-Only Transformer (LLM) For Question

In [1]: #necessary imports

In [3]: data = pd.read_json('{fill with path to your data}').to_dict(orient='list')

In [6]: def tokenize_input(qa):

x_tokens = q_tokens + a_tokens[:-1]

x_pad = [0 for i in range(seq_length-len(x_tokens))]

return final_x, final_y

In [7]: #tokenizing all data

(torch.Size([124975, 300]), torch.Size([124975, 300]))

# Compute the positional encoding values

* shape (tuple): A tuple (seq_length, dmodel) where `seq_length` is the lengt

* mask (bool, optional): If True, a mask will be applied to prevent attentio

def forward(self, x):

# Calculate attention using the projected vectors q, k, and v

# Concatenate all the heads together

return output + x # Residual connection

self.multi_head = self_attention(no_of_heads=no_of_heads, mask=True,

return self.layerNorm(f + x) #residual connection

Full Model Architecture

In [14]: from torchmetrics import Accuracy

In [15]: dataset = torch.utils.data.TensorDataset(X,Y)

In [76]: metric = Accuracy(num_classes=vocab_size,task='multiclass').to(device)

In [ ]: from tqdm import tqdm

for epoch in range(NUM_EPOCHS):

# Initialize tqdm progress bar

# Zero the parameter gradients

# Backward pass and optimize

# Update the progress bar

In [ ]: #Link to download model

Simplistic Inference Script

In [6]: def model_pred(tokens,temp):

def inference(text, starter='', temperature=1.0):

while curr != 102:

if curr > 10:

return starter + ' ' + tokenizer.decode(pred_list)

You might also like