Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
Asking
"FROM SCRATCH"
Notebook Structure
Data
Data source
Tokenization
Features and Target
Test data
Model Design
Positional encoding
Multi-head attention
Transformer Decoder
Final Architecture
Training script
Simplistic Inference Script
Issues and mistakes
Pre-training with a downstream task
Not masking Padding layers
Context window
In [2]: if torch.cuda.is_available():
device = torch.device("cuda")
print("GPU is available")
else:
device = torch.device("cpu")
print("GPU is not available, using CPU")
GPU is available
DATA
Data Source
The data I used for this project is the Stanford Question Ansewring Dataset (SQuAD).
SQuAD was prepared such that a question and a context would map to an Answer (Q+C -->
A). I modified this the data so that a Context would map to question (C --> Q).
Find my modified data here = link to dataset
Tokenization
In [ ]: #bert tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
In [5]: #Example
data['conversation'][0]
[{'from': 'human',
Out[5]:
'value': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1
981) is an American singer, songwriter, record producer and actress. Born and raised in
Houston, Texas, she performed in various singing and dancing competitions as a child, an
d rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Mana
ged by her father, Mathew Knowles, the group became one of the world\'s best-selling gir
l groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerousl
y in Love (2003), which established her as a solo artist worldwide, earned five Grammy A
wards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Bo
y".'},
{'from': 'gpt', 'value': 'When did Beyonce start becoming popular?'}]
if len(x) == 300:
tokens.append(x)
targets.append(y)
except:
pass
X = torch.IntTensor(tokens)
Y = torch.LongTensor(targets)
Token indices sequence length is longer than the specified maximum sequence length for t
his model (718 > 512). Running this sequence through the model will result in indexing e
rrors
In [8]: X.shape, Y.shape
Test Data
Create your test data here
Model Design
Important Notes
Embedding layer: I used the emmbedding layer from the bert model.
Positional encoding: Sinusoidal encoding from Attention is all you need
Attention: Multihead (4 heads)
Linear projection: Projected input to 224 before passing it through decoder
Number of decoders: 8
Embedding layer
In [9]: class embed(torch.nn.Module):
def __init__(self):
super().__init__()
self.embedder = AutoModel.from_pretrained('bert-base-uncased')
def forward(self,x_tokens):
inputs = {'input_ids':x_tokens}
with torch.no_grad():
attention_mask = (inputs['input_ids'] != 0).int()
outputs = self.embedder(**inputs,attention_mask=attention_mask)
embeddings = outputs.last_hidden_state * attention_mask.unsqueeze(-1)
return embeddings
Positional Encoding
In [10]: ## sinusoidal positional encoding
class pos_enc(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
def forward(self,x):
batch_size, max_seq_length, dmodel = x.shape
pe = torch.zeros_like(x) #position encoding matrix
x = x + pe
return x
Self-attention mechanisim
In [11]: class self_attention(torch.nn.Module):
def __init__(self,no_of_heads: int ,shape: tuple, mask: bool=False, QKV: list=[]
'''
Initializes a Self Attention module as described in the "Attention is all you ne
This module splits the input into multiple heads to allow the model to jointly a
from different representation subspaces at different positions. After attention
on each head, the module concatenates and linearly transforms the results.
## Parameters:
* no_of_heads (int): Number of attention heads. To implement single head att
* QKV (list, optional): A list containing pre-computed Query (Q), Key (K), a
The forward pass computes the multi-head attention for input `x` and returns the
'''
super().__init__()
self.h = no_of_heads
self.seq_length,self.dmodel = shape
self.dk = self.dmodel//self.h
self.softmax = torch.nn.Softmax(dim=-1)
self.mQW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.mKW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.mVW = torch.nn.ModuleList([torch.nn.Linear(self.dmodel,self.dk) for
self.output_linear = torch.nn.Linear(self.dmodel,self.dmodel)
self.mask = mask
self.QKV = QKV
def __add_mask(self,atten_values):
#masking attention values
mask_value = -1e9
mask = torch.triu(torch.ones(atten_values.shape) * mask_value, diagonal=1)
masked = atten_values + mask.to(device)
return masked
attn = self.softmax(self.scores)
head_i = torch.matmul(attn, v)
heads.append(head_i)
Decoder
In [12]: class decoder_layer(torch.nn.Module):
def __init__(self,shape: tuple,no_of_heads:int = 1):
'''
Implementation of Transformer Dencoder
Parameters:
shape (tuple): The shape (H, W) of the input tensor
no_of_heads (int): number of heads in the attention mechanism. set this to 1
Returns:
Tensor: The output of the encoder layer after applying attention, feedforwar
'''
super().__init__()
self.max_seq_length,self.dmodel = shape
def ff_weights():
layer1 = torch.nn.Linear(self.dmodel,600)
layer2 = torch.nn.Linear(600,600)
layer3 = torch.nn.Linear(600,self.dmodel)
return layer1,layer2,layer3
self.no_of_heads = no_of_heads
self.layer1,self.layer2,self.layer3 = ff_weights()
self.softmax = torch.nn.Softmax(dim=-1)
self.layerNorm = torch.nn.LayerNorm(shape)
self.relu1 = torch.nn.ReLU()
self.relu2 = torch.nn.ReLU()
def feed_forward(self,x):
f = self.layer1(x)
f = self.relu1(f)
f = self.layer2(f)
f = self.relu2(f)
f = self.layer3(f)
def forward(self,x):
x = self.multi_head(x)
x = self.layerNorm(x)
x = self.feed_forward(x)
x = self.layerNorm(x)
return x
def forward(self,x,temperature=1.0):
x = self.embedding_layer(x)
x = self.proj_to_224(x)
x = self.positional(x)
x = self.decoder1(x)
x = self.decoder2(x)
x = self.decoder3(x)
x = self.decoder4(x)
x = self.decoder5(x)
x = self.decoder6(x)
x = self.decoder7(x)
x = self.decoder8(x)
x = self.final_MLP(x)
logits = x / temperature
x = self.softmax(logits)
return x
Training Script
In [ ]: !pip install torchmetrics
In [ ]: vocab_size = tokenizer.vocab_size
model = architecture(n_classes = vocab_size, shape = (300,768))
model = model.to(device)
model.load_state_dict(torch.load("{fill with path to model weights if any}"))
print('Training Started')
NUM_EPOCHS = 1
# Forward pass
outputs = model(x_batch)
# Flatten the outputs and y_batch tensors one dimension lower
outputs = outputs.view(-1, outputs.shape[-1])
y_batch = y_batch.view(-1)
# Loss calculation
loss = criterion(outputs, y_batch).to(device)
# Metrics
argmax_pred = outputs.argmax(axis=1)
metric.update(argmax_pred, y_batch)
# Print statistics
running_loss += loss.item()
if i % 10 == 9: # update every 10 mini-batches
accuracy = metric.compute().item()
epoch_accuracy += accuracy
pbar.set_postfix({'Loss': running_loss / (i + 1), 'Accuracy': accuracy})
# Compute and print average loss and accuracy for the epoch
avg_loss = running_loss / num_batches
avg_accuracy = epoch_accuracy / (num_batches // 10) # since we're summing accuracy
print(f'Epoch {epoch + 1} - Loss: {avg_loss:.4f}, Accuracy: {avg_accuracy:.4f}')
print('Training Completed')
Training Started
Epoch 1: 66%|██████ | 2770/4166 [1:58:07<1:00:08, 2.59s/it, Loss=9.89, Accuracy=0.2
34]
def tokenize_text(text):
seq_length = 300
q_tokens = tokenizer(text,add_special_tokens=False)['input_ids']
pad = [0 for i in range(seq_length-len(q_tokens))]
final_tokens = [q_tokens + pad]
last_index = len(q_tokens)-1
return torch.tensor(final_tokens),last_index
inference(text, '')
QUESTION FROM THE MODEL: what is the name of the singer? [SEP]
Issues
Not Enough data: I trained on only 130K samples which is too small
Pre-trainig on a downstream task: Pretraining is supposed to be self supervised