0% found this document useful (0 votes)
11 views10 pages

Transformers Implementations 1731410319

Uploaded by

shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

Transformers Implementations 1731410319

Uploaded by

shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

In this tutorial, we will build a basic Transformer model from scratch using

PyTorch. The Transformer model, introduced by Vaswani et al. in the paper


Attention is All You Need is a deep learning architecture designed for
sequence-to-sequence tasks, such as machine translation and text
summarization. It is based on self-attention mechanisms and has become
the foundation for many state-of-the-art natural language processing
models, like GPT and BERT.

To understand Transformer models in detail kindly visit these articles 👍


Introduction to Transformers

To Build Transformer model, we'll follow these steps:


Import necessary libraries and modules
Define the basic building blocks: Multi-Head Attention,Position-Wise Feed-forward
Networks, Positional Encoding
Build Encoder and Decoder layers
Combine Encoder and Decoder layers to complete transformer model
Preapare Sample Data
Train Model

keyboard_arrow_down Let's start by importing necessary libraries and modules


import torch
import torch.nn as nn
import torch.optim as optim
import math
import copy

Now, we'll define the basic building blocks of the


transformer model

keyboard_arrow_down Multi-Head Attention


The Multi-Head attention mechanism computes the attention between

keyboard_arrow_down each pair of positions in a sequence. It consist of multiple "attention


heads" that capture different aspects of the input sequence.

class MultiHeadAttention(nn.Module):
def __init__(self,d_model,num_heads):
super(MultiHeadAttention,self).__init__()
assert d_model % num_heads == 0 # d_model must be divisible by num_of heads

self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads

self.W_q = nn.Linear(d_model,d_model)
self.W_k = nn.Linear(d_model,d_model)
self.W_v = nn.Linear(d_model,d_model)
self.W_o = nn.Linear(d_model,d_model)

def scaled_dot_product_attention(self,Q,K,V,mask=None):
attn_scores = torch.matmul(Q,K.transpose(-2,-1)) / math.sqrt(self.d_k)
if mask is not None:
attn_scores = attn_scores.masked_fill(mask==0, -1e9)

attn_probs = torch.softmax(attn_scores,dim=-1)
output = torch.matmul(attn_probs,V)
return output
def split_heads(self,x):
batch_size,seq_length,d_model = x.size()
return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

def combine_heads(self,x):
batch_size,_,seq_length,d_k = x.size()
return x.transpose(1,2).contiguous().view(batch_size,seq_length,self.d_model)

def forward(self,Q,K,V,mask=None):
Q = self.split_heads(self.W_q(Q))
K = self.split_heads(self.W_k(K))
V = self.split_heads(self.W_v(V))

attn_output = self.scaled_dot_product_attention(Q,K,V,mask)
output = self.W_o(self.combine_heads(attn_output))
return output

The MultiHeadAttention code initializes the module with input parameters and linear
transformation layers. It calculates attention scores, reshapes the input tensor into multiple
heads, and combines the attention outputs from all heads. The forward method computes the
multi-head self-attention, allowing the model to focus on some different aspects of the input
sequence.

keyboard_arrow_down PositionWise Feed-Forward Networks


class PositionWiseFeedForrward(nn.Module):
def __init__(self,d_model,d_ff):
super(PositionWiseFeedForrward,self).__init__()
self.fc1 = nn.Linear(d_model,d_ff)
self.fc2 = nn.Linear(d_ff,d_model)
self.relu = nn.ReLU()

def forward(self,x):
return self.fc2(self.relu(self.fc1(x)))

The PositionWiseFeedForward class extends PyTorch’s nn.Module and implements a position-


wise feed-forward network. The class initializes with two linear transformation layers and a
ReLU activation function. The forward method applies these transformations and activation
function sequentially to compute the output. This process enables the model to consider the
position of input elements while making predictions.

keyboard_arrow_down Positional Encoding


Positional Encoding is used to inject the position information of each token in the input
sequence. It uses sine and cosine functions of different frequencies to generate the positional
encoding.
class PositionalEncoding(nn.Module):
def __init__(self,d_model,max_seq_length):
super(PositionalEncoding,self).__init__()

pe = torch.zeros(max_seq_length,d_model)
position = torch.arange(0,max_seq_length,dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0,d_model,2).float() * -(math.log(10000.0) / d_mod

pe[:,0::2] = torch.sin(position * div_term)


pe[:,1::2] = torch.cos(position * div_term)

self.register_buffer('pe',pe.unsqueeze(0))

def forward(self,x):
return x + self.pe[:, :x.size(1)]

The PositionalEncoding class initializes with input parameters d_model and max_seq_length,
creating a tensor to store positional encoding values. The class calculates sine and cosine
values for even and odd indices, respectively, based on the scaling factor div_term. The
forward method computes the positional encoding by adding the stored positional encoding
values to the input tensor, allowing the model to capture the position information of the input
sequence.

Now, we’ll build the Encoder and Decoder layers.

keyboard_arrow_down Encoder Layer


An Encoder layer consists of a Multi-Head Attention layer, a Position-wise Feed-Forward layer,
and two Layer Normalization layers.

class EncoderLayer(nn.Module):
def __init__(self,d_model,num_heads,d_ff,dropout):
super(EncoderLayer,self).__init__()
self.self_attn = MultiHeadAttention(d_model,num_heads)
self.feed_forward = PositionWiseFeedForrward(d_model,d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)

def forward(self,x,mask):
attn_output = self.self_attn(x,x,x,mask)
x=self.norm1(x+self.dropout(attn_output))
ff_output = self.feed_forward(x)
x=self.norm2(x+self.dropout(ff_output))
return x

The EncoderLayer class initializes with input parameters and components, including a
MultiHeadAttention module, a PositionWiseFeedForward module, two layer normalization
modules, and a dropout layer. The forward methods computes the encoder layer output by
applying self-attention, adding the attention output to the input tensor, and normalizing the
result. Then, it computes the position-wise feed-forward output, combines it with the
normalized self-attention output, and normalizes the final result before returning the
processed tensor.
keyboard_arrow_down Decoder Layer

A Decoder layer consists of two Multi-Head Attention layers, a Position-wise Feed-Forward


layer, and three Layer Normalization layers.

class DecoderLayer(nn.Module):
def __init__(self,d_model,num_heads,d_ff,dropout):
super(DecoderLayer,self).__init__()
self.self_attn = MultiHeadAttention(d_model,num_heads)
self.cross_attn = MultiHeadAttention(d_model,num_heads)
self.feed_forward = PositionWiseFeedForrward(d_model,d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)

def forward(self,x,enc_output,src_mask,tgt_mask):
attn_output = self.self_attn(x,x,x,tgt_mask)
x = self.norm1(x + self.dropout(attn_output))
attn_output = self.cross_attn(x,enc_output,enc_output,src_mask)
x = self.norm2(x + self.dropout(attn_output))
ff_output = self.feed_forward(x)
x = self.norm3(x + self.dropout(ff_output))
return x

The DecoderLayer initializes with input parameters and components such as


MultiHeadAttention modules for masked self-attention and cross-attention, a
PositionWiseFeedForward module, three layer normalization modules, and a dropout layer.

The forward method computes the decoder layer output by performing the following steps:

1. Calculate the masked self-attention output and add it to the input tensor, followed by
dropout and layer normalization.

2. Compute the cross-attention output between the decoder and encoder outputs, and add it
to the normalized masked self-attention output, followed by dropout and layer
normalization.

3. Calculate the position-wise feed-forward output and combine it with the normalized
cross-attention output, followed by dropout and layer normalization.

4. Return the processed tensor.

These operations enable the decoder to generate target sequences based on the input and the
encoder output.

Now Let's combine the Encoder and Decoder layers to create the
complete transformer model.

keyboard_arrow_down Transformer Model


Merging it all together:

class Transformer(nn.Module):
def __init__(self,src_vocab_size,tgt_vocab_size,d_model,num_heads,num_layers,d_ff,max_
super(Transformer,self).__init__()
self.encoder_embedding = nn.Embedding(src_vocab_size,d_model)
self.decoder_embedding = nn.Embedding(tgt_vocab_size,d_model)
self.positional_encoding = PositionalEncoding(d_model,max_seq_length)

self.encoder_layers = nn.ModuleList([EncoderLayer(d_model,num_heads,d_ff,dropout) fo
self.decoder_layers = nn.ModuleList([DecoderLayer(d_model,num_heads,d_ff,dropout) fo

self.fc = nn.Linear(d_model,tgt_vocab_size)
self.dropout = nn.Dropout(dropout)

def generate_mask(self,src,tgt):
src_mask = (src !=0).unsqueeze(1).unsqueeze(2)
tgt_mask = (tgt !=0).unsqueeze(1).unsqueeze(3)
seq_length = tgt.size(1)
nopeak_mask = (1 - torch.triu(torch.ones(1,seq_length,seq_length),diagonal=1)).bool(
tgt_mask = tgt_mask & nopeak_mask
return src_mask,tgt_mask

def forward(self,src,tgt):
src_mask,tgt_mask = self.generate_mask(src,tgt)
src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

enc_output = src_embedded
for enc_layer in self.encoder_layers:
enc_output = enc_layer(enc_output,src_mask)

dec_output = tgt_embedded
for dec_layer in self.decoder_layers:
dec_output = dec_layer(dec_output,enc_output,src_mask, tgt_mask)

output = self.fc(dec_output)
return output

The Transformer class combines the previously defined modules to create a complete
Transformer model. During initialization, the Transformer module sets up input parameters and
initializes various components, including embedding layers for source and target sequences, a
PositionalEncoding module, EncoderLayer and DecoderLayer modules to create stacked layers,
a linear layer for projecting decoder output, and a dropout layer.

The generate_mask method creates binary masks for source and target sequences to ignore
padding tokens and prevent the decoder from attending to future tokens. The forward method
computes the Transformer model’s output through the following steps:

1. Generate source and target masks using the generate_mask method.

2. Compute source and target embeddings, and apply positional encoding and dropout.

3. Process the source sequence through encoder layers, updating the enc_output tensor.

4. Process the target sequence through decoder layers, using enc_output and masks, and
updating the dec_output tensor.

5. Apply the linear projection layer to the decoder output, obtaining output logits.

These steps enable the Transformer model to process input sequences and generate output
sequences based on the combined functionality of its components.

keyboard_arrow_down Preparing Sample Data


In this example, we will create a toy dataset for demonstration purposes. In practice, you would
use a larger dataset, preprocess the text, and create vocabulary mappings for source and target
languages.
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

transformer = Transformer(src_vocab_size,tgt_vocab_size,d_model,num_heads,num_layers,d_f

# generate sample data


src_data = torch.randint(1,src_vocab_size,(64,max_seq_length))
tgt_data = torch.randint(1,tgt_vocab_size,(64,max_seq_length))

print(src_data.shape)
print(tgt_data.shape)

torch.Size([64, 100])
torch.Size([64, 100])

criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(),lr=0.001,betas=(0.9,0.98),eps=1e-9)

transformer.train()

for epoch in range(10):


optimizer.zero_grad()
output = transformer(src_data, tgt_data[:, :-1])
loss = criterion(output.contiguous().view(-1,tgt_vocab_size),tgt_data[:, 1:].contiguou
loss.backward()
optimizer.step()
print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

Epoch: 1, Loss: 8.661681175231934

You might also like