0% found this document useful (0 votes)

11 views10 pages

Transformers Implementations 1731410319

Uploaded by

shubham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views10 pages

Transformers Implementations 1731410319

Uploaded by

shubham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

In this tutorial, we will build a basic Transformer model from scratch using

PyTorch. The Transformer model, introduced by Vaswani et al. in the paper

Attention is All You Need is a deep learning architecture designed for
sequence-to-sequence tasks, such as machine translation and text
summarization. It is based on self-attention mechanisms and has become
the foundation for many state-of-the-art natural language processing
models, like GPT and BERT.

To understand Transformer models in detail kindly visit these articles 👍

Introduction to Transformers

To Build Transformer model, we'll follow these steps:

Import necessary libraries and modules
Define the basic building blocks: Multi-Head Attention,Position-Wise Feed-forward
Networks, Positional Encoding
Build Encoder and Decoder layers
Combine Encoder and Decoder layers to complete transformer model
Preapare Sample Data
Train Model

keyboard_arrow_down Let's start by importing necessary libraries and modules

import torch
import torch.nn as nn
import torch.optim as optim
import math
import copy

Now, we'll define the basic building blocks of the

transformer model

keyboard_arrow_down Multi-Head Attention

The Multi-Head attention mechanism computes the attention between

keyboard_arrow_down each pair of positions in a sequence. It consist of multiple "attention

heads" that capture different aspects of the input sequence.

class MultiHeadAttention(nn.Module):
def __init__(self,d_model,num_heads):
super(MultiHeadAttention,self).__init__()
assert d_model % num_heads == 0 # d_model must be divisible by num_of heads

self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads

self.W_q = nn.Linear(d_model,d_model)
self.W_k = nn.Linear(d_model,d_model)
self.W_v = nn.Linear(d_model,d_model)
self.W_o = nn.Linear(d_model,d_model)

def scaled_dot_product_attention(self,Q,K,V,mask=None):
attn_scores = torch.matmul(Q,K.transpose(-2,-1)) / math.sqrt(self.d_k)
if mask is not None:
attn_scores = attn_scores.masked_fill(mask==0, -1e9)

attn_probs = torch.softmax(attn_scores,dim=-1)
output = torch.matmul(attn_probs,V)
return output
def split_heads(self,x):
batch_size,seq_length,d_model = x.size()
return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

def combine_heads(self,x):
batch_size,_,seq_length,d_k = x.size()
return x.transpose(1,2).contiguous().view(batch_size,seq_length,self.d_model)

def forward(self,Q,K,V,mask=None):
Q = self.split_heads(self.W_q(Q))
K = self.split_heads(self.W_k(K))
V = self.split_heads(self.W_v(V))

attn_output = self.scaled_dot_product_attention(Q,K,V,mask)
output = self.W_o(self.combine_heads(attn_output))
return output

The MultiHeadAttention code initializes the module with input parameters and linear
transformation layers. It calculates attention scores, reshapes the input tensor into multiple
heads, and combines the attention outputs from all heads. The forward method computes the
multi-head self-attention, allowing the model to focus on some different aspects of the input
sequence.

keyboard_arrow_down PositionWise Feed-Forward Networks

class PositionWiseFeedForrward(nn.Module):
def __init__(self,d_model,d_ff):
super(PositionWiseFeedForrward,self).__init__()
self.fc1 = nn.Linear(d_model,d_ff)
self.fc2 = nn.Linear(d_ff,d_model)
self.relu = nn.ReLU()

def forward(self,x):
return self.fc2(self.relu(self.fc1(x)))

The PositionWiseFeedForward class extends PyTorch’s nn.Module and implements a position-

wise feed-forward network. The class initializes with two linear transformation layers and a
ReLU activation function. The forward method applies these transformations and activation
function sequentially to compute the output. This process enables the model to consider the
position of input elements while making predictions.

keyboard_arrow_down Positional Encoding

Positional Encoding is used to inject the position information of each token in the input
sequence. It uses sine and cosine functions of different frequencies to generate the positional
encoding.
class PositionalEncoding(nn.Module):
def __init__(self,d_model,max_seq_length):
super(PositionalEncoding,self).__init__()

pe = torch.zeros(max_seq_length,d_model)
position = torch.arange(0,max_seq_length,dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0,d_model,2).float() * -(math.log(10000.0) / d_mod

pe[:,0::2] = torch.sin(position * div_term)

pe[:,1::2] = torch.cos(position * div_term)

self.register_buffer('pe',pe.unsqueeze(0))

def forward(self,x):
return x + self.pe[:, :x.size(1)]

The PositionalEncoding class initializes with input parameters d_model and max_seq_length,
creating a tensor to store positional encoding values. The class calculates sine and cosine
values for even and odd indices, respectively, based on the scaling factor div_term. The
forward method computes the positional encoding by adding the stored positional encoding
values to the input tensor, allowing the model to capture the position information of the input
sequence.

Now, we’ll build the Encoder and Decoder layers.

keyboard_arrow_down Encoder Layer

An Encoder layer consists of a Multi-Head Attention layer, a Position-wise Feed-Forward layer,
and two Layer Normalization layers.

class EncoderLayer(nn.Module):
def __init__(self,d_model,num_heads,d_ff,dropout):
super(EncoderLayer,self).__init__()
self.self_attn = MultiHeadAttention(d_model,num_heads)
self.feed_forward = PositionWiseFeedForrward(d_model,d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)

def forward(self,x,mask):
attn_output = self.self_attn(x,x,x,mask)
x=self.norm1(x+self.dropout(attn_output))
ff_output = self.feed_forward(x)
x=self.norm2(x+self.dropout(ff_output))
return x

The EncoderLayer class initializes with input parameters and components, including a
MultiHeadAttention module, a PositionWiseFeedForward module, two layer normalization
modules, and a dropout layer. The forward methods computes the encoder layer output by
applying self-attention, adding the attention output to the input tensor, and normalizing the
result. Then, it computes the position-wise feed-forward output, combines it with the
normalized self-attention output, and normalizes the final result before returning the
processed tensor.
keyboard_arrow_down Decoder Layer

A Decoder layer consists of two Multi-Head Attention layers, a Position-wise Feed-Forward

layer, and three Layer Normalization layers.

class DecoderLayer(nn.Module):
def __init__(self,d_model,num_heads,d_ff,dropout):
super(DecoderLayer,self).__init__()
self.self_attn = MultiHeadAttention(d_model,num_heads)
self.cross_attn = MultiHeadAttention(d_model,num_heads)
self.feed_forward = PositionWiseFeedForrward(d_model,d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)

def forward(self,x,enc_output,src_mask,tgt_mask):
attn_output = self.self_attn(x,x,x,tgt_mask)
x = self.norm1(x + self.dropout(attn_output))
attn_output = self.cross_attn(x,enc_output,enc_output,src_mask)
x = self.norm2(x + self.dropout(attn_output))
ff_output = self.feed_forward(x)
x = self.norm3(x + self.dropout(ff_output))
return x

The DecoderLayer initializes with input parameters and components such as

MultiHeadAttention modules for masked self-attention and cross-attention, a
PositionWiseFeedForward module, three layer normalization modules, and a dropout layer.

The forward method computes the decoder layer output by performing the following steps:

1. Calculate the masked self-attention output and add it to the input tensor, followed by
dropout and layer normalization.

2. Compute the cross-attention output between the decoder and encoder outputs, and add it
to the normalized masked self-attention output, followed by dropout and layer
normalization.

3. Calculate the position-wise feed-forward output and combine it with the normalized
cross-attention output, followed by dropout and layer normalization.

4. Return the processed tensor.

These operations enable the decoder to generate target sequences based on the input and the
encoder output.

Now Let's combine the Encoder and Decoder layers to create the
complete transformer model.

keyboard_arrow_down Transformer Model

Merging it all together:

class Transformer(nn.Module):
def __init__(self,src_vocab_size,tgt_vocab_size,d_model,num_heads,num_layers,d_ff,max_
super(Transformer,self).__init__()
self.encoder_embedding = nn.Embedding(src_vocab_size,d_model)
self.decoder_embedding = nn.Embedding(tgt_vocab_size,d_model)
self.positional_encoding = PositionalEncoding(d_model,max_seq_length)

self.encoder_layers = nn.ModuleList([EncoderLayer(d_model,num_heads,d_ff,dropout) fo
self.decoder_layers = nn.ModuleList([DecoderLayer(d_model,num_heads,d_ff,dropout) fo

self.fc = nn.Linear(d_model,tgt_vocab_size)
self.dropout = nn.Dropout(dropout)

def generate_mask(self,src,tgt):
src_mask = (src !=0).unsqueeze(1).unsqueeze(2)
tgt_mask = (tgt !=0).unsqueeze(1).unsqueeze(3)
seq_length = tgt.size(1)
nopeak_mask = (1 - torch.triu(torch.ones(1,seq_length,seq_length),diagonal=1)).bool(
tgt_mask = tgt_mask & nopeak_mask
return src_mask,tgt_mask

def forward(self,src,tgt):
src_mask,tgt_mask = self.generate_mask(src,tgt)
src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

enc_output = src_embedded
for enc_layer in self.encoder_layers:
enc_output = enc_layer(enc_output,src_mask)

dec_output = tgt_embedded
for dec_layer in self.decoder_layers:
dec_output = dec_layer(dec_output,enc_output,src_mask, tgt_mask)

output = self.fc(dec_output)
return output

The Transformer class combines the previously defined modules to create a complete
Transformer model. During initialization, the Transformer module sets up input parameters and
initializes various components, including embedding layers for source and target sequences, a
PositionalEncoding module, EncoderLayer and DecoderLayer modules to create stacked layers,
a linear layer for projecting decoder output, and a dropout layer.

The generate_mask method creates binary masks for source and target sequences to ignore
padding tokens and prevent the decoder from attending to future tokens. The forward method
computes the Transformer model’s output through the following steps:

1. Generate source and target masks using the generate_mask method.

2. Compute source and target embeddings, and apply positional encoding and dropout.

3. Process the source sequence through encoder layers, updating the enc_output tensor.

4. Process the target sequence through decoder layers, using enc_output and masks, and
updating the dec_output tensor.

5. Apply the linear projection layer to the decoder output, obtaining output logits.

These steps enable the Transformer model to process input sequences and generate output
sequences based on the combined functionality of its components.

keyboard_arrow_down Preparing Sample Data

In this example, we will create a toy dataset for demonstration purposes. In practice, you would
use a larger dataset, preprocess the text, and create vocabulary mappings for source and target
languages.
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

transformer = Transformer(src_vocab_size,tgt_vocab_size,d_model,num_heads,num_layers,d_f

# generate sample data

src_data = torch.randint(1,src_vocab_size,(64,max_seq_length))
tgt_data = torch.randint(1,tgt_vocab_size,(64,max_seq_length))

print(src_data.shape)
print(tgt_data.shape)

torch.Size([64, 100])
torch.Size([64, 100])

criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(),lr=0.001,betas=(0.9,0.98),eps=1e-9)

transformer.train()

for epoch in range(10):

optimizer.zero_grad()
output = transformer(src_data, tgt_data[:, :-1])
loss = criterion(output.contiguous().view(-1,tgt_vocab_size),tgt_data[:, 1:].contiguou
loss.backward()
optimizer.step()
print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

Epoch: 1, Loss: 8.661681175231934

Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
GPT2 From Scratch in PyTorch
No ratings yet
GPT2 From Scratch in PyTorch
13 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Astro AI
No ratings yet
Astro AI
20 pages
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
No ratings yet
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
9 pages
Astro AI
No ratings yet
Astro AI
20 pages
Project Source
No ratings yet
Project Source
21 pages
L5 Single Layer FeedForward Network NN - Linear
No ratings yet
L5 Single Layer FeedForward Network NN - Linear
21 pages
NLP 4
No ratings yet
NLP 4
10 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
GPT4 Architecture
No ratings yet
GPT4 Architecture
2 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
PyTorch Crash Course 1713016363
No ratings yet
PyTorch Crash Course 1713016363
15 pages
Transformers
No ratings yet
Transformers
15 pages
LLM Code Ref
No ratings yet
LLM Code Ref
10 pages
Transformer 2
No ratings yet
Transformer 2
6 pages
IBest DeepLearning
No ratings yet
IBest DeepLearning
123 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
TXT
No ratings yet
TXT
7 pages
Intro To Pytorch
No ratings yet
Intro To Pytorch
12 pages
Coding Attention Mechanisms
No ratings yet
Coding Attention Mechanisms
24 pages
Pytorch Tutorial For Beginner: Department of Computer Science & Engineering University of Washington
No ratings yet
Pytorch Tutorial For Beginner: Department of Computer Science & Engineering University of Washington
11 pages
Pytorch Neural Networks Guide 1717173717
No ratings yet
Pytorch Neural Networks Guide 1717173717
17 pages
Mlp-Fromscratch Sigmoid-Mse
No ratings yet
Mlp-Fromscratch Sigmoid-Mse
13 pages
Transformer
No ratings yet
Transformer
5 pages
GPT 2 - Learninhg 2
No ratings yet
GPT 2 - Learninhg 2
2 pages
Lab 6
No ratings yet
Lab 6
29 pages
cs519 hw2
No ratings yet
cs519 hw2
15 pages
A4
No ratings yet
A4
8 pages
Beginner's PyTorch Guide
No ratings yet
Beginner's PyTorch Guide
35 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
17 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Train Your Image Classifier Model With PyTorch
No ratings yet
Train Your Image Classifier Model With PyTorch
6 pages
Modeling Chatglm
No ratings yet
Modeling Chatglm
20 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
Pytorch 101: Deep Learning PHD Course 2017/2018
No ratings yet
Pytorch 101: Deep Learning PHD Course 2017/2018
19 pages
Class Neuron Red
No ratings yet
Class Neuron Red
12 pages
Pytorch Demo 1749471354
No ratings yet
Pytorch Demo 1749471354
10 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
10 pages
AIHT Final Project
No ratings yet
AIHT Final Project
6 pages
PyTorch CrashCourse
No ratings yet
PyTorch CrashCourse
16 pages
Convolutional Autoencoder in Pytorch On MNIST Dataset - by Eugenia Anello - DataSeries - Medium
No ratings yet
Convolutional Autoencoder in Pytorch On MNIST Dataset - by Eugenia Anello - DataSeries - Medium
18 pages
Module02 PyTorch
No ratings yet
Module02 PyTorch
36 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
EncoderDecoderSeq2Seq DeepLSTM
No ratings yet
EncoderDecoderSeq2Seq DeepLSTM
7 pages
Slides
No ratings yet
Slides
81 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
L6 Multilayer FeedForward Network XOR & MNIST DIGIT
No ratings yet
L6 Multilayer FeedForward Network XOR & MNIST DIGIT
51 pages
Chapter 2
No ratings yet
Chapter 2
52 pages
Chapter 1
No ratings yet
Chapter 1
37 pages
M3 Transcript
No ratings yet
M3 Transcript
10 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
Pytorch Tutorial 1
No ratings yet
Pytorch Tutorial 1
48 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Font Transfer 2 Autoencoders
No ratings yet
Font Transfer 2 Autoencoders
78 pages
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Vb Net Programming
From Everand
Vb Net Programming
Martin Booch
No ratings yet
Meta AI's Chameleon: A Revolutionary Leap in Mixed-Modal AI
No ratings yet
Meta AI's Chameleon: A Revolutionary Leap in Mixed-Modal AI
8 pages
SENSL 23 04 RL 0168 - Proof - Hi PDF
No ratings yet
SENSL 23 04 RL 0168 - Proof - Hi PDF
6 pages
Course Syllabus
No ratings yet
Course Syllabus
31 pages
The Role of ChatGPT in Scientific Communication: Writing Better Scientific Review Articles
No ratings yet
The Role of ChatGPT in Scientific Communication: Writing Better Scientific Review Articles
7 pages
Seminar Outline NLP
No ratings yet
Seminar Outline NLP
5 pages
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
No ratings yet
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
27 pages
Lgorithmic Progress in Language Models: Tamay@epochai. Org
No ratings yet
Lgorithmic Progress in Language Models: Tamay@epochai. Org
31 pages
The Development of Language AI Models in 2018
No ratings yet
The Development of Language AI Models in 2018
5 pages
350 NLP Projects With Code
No ratings yet
350 NLP Projects With Code
70 pages
CB Insights - Generative AI Bible
No ratings yet
CB Insights - Generative AI Bible
122 pages
Climax: A Foundation Model For Weather and Climate
No ratings yet
Climax: A Foundation Model For Weather and Climate
41 pages
Luna: An Evaluation Foundation Model To Catch Language Model Hallucinations With High Accuracy and Low Cost
No ratings yet
Luna: An Evaluation Foundation Model To Catch Language Model Hallucinations With High Accuracy and Low Cost
13 pages
13-Generative AI For Software Practitioners
No ratings yet
13-Generative AI For Software Practitioners
9 pages
17 (Advanced) RAG Techniques To Turn Your LLM App Prototype Into A Production-Ready Solution - by Dominik Polzer - Jun, 2024 - Towards Data Science
No ratings yet
17 (Advanced) RAG Techniques To Turn Your LLM App Prototype Into A Production-Ready Solution - by Dominik Polzer - Jun, 2024 - Towards Data Science
54 pages
On Technical Trading and Social Media Indicators For Cryptocurrency Price
No ratings yet
On Technical Trading and Social Media Indicators For Cryptocurrency Price
15 pages
Chen An Empirical Study of Training Self-Supervised Vision Transformers ICCV 2021 Paper
No ratings yet
Chen An Empirical Study of Training Self-Supervised Vision Transformers ICCV 2021 Paper
10 pages
One-Shot Open Affordance Learning With Foundation Models
No ratings yet
One-Shot Open Affordance Learning With Foundation Models
14 pages
Guidance For Generative AI in Education and Research
No ratings yet
Guidance For Generative AI in Education and Research
48 pages
The Busy Person Intro To LLMs. Covering All The Major Updates in The - by Vishal Rajput - AIGuys - Dec, 2023 - Medium
No ratings yet
The Busy Person Intro To LLMs. Covering All The Major Updates in The - by Vishal Rajput - AIGuys - Dec, 2023 - Medium
1 page
Graphormer A General-Propose Backbone For Graph Learning
No ratings yet
Graphormer A General-Propose Backbone For Graph Learning
14 pages
Alice's Adventures in A Differentiable Wonderland
No ratings yet
Alice's Adventures in A Differentiable Wonderland
279 pages
Chatbots & Recommendation Systems Final Review
No ratings yet
Chatbots & Recommendation Systems Final Review
49 pages
GenAI Toolkit
No ratings yet
GenAI Toolkit
13 pages
OCI - Gen - AI - Solutions 4
No ratings yet
OCI - Gen - AI - Solutions 4
12 pages
Generative AI With LArge Language Models
No ratings yet
Generative AI With LArge Language Models
36 pages
Online Adaptation of Language Models With A Memory of Amortized Contexts
No ratings yet
Online Adaptation of Language Models With A Memory of Amortized Contexts
14 pages
Language Models For Biological Research: A Primer: Nature Methods
No ratings yet
Language Models For Biological Research: A Primer: Nature Methods
8 pages
ChatGPT, LLM and RLHF
No ratings yet
ChatGPT, LLM and RLHF
45 pages
A Hybrid Convolution Transformer For Hyperspectral Image Classification
No ratings yet
A Hybrid Convolution Transformer For Hyperspectral Image Classification
17 pages
Machine Learning Security and Privacy
No ratings yet
Machine Learning Security and Privacy
3 pages