0% found this document useful (0 votes)
4 views

vertopal.com_lab6

This document is Part 1 of a lab on Transformers in a Machine Learning Hardware course, focusing on environment setup and basic operations within the Transformer architecture. It covers the implementation of the attention mechanism, specifically the scaled dot product attention and multi-head attention, detailing their mathematical foundations and practical coding examples using PyTorch. Additional resources and references for further learning are also provided.

Uploaded by

Amal
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

vertopal.com_lab6

This document is Part 1 of a lab on Transformers in a Machine Learning Hardware course, focusing on environment setup and basic operations within the Transformer architecture. It covers the implementation of the attention mechanism, specifically the scaled dot product attention and multi-head attention, detailing their mathematical foundations and practical coding examples using PyTorch. Additional resources and references for further learning are also provided.

Uploaded by

Amal
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 29

Lab 6: Transformers

Machine Learning Hardware Course

This notebook is Part 1 of the transformer lab. It covers:

1. Environment setup
2. Understand basic operations in Transformer.

Reference.

1. https://uvadlc-notebooks.readthedocs.io.
2. https://github.com/phlippe/uvadlc_notebooks

PART 1: Basic Operations in Transformer

## Standard libraries
import os
import numpy as np
import random
import math
import json
from functools import partial

## Imports for plotting


import matplotlib.pyplot as plt
plt.set_cmap('cividis')
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf') # For export
from matplotlib.colors import to_rgb
import matplotlib
matplotlib.rcParams['lines.linewidth'] = 2.0
import seaborn as sns
sns.reset_orig()

## tqdm for loading bars


from tqdm.notebook import tqdm

## PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim

## Torchvision
import torchvision
from torchvision.datasets import CIFAR100
from torchvision import transforms

# PyTorch Lightning
try:
import pytorch_lightning as pl
except ModuleNotFoundError: # Google Colab does not have PyTorch Lightning
installed by default. Hence, we do it here if necessary
!pip install --quiet pytorch-lightning>=1.4
import pytorch_lightning as pl
from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint

# Path to the folder where the datasets are/should be downloaded (e.g. CIFAR10)
DATASET_PATH = "../data"
# Path to the folder where the pretrained models are saved
CHECKPOINT_PATH = "../saved_models/tutorial6"

# Setting the seed


pl.seed_everything(42)

# Ensure that all operations are deterministic on GPU (if used) for
reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device("cuda:0") if torch.cuda.is_available() else


torch.device("cpu")
print("Device:", device)

<ipython-input-1-c8fdf9d3ded7>:14: DeprecationWarning: `set_matplotlib_formats`


is deprecated since IPython 7.23, directly use
`matplotlib_inline.backend_inline.set_matplotlib_formats()`
set_matplotlib_formats('svg', 'pdf') # For export
INFO:lightning_fabric.utilities.seed:Seed set to 42

Device: cuda:0

<Figure size 640x480 with 0 Axes>

Two pre-trained models are downloaded below. Make sure to have adjusted
your CHECKPOINT_PATH before running this code if not already done.

import urllib.request
from urllib.error import HTTPError
# Github URL where saved models are stored for this tutorial
base_url =
"https://raw.githubusercontent.com/phlippe/saved_models/main/tutorial6/"
# Files to download
pretrained_files = ["ReverseTask.ckpt", "SetAnomalyTask.ckpt"]

# Create checkpoint path if it doesn't exist yet


os.makedirs(CHECKPOINT_PATH, exist_ok=True)

# For each file, check whether it already exists. If not, try downloading it.
for file_name in pretrained_files:
file_path = os.path.join(CHECKPOINT_PATH, file_name)
if "/" in file_name:
os.makedirs(file_path.rsplit("/",1)[0], exist_ok=True)
if not os.path.isfile(file_path):
file_url = base_url + file_name
print(f"Downloading {file_url}...")
try:
urllib.request.urlretrieve(file_url, file_path)
except HTTPError as e:
print("Something went wrong. Please try to download the file from
the GDrive folder, or contact the author with the full output including the
following error:\n", e)
Downloading
https://raw.githubusercontent.com/phlippe/saved_models/main/tutorial6/
ReverseTask.ckpt...
Downloading
https://raw.githubusercontent.com/phlippe/saved_models/main/tutorial6/
SetAnomalyTask.ckpt...

The Transformer architecture

In the first part of this notebook, we will implement the Transformer


architecture by hand. As the architecture is so popular, there already
exists a Pytorch module nn.Transformer (documentation) and a tutorial on
how to use it for next token prediction. However, we will implement it
here ourselves, to get through to the smallest details.

What is Attention?

The attention mechanism describes a recent new group of layers in neural


networks that has attracted a lot of interest in the past few years,
especially in sequence tasks. There are a lot of different possible
definitions of "attention" in the literature, but the one we will use
here is the following: the attention mechanism describes a weighted
average of (sequence) elements with the weights dynamically computed
based on an input query and elements' keys. So what does this exactly
mean? The goal is to take an average over the features of multiple
elements. However, instead of weighting each element equally, we want to
weight them depending on their actual values. In other words, we want to
dynamically decide on which inputs we want to "attend" more than others.
In particular, an attention mechanism has usually four parts we need to
specify:

- Query: The query is a feature vector that describes what we are


looking for in the sequence, i.e. what would we maybe want to pay
attention to.
- Keys: For each input element, we have a key which is again a feature
vector. This feature vector roughly describes what the element is
"offering", or when it might be important. The keys should be
designed such that we can identify the elements we want to pay
attention to based on the query.
- Values: For each input element, we also have a value vector. This
feature vector is the one we want to average over.
- Score function: To rate which elements we want to pay attention to,
we need to specify a score function f_(attn). The score function
takes the query and a key as input, and output the score/attention
weight of the query-key pair. It is usually implemented by simple
similarity metrics like a dot product, or a small MLP.

The weights of the average are calculated by a softmax over all score
function outputs. Hence, we assign those value vectors a higher weight
whose corresponding key is most similar to the query. If we try to
describe it with pseudo-math, we can write:

$$
\alpha_i = \frac{\exp\left(f_{attn}\left(\text{key}_i, \text{query}\right)\right)}
{\sum_j \exp\left(f_{attn}\left(\text{key}_j, \text{query}\right)\right)}, \
hspace{5mm} \text{out} = \sum_i \alpha_i \cdot \text{value}_i
$$

For every word, we have one key and one value vector. The query is
compared to all keys with a score function (in this case the dot
product) to determine the weights. The softmax is not visualized for
simplicity. Finally, the value vectors of all words are averaged using
the attention weights.

Most attention mechanisms differ in terms of what queries they use, how
the key and value vectors are defined, and what score function is used.
The attention applied inside the Transformer architecture is called
self-attention. In self-attention, each sequence element provides a key,
value, and query. For each element, we perform an attention layer where
based on its query, we check the similarity of the all sequence
elements' keys, and returned a different, averaged value vector for each
element. We will now go into a bit more detail by first looking at the
specific implementation of the attention mechanism which is in the
Transformer case the scaled dot product attention.

Scaled Dot Product Attention

The core concept behind self-attention is the scaled dot product


attention. Our goal is to have an attention mechanism with which any
element in a sequence can attend to any other while still being
efficient to compute. The dot product attention takes as input a set of
queries Q ∈ ℝ^(T × d_(k)), keys K ∈ ℝ^(T × d_(k)) and values
V ∈ ℝ^(T × d_(v)) where T is the sequence length, and d_(k) and d_(v)
are the hidden dimensionality for queries/keys and values respectively.
For simplicity, we neglect the batch dimension for now. The attention
value from element i to j is based on its similarity of the query Q_(i)
and key K_(j), using the dot product as the similarity metric. In math,
we calculate the dot product attention as follows:

$$\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

The matrix multiplication QK^(T) performs the dot product for every
possible pair of queries and keys, resulting in a matrix of the shape
T × T. Each row represents the attention logits for a specific element i
to all other elements in the sequence. On these, we apply a softmax and
multiply with the value vector to obtain a weighted mean (the weights
being determined by the attention). Another perspective on this
attention mechanism offers the computation graph which is visualized
below (figure credit - Vaswani et al., 2017).

One aspect we haven't discussed yet is the scaling factor of


$1/\sqrt{d_k}$. This scaling factor is crucial to maintain an
appropriate variance of attention values after initialization. Remember
that we intialize our layers with the intention of having equal variance
throughout the model, and hence, Q and K might also have a variance
close to 1. However, performing a dot product over two vectors with a
variance σ² results in a scalar having d_(k)-times higher variance:

$$q_i \sim \mathcal{N}(0,\sigma^2), k_i \sim \mathcal{N}(0,\sigma^2) \to \


text{Var}\left(\sum_{i=1}^{d_k} q_i\cdot k_i\right) = \sigma^4\cdot d_k$$

If we do not scale down the variance back to ∼ σ², the softmax over the
logits will already saturate to 1 for one random element and 0 for all
others. The gradients through the softmax will be close to zero so that
we can't learn the parameters appropriately. Note that the extra factor
of σ², i.e., having σ⁴ instead of σ², is usually not an issue, since we
keep the original variance σ² close to 1 anyways.
The block Mask (opt.) in the diagram above represents the optional
masking of specific entries in the attention matrix. This is for
instance used if we stack multiple sequences with different lengths into
a batch. To still benefit from parallelization in PyTorch, we pad the
sentences to the same length and mask out the padding tokens during the
calculation of the attention values. This is usually done by setting the
respective attention logits to a very low value.

After we have discussed the details of the scaled dot product attention
block, we can write a function below which computes the output features
given the triple of queries, keys, and values:

def scaled_dot_product(q, k, v, mask=None):


d_k = q.size()[-1]
attn_logits = torch.matmul(q, k.transpose(-2, -1))
attn_logits = attn_logits / math.sqrt(d_k)
if mask is not None:
attn_logits = attn_logits.masked_fill(mask == 0, -9e15)
attention = F.softmax(attn_logits, dim=-1)
values = torch.matmul(attention, v)
return values, attention

Note that our code above supports any additional dimensionality in front
of the sequence length so that we can also use it for batches. However,
for a better understanding, let's generate a few random queries, keys,
and value vectors, and calculate the attention outputs:

seq_len, d_k = 3, 2
pl.seed_everything(42)
q = torch.randn(seq_len, d_k)
k = torch.randn(seq_len, d_k)
v = torch.randn(seq_len, d_k)
values, attention = scaled_dot_product(q, k, v)
print("Q\n", q)
print("K\n", k)
print("V\n", v)
print("Values\n", values)
print("Attention\n", attention)

INFO:lightning_fabric.utilities.seed:Seed set to 42

Q
tensor([[ 0.3367, 0.1288],
[ 0.2345, 0.2303],
[-1.1229, -0.1863]])
K
tensor([[ 2.2082, -0.6380],
[ 0.4617, 0.2674],
[ 0.5349, 0.8094]])
V
tensor([[ 1.1103, -1.6898],
[-0.9890, 0.9580],
[ 1.3221, 0.8172]])
Values
tensor([[ 0.5698, -0.1520],
[ 0.5379, -0.0265],
[ 0.2246, 0.5556]])
Attention
tensor([[0.4028, 0.2886, 0.3086],
[0.3538, 0.3069, 0.3393],
[0.1303, 0.4630, 0.4067]])

Before continuing, make sure you can follow the calculation of the
specific values here, and also check it by hand. It is important to
fully understand how the scaled dot product attention is calculated.

Multi-Head Attention

The scaled dot product attention allows a network to attend over a


sequence. However, often there are multiple different aspects a sequence
element wants to attend to, and a single weighted average is not a good
option for it. This is why we extend the attention mechanisms to
multiple heads, i.e. multiple different query-key-value triplets on the
same features. Specifically, given a query, key, and value matrix, we
transform those into h sub-queries, sub-keys, and sub-values, which we
pass through the scaled dot product attention independently. Afterward,
we concatenate the heads and combine them with a final weight matrix.
Mathematically, we can express this operation as:

$$
\begin{split}
\text{Multihead}(Q,K,V) & = \text{Concat}(\text{head}_1,...,\
text{head}_h)W^{O}\\
\text{where } \text{head}_i & = \text{Attention}(QW_i^Q,KW_i^K, VW_i^V)
\end{split}
$$

We refer to this as Multi-Head Attention layer with the learnable


parameters W_(1...h)^(Q) ∈ ℝ^(D × d_(k)), W_(1...h)^(K) ∈ ℝ^(D × d_(k)),
W_(1...h)^(V) ∈ ℝ^(D × d_(v)), and W^(O) ∈ ℝ^(h ⋅ d_(v) × d_(out)) (D
being the input dimensionality).

How are we applying a Multi-Head Attention layer in a neural network,


where we don't have an arbitrary query, key, and value vector as input?
Looking at the computation graph above, a simple but effective
implementation is to set the current feature map in a NN,
X ∈ ℝ^(B × T × d_(model)), as Q, K and V (B being the batch size, T the
sequence length, d_(model) the hidden dimensionality of X). The
consecutive weight matrices W^(Q), W^(K), and W^(V) can transform X to
the corresponding feature vectors that represent the queries, keys, and
values of the input. Using this approach, we can implement the
Multi-Head Attention module below.

# Helper function to support different mask shapes.


# Output shape supports (batch_size, number of heads, seq length, seq length)
# If 2D: broadcasted over batch size and number of heads
# If 3D: broadcasted over number of heads
# If 4D: leave as is
def expand_mask(mask):
assert mask.ndim >= 2, "Mask must be at least 2-dimensional with seq_length
x seq_length"
if mask.ndim == 3:
mask = mask.unsqueeze(1)
while mask.ndim < 4:
mask = mask.unsqueeze(0)
return mask

# http://jalammar.github.io/illustrated-transformer/
class MultiheadAttention(nn.Module):

def __init__(self, input_dim, embed_dim, num_heads):


super().__init__()
assert embed_dim % num_heads == 0, "Embedding dimension must be 0
modulo number of heads."

self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads

# Stack all weight matrices 1...h together for efficiency


# Note that in many implementations you see "bias=False" which is
optional
self.qkv_proj = nn.Linear(input_dim, 3*embed_dim)
self.o_proj = nn.Linear(embed_dim, input_dim)

self._reset_parameters()

def _reset_parameters(self):
# Original Transformer initialization, see PyTorch documentation
nn.init.xavier_uniform_(self.qkv_proj.weight)
self.qkv_proj.bias.data.fill_(0)
nn.init.xavier_uniform_(self.o_proj.weight)
self.o_proj.bias.data.fill_(0)

def forward(self, x, mask=None, return_attention=False):


batch_size, seq_length, _ = x.size()
if mask is not None:
mask = expand_mask(mask)
qkv = self.qkv_proj(x)

# Separate Q, K, V from linear output


qkv = qkv.reshape(batch_size, seq_length, self.num_heads,
3*self.head_dim)
qkv = qkv.permute(0, 2, 1, 3) # [Batch, Head, SeqLen, Dims]
q, k, v = qkv.chunk(3, dim=-1)

# Determine value outputs


values, attention = scaled_dot_product(q, k, v, mask=mask)
values = values.permute(0, 2, 1, 3) # [Batch, SeqLen, Head, Dims]
values = values.reshape(batch_size, seq_length, self.embed_dim)
o = self.o_proj(values)

if return_attention:
return o, attention
else:
return o

One crucial characteristic of the multi-head attention is that it is


permutation-equivariant with respect to its inputs. This means that if
we switch two input elements in the sequence, e.g. X₁ ↔ X₂ (neglecting
the batch dimension for now), the output is exactly the same besides the
elements 1 and 2 switched. Hence, the multi-head attention is actually
looking at the input not as a sequence, but as a set of elements. This
property makes the multi-head attention block and the Transformer
architecture so powerful and widely applicable! But what if the order of
the input is actually important for solving the task, like language
modeling? The answer is to encode the position in the input features,
which we will take a closer look at later (topic Positional encodings
below).

Transformer Encoder

Next, we will look at how to apply the multi-head attention block inside
the Transformer architecture.

The encoder consists of N identical blocks that are applied in sequence.


Taking as input x, it is first passed through a Multi-Head Attention
block as we have implemented above. The output is added to the original
input using a residual connection, and we apply a consecutive Layer
Normalization on the sum. Overall, it calculates
LayerNorm(x+Multihead(x,x,x)) (x being Q, K and V input to the attention
layer). The residual connection is crucial in the Transformer
architecture for two reasons:

1. Similar to ResNets, Transformers are designed to be very deep. Some


models contain more than 24 blocks in the encoder. Hence, the
residual connections are crucial for enabling a smooth gradient flow
through the model.
2. Without the residual connection, the information about the original
sequence is lost. Remember that the Multi-Head Attention layer
ignores the position of elements in a sequence, and can only learn
it based on the input features. Removing the residual connections
would mean that this information is lost after the first attention
layer (after initialization), and with a randomly initialized query
and key vector, the output vectors for position i has no relation to
its original input. All outputs of the attention are likely to
represent similar/same information, and there is no chance for the
model to distinguish which information came from which input
element. An alternative option to residual connection would be to
fix at least one head to focus on its original input, but this is
very inefficient and does not have the benefit of the improved
gradient flow.

The Layer Normalization also plays an important role in the Transformer


architecture as it enables faster training and provides small
regularization. Additionally, it ensures that the features are in a
similar magnitude among the elements in the sequence. We are not using
Batch Normalization because it depends on the batch size which is often
small with Transformers (they require a lot of GPU memory), and
BatchNorm has shown to perform particularly bad in language as the
features of words tend to have a much higher variance (there are many,
very rare words which need to be considered for a good distribution
estimate).

Additionally to the Multi-Head Attention, a small fully connected


feed-forward network is added to the model, which is applied to each
position separately and identically. Specifically, the model uses a
Linear→ReLU→Linear MLP. The full transformation including the residual
connection can be expressed as:

$$
\begin{split}
\text{FFN}(x) & = \max(0, xW_1+b_1)W_2 + b_2\\
x & = \text{LayerNorm}(x + \text{FFN}(x))
\end{split}
$$
This MLP adds extra complexity to the model and allows transformations
on each sequence element separately. You can imagine as this allows the
model to "post-process" the new information added by the previous
Multi-Head Attention, and prepare it for the next attention block.
Usually, the inner dimensionality of the MLP is 2-8× larger than
d_(model), i.e. the dimensionality of the original input x. The general
advantage of a wider layer instead of a narrow, multi-layer MLP is the
faster, parallelizable execution.

Finally, after looking at all parts of the encoder architecture, we can


start implementing it below. We first start by implementing a single
encoder block. Additionally to the layers described above, we will add
dropout layers in the MLP and on the output of the MLP and Multi-Head
Attention for regularization.

class EncoderBlock(nn.Module):

def __init__(self, input_dim, num_heads, dim_feedforward, dropout=0.0):


"""
Inputs:
input_dim - Dimensionality of the input
num_heads - Number of heads to use in the attention block
dim_feedforward - Dimensionality of the hidden layer in the MLP
dropout - Dropout probability to use in the dropout layers
"""
super().__init__()

# Attention layer
self.self_attn = MultiheadAttention(input_dim, input_dim, num_heads)

# Two-layer MLP
self.linear_net = nn.Sequential(
nn.Linear(input_dim, dim_feedforward),
nn.Dropout(dropout),
nn.ReLU(inplace=True),
nn.Linear(dim_feedforward, input_dim)
)

# Layers to apply in between the main layers


self.norm1 = nn.LayerNorm(input_dim)
self.norm2 = nn.LayerNorm(input_dim)
self.dropout = nn.Dropout(dropout)

def forward(self, x, mask=None):


# Attention part
attn_out = self.self_attn(x, mask=mask)
x = x + self.dropout(attn_out)
x = self.norm1(x)

# MLP part
linear_out = self.linear_net(x)
x = x + self.dropout(linear_out)
x = self.norm2(x)

return x

Based on this block, we can implement a module for the full Transformer
encoder.
Positional encoding

We have discussed before that the Multi-Head Attention block is


permutation-equivariant, and cannot distinguish whether an input comes
before another one in the sequence or not. In tasks like language
understanding, however, the position is important for interpreting the
input words. The position information can therefore be added via the
input features. We could learn a embedding for every possible position,
but this would not generalize to a dynamical input sequence length.
Hence, the better option is to use feature patterns that the network can
identify from the features and potentially generalize to larger
sequences. The specific pattern chosen by Vaswani et al. are sine and
cosine functions of different frequencies, as follows:

$$
PE_{(pos,i)} = \begin{cases}
\sin\left(\frac{pos}{10000^{i/d_{\text{model}}}}\right) & \text{if}\hspace{3mm}
i \text{ mod } 2=0\\
\cos\left(\frac{pos}{10000^{(i-1)/d_{\text{model}}}}\right) & \
text{otherwise}\\
\end{cases}
$$

PE_((pos,i)) represents the position encoding at position pos in the


sequence, and hidden dimensionality i. These values, concatenated for
all hidden dimensions, are added to the original input features (in the
Transformer visualization above, see "Positional encoding"), and
constitute the position information. We distinguish between even
(i mod 2 = 0) and uneven (i mod 2 = 1) hidden dimensionalities where we
apply a sine/cosine respectively. The intuition behind this encoding is
that you can represent PE_((pos+k,:)) as a linear function of
PE_((pos,:)), which might allow the model to easily attend to relative
positions. The wavelengths in different dimensions range from 2π to
10000 ⋅ 2π.

The positional encoding is implemented below. The code is taken from the
PyTorch tutorial about Transformers on NLP and adjusted for our
purposes.

class PositionalEncoding(nn.Module):

def __init__(self, d_model, max_len=5000):


"""
Inputs
d_model - Hidden dimensionality of the input.
max_len - Maximum length of a sequence to expect.
"""
super().__init__()

# Create matrix of [SeqLen, HiddenDim] representing the positional


encoding for max_len inputs
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-
math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
# register_buffer => Tensor which is not a parameter, but should be
part of the modules state.
# Used for tensors that need to be on the same device as the module.
# persistent=False tells PyTorch to not add the buffer to the state
dict (e.g. when we save the model)
self.register_buffer('pe', pe, persistent=False)

def forward(self, x):


x = x + self.pe[:, :x.size(1)]
return x

To understand the positional encoding, we can visualize it below. We


will generate an image of the positional encoding over hidden
dimensionality and position in a sequence. Each pixel, therefore,
represents the change of the input feature we perform to encode the
specific position. Let's do it below.

encod_block = PositionalEncoding(d_model=48, max_len=96)


pe = encod_block.pe.squeeze().T.cpu().numpy()

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8,3))


pos = ax.imshow(pe, cmap="RdGy", extent=(1,pe.shape[1]+1,pe.shape[0]+1,1))
fig.colorbar(pos, ax=ax)
ax.set_xlabel("Position in sequence")
ax.set_ylabel("Hidden dimension")
ax.set_title("Positional encoding over hidden dimensions")
ax.set_xticks([1]+[i*10 for i in range(1,1+pe.shape[1]//10)])
ax.set_yticks([1]+[i*10 for i in range(1,1+pe.shape[0]//10)])
plt.show()

[]

You can clearly see the sine and cosine waves with different wavelengths
that encode the position in the hidden dimensions. Specifically, we can
look at the sine/cosine wave for each hidden dimension separately, to
get a better intuition of the pattern. Below we visualize the positional
encoding for the hidden dimensions 1, 2, 3 and 4.

sns.set_theme()
fig, ax = plt.subplots(2, 2, figsize=(12,4))
ax = [a for a_list in ax for a in a_list]
for i in range(len(ax)):
ax[i].plot(np.arange(1,17), pe[i,:16], color=f'C{i}', marker="o",
markersize=6, markeredgecolor="black")
ax[i].set_title(f"Encoding in hidden dimension {i+1}")
ax[i].set_xlabel("Position in sequence", fontsize=10)
ax[i].set_ylabel("Positional encoding", fontsize=10)
ax[i].set_xticks(np.arange(1,17))
ax[i].tick_params(axis='both', which='major', labelsize=10)
ax[i].tick_params(axis='both', which='minor', labelsize=8)
ax[i].set_ylim(-1.2, 1.2)
fig.subplots_adjust(hspace=0.8)
sns.reset_orig()
plt.show()

[]

As we can see, the patterns between the hidden dimension 1 and 2 only
differ in the starting angle. The wavelength is 2π, hence the repetition
after position 6. The hidden dimensions 2 and 3 have about twice the
wavelength.

PART 2: Vision Transformer

This notebook is Part 2 of the transformer lab.

In this tutorial, we will take a closer look at a recent new trend:


Transformers for Computer Vision. Since Alexey Dosovitskiy et al.
successfully applied a Transformer on a variety of image recognition
benchmarks, there have been an incredible amount of follow-up works
showing that CNNs might not be optimal architecture for Computer Vision
anymore. But how do Vision Transformers work exactly, and what benefits
and drawbacks do they offer in contrast to CNNs? We will answer these
questions by implementing a Vision Transformer ourselves and train it on
the popular, small dataset CIFAR10. We will use PyTorch Lightning. Let's
start with importing our standard set of libraries.

## Standard libraries
import os
import numpy as np
import random
import math
import json
from functools import partial
from PIL import Image

## Imports for plotting


import matplotlib.pyplot as plt
plt.set_cmap('cividis')
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf') # For export
from matplotlib.colors import to_rgb
import matplotlib
matplotlib.rcParams['lines.linewidth'] = 2.0
import seaborn as sns
sns.reset_orig()

## tqdm for loading bars


from tqdm.notebook import tqdm

## PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim

## Torchvision
import torchvision
from torchvision.datasets import CIFAR10
from torchvision import transforms

# PyTorch Lightning
try:
import pytorch_lightning as pl
except ModuleNotFoundError: # Google Colab does not have PyTorch Lightning
installed by default. Hence, we do it here if necessary
!pip install --quiet pytorch-lightning>=1.4
import pytorch_lightning as pl
from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint

# Import tensorboard
%load_ext tensorboard

# Path to the folder where the datasets are/should be downloaded (e.g. CIFAR10)
DATASET_PATH = "../data"
# Path to the folder where the pretrained models are saved
CHECKPOINT_PATH = "../saved_models/tutorial15"

# Setting the seed


pl.seed_everything(42)

# Ensure that all operations are deterministic on GPU (if used) for
reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device("cuda:0") if torch.cuda.is_available() else


torch.device("cpu")
print("Device:", device)

<ipython-input-11-4eb4ebd6ea16>:15: DeprecationWarning:
`set_matplotlib_formats` is deprecated since IPython 7.23, directly use
`matplotlib_inline.backend_inline.set_matplotlib_formats()`
set_matplotlib_formats('svg', 'pdf') # For export
INFO:lightning_fabric.utilities.seed:Seed set to 42

Device: cuda:0

<Figure size 640x480 with 0 Axes>

We provide a pre-trained Vision Transformer which we download in the


next cell. However, Vision Transformers can be relatively quickly
trained on CIFAR10 with an overall training time of less than an hour on
an NVIDIA TitanRTX. Feel free to experiment with training your own
Transformer once you went through the whole notebook.

import urllib.request
from urllib.error import HTTPError
# Github URL where saved models are stored for this tutorial
base_url = "https://raw.githubusercontent.com/phlippe/saved_models/main/"
# Files to download
pretrained_files = ["tutorial15/ViT.ckpt",
"tutorial15/tensorboards/ViT/events.out.tfevents.ViT",
"tutorial5/tensorboards/ResNet/events.out.tfevents.resnet"]
# Create checkpoint path if it doesn't exist yet
os.makedirs(CHECKPOINT_PATH, exist_ok=True)

# For each file, check whether it already exists. If not, try downloading it.
for file_name in pretrained_files:
file_path = os.path.join(CHECKPOINT_PATH, file_name.split("/",1)[1])
if "/" in file_name.split("/",1)[1]:
os.makedirs(file_path.rsplit("/",1)[0], exist_ok=True)
if not os.path.isfile(file_path):
file_url = base_url + file_name
print(f"Downloading {file_url}...")
try:
urllib.request.urlretrieve(file_url, file_path)
except HTTPError as e:
print("Something went wrong. Please try to download the file from
the GDrive folder, or contact the author with the full output including the
following error:\n", e)

Downloading
https://raw.githubusercontent.com/phlippe/saved_models/main/tutorial15/ViT.ckpt...
Downloading
https://raw.githubusercontent.com/phlippe/saved_models/main/tutorial15/
tensorboards/ViT/events.out.tfevents.ViT...
Downloading
https://raw.githubusercontent.com/phlippe/saved_models/main/tutorial5/
tensorboards/ResNet/events.out.tfevents.resnet...

We load the CIFAR10 dataset below. We use the same setup of the datasets
and data augmentations as for the CNNs. The constants in the
transforms.Normalize correspond to the values that scale and shift the
data to a zero mean and standard deviation of one.

test_transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize([0.49139968,
0.48215841, 0.44653091], [0.24703223, 0.24348513, 0.26158784])
])
# For training, we add some augmentation. Networks are too powerful and would
overfit.
train_transform = transforms.Compose([transforms.RandomHorizontalFlip(),

transforms.RandomResizedCrop((32,32),scale=(0.8,1.0),ratio=(0.9,1.1)),
transforms.ToTensor(),
transforms.Normalize([0.49139968,
0.48215841, 0.44653091], [0.24703223, 0.24348513, 0.26158784])
])
# Loading the training dataset. We need to split it into a training and
validation part
# We need to do a little trick because the validation set should not use the
augmentation.
train_dataset = CIFAR10(root=DATASET_PATH, train=True,
transform=train_transform, download=True)
val_dataset = CIFAR10(root=DATASET_PATH, train=True, transform=test_transform,
download=True)
pl.seed_everything(42)
train_set, _ = torch.utils.data.random_split(train_dataset, [45000, 5000])
pl.seed_everything(42)
_, val_set = torch.utils.data.random_split(val_dataset, [45000, 5000])

# Loading the test set


test_set = CIFAR10(root=DATASET_PATH, train=False, transform=test_transform,
download=True)

# We define a set of data loaders that we can use for various purposes later.
train_loader = data.DataLoader(train_set, batch_size=128, shuffle=True,
drop_last=True, pin_memory=True, num_workers=4)
val_loader = data.DataLoader(val_set, batch_size=128, shuffle=False,
drop_last=False, num_workers=4)
test_loader = data.DataLoader(test_set, batch_size=128, shuffle=False,
drop_last=False, num_workers=4)
# Visualize some examples
NUM_IMAGES = 4
CIFAR_images = torch.stack([val_set[idx][0] for idx in range(NUM_IMAGES)],
dim=0)
img_grid = torchvision.utils.make_grid(CIFAR_images, nrow=4, normalize=True,
pad_value=0.9)
img_grid = img_grid.permute(1, 2, 0)

plt.figure(figsize=(8,8))
plt.title("Image examples of the CIFAR10 dataset")
plt.imshow(img_grid)
plt.axis('off')
plt.show()
plt.close()

100%|██████████| 170M/170M [00:08<00:00, 20.4MB/s]


INFO:lightning_fabric.utilities.seed:Seed set to 42
INFO:lightning_fabric.utilities.seed:Seed set to 42
/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py:624:
UserWarning: This DataLoader will create 4 worker processes in total. Our suggested
max number of worker in current system is 2, which is smaller than what this
DataLoader is going to create. Please be aware that excessive worker creation might
get DataLoader running slow or even freeze, lower the worker number to avoid
potential slowness/freeze if necessary.
warnings.warn(

[]

Transformers for image classification

Transformers have been originally proposed to process sets since it is a


permutation-equivariant architecture, i.e., producing the same output
permuted if the input is permuted. To apply Transformers to sequences,
we have simply added a positional encoding to the input feature vectors,
and the model learned by itself what to do with it. So, why not do the
same thing on images? This is exactly what Alexey Dosovitskiy et al.
proposed in their paper “An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale”. Specifically, the Vision Transformer is a
model for image classification that views images as sequences of smaller
patches. As a preprocessing step, we split an image of, for example,
48 × 48 pixels into 9 16 × 16 patches. Each of those patches is
considered to be a “word”/“token” and projected to a feature space. With
adding positional encodings and a token for classification on top, we
can apply a Transformer as usual to this sequence and start training it
for our task. A nice GIF visualization of the architecture is shown
below (figure credit - Phil Wang):

We will walk step by step through the Vision Transformer, and implement
all parts by ourselves. First, let's implement the image preprocessing:
an image of size N * N has to be split into (N/M)² patches of size
M * M. These represent the input words to the Transformer.

def img_to_patch(x, patch_size, flatten_channels=True):


"""
Inputs:
x - torch.Tensor representing the image of shape [B, C, H, W]
patch_size - Number of pixels per dimension of the patches (integer)
flatten_channels - If True, the patches will be returned in a flattened
format
as a feature vector instead of a image grid.
"""
B, C, H, W = x.shape
x = x.reshape(B, C, H//patch_size, patch_size, W//patch_size, patch_size)
x = x.permute(0, 2, 4, 1, 3, 5) # [B, H', W', C, p_H, p_W]
x = x.flatten(1,2) # [B, H'*W', C, p_H, p_W]
if flatten_channels:
x = x.flatten(2,4) # [B, H'*W', C*p_H*p_W]
return x

Let's take a look at how that works for our CIFAR examples above. For
our images of size 32 × 32, we choose a patch size of 4. Hence, we
obtain sequences of 64 patches of size 4 × 4. We visualize them below:

img_patches = img_to_patch(CIFAR_images, patch_size=4, flatten_channels=False)

fig, ax = plt.subplots(CIFAR_images.shape[0], 1, figsize=(14,3))


fig.suptitle("Images as input sequences of patches")
for i in range(CIFAR_images.shape[0]):
img_grid = torchvision.utils.make_grid(img_patches[i], nrow=64,
normalize=True, pad_value=0.9)
img_grid = img_grid.permute(1, 2, 0)
ax[i].imshow(img_grid)
ax[i].axis('off')
plt.show()
plt.close()

[]

Compared to the original images, it is much harder to recognize the


objects from those patch lists now. Still, this is the input we provide
to the Transformer for classifying the images. The model has to learn
itself how it has to combine the patches to recognize the objects. The
inductive bias in CNNs that an image is a grid of pixels, is lost in
this input format.

After we have looked at the preprocessing, we can now start building the
Transformer model. Since we have discussed the fundamentals of
Multi-Head Attention, we will use the PyTorch module
nn.MultiheadAttention here. Further, we use the Pre-Layer Normalization
version of the Transformer blocks proposed by Ruibin Xiong et al. in
2020. The idea is to apply Layer Normalization not in between residual
blocks, but instead as a first layer in the residual blocks. This
reorganization of the layers supports better gradient flow and removes
the necessity of a warm-up stage. A visualization of the difference
between the standard Post-LN and the Pre-LN version is shown
below.[pre_layer_norm.svg]

The implementation of the Pre-LN attention block looks as follows:

class AttentionBlock(nn.Module):

def __init__(self, embed_dim, hidden_dim, num_heads, dropout=0.0):


"""
Inputs:
embed_dim - Dimensionality of input and attention feature vectors
hidden_dim - Dimensionality of hidden layer in feed-forward network
(usually 2-4x larger than embed_dim)
num_heads - Number of heads to use in the Multi-Head Attention
block
dropout - Amount of dropout to apply in the feed-forward network
"""
super().__init__()

self.layer_norm_1 = nn.LayerNorm(embed_dim)
self.attn = nn.MultiheadAttention(embed_dim, num_heads,
dropout=dropout)
self.layer_norm_2 = nn.LayerNorm(embed_dim)
self.linear = nn.Sequential(
nn.Linear(embed_dim, hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, embed_dim),
nn.Dropout(dropout)
)

def forward(self, x):


inp_x = self.layer_norm_1(x)
x = x + self.attn(inp_x, inp_x, inp_x)[0]
x = x + self.linear(self.layer_norm_2(x))
return x

Now we have all modules ready to build our own Vision Transformer.
Besides the Transformer encoder, we need the following modules:

- A linear projection layer that maps the input patches to a feature


vector of larger size. It is implemented by a simple linear layer
that takes each patch independently as input.

- A classification token that is added to the input sequence. We will


use the output feature vector of the classification token (CLS token
in short) for determining the classification prediction.

- Learnable positional encodings that are added to the tokens before


being processed by the Transformer. Those are needed to learn
position-dependent information, and convert the set to a sequence.
Since we usually work with a fixed resolution, we can learn the
positional encodings instead of having the pattern of sine and
cosine functions.

- An MLP head that takes the output feature vector of the CLS token,
and maps it to a classification prediction. This is usually
implemented by a small feed-forward network or even a single linear
layer.

With those components in mind, let's implement the full Vision


Transformer below:

class VisionTransformer(nn.Module):

def __init__(self, embed_dim, hidden_dim, num_channels, num_heads,


num_layers, num_classes, patch_size, num_patches, dropout=0.0):
"""
Inputs:
embed_dim - Dimensionality of the input feature vectors to the
Transformer
hidden_dim - Dimensionality of the hidden layer in the feed-forward
networks
within the Transformer
num_channels - Number of channels of the input (3 for RGB)
num_heads - Number of heads to use in the Multi-Head Attention
block
num_layers - Number of layers to use in the Transformer
num_classes - Number of classes to predict
patch_size - Number of pixels that the patches have per dimension
num_patches - Maximum number of patches an image can have
dropout - Amount of dropout to apply in the feed-forward network
and
on the input encoding
"""
super().__init__()

self.patch_size = patch_size

# Layers/Networks
self.input_layer = nn.Linear(num_channels*(patch_size**2), embed_dim)
self.transformer = nn.Sequential(*[AttentionBlock(embed_dim,
hidden_dim, num_heads, dropout=dropout) for _ in range(num_layers)])
self.mlp_head = nn.Sequential(
nn.LayerNorm(embed_dim),
nn.Linear(embed_dim, num_classes)
)
self.dropout = nn.Dropout(dropout)

# Parameters/Embeddings
self.cls_token = nn.Parameter(torch.randn(1,1,embed_dim))
self.pos_embedding =
nn.Parameter(torch.randn(1,1+num_patches,embed_dim))

def forward(self, x):


# Preprocess input
x = img_to_patch(x, self.patch_size)
B, T, _ = x.shape
x = self.input_layer(x)

# Add CLS token and positional encoding


cls_token = self.cls_token.repeat(B, 1, 1)
x = torch.cat([cls_token, x], dim=1)
x = x + self.pos_embedding[:,:T+1]

# Apply Transforrmer
x = self.dropout(x)
x = x.transpose(0, 1)
x = self.transformer(x)

# Perform classification prediction


cls = x[0]
out = self.mlp_head(cls)
return out

Finally, we can put everything into a PyTorch Lightning Module as usual.


We use torch.optim.AdamW as the optimizer, which is Adam with a
corrected weight decay implementation. Since we use the Pre-LN
Transformer version, we do not need to use a learning rate warmup stage
anymore. Instead, we use the same learning rate scheduler as the CNNs on
image classification.

class ViT(pl.LightningModule):

def __init__(self, model_kwargs, lr):


super().__init__()
self.save_hyperparameters()
self.model = VisionTransformer(**model_kwargs)
self.example_input_array = next(iter(train_loader))[0]

def forward(self, x):


return self.model(x)

def configure_optimizers(self):
optimizer = optim.AdamW(self.parameters(), lr=self.hparams.lr)
lr_scheduler = optim.lr_scheduler.MultiStepLR(optimizer,
milestones=[100,150], gamma=0.1)
return [optimizer], [lr_scheduler]

def _calculate_loss(self, batch, mode="train"):


imgs, labels = batch
preds = self.model(imgs)
loss = F.cross_entropy(preds, labels)
acc = (preds.argmax(dim=-1) == labels).float().mean()

self.log(f'{mode}_loss', loss)
self.log(f'{mode}_acc', acc)
return loss

def training_step(self, batch, batch_idx):


loss = self._calculate_loss(batch, mode="train")
return loss

def validation_step(self, batch, batch_idx):


self._calculate_loss(batch, mode="val")

def test_step(self, batch, batch_idx):


self._calculate_loss(batch, mode="test")

ViT Inference

Inference latency with different layers

import time
def test_inference_time(model, input_tensor, num_runs=100, device='cuda' if
torch.cuda.is_available() else 'cpu'):
"""
Test the inference runtime of the VisionTransformer model.

Parameters:
model: VisionTransformer model instance
input_tensor: Input tensor, shape (batch_size, num_channels, height,
width)
num_runs: Number of inference runs to compute average time
device: Device to run on ('cuda' or 'cpu')

Returns:
avg_time: Average inference time per run (seconds)
"""
# Move model and input to the specified device
model = model.to(device)
input_tensor = input_tensor.to(device)

# Set model to evaluation mode


model.eval()

# Warm-up runs to eliminate initial run overhead


with torch.no_grad():
for _ in range(100):
_ = model(input_tensor)

# Record inference time


total_time = 0.0
with torch.no_grad():
for _ in range(num_runs):
start_time = time.time()
_ = model(input_tensor)
torch.cuda.synchronize() if device == 'cuda' else None # Ensure
GPU execution is complete
end_time = time.time()
total_time += (end_time - start_time)

# Calculate average inference time


avg_time = total_time / num_runs
print(f"Average inference time over {num_runs} runs: {avg_time:.6f}
seconds")

return avg_time

# Model parameters
model_kwargs = {
'embed_dim': 256,
'hidden_dim': 512,
'num_channels': 3,
'num_heads': 8,
'num_layers': 3,
'num_classes': 10,
'patch_size': 4,
'num_patches': 64,
'dropout': 0.2
}

# Initialize the model


model = VisionTransformer(**model_kwargs)

# Create example input tensor (batch_size=32, 3 channels, 32x32 images)


batch_size = 32
input_tensor = torch.randn(batch_size, 3, 32, 32)

# Test inference time


avg_time = test_inference_time(model, input_tensor, num_runs=1000)

Average inference time over 1000 runs: 0.003622 seconds

Inference latency V.S. layer numbers

def test_latency_vs_layers(model_kwargs, input_tensor, layer_range,


num_runs=100, device='cuda' if torch.cuda.is_available() else 'cpu'):
"""
Test the inference latency of VisionTransformer with different numbers of
layers and plot the results.

Parameters:
model_kwargs: Dictionary of initialization parameters for
VisionTransformer
input_tensor: Input tensor, shape (batch_size, num_channels, height,
width)
layer_range: Range of layer counts to test (list or range)
num_runs: Number of inference runs per test
device: Device to run on ('cuda' or 'cpu')

Returns:
latencies: List of average inference times for each layer count
"""
latencies = []

for num_layers in layer_range:


print(f"\nTesting with {num_layers} layers...")
model_kwargs['num_layers'] = num_layers
model = VisionTransformer(**model_kwargs)
avg_time = test_inference_time(model, input_tensor, num_runs, device)
latencies.append(avg_time)

# Plot the chart


plt.figure(figsize=(10, 6))
plt.plot(layer_range, latencies, marker='o', linestyle='-', color='b',
label='Inference Latency')
plt.title('Inference Latency vs. Number of Transformer Layers')
plt.xlabel('Number of Layers')
plt.ylabel('Average Inference Time (seconds)')
plt.grid(True)
plt.legend()
plt.show()

return latencies

# test different layer number


layer_range = range(1, 13)
latencies = test_latency_vs_layers(model_kwargs, input_tensor, layer_range,
num_runs=1000)

for layers, latency in zip(layer_range, latencies):


print(f"Layers: {layers}, Latency: {latency:.6f} seconds")

Testing with 1 layers...


Average inference time over 1000 runs: 0.001378 seconds

Testing with 2 layers...


Average inference time over 1000 runs: 0.002617 seconds

Testing with 3 layers...


Average inference time over 1000 runs: 0.003600 seconds

Testing with 4 layers...


Average inference time over 1000 runs: 0.005463 seconds
Testing with 5 layers...
Average inference time over 1000 runs: 0.005944 seconds

Testing with 6 layers...


Average inference time over 1000 runs: 0.007114 seconds

Testing with 7 layers...


Average inference time over 1000 runs: 0.008288 seconds

Testing with 8 layers...


Average inference time over 1000 runs: 0.009561 seconds

Testing with 9 layers...


Average inference time over 1000 runs: 0.010852 seconds

Testing with 10 layers...


Average inference time over 1000 runs: 0.012179 seconds

Testing with 11 layers...


Average inference time over 1000 runs: 0.013592 seconds

Testing with 12 layers...


Average inference time over 1000 runs: 0.015046 seconds

[]

Layers: 1, Latency: 0.001378 seconds


Layers: 2, Latency: 0.002617 seconds
Layers: 3, Latency: 0.003600 seconds
Layers: 4, Latency: 0.005463 seconds
Layers: 5, Latency: 0.005944 seconds
Layers: 6, Latency: 0.007114 seconds
Layers: 7, Latency: 0.008288 seconds
Layers: 8, Latency: 0.009561 seconds
Layers: 9, Latency: 0.010852 seconds
Layers: 10, Latency: 0.012179 seconds
Layers: 11, Latency: 0.013592 seconds
Layers: 12, Latency: 0.015046 seconds

Inference latency V.S. head number

def test_latency_vs_heads(model_kwargs, input_tensor, head_range, num_runs=100,


device='cuda' if torch.cuda.is_available() else 'cpu'):
"""
Test the inference latency of VisionTransformer with different numbers of
attention heads and plot the results.

Parameters:
model_kwargs: Dictionary of initialization parameters for
VisionTransformer
input_tensor: Input tensor, shape (batch_size, num_channels, height,
width)
head_range: Range of attention head counts to test (list or range)
num_runs: Number of inference runs per test
device: Device to run on ('cuda' or 'cpu')

Returns:
latencies: List of average inference times for each head count
valid_heads: List of valid head counts tested
"""
embed_dim = model_kwargs['embed_dim']
# Filter valid num_heads to ensure embed_dim is divisible by num_heads
valid_heads = [h for h in head_range if embed_dim % h == 0]
if not valid_heads:
raise ValueError(f"No valid num_heads in {head_range} can divide
embed_dim={embed_dim}")

latencies = []

for num_heads in valid_heads:


print(f"\nTesting with {num_heads} heads...")
model_kwargs['num_heads'] = num_heads
model = VisionTransformer(**model_kwargs)
avg_time = test_inference_time(model, input_tensor, num_runs, device)
latencies.append(avg_time)

# Plot the chart


plt.figure(figsize=(10, 6))
plt.plot(valid_heads, latencies, marker='o', linestyle='-', color='b',
label='Inference Latency')
plt.title('Inference Latency vs. Number of Attention Heads')
plt.xlabel('Number of Attention Heads')
plt.ylabel('Average Inference Time (seconds)')
plt.grid(True)
plt.legend()

plt.show()

return latencies, valid_heads

# inference latency with different head range


head_range = [2, 4, 8, 16, 32, ]
latencies, valid_heads = test_latency_vs_heads(model_kwargs, input_tensor,
head_range, num_runs=1000)

for heads, latency in zip(valid_heads, latencies):


print(f"Heads: {heads}, Latency: {latency:.6f} seconds")

Testing with 2 heads...


Average inference time over 1000 runs: 0.013578 seconds

Testing with 4 heads...


Average inference time over 1000 runs: 0.013773 seconds

Testing with 8 heads...


Average inference time over 1000 runs: 0.014814 seconds

Testing with 16 heads...


Average inference time over 1000 runs: 0.017040 seconds

Testing with 32 heads...


Average inference time over 1000 runs: 0.021518 seconds

[]

Heads: 2, Latency: 0.013578 seconds


Heads: 4, Latency: 0.013773 seconds
Heads: 8, Latency: 0.014814 seconds
Heads: 16, Latency: 0.017040 seconds
Heads: 32, Latency: 0.021518 seconds

ViT Training

Commonly, Vision Transformers are applied to large-scale image


classification benchmarks such as ImageNet to leverage their full
potential. However, here we take a step back and ask: can Vision
Transformer also succeed on classical, small benchmarks such as CIFAR10?
To find this out, we train a Vision Transformer from scratch on the
CIFAR10 dataset. Let’s first create a training function for our PyTorch
Lightning module which also loads the pre-trained model if you have
downloaded it above.

def train_model(**kwargs):
trainer = pl.Trainer(default_root_dir=os.path.join(CHECKPOINT_PATH, "ViT"),
accelerator="gpu" if str(device).startswith("cuda")
else "cpu",
devices=1,
max_epochs=180,
callbacks=[ModelCheckpoint(save_weights_only=True,
mode="max", monitor="val_acc"),
LearningRateMonitor("epoch")])
trainer.logger._log_graph = True # If True, we plot the computation
graph in tensorboard
trainer.logger._default_hp_metric = None # Optional logging argument that
we don't need

# Check whether pretrained model exists. If yes, load it and skip training
pretrained_filename = os.path.join(CHECKPOINT_PATH, "ViT.ckpt")
if os.path.isfile(pretrained_filename):
print(f"Found pretrained model at {pretrained_filename}, loading...")
model = ViT.load_from_checkpoint(pretrained_filename) # Automatically
loads the model with the saved hyperparameters
else:
pl.seed_everything(42) # To be reproducable
model = ViT(**kwargs)
trainer.fit(model, train_loader, val_loader)
model =
ViT.load_from_checkpoint(trainer.checkpoint_callback.best_model_path) # Load best
checkpoint after training

# Test best model on validation and test set


val_result = trainer.test(model, val_loader, verbose=False)
test_result = trainer.test(model, test_loader, verbose=False)
result = {"test": test_result[0]["test_acc"], "val": val_result[0]
["test_acc"]}

return model, result

Now, we can already start training our model. As seen in our


implementation, we have a couple of hyperparameters that we have to set.
When creating this notebook, we have performed a small grid search over
hyperparameters and listed the best hyperparameters in the cell below.
Nevertheless, it is worth discussing the influence that each
hyperparameter has, and what intuition we have for choosing its value.
First, let's consider the patch size. The smaller we make the patches,
the longer the input sequences to the Transformer become. While in
general, this allows the Transformer to model more complex functions, it
requires a longer computation time due to its quadratic memory usage in
the attention layer. Furthermore, small patches can make the task more
difficult since the Transformer has to learn which patches are close-by,
and which are far away. We experimented with patch sizes of 2, 4, and 8
which gives us the input sequence lengths of 256, 64, and 16
respectively. We found 4 to result in the best performance and hence
pick it below.

Next, the embedding and hidden dimensionality have a similar impact on a


Transformer as to an MLP. The larger the sizes, the more complex the
model becomes, and the longer it takes to train. In Transformers,
however, we have one more aspect to consider: the query-key sizes in the
Multi-Head Attention layers. Each key has the feature dimensionality of
embed_dim/num_heads. Considering that we have an input sequence length
of 64, a minimum reasonable size for the key vectors is 16 or 32. Lower
dimensionalities can restrain the possible attention maps too much. We
observed that more than 8 heads are not necessary for the Transformer,
and therefore pick an embedding dimensionality of 256. The hidden
dimensionality in the feed-forward networks is usually 2-4x larger than
the embedding dimensionality, and thus we pick 512.

Finally, the learning rate for Transformers is usually relatively small,


and in papers, a common value to use is 3e-5. However, since we work
with a smaller dataset and have a potentially easier task, we found that
we are able to increase the learning rate to 3e-4 without any problems.
To reduce overfitting, we use a dropout value of 0.2. Remember that we
also use small image augmentations as regularization during training.

Feel free to explore the hyperparameters yourself by changing the values


below. In general, the Vision Transformer did not show to be too
sensitive to the hyperparameter choices on the CIFAR10 dataset.

model, results = train_model(model_kwargs={


'embed_dim': 256,
'hidden_dim': 512,
'num_heads': 8,
'num_layers': 6,
'patch_size': 4,
'num_channels': 3,
'num_patches': 64,
'num_classes': 10,
'dropout': 0.2
},
lr=3e-4)
print("ViT results", results)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used:


True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU
cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs

Found pretrained model at ../saved_models/tutorial15/ViT.ckpt, loading...

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically
upgraded your loaded checkpoint from v1.6.4 to v2.5.1. To apply the upgrade to your
files permanently, run `python -m
pytorch_lightning.utilities.upgrade_checkpoint ../saved_models/tutorial15/ViT.ckpt`
/usr/local/lib/python3.11/dist-packages/torch/utils/data/dataloader.py:624:
UserWarning: This DataLoader will create 4 worker processes in total. Our suggested
max number of worker in current system is 2, which is smaller than what this
DataLoader is going to create. Please be aware that excessive worker creation might
get DataLoader running slow or even freeze, lower the worker number to avoid
potential slowness/freeze if necessary.
warnings.warn(
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES:
[0]

{"model_id":"bbea62e005eb49bdab2884d048ebd2bb","version_major":2,"version_minor":0}

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES:
[0]

{"model_id":"958bcbc0e61c4e728dff9280fcd1a70f","version_major":2,"version_minor":0}

ViT results {'test': 0.7713000178337097, 'val': 0.7781999707221985}

Change the parameters and retrain the ViT, below is an example of patch size/number
of patches

###Take away: Transformer training takes a lot of time!!

# always train
def train_model(**kwargs):
trainer = pl.Trainer(default_root_dir=os.path.join(CHECKPOINT_PATH, "ViT"),
accelerator="gpu" if str(device).startswith("cuda")
else "cpu",
devices=1,
max_epochs=180,
callbacks=[ModelCheckpoint(save_weights_only=True,
mode="max", monitor="val_acc"),
LearningRateMonitor("epoch")])
trainer.logger._log_graph = True # If True, we plot the computation
graph in tensorboard
trainer.logger._default_hp_metric = None # Optional logging argument that
we don't need

# Check whether pretrained model exists. If yes, load it and skip training
pretrained_filename = os.path.join(CHECKPOINT_PATH, "ViT.ckpt")
# if os.path.isfile(pretrained_filename):
# print(f"Found pretrained model at {pretrained_filename}, loading...")
# model = ViT.load_from_checkpoint(pretrained_filename) # Automatically
loads the model with the saved hyperparameters
# else:
pl.seed_everything(42) # To be reproducable
model = ViT(**kwargs)
trainer.fit(model, train_loader, val_loader)
model =
ViT.load_from_checkpoint(trainer.checkpoint_callback.best_model_path) # Load best
checkpoint after training

# Test best model on validation and test set


val_result = trainer.test(model, val_loader, verbose=False)
test_result = trainer.test(model, test_loader, verbose=False)
result = {"test": test_result[0]["test_acc"], "val": val_result[0]
["test_acc"]}

return model, result

embed_dim_list = [128,256]
hidden_dim = [256,512]
num_heads_list = [4,8]
num_layers=[4,8]
patch_num=[[2,256],[8,16]]

for i in range(len(patch_num)):
model, results = train_model(model_kwargs={
'embed_dim': 256,
'hidden_dim': 512,
'num_heads': 8,
'num_layers': 6,
'patch_size': patch_num[i][0],
'num_channels': 3,
'num_patches': patch_num[i][1],
'num_classes': 10,
'dropout': 0.2
},
lr=3e-4)
print(f"patch size: {patch_num[i][0]}, patch number: {patch_num[i][1]}, ViT
results", results)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used:


True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU
cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:lightning_fabric.utilities.seed:Seed set to 42
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES:
[0]
INFO:pytorch_lightning.callbacks.model_summary:
| Name | Type | Params | Mode | In sizes | Out sizes

-----------------------------------------------------------------------------------
0 | model | VisionTransformer | 3.2 M | train | [128, 3, 32, 32] | [128, 10]

-----------------------------------------------------------------------------------
3.2 M Trainable params
0 Non-trainable params
3.2 M Total params
12.940 Total estimated model params size (MB)
73 Modules in train mode
0 Modules in eval mode

{"model_id":"a25b7b0eec2641d6ac6166223071454d","version_major":2,"version_minor":0}

{"model_id":"445c6475e74840d88038f795fded2f71","version_major":2,"version_minor":0}

{"model_id":"a1325543bb104a00b3d05f686850edeb","version_major":2,"version_minor":0}
{"model_id":"57caedea2676489a878d535e97fde89f","version_major":2,"version_minor":0}

{"model_id":"04d680079a294fa3a9bf7ee19a259fbd","version_major":2,"version_minor":0}

{"model_id":"651db07634164257abfc158f585559ef","version_major":2,"version_minor":0}

{"model_id":"86c6d9e87d2e4a6d9fe8859e67b96790","version_major":2,"version_minor":0}

{"model_id":"e36d62529b1247ebbc1b6d566ff1028c","version_major":2,"version_minor":0}

{"model_id":"426d15a05c6e4cf693e4977f852ba7ee","version_major":2,"version_minor":0}

{"model_id":"170599fa1cba4f3eb35a4f5dbfd0c1dd","version_major":2,"version_minor":0}

{"model_id":"62edb6c9927c45ec8800ca74d4fd22df","version_major":2,"version_minor":0}

{"model_id":"7efcc7f4fbb84e56bb4115ec1b141974","version_major":2,"version_minor":0}

{"model_id":"d04f33f07f794cf4815add6e0c4e0237","version_major":2,"version_minor":0}

{"model_id":"01187947a6db4acbb3b27a25ff98541a","version_major":2,"version_minor":0}

{"model_id":"bdfef0bbd020467184fa33b8b5eb3e9d","version_major":2,"version_minor":0}

The Vision Transformer achieves a validation and test performance of


about 75%. In comparison, almost all CNN architectures that we have
tested obtained a classification performance of around 90%. This is a
considerable gap and shows that although Vision Transformers perform
strongly on ImageNet with potential pretraining, they cannot come close
to simple CNNs on CIFAR10 when being trained from scratch. The
differences between a CNN and Transformer can be well observed in the
training curves.

All those observed phenomenons can be explained with a concept that we


have visited before: inductive biases. Convolutional Neural Networks
have been designed with the assumption that images are translation
invariant. Hence, we apply convolutions with shared filters across the
image. Furthermore, a CNN architecture integrates the concept of
distance in an image: two pixels that are close to each other are more
related than two distant pixels. Local patterns are combined into larger
patterns until we perform our classification prediction. All those
aspects are inductive biases of a CNN. In contrast, a Vision Transformer
does not know which two pixels are close to each other, and which are
far apart. It has to learn this information solely from the sparse
learning signal of the classification task. This is a huge disadvantage
when we have a small dataset since such information is crucial for
generalizing to an unseen test dataset. With large enough datasets
and/or good pre-training, a Transformer can learn this information
without the need for inductive biases, and instead is more flexible than
a CNN. Especially long-distance relations between local patterns can be
difficult to process in CNNs, while in Transformers, all patches have
the distance of one. This is why Vision Transformers are so strong on
large-scale datasets such as ImageNet but underperform a lot when being
applied to a small dataset such as CIFAR10.

You might also like