0% found this document useful (0 votes)
6 views63 pages

Computer Vision 11 Transformers

The document discusses the evolution of computer vision techniques, focusing on Convolutional Neural Networks (CNNs) and Transformers, particularly Vision Transformers (ViTs) and Swin Transformers. It outlines the architecture and processes involved in these models, including the importance of inductive bias, self-attention mechanisms, and the training process. Additionally, it highlights the advantages of Swin Transformers over traditional CNNs in terms of computational efficiency and performance on image classification tasks.

Uploaded by

laughriotclip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views63 pages

Computer Vision 11 Transformers

The document discusses the evolution of computer vision techniques, focusing on Convolutional Neural Networks (CNNs) and Transformers, particularly Vision Transformers (ViTs) and Swin Transformers. It outlines the architecture and processes involved in these models, including the importance of inductive bias, self-attention mechanisms, and the training process. Additionally, it highlights the advantages of Swin Transformers over traditional CNNs in terms of computational efficiency and performance on image classification tasks.

Uploaded by

laughriotclip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

COMPUTER VISION

2024 - 2025
>VISION
TRANSFORMERS
UTRECHT UNIVERSITY
RONALD POPPE
CNN RECAP
CNNs take visual inputs and produce an output
• Label, bounding box, segmentation map, etc.
• For recognition tasks: feature extractor part and classification part
RECEPTIVE FIELD RECAP
CNNs hierarchically process local information
• Convolution and pooling kernels operate on small neighborhoods
• Receptive field increases with layers
• Spatially remote areas only processed jointly when in the same receptive field
RECEPTIVE FIELD RECAP2
Example: receptive field importance

Even then, relative positioning is


encoded in the convolutional kernels
• Invariancies have to be learned

In CNNs, spatially distant information


is typically handled in the FC layers
INDUCTIVE BIAS
Inductive biases govern how models generalize to unseen data
• Inherent properties of the model affect what they focus on

Helps in obtaining a good solution in direction of bias


• Prevents overfitting

Might not provide optimal solution


INDUCTIVE BIAS IN CNNS
CNNs process locally through kernel operations
• Consequently, focus on local patterns
• Texture typically dominant over shape
• Learning local invariancies easier than learning spatial invariancies
• Difficulty in capturing long-range patterns
TRANSFORMERS
TRANSFORMERS
Transformers are a type of sequence-to-sequence models
• Contain encoder and decoder

Output of encoder is a vectorized representation of the input


Output of the decoder is the translated version of the input
TRANSFORMERS2
Many different sequence-to-sequence tasks:
• Language translation
• Image-to-text
• Conversational models
• Etc.

Classification also possible


• Requires trivial adaptation
TRANSFORMERS3
First sequence-to-sequence models used recurrent neural networks (RNNs)
• Hidden states (encoder vector) summarizes the full input
TRANSFORMERS4
Transformers instead use self-attention
• Encoder and decoder have L transformer blocks

We focus on image classification Lx

• Decoder not used, only label as output Lx

• Decoder returns in the next lecture

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017


VISION TRANSFORMER
VISION TRANSFORMER
Vision transformers (ViTs) are adapted from NLP to visual inputs
• Initially used for classification instead of translation
• Nowadays adapted to a variety of tasks
• Detection, segmentation, etc.

Alexey Dosovitskiy, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR 2021
VISION TRANSFORMER2
Core property of transformer models is self-attention
• Ability to weight information through correlations within a layer
• Less directed so lower inductive bias
• More flexible
VISION TRANSFORMER3
VISION TRANSFORMER4
Familiar structure
• Feature encoder
• Classifier
VISION TRANSFORMER5
A forward pass in a vision transformer:
1. Split an image into patches
2. Flatten the patches
3. Produce lower-dimensional embeddings
4. Add a class token
5. Add positional encodings
6. Feed the sequence to transformer encoder
7. Employ MLP to make class predictions
SPLIT IMAGE INTO PATCHES
Step 1: divide image into square patches
• Non-overlapping: stride = patch size

Global spatial information is lost


SPLIT IMAGE INTO PATCHES2
Original image: W x H x C
• N patches
• Each patch P x P x C pixels
FLATTEN IMAGE PATCH
Step 2: flatten image patches
• We further remove local spatial information by flattening each patch
• Output size N x (P2C), with N rows of patches with P2C values
EMBED PATCHES
Step 3: Produce lower-dimensional embeddings from the flattened patches
• We linearly project each patch to a fixed-dimensional vector with length D
• We use trainable weight matrix of size P2C x D
• Output is matrix of size N x D

Flattened patches Weight matrix Patch embeddings

N X P2C = N

P2C D D
EMBED PATCHES2
Patching and embedding patches implemented as convolution layer
• Stride and kernel size are P: extract patches
• Each kernel has size P x P x C
• Number of kernels is D: achieves the linear projection

Each output is a weighted linear sum of all inputs


• Output size is 𝑁𝑁 × 𝑁𝑁 × 𝐷𝐷, which is then reshaped to N x D
CLASS TOKEN
Step 4: Add class token
• Class or [CLS] token is appended to the sequence of patch tokens
• Results in (N + 1) x D input

N N+1

D D
CLASS TOKEN2
Class token accumulates the information in the transformer encoder
• Final class token embedding is used by classifier

Alternative is to use final embeddings of all patch tokens


• Would result in differently sized embedding when number of N changes

Initial value of the class token is learned during training


POSITIONAL ENCODINGS
Step 5: Add positional encodings
• During patching, position information was lost
• We infuse position information per patch using positional encodings

We could number patches and use indices as positional encodings


• Idea abandoned in NLP due to varying sequence lengths 1 2 3 4
• Rare indices would hamper training
5 6 7 8

9 10 11 12

13 14 15 16
POSITIONAL ENCODINGS2
Instead, we can learn the encodings
• Size of positional encoding vector is the same as patch embedding: D
• Allows flexibility in learning how regions should relate spatially
• Further reduces inductive bias

In practice: rows/columns implicitly encoded


• Larger similarity between close-by patches
POSITIONAL ENCODINGS3
In original ViT formulation, sine and cosine encoding was used
• Again, size of positional encoding vector is same as patch embedding: D

For a patch at position pos, and embedding dimension i of D


𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝
• 𝑃𝑃𝑃𝑃(𝑝𝑝𝑝𝑝𝑝𝑝,2𝑖𝑖) = sin 2𝑖𝑖 , 𝑃𝑃𝑃𝑃(𝑝𝑝𝑝𝑝𝑝𝑝,2𝑖𝑖+1) = cos 2𝑖𝑖
10000 𝐷𝐷 10000 𝐷𝐷
POSITIONAL ENCODINGS4
Sinusoidal positional encoding has favorable properties:
• Each patch has a unique encoding
• All values are naturally in the -1…1 range
• Distance between neighboring patches is the same
• Encodings can be interpolated when the number of patches changes
POSITIONAL ENCODINGS5
Eventually, the positional encodings are element-wise summed with
the patch embeddings
• Output size remains (N + 1) x D

Summing them instead of concatenation is unintuitive but efficient


• Transformer can learn to disentangle signal and position
TRANSFORMER ENCODER
At this point, we have an image embedded per patch
• Added a class token
• Added positional embeddings
TRANSFORMER ENCODER2
Step 6: Feed the sequence to transformer encoder

Transformer encoder consists of L encoder blocks that contain:


• Self-attention module
• MLP module

Output of each encoder block is input to next


• Size of input and output does not change
• Each transformer block is identical in terms of architecture
TRANSFORMER ENCODER3
Self-attention module combines normalization and self-attention
• Skip connection used to add the original value

Self-attention allows for sharing information between inputs


• Scaled dot-project attention used

Layer normalization deals with sequential data


• Independent on the number of inputs
TRANSFORMER ENCODER4
Scaled dot-product attention focuses on important information

Takes input and calculates:


• Query (Q): representation of the current token/patch
• Key (K): representation of all tokens/patches
• Value (V): value associated with each token
TRANSFORMER ENCODER5
Q, K and V are calculated using weight matrices WQ, WK and WV
• All three are trained
• 𝑄𝑄 = 𝑋𝑋𝑊𝑊 𝑄𝑄 ((N+1) x DQ)
• K = 𝑋𝑋𝑊𝑊 𝐾𝐾 ((N+1) x DQ)
• 𝑉𝑉 = 𝑋𝑋𝑊𝑊 𝑉𝑉 ((N+1) x D)

We calculate the dot product of Q and K: QKT


• Results in attention matrix E (size (N+1) x (N+1))

Scaled by the square root of the embedding dimensionality


TRANSFORMER ENCODER6
Softmax is used to make attention relative over all tokens

Final multiplication with V


• Attention weighed according to importance of each token

𝑄𝑄𝐾𝐾𝑇𝑇
Summary: 𝐴𝐴 = softmax 𝐷𝐷
𝑉𝑉

• Output dimension is (N+1) x D, same as input dimension


BREAK!
TRANSFORMER ENCODER7
Calculate keys and values for three tokens
TRANSFORMER ENCODER8
Calculate queries for three tokens
TRANSFORMER ENCODER9
Calculate attention between first query and all keys
TRANSFORMER ENCODER10
For first query, multiply softmax attention scores with corresponding values
TRANSFORMER ENCODER11
For first query, sum outputs for all keys
TRANSFORMER ENCODER12
For queries 2 and 3, repeat process
TRANSFORMER ENCODER13
Weight matrices WQ, WK and WV provide one perspective on information
• Usually multiple heads used in parallel: multi-head self-attention (MSA)
• Attend to different patterns
TRANSFORMER ENCODER14
Outputs of h different attention heads are concatenated
• Then projected onto to original dimensions (N + 1) x D using WO

AMSA WO A

N+1 X hD = N+1
hD D

D
TRANSFORMER ENCODER15
Second module in the transformer block contains the MLP
• Also contains skip connection
• First, layer normalization

MLP consists of two FC layers with dropout


• GELU activation in-between
• Output dimension same as input dimension
TRANSFORMER ENCODER16
GELU resembles ReLU but:
• Is differentiable everywhere
• Has very small negative value
MLP HEAD
After L transformer blocks, we have an output for each token/patch
• Including class token

We only consider the class token


• Has size 1 x D
• Considered the feature vector
MLP HEAD2
The MLP head consists of limited number of FC layers
• Last with softmax activation
• Outputs are class probabilities
TRAINING
TRAINING
When training a transformer, the following parameters need to be estimated:

Per image:
• Patch embedding matrix
• Positional encoding (when using learned encodings)
• Initial class embedding

Per transformer block:


• Weight matrices WK, WQ, WV
• FC layers in MLP
• WO (for multi-head self-attention)
TRAINING2
Training can be supervised end-to-end
• Recall that transformers have low inductive bias
• Large amounts of training data are needed

Training similar to CNNs


• Calculate loss between actual and predicted label
• Backpropagate to obtain partial derivatives per weight
• Adjust weights using optimizer
TRAINING3
Several pre-trained models available
• Original ViT trained on JFT-300M
• Internal Google dataset with 300M labeled images
• Check PyTorch: https://pytorch.org/vision/0.20/models/vision_transformer.html

Typical approach is to fine-tune these models on novel datasets


• Usually, none of the layers/matrices are frozen

Alternative is to pre-train using self-supervision


• Topic of the next lecture
SWIN TRANSFORMER
SWIN TRANSFORMER
Computation cost of the global attention in the vision transformer is high
• Scales quadratically in N

Swin (shifted windows) transformer addresses this issue


• Using hierarchical processing
• Using local attention within windows

Increased inductive bias due to stronger focus on local processing


• Long-range features interactions through cross-window

Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, ICCV 2021
SWIN TRANSFORMER2
Swin transformer block contains two blocks
• Window multi-head self-attention (W-MSA)
• Shifted window MSA (SW-MSA)
SWIN TRANSFORMER3
Window MSA considers only patches within a smaller window
• Fewer inputs for the self-attention

Shifted window MSA allows for overlap with neighboring patches


SWIN TRANSFORMER4
Swin transformer gradually covers larger areas
• Start from 4 x 4 pixels per patch
• After each block, 2 x 2 patches are merged and linearly projected
• Simultaneously, the embedding channel size increases
COMPARISON
On ImageNet, ViT and SWIN outperform CNN-based classifiers

SWIN has fewer FLOPs


SWIN scales better with image size
QUESTIONS?
FINALLY…
ASSIGNMENT
Assignment 3
• Deadline Sunday March 16, 23:00
• Assignment support session at 15:15 in BBG 209/219
• Use the assignments channel!
NEXT LECTURE
Next lecture:
• Thursday March 13, 11:00-12:45, Ruppert-040
• Vision language models
• Assignment 4 walk-through
MATERIALS
Background materials:
• Viswani et al., “Attention is all you need”, NeurIPS 2017
• Dosovitskiy et al., “An image is worth 16x16 words”, ICML 2021
• Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted
Windows”, ICCV 2021
• PyTorch vision transformers

You might also like