Computer Vision 11 Transformers
Computer Vision 11 Transformers
2024 - 2025
>VISION
TRANSFORMERS
UTRECHT UNIVERSITY
RONALD POPPE
CNN RECAP
CNNs take visual inputs and produce an output
• Label, bounding box, segmentation map, etc.
• For recognition tasks: feature extractor part and classification part
RECEPTIVE FIELD RECAP
CNNs hierarchically process local information
• Convolution and pooling kernels operate on small neighborhoods
• Receptive field increases with layers
• Spatially remote areas only processed jointly when in the same receptive field
RECEPTIVE FIELD RECAP2
Example: receptive field importance
Alexey Dosovitskiy, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR 2021
VISION TRANSFORMER2
Core property of transformer models is self-attention
• Ability to weight information through correlations within a layer
• Less directed so lower inductive bias
• More flexible
VISION TRANSFORMER3
VISION TRANSFORMER4
Familiar structure
• Feature encoder
• Classifier
VISION TRANSFORMER5
A forward pass in a vision transformer:
1. Split an image into patches
2. Flatten the patches
3. Produce lower-dimensional embeddings
4. Add a class token
5. Add positional encodings
6. Feed the sequence to transformer encoder
7. Employ MLP to make class predictions
SPLIT IMAGE INTO PATCHES
Step 1: divide image into square patches
• Non-overlapping: stride = patch size
N X P2C = N
P2C D D
EMBED PATCHES2
Patching and embedding patches implemented as convolution layer
• Stride and kernel size are P: extract patches
• Each kernel has size P x P x C
• Number of kernels is D: achieves the linear projection
N N+1
D D
CLASS TOKEN2
Class token accumulates the information in the transformer encoder
• Final class token embedding is used by classifier
9 10 11 12
13 14 15 16
POSITIONAL ENCODINGS2
Instead, we can learn the encodings
• Size of positional encoding vector is the same as patch embedding: D
• Allows flexibility in learning how regions should relate spatially
• Further reduces inductive bias
𝑄𝑄𝐾𝐾𝑇𝑇
Summary: 𝐴𝐴 = softmax 𝐷𝐷
𝑉𝑉
AMSA WO A
N+1 X hD = N+1
hD D
D
TRANSFORMER ENCODER15
Second module in the transformer block contains the MLP
• Also contains skip connection
• First, layer normalization
Per image:
• Patch embedding matrix
• Positional encoding (when using learned encodings)
• Initial class embedding
Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, ICCV 2021
SWIN TRANSFORMER2
Swin transformer block contains two blocks
• Window multi-head self-attention (W-MSA)
• Shifted window MSA (SW-MSA)
SWIN TRANSFORMER3
Window MSA considers only patches within a smaller window
• Fewer inputs for the self-attention