0% found this document useful (0 votes)

6 views63 pages

Computer Vision 11 Transformers

The document discusses the evolution of computer vision techniques, focusing on Convolutional Neural Networks (CNNs) and Transformers, particularly Vision Transformers (ViTs) and Swin Transformers. It outlines the architecture and processes involved in these models, including the importance of inductive bias, self-attention mechanisms, and the training process. Additionally, it highlights the advantages of Swin Transformers over traditional CNNs in terms of computational efficiency and performance on image classification tasks.

Uploaded by

laughriotclip

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views63 pages

Computer Vision 11 Transformers

Uploaded by

laughriotclip

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

COMPUTER VISION

2024 - 2025
>VISION
TRANSFORMERS
UTRECHT UNIVERSITY
RONALD POPPE
CNN RECAP
CNNs take visual inputs and produce an output
• Label, bounding box, segmentation map, etc.
• For recognition tasks: feature extractor part and classification part
RECEPTIVE FIELD RECAP
CNNs hierarchically process local information
• Convolution and pooling kernels operate on small neighborhoods
• Receptive field increases with layers
• Spatially remote areas only processed jointly when in the same receptive field
RECEPTIVE FIELD RECAP2
Example: receptive field importance

Even then, relative positioning is

encoded in the convolutional kernels
• Invariancies have to be learned

In CNNs, spatially distant information

is typically handled in the FC layers
INDUCTIVE BIAS
Inductive biases govern how models generalize to unseen data
• Inherent properties of the model affect what they focus on

Helps in obtaining a good solution in direction of bias

• Prevents overfitting

Might not provide optimal solution

INDUCTIVE BIAS IN CNNS
CNNs process locally through kernel operations
• Consequently, focus on local patterns
• Texture typically dominant over shape
• Learning local invariancies easier than learning spatial invariancies
• Difficulty in capturing long-range patterns
TRANSFORMERS
TRANSFORMERS
Transformers are a type of sequence-to-sequence models
• Contain encoder and decoder

Output of encoder is a vectorized representation of the input

Output of the decoder is the translated version of the input
TRANSFORMERS2
Many different sequence-to-sequence tasks:
• Language translation
• Image-to-text
• Conversational models
• Etc.

Classification also possible

• Requires trivial adaptation
TRANSFORMERS3
First sequence-to-sequence models used recurrent neural networks (RNNs)
• Hidden states (encoder vector) summarizes the full input
TRANSFORMERS4
Transformers instead use self-attention
• Encoder and decoder have L transformer blocks

We focus on image classification Lx

• Decoder not used, only label as output Lx

• Decoder returns in the next lecture

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017

VISION TRANSFORMER
VISION TRANSFORMER
Vision transformers (ViTs) are adapted from NLP to visual inputs
• Initially used for classification instead of translation
• Nowadays adapted to a variety of tasks
• Detection, segmentation, etc.

Alexey Dosovitskiy, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR 2021
VISION TRANSFORMER2
Core property of transformer models is self-attention
• Ability to weight information through correlations within a layer
• Less directed so lower inductive bias
• More flexible
VISION TRANSFORMER3
VISION TRANSFORMER4
Familiar structure
• Feature encoder
• Classifier
VISION TRANSFORMER5
A forward pass in a vision transformer:
1. Split an image into patches
2. Flatten the patches
3. Produce lower-dimensional embeddings
4. Add a class token
5. Add positional encodings
6. Feed the sequence to transformer encoder
7. Employ MLP to make class predictions
SPLIT IMAGE INTO PATCHES
Step 1: divide image into square patches
• Non-overlapping: stride = patch size

Global spatial information is lost

SPLIT IMAGE INTO PATCHES2
Original image: W x H x C
• N patches
• Each patch P x P x C pixels
FLATTEN IMAGE PATCH
Step 2: flatten image patches
• We further remove local spatial information by flattening each patch
• Output size N x (P2C), with N rows of patches with P2C values
EMBED PATCHES
Step 3: Produce lower-dimensional embeddings from the flattened patches
• We linearly project each patch to a fixed-dimensional vector with length D
• We use trainable weight matrix of size P2C x D
• Output is matrix of size N x D

Flattened patches Weight matrix Patch embeddings

N X P2C = N

P2C D D
EMBED PATCHES2
Patching and embedding patches implemented as convolution layer
• Stride and kernel size are P: extract patches
• Each kernel has size P x P x C
• Number of kernels is D: achieves the linear projection

Each output is a weighted linear sum of all inputs

• Output size is 𝑁𝑁 × 𝑁𝑁 × 𝐷𝐷, which is then reshaped to N x D
CLASS TOKEN
Step 4: Add class token
• Class or [CLS] token is appended to the sequence of patch tokens
• Results in (N + 1) x D input

N N+1

D D
CLASS TOKEN2
Class token accumulates the information in the transformer encoder
• Final class token embedding is used by classifier

Alternative is to use final embeddings of all patch tokens

• Would result in differently sized embedding when number of N changes

Initial value of the class token is learned during training

POSITIONAL ENCODINGS
Step 5: Add positional encodings
• During patching, position information was lost
• We infuse position information per patch using positional encodings

We could number patches and use indices as positional encodings

• Idea abandoned in NLP due to varying sequence lengths 1 2 3 4
• Rare indices would hamper training
5 6 7 8

9 10 11 12

13 14 15 16
POSITIONAL ENCODINGS2
Instead, we can learn the encodings
• Size of positional encoding vector is the same as patch embedding: D
• Allows flexibility in learning how regions should relate spatially
• Further reduces inductive bias

In practice: rows/columns implicitly encoded

• Larger similarity between close-by patches
POSITIONAL ENCODINGS3
In original ViT formulation, sine and cosine encoding was used
• Again, size of positional encoding vector is same as patch embedding: D

For a patch at position pos, and embedding dimension i of D

𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝𝑝𝑝𝑝𝑝
• 𝑃𝑃𝑃𝑃(𝑝𝑝𝑝𝑝𝑝𝑝,2𝑖𝑖) = sin 2𝑖𝑖 , 𝑃𝑃𝑃𝑃(𝑝𝑝𝑝𝑝𝑝𝑝,2𝑖𝑖+1) = cos 2𝑖𝑖
10000 𝐷𝐷 10000 𝐷𝐷
POSITIONAL ENCODINGS4
Sinusoidal positional encoding has favorable properties:
• Each patch has a unique encoding
• All values are naturally in the -1…1 range
• Distance between neighboring patches is the same
• Encodings can be interpolated when the number of patches changes
POSITIONAL ENCODINGS5
Eventually, the positional encodings are element-wise summed with
the patch embeddings
• Output size remains (N + 1) x D

Summing them instead of concatenation is unintuitive but efficient

• Transformer can learn to disentangle signal and position
TRANSFORMER ENCODER
At this point, we have an image embedded per patch
• Added a class token
• Added positional embeddings
TRANSFORMER ENCODER2
Step 6: Feed the sequence to transformer encoder

Transformer encoder consists of L encoder blocks that contain:

• Self-attention module
• MLP module

Output of each encoder block is input to next

• Size of input and output does not change
• Each transformer block is identical in terms of architecture
TRANSFORMER ENCODER3
Self-attention module combines normalization and self-attention
• Skip connection used to add the original value

Self-attention allows for sharing information between inputs

• Scaled dot-project attention used

Layer normalization deals with sequential data

• Independent on the number of inputs
TRANSFORMER ENCODER4
Scaled dot-product attention focuses on important information

Takes input and calculates:

• Query (Q): representation of the current token/patch
• Key (K): representation of all tokens/patches
• Value (V): value associated with each token
TRANSFORMER ENCODER5
Q, K and V are calculated using weight matrices WQ, WK and WV
• All three are trained
• 𝑄𝑄 = 𝑋𝑋𝑊𝑊 𝑄𝑄 ((N+1) x DQ)
• K = 𝑋𝑋𝑊𝑊 𝐾𝐾 ((N+1) x DQ)
• 𝑉𝑉 = 𝑋𝑋𝑊𝑊 𝑉𝑉 ((N+1) x D)

We calculate the dot product of Q and K: QKT

• Results in attention matrix E (size (N+1) x (N+1))

Scaled by the square root of the embedding dimensionality

TRANSFORMER ENCODER6
Softmax is used to make attention relative over all tokens

Final multiplication with V

• Attention weighed according to importance of each token

𝑄𝑄𝐾𝐾𝑇𝑇
Summary: 𝐴𝐴 = softmax 𝐷𝐷
𝑉𝑉

• Output dimension is (N+1) x D, same as input dimension

BREAK!
TRANSFORMER ENCODER7
Calculate keys and values for three tokens
TRANSFORMER ENCODER8
Calculate queries for three tokens
TRANSFORMER ENCODER9
Calculate attention between first query and all keys
TRANSFORMER ENCODER10
For first query, multiply softmax attention scores with corresponding values
TRANSFORMER ENCODER11
For first query, sum outputs for all keys
TRANSFORMER ENCODER12
For queries 2 and 3, repeat process
TRANSFORMER ENCODER13
Weight matrices WQ, WK and WV provide one perspective on information
• Usually multiple heads used in parallel: multi-head self-attention (MSA)
• Attend to different patterns
TRANSFORMER ENCODER14
Outputs of h different attention heads are concatenated
• Then projected onto to original dimensions (N + 1) x D using WO

AMSA WO A

N+1 X hD = N+1
hD D

D
TRANSFORMER ENCODER15
Second module in the transformer block contains the MLP
• Also contains skip connection
• First, layer normalization

MLP consists of two FC layers with dropout

• GELU activation in-between
• Output dimension same as input dimension
TRANSFORMER ENCODER16
GELU resembles ReLU but:
• Is differentiable everywhere
• Has very small negative value
MLP HEAD
After L transformer blocks, we have an output for each token/patch
• Including class token

We only consider the class token

• Has size 1 x D
• Considered the feature vector
MLP HEAD2
The MLP head consists of limited number of FC layers
• Last with softmax activation
• Outputs are class probabilities
TRAINING
TRAINING
When training a transformer, the following parameters need to be estimated:

Per image:
• Patch embedding matrix
• Positional encoding (when using learned encodings)
• Initial class embedding

Per transformer block:

• Weight matrices WK, WQ, WV
• FC layers in MLP
• WO (for multi-head self-attention)
TRAINING2
Training can be supervised end-to-end
• Recall that transformers have low inductive bias
• Large amounts of training data are needed

Training similar to CNNs

• Calculate loss between actual and predicted label
• Backpropagate to obtain partial derivatives per weight
• Adjust weights using optimizer
TRAINING3
Several pre-trained models available
• Original ViT trained on JFT-300M
• Internal Google dataset with 300M labeled images
• Check PyTorch: https://pytorch.org/vision/0.20/models/vision_transformer.html

Typical approach is to fine-tune these models on novel datasets

• Usually, none of the layers/matrices are frozen

Alternative is to pre-train using self-supervision

• Topic of the next lecture
SWIN TRANSFORMER
SWIN TRANSFORMER
Computation cost of the global attention in the vision transformer is high
• Scales quadratically in N

Swin (shifted windows) transformer addresses this issue

• Using hierarchical processing
• Using local attention within windows

Increased inductive bias due to stronger focus on local processing

• Long-range features interactions through cross-window

Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, ICCV 2021
SWIN TRANSFORMER2
Swin transformer block contains two blocks
• Window multi-head self-attention (W-MSA)
• Shifted window MSA (SW-MSA)
SWIN TRANSFORMER3
Window MSA considers only patches within a smaller window
• Fewer inputs for the self-attention

Shifted window MSA allows for overlap with neighboring patches

SWIN TRANSFORMER4
Swin transformer gradually covers larger areas
• Start from 4 x 4 pixels per patch
• After each block, 2 x 2 patches are merged and linearly projected
• Simultaneously, the embedding channel size increases
COMPARISON
On ImageNet, ViT and SWIN outperform CNN-based classifiers

SWIN has fewer FLOPs

SWIN scales better with image size
QUESTIONS?
FINALLY…
ASSIGNMENT
Assignment 3
• Deadline Sunday March 16, 23:00
• Assignment support session at 15:15 in BBG 209/219
• Use the assignments channel!
NEXT LECTURE
Next lecture:
• Thursday March 13, 11:00-12:45, Ruppert-040
• Vision language models
• Assignment 4 walk-through
MATERIALS
Background materials:
• Viswani et al., “Attention is all you need”, NeurIPS 2017
• Dosovitskiy et al., “An image is worth 16x16 words”, ICML 2021
• Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted
Windows”, ICCV 2021
• PyTorch vision transformers

ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
Computer Vision 12 Vision Language Models
No ratings yet
Computer Vision 12 Vision Language Models
56 pages
Transformer LectureNotes
No ratings yet
Transformer LectureNotes
33 pages
Lec 7 Trans (Decoder) +ViT
No ratings yet
Lec 7 Trans (Decoder) +ViT
20 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
41 pages
LLM
No ratings yet
LLM
28 pages
Lec25 Architectures
No ratings yet
Lec25 Architectures
52 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
59 pages
01 - Introduction To Watsonx - Ai
No ratings yet
01 - Introduction To Watsonx - Ai
77 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
Cluster1 Core ML NLP Techniques Summary
No ratings yet
Cluster1 Core ML NLP Techniques Summary
8 pages
Anlp 05 Transformers
No ratings yet
Anlp 05 Transformers
40 pages
Implement A Vision On A LLM
No ratings yet
Implement A Vision On A LLM
21 pages
Lecture 28 TransformerIntroductionFinal 1
No ratings yet
Lecture 28 TransformerIntroductionFinal 1
69 pages
Conditional Positional Encoding Fot ViT
No ratings yet
Conditional Positional Encoding Fot ViT
19 pages
Rec03 - Deep Architectures
No ratings yet
Rec03 - Deep Architectures
65 pages
Vision Transformer Overview
No ratings yet
Vision Transformer Overview
21 pages
Transformers
No ratings yet
Transformers
15 pages
VR Part2 Lecture 6 Annotated
No ratings yet
VR Part2 Lecture 6 Annotated
10 pages
Generative AI
No ratings yet
Generative AI
54 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Paper 3
No ratings yet
Paper 3
7 pages
465-Lecture 16 ViT
No ratings yet
465-Lecture 16 ViT
18 pages
Transformers in Computational Visual Media A Surve
No ratings yet
Transformers in Computational Visual Media A Surve
30 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
Conditional Positional Encodings For Vision Transformers
No ratings yet
Conditional Positional Encodings For Vision Transformers
13 pages
GenAIWorkshop GEOMAR With Footnotes Final
No ratings yet
GenAIWorkshop GEOMAR With Footnotes Final
41 pages
Aman Arora Blog On Vision Transformer
No ratings yet
Aman Arora Blog On Vision Transformer
11 pages
Major Project Presentation 2 - G6
No ratings yet
Major Project Presentation 2 - G6
13 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Vision Transformer (Vit) : Shusen Wang
No ratings yet
Vision Transformer (Vit) : Shusen Wang
35 pages
Transformer
No ratings yet
Transformer
31 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Deep Learning Autoencoders
No ratings yet
Deep Learning Autoencoders
31 pages
2103 TiT
No ratings yet
2103 TiT
10 pages
Astro AI
No ratings yet
Astro AI
20 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
2009 Tutorial Nips
No ratings yet
2009 Tutorial Nips
113 pages
Paper 2
No ratings yet
Paper 2
8 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Transformer
No ratings yet
Transformer
58 pages
GPT2 From Scratch in PyTorch
No ratings yet
GPT2 From Scratch in PyTorch
13 pages
Lecture 26
No ratings yet
Lecture 26
17 pages
ViT Explained
No ratings yet
ViT Explained
15 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Research Notes
No ratings yet
Research Notes
9 pages
2024-05-13-Kolmogorov-Arnold Networks The Latest Advance in Neural Networks, Simply Explained by Theo Wolf May
No ratings yet
2024-05-13-Kolmogorov-Arnold Networks The Latest Advance in Neural Networks, Simply Explained by Theo Wolf May
22 pages
Soft Computing
No ratings yet
Soft Computing
96 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
AWS Academy Machine Learning Foundations Course Outline (English)
No ratings yet
AWS Academy Machine Learning Foundations Course Outline (English)
7 pages
1 1excite
No ratings yet
1 1excite
23 pages
ML Unit-1
No ratings yet
ML Unit-1
43 pages
Transformer
No ratings yet
Transformer
4 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Artificial Neural Networks and Machine Learning ICANN 2019 Text and Time Series 28th International Conference on Artificial Neural Networks Munich Germany September 17 19 2019 Proceedings Part IV Igor V. Tetko 2024 Scribd Download
No ratings yet
Artificial Neural Networks and Machine Learning ICANN 2019 Text and Time Series 28th International Conference on Artificial Neural Networks Munich Germany September 17 19 2019 Proceedings Part IV Igor V. Tetko 2024 Scribd Download
65 pages
Kernel Perceptron
No ratings yet
Kernel Perceptron
28 pages
ML Lab Manual (Vim)
No ratings yet
ML Lab Manual (Vim)
13 pages
Week 02 Ch2.1 Introduction To Neural Networks
No ratings yet
Week 02 Ch2.1 Introduction To Neural Networks
44 pages
Chapter 6 E-ANN
No ratings yet
Chapter 6 E-ANN
64 pages
Transformer
No ratings yet
Transformer
5 pages
AIMLLecture01 220427 114617
No ratings yet
AIMLLecture01 220427 114617
20 pages
Machine Learning For Everyone
100% (1)
Machine Learning For Everyone
50 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Part I - NLP
No ratings yet
Part I - NLP
24 pages
Deep Learning: Prof:Naveen Ghorpade
No ratings yet
Deep Learning: Prof:Naveen Ghorpade
43 pages
Pruning Attention Heads of Transformer Models Using A Search
No ratings yet
Pruning Attention Heads of Transformer Models Using A Search
22 pages
期末專題1
No ratings yet
期末專題1
14 pages
Transformer Vs MOE
No ratings yet
Transformer Vs MOE
7 pages
Generative Ai: Is This The Future of Content Creation?
No ratings yet
Generative Ai: Is This The Future of Content Creation?
22 pages
Internship PPT 1
No ratings yet
Internship PPT 1
13 pages
07 Representation Learning
No ratings yet
07 Representation Learning
11 pages
Deepfake Detection System Using Deep Neural Networks
No ratings yet
Deepfake Detection System Using Deep Neural Networks
5 pages
Lecture 1 - Neural Network Definitions and Concepts 1
No ratings yet
Lecture 1 - Neural Network Definitions and Concepts 1
4 pages
Coloration Technology: Automatic Fabric Defect Detection Using A Deep Convolutional Neural Network
No ratings yet
Coloration Technology: Automatic Fabric Defect Detection Using A Deep Convolutional Neural Network
11 pages
Detect AI-generated Text Using Machine Learning
No ratings yet
Detect AI-generated Text Using Machine Learning
5 pages
Praneeth New Resume
No ratings yet
Praneeth New Resume
1 page
19CSE456 - VI Sem May 2022
No ratings yet
19CSE456 - VI Sem May 2022
6 pages
Intro
No ratings yet
Intro
2 pages
Cse319 Soft-Computing TH 1.10 Ac26 PDF
No ratings yet
Cse319 Soft-Computing TH 1.10 Ac26 PDF
2 pages
Create Reference Model Controller With MATLAB Script - MATLAB & Simulink
No ratings yet
Create Reference Model Controller With MATLAB Script - MATLAB & Simulink
5 pages
Geometric Description DH/DHM Stiffness (2) : R Ef Erences
No ratings yet
Geometric Description DH/DHM Stiffness (2) : R Ef Erences
1 page
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet

Computer Vision 11 Transformers

Uploaded by

Computer Vision 11 Transformers

Uploaded by

COMPUTER VISION

Even then, relative positioning is

In CNNs, spatially distant information

Helps in obtaining a good solution in direction of bias

Might not provide optimal solution

Output of encoder is a vectorized representation of the input

Classification also possible

We focus on image classification Lx

• Decoder not used, only label as output Lx

• Decoder returns in the next lecture

Vaswani et al., “Attention Is All You Need”, NeurIPS 2017

Global spatial information is lost

Flattened patches Weight matrix Patch embeddings

Each output is a weighted linear sum of all inputs

Alternative is to use final embeddings of all patch tokens

Initial value of the class token is learned during training

We could number patches and use indices as positional encodings

In practice: rows/columns implicitly encoded

For a patch at position pos, and embedding dimension i of D

Summing them instead of concatenation is unintuitive but efficient

Transformer encoder consists of L encoder blocks that contain:

Output of each encoder block is input to next

Self-attention allows for sharing information between inputs

Layer normalization deals with sequential data

Takes input and calculates:

We calculate the dot product of Q and K: QKT

Scaled by the square root of the embedding dimensionality

Final multiplication with V

• Output dimension is (N+1) x D, same as input dimension

MLP consists of two FC layers with dropout

We only consider the class token

Per transformer block:

Training similar to CNNs

Typical approach is to fine-tune these models on novel datasets

Alternative is to pre-train using self-supervision

Swin (shifted windows) transformer addresses this issue

Increased inductive bias due to stronger focus on local processing

Shifted window MSA allows for overlap with neighboring patches

SWIN has fewer FLOPs

You might also like