0% found this document useful (0 votes)
148 views50 pages

CS485 Ch5 Transformers

Uploaded by

Mennan Gök
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views50 pages

CS485 Ch5 Transformers

Uploaded by

Mennan Gök
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CS485/585

Deep Generative Networks


Bilkent University
CNNs vs. Transformers
• CNNs exhibit a strong bias towards feature locality,
as well as spatial invariance due to sharing filter
weights across all locations.
• Transformers have strong representation capability
and are free of human-defined inductive bias.
Transformers
Transformers
• To understand transformers, we first need to
understand attention.
• Connections are computed on flight via attentions.
• Requires more data.
Attention
• Which part of the input should I focus?
Sequence Modeling
The book is on the table

Encode

Decode

Kitap masanin ustunde

Sequence to sequence modeling


Sequence Modeling - RNN
The book is on the table

Kitap masanin ustunde

Sequence to sequence modeling


Sequence Modeling - RNN
The book is on the table

Kitap masanin ustunde

Sequence to sequence modeling


Sequence Modeling - RNN
The book is on the table

Kitap masanin ustunde


Attention is all you need

Attention Is All You Need, Neurips 2017


Sequence Modeling
• Challenges with RNNs • Transformer Networks

• Long range • Longe range


dependencies dependencies enabled

• Gradient vanishing • No gradient vanishing

• Serial operations • Parallel computing


Concept of Database
Retrieval from a database
Key 1 Value 1

Query Key 2 Value 2

Key 3 Value 3
Value

Key 4 Value 4
Attention Mechanism
• Mimics the retrieval
• Measure the similarity between query and key and
produce an output based on the similarity.

• 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞, 𝑘, 𝑣 = ∑! 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑞, 𝑘! ×𝑣!


Attention Mechanism
• Mimics the retrieval
• Measure the similarity between query and key and
produce an output based on the similarity.

• 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞, 𝑘, 𝑣 = ∑! 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑞, 𝑘! ×𝑣!

𝑠! 𝑠" 𝑠# 𝑠$

query

𝑘! 𝑘" 𝑘# 𝑘$
Attention Mechanism
𝑠! 𝑠" 𝑠# 𝑠$

query

𝑘! 𝑘" 𝑘# 𝑘$
• Similarity:
– Dot product 𝑞# 𝑘$
%! &"
– Scaled dot product , d is dimensionality of each key
'
– General dot product 𝑞# 𝑊𝑘$
Attention Mechanism
Largest scale dot product
comes from 𝑘)
𝑘!

𝑘"

𝑞!

𝑘#
Attention Mechanism
𝑠! 𝑠" 𝑠# 𝑠$

query

𝑘! 𝑘" 𝑘# 𝑘$

• Similarity: Dot product 𝑞 " 𝑘!


• Similarities compete with SoftMax
#$%('! )
𝑎! = ∑! #$%('! )
Attention Mechanism
Largest scale dot product
comes from 𝑘)
𝑘!
Apply softmax
𝑘"

𝑞!

𝑘#
Attention Mechanism

• 𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞, 𝑘, 𝑣 = ∑! 𝑎! ×𝑣!

• Provides a soft attention.


Attention
• Which part of the input should I focus?
Attention
• Self attention
• Query, Key, and Value all depends on the input
features (x).
• Q = conv1(x)
• K = conv2(x)
• V = conv3(x)
Attention

Non-local Neural Networks, CVPR 2018


Attention is all you need
Multihead Attention
Multihead Attention
concatenates multiple
attentions per query with
different weights.
• Q1 = conv1_1(x)
• K1 = conv1_2(x)
• V1 = conv1_3(x)
• Q2 = conv2_1(x)
• K2 = conv2_2(x)
• V2 = conv2_3(x)
Self-attention GAN

Self-Attention Generative Adversarial Networks, ICML 2019


Self-attention GAN

Self-Attention Generative Adversarial Networks, ICML 2019


Self-attention GAN

Visualization of attention maps. These images were generated by SAGAN. We visualize


the attention maps of the last generator layer that used attention, since this layer is the
closest to the output pixels and is the most straightforward to project into pixel space and
interpret.

Self-Attention Generative Adversarial Networks, ICML 2019


Self-attention GAN

Self-Attention Generative Adversarial Networks, ICML 2019


Self-attention
• Pairwise self-attention
– generalizes standard dot-product attention
• Patchwise self-attention
– strictly more powerful than convolution

Exploring Self-attention for Image Recognition, CVPR 2020


Self-attention

Exploring Self-attention for Image Recognition, CVPR 2020


Self-attention

Exploring Self-attention for Image Recognition, CVPR 2020


Attention vs. Convolution
• Convolution has a local • Enable long range
receptive field and therefore dependencies
long range dependencies can
only be processed after
• Computationally expensive
passing through several
convolutional layers
• Attention can be combined
complementary to convolution
• Computationally efficient
layer
Vision Transformer
• Completely abandon convolution?
• Extend transformer to an image, each pixel is a
word?
• Each patch is a word?

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE


RECOGNITION AT SCALE, ICLR 2021
Vision Transformer
• Completely abandon convolution?
– Limited receptive field
– Translation invariance

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE


RECOGNITION AT SCALE, ICLR 2021
Vision Transformer
• Completely abandon convolution?

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE


RECOGNITION AT SCALE, ICLR 2021
Vision Transformer
• Instead of local attention over pixels, do global
attention over patches.

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE


RECOGNITION AT SCALE, ICLR 2021
Positional Encoding
• Ways of encoding spatial information using
positional embedding
– 1-dimensional positional embedding: Considering the
inputs as a sequence of patches in the raster order
– 2-dimensional positional embedding: Considering the
inputs as a grid of patches in two dimensions.
– Learned positional embedding
TransGAN

TransGAN: Two Pure Transformers Can Make One Strong


GAN, and That Can Scale Up, arXiv 2021
TransGAN

TransGAN: Two Pure Transformers Can Make One Strong


GAN, and That Can Scale Up, arXiv 2021
Grid Attention

Grid Self-Attention across different transformer stages. We replace


Standard Self-Attention with Grid Self-Attention when the resolution is
higher than 32 × 32 and the grid size is set to be 16 × 16 by default

TransGAN: Two Pure Transformers Can Make One Strong


GAN, and That Can Scale Up, arXiv 2021
Swin Transformer

Swin Transformer: Hierarchical Vision Transformer using


Shifted Windows, arxiv 2021
Transformers data hungry
• Need large scale of datasets for pretraining
• Data Augmentation is Crucial for TransGAN

TransGAN: Two Pure Transformers Can Make One Strong


GAN, and That Can Scale Up, arXiv 2021
Augmentation for GAN
• 10^5 − 10^6 images required to train a modern
high-quality, high-resolution GAN
• The key problem with small datasets is that the
discriminator overfits to the training examples; its
feedback to the generator becomes meaningless
and training starts to diverge.
• In almost all areas of deep learning, dataset
augmentation is the standard solution against
overfitting.

Training Generative Adversarial Networks with Limited Data, Neurips 2020


Augmentation for GAN

Training Generative Adversarial Networks with Limited Data, Neurips 2020


Augmentation for GAN
• In contrast, a GAN trained under similar dataset
augmentations learns to generate the augmented
distribution
• such “leaking” of augmentations to the generated
samples is highly undesirable

Training Generative Adversarial Networks with Limited Data, Neurips 2020


Designing augmentations that do not leak
• Discriminator augmentation corresponds to putting
distorting goggles on the discriminator, and asking
the generator to produce samples that cannot be
distinguished from the training set when viewed
through the goggles.

Training Generative Adversarial Networks with Limited Data, Neurips 2020


Augmentation for GAN
• Evaluate the discriminator only
using augmented images, and do
this also when training the generator
• The augmentations need to be
differentiable. This is achieved this
by implementing them using
standard differentiable primitives
offered by the deep learning
framework

Training Generative Adversarial Networks with Limited Data, Neurips 2020


Augmentation for GAN

Training Generative Adversarial Networks with Limited Data, Neurips 2020


Results

Training Generative Adversarial Networks with Limited Data, Neurips 2020


Results

Training Generative Adversarial Networks with Limited Data, Neurips 2020

You might also like