Transformer
Transformer
Hichem Felouat
[email protected]
Contents
• We can represent this tiny network against the time axis, as shown in (figure -
right). This is called unrolling the network through time.
• Seq-to-seq (top left), seq-to-vector (top right), vector-to-seq (bottom left), and Encoder–Decoder
(bottom right) networks. Hichem Felouat - [email protected] - 2024 16
Recurrent Neural Network(RNN)
As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism
was focusing on "The Animal", and baked a part of its representation into the encoding of "it".
Hichem Felouat - [email protected] - 2024 24
Self-Attention in Detail
Weights
Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up
creating a "query", a "key", and a "value" projection of each word in the input sentence.
Hichem Felouat - [email protected] - 2024 25
Self-Attention in Detail
dot product
As we encode the word "it", one attention head is focusing most on If we add all the attention heads to the
"the animal", while another is focusing on "tired" , in a sense, the picture, however, things can be harder to
model's representation of the word "it" bakes in some of the interpret.
representation of both "animal" and "tired".
Hichem Felouat - [email protected] - 2024 32
Transformer
Positional Encoding:
The transformer adds a vector to each input embedding. These vectors
follow a specific pattern that the model learns, which helps it determine
the position of each word or the distance between different words in
the sequence.
Performance benchmark comparison of Vision Transformers (ViT) with ResNet and MobileNet when trained
from scratch on ImageNet.
Hichem Felouat - [email protected] - 2024 40
Vision Transformers (ViTs) vs CNNs
The authors in [1] demonstrated that CNNs trained on ImageNet are strongly biased
towards recognizing textures rather than shapes. Below is an excellent example of
such a case:
[1]: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. https://arxiv.org/abs/1811.12231
Hichem Felouat - [email protected] - 2024 41
Vision Transformers (ViTs) vs CNNs
• Neuroscience studies (The importance of shape in early lexical learning [1])
showed that object shape is the single most important cue for human object
recognition.
• Intuitively, the object shape remains relatively stable, while other cues can be
easily distorted by all sorts of noise [2].
1: https://psycnet.apa.org/doi/10.1016/0885-2014(88)90014-7
2: https://arxiv.org/abs/1811.12231
Hichem Felouat - [email protected] - 2024 42
Vision Transformers (ViTs) vs CNNs
Accuracies and example stimuli for five different experiments without cue conflict.
Source: https://arxiv.org/abs/1811.12231
Hichem Felouat - [email protected] - 2024 43
Vision Transformers (ViTs) vs CNNs
• The texture is not sufficient for determining whether the zebra is rotated. Thus,
predicting rotation requires modeling shape, to some extent.
https://arxiv.org/abs/2106.01548
The authors in [1] looked at the self-attention of the CLS token on the heads of the last layer. Crucially, no labels are used
during the self-supervised training. These maps demonstrate that the learned class-specific features lead to remarkable
unsupervised segmentation masks and visibly correlate with the shape of semantic objects in the images.
1: Self-Supervised Vision Transformers with DINO https://arxiv.org/abs/2104.14294
Hichem Felouat - [email protected] - 2024 46
Vision Transformers (ViTs) vs CNNs
• The adversarial perturbations computed for a ViT and a ResNet model.
• The adversarial perturbations are qualitatively very different even though both models may
perform similarly in image recognition.
https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Vision_Transformer_(ViT)_for_Image_Classification_(cifar10_dataset).ipynb
Hichem Felouat - [email protected] - 2024 50
Vision Transformers (ViTs)
https://github.com/Vision-CAIR/MiniGPT-4
Hichem Felouat - [email protected] - 2024 54
Thank You For Attending
Q&A
Hichem Felouat …