0% found this document useful (0 votes)
14 views55 pages

Transformer

Uploaded by

ingresos.ubaar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views55 pages

Transformer

Uploaded by

ingresos.ubaar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Transformers

Hichem Felouat
[email protected]
Contents

1.Natural Language Processing NLP


2.Self-Attention
3.Transformer
4.Vision Transformer (ViT)
5.Large Language Models
6.Vision Language Models

Hichem Felouat - [email protected] - 2024 2


Natural Language Processing NLP

• Natural language processing (NLP) is a subfield of artificial


intelligence concerned with the interactions between computers and
human (natural) languages, in particular how to program computers to
process and analyze large amounts of natural language data.

• Challenges in natural language processing frequently involve speech


recognition, natural language understanding, and natural language
generation.

Hichem Felouat - [email protected] - 2024 3


Natural Language Processing NLP

Hichem Felouat - [email protected] - 2024 4


Natural Language Processing NLP

Hichem Felouat - [email protected] - 2024 5


Generic NLP Pipeline

Hichem Felouat - [email protected] - 2024 6


Texts to Sequence/Matrix
• In natural language processing (NLP), texts can be represented as a
sequence or a matrix, depending on the task and the model type.

texts = ["I love Algeria", "machine learning", "Artificial intelligence", "AI"]

• The total number of documents: 4


• The number of distinct words (Tokenization ): 8
• word_index :
{'i': 1, 'love': 2, 'algeria': 3, 'machine': 4, 'learning': 5, 'artificial': 6, 'intelligence': 7, 'ai': 8}
• texts_to_sequences : input [Algeria love AI]
[3, 2, 8]
• sequences_to_texts : input [3, 4, 7, 2, 8, 1, 3]
['algeria machine intelligence love ai i algeria']

Hichem Felouat - [email protected] - 2024 7


Texts to Sequence/Matrix
• binary: Whether or not each word is present in the document. This is the default.
• count : The count of each word in the document.
• freq : The frequency of each word as a ratio of words within each document.
• tfidf : The Text Frequency-Inverse Document Frequency (TF-IDF) scoring for each
word in the document.
texts = [
"blue car and blue window", "black crow in the window","i see my reflection in the window"
]

Hichem Felouat - [email protected] - 2024 8


Texts to Sequence/Matrix

Hichem Felouat - [email protected] - 2024 9


Sequence Padding
• Sequence padding is the process of adding zeroes or other filler
tokens to sequences of variable length so that all sequences have the
same length.
• Many machine learning models require fixed-length inputs, and
variable-length sequences can't be fed directly into these models.

sequences = [ [1, 2, 3, 4], [1, 2, 3], [1] ]


maxlen= 4
result: [[1 2 3 4]
[0 1 2 3]
[0 0 0 1]]
Hichem Felouat - [email protected] - 2024 10
Deep Learning-Based NLP

Hichem Felouat - [email protected] - 2024 11


Word Embedding

• Word embedding is a technique used in NLP to represent words as


numerical vectors in a high-dimensional space.

• Word embedding aims to capture the meaning and context of words


in a way that is useful for downstream NLP tasks, such as text
classification, sentiment analysis, and machine translation.

• There are several popular algorithms for creating word embeddings,


such as Word2Vec, GloVe, and fastText.

Hichem Felouat - [email protected] - 2024 12


Word Embedding

Hichem Felouat - [email protected] - 2024 13


Recurrent Neural Network(RNN)
• The simplest possible RNN composed of one neuron receiving inputs, producing
an output, and sending that output back to itself (figure -left).

• We can represent this tiny network against the time axis, as shown in (figure -
right). This is called unrolling the network through time.

Hichem Felouat - [email protected] - 2024 14


Recurrent Neural Network(RNN)
• You can easily create a layer of recurrent neurons. At each time step t, every
neuron receives both the input vector x(t) and the output vector from the
previous time step y(t–1).

A recurrent neuron (left) unrolled through time (right)


Hichem Felouat - [email protected] - 2024 15
Recurrent Neural Network(RNN)

• Seq-to-seq (top left), seq-to-vector (top right), vector-to-seq (bottom left), and Encoder–Decoder
(bottom right) networks. Hichem Felouat - [email protected] - 2024 16
Recurrent Neural Network(RNN)

Deep RNN (left) unrolled through time (right)


Hichem Felouat - [email protected] - 2024 17
Long Short-Term Memory (LSTM)
• When the data traversing an RNN, some information is lost at each time step. After a while, the RNN’s state
contains virtually no trace of the first inputs.

Hichem Felouat - [email protected] - 2024 18


Gated Recurrent Unit (GRU)

Hichem Felouat - [email protected] - 2024 19


Bidirectional RNNs
For example: in Neural Machine Translation, it is often
preferable to look ahead at the next words before
encoding a given word.

• Consider the phrases "the queen of the United


Kingdom", "the queen of hearts", and "the queen
bee": to properly encode the word “queen”, you need
to look ahead.
Hichem Felouat - [email protected] - 2024 20
Bidirectional RNNs
• To implement this, run two recurrent layers on the same inputs, one
reading the words from left to right and the other reading them from
right to left. Then simply concatenating them.

Hichem Felouat - [email protected] - 2024 21


Self-Attention

Hichem Felouat - [email protected] - 2024 22


Self-Attention
• The following sentence is an input sentence we want to
translate: "The animal didn't cross the street because it was
too tired“

• What does "it" in this sentence refer to?


• Is it referring to the street or to the animal? It’s a simple
question to a human but not as simple to an algorithm.

• When the model is processing the word "it", self-attention


allows it to associate "it" with "animal".
Hichem Felouat - [email protected] - 2024 23
Self-Attention

As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism
was focusing on "The Animal", and baked a part of its representation into the encoding of "it".
Hichem Felouat - [email protected] - 2024 24
Self-Attention in Detail
Weights

Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up
creating a "query", a "key", and a "value" projection of each word in the input sentence.
Hichem Felouat - [email protected] - 2024 25
Self-Attention in Detail

dot product

Hichem Felouat - [email protected] - 2024 26


Matrix Calculation of Self-Attention

Every row in the X matrix


corresponds to a word in
the input sentence.
Hichem Felouat - [email protected] - 2024 27
The Attention Mechanism from Scratch

Hichem Felouat - [email protected] - 2024 28


Matrix Calculation of Self-Attention

Hichem Felouat - [email protected] - 2024 29


Multi-Headed Attention

Multi-Headed Attention improves the performance of the attention


layer in two ways:

• It expands the model’s ability to focus on different positions. Yes, in


the example above, z1 contains a little bit of every other encoding,
but it could be dominated by the actual word itself.

• It gives the attention layer multiple representation subspaces.

Hichem Felouat - [email protected] - 2024 30


Multi-Headed Attention

Hichem Felouat - [email protected] - 2024 31


Multi-Headed Attention

As we encode the word "it", one attention head is focusing most on If we add all the attention heads to the
"the animal", while another is focusing on "tired" , in a sense, the picture, however, things can be harder to
model's representation of the word "it" bakes in some of the interpret.
representation of both "animal" and "tired".
Hichem Felouat - [email protected] - 2024 32
Transformer
Positional Encoding:
The transformer adds a vector to each input embedding. These vectors
follow a specific pattern that the model learns, which helps it determine
the position of each word or the distance between different words in
the sequence.

Hichem Felouat - [email protected] - 2024 33


Transformer

Hichem Felouat - [email protected] - 2024 34


Transformer

Hichem Felouat - [email protected] - 2024 35


Transformer

Attention Is All You Need


https://arxiv.org/abs/1706.03762

Hichem Felouat - [email protected] - 2024 36


Transformer

The Annotated Transformer


a line-by-line implementation
http://nlp.seas.harvard.edu/annotated-transformer/

Hichem Felouat - [email protected] - 2024 37


Transformer

Hichem Felouat - [email protected] - 2024 38


Vision Transformer
(ViT)

Hichem Felouat - [email protected] - 2024 39


Vision Transformers (ViTs) vs CNNs

Performance benchmark comparison of Vision Transformers (ViT) with ResNet and MobileNet when trained
from scratch on ImageNet.
Hichem Felouat - [email protected] - 2024 40
Vision Transformers (ViTs) vs CNNs
The authors in [1] demonstrated that CNNs trained on ImageNet are strongly biased
towards recognizing textures rather than shapes. Below is an excellent example of
such a case:

[1]: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. https://arxiv.org/abs/1811.12231
Hichem Felouat - [email protected] - 2024 41
Vision Transformers (ViTs) vs CNNs
• Neuroscience studies (The importance of shape in early lexical learning [1])
showed that object shape is the single most important cue for human object
recognition.

• By studying the visual pathway of humans regarding image recognition,


researchers identified that the perception of object shape is invariant to most
perturbations. So as far as we know, the shape is the most reliable cue.

• Intuitively, the object shape remains relatively stable, while other cues can be
easily distorted by all sorts of noise [2].

1: https://psycnet.apa.org/doi/10.1016/0885-2014(88)90014-7
2: https://arxiv.org/abs/1811.12231
Hichem Felouat - [email protected] - 2024 42
Vision Transformers (ViTs) vs CNNs

Accuracies and example stimuli for five different experiments without cue conflict.
Source: https://arxiv.org/abs/1811.12231
Hichem Felouat - [email protected] - 2024 43
Vision Transformers (ViTs) vs CNNs
• The texture is not sufficient for determining whether the zebra is rotated. Thus,
predicting rotation requires modeling shape, to some extent.

• The object's shape can be invariant to rotations.

Hichem Felouat - [email protected] - 2024 44


Vision Transformers (ViTs) vs CNNs
The self-attention captures long-range dependencies and contextual information in
the input data.
The self-attention mechanism allows a ViT model to attend to different regions of
the input data based on their relevance to the task at hand.

Raw images (Left) and attention


maps of ViT-S/16 with (Right)
and without (Middle).

https://arxiv.org/abs/2106.01548

Hichem Felouat - [email protected] - 2024 45


Vision Transformers (ViTs) vs CNNs

The authors in [1] looked at the self-attention of the CLS token on the heads of the last layer. Crucially, no labels are used
during the self-supervised training. These maps demonstrate that the learned class-specific features lead to remarkable
unsupervised segmentation masks and visibly correlate with the shape of semantic objects in the images.
1: Self-Supervised Vision Transformers with DINO https://arxiv.org/abs/2104.14294
Hichem Felouat - [email protected] - 2024 46
Vision Transformers (ViTs) vs CNNs
• The adversarial perturbations computed for a ViT and a ResNet model.

• The adversarial perturbations are qualitatively very different even though both models may
perform similarly in image recognition.

ViTs and ResNets process their inputs very differently. https://arxiv.org/abs/2103.14586


Hichem Felouat - [email protected] - 2024 47
Vision Transformers (ViTs) vs CNNs
• The transformer can attend to all the
tokens (image patches) at each block
by design. The originally proposed ViT
model in [1] already demonstrated that
heads from early layers tend to attend
to far-away pixels, while heads from
later layers do not.

How heads of different layers attend to


their surrounding pixels [1].
[1]: https://arxiv.org/abs/2010.11929
Hichem Felouat - [email protected] - 2024 48
Vision Transformers (ViTs)
How the Vision Transformer Works:

1. Split an image into patches


2. Flatten the patches
3. Produce lower-dimensional linear embeddings from the flattened
patches
4. Add positional embeddings
5. Feed the sequence as an input to a standard transformer encoder
6. Pretrain the model with image labels (fully supervised on a huge
dataset)
7. Finetune on the downstream dataset for image classification
Hichem Felouat - [email protected] - 2024 49
Vision Transformers (ViTs)

https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Vision_Transformer_(ViT)_for_Image_Classification_(cifar10_dataset).ipynb
Hichem Felouat - [email protected] - 2024 50
Vision Transformers (ViTs)

Global Context Vision Transformer (GC ViT):


https://github.com/NVlabs/GCViT

Hichem Felouat - [email protected] - 2024 51


Large Language Models

A Survey of Large Language Models:


https://arxiv.org/abs/2303.18223 Hichem Felouat - [email protected] - 2024 52
Vision Language Models

The architecture of MiniGPT-4


https://minigpt-4.github.io
Hichem Felouat - [email protected] - 2024 53
Vision Language Models

https://github.com/Vision-CAIR/MiniGPT-4
Hichem Felouat - [email protected] - 2024 54
Thank You For Attending
Q&A

Hichem Felouat …

Hichem Felouat - [email protected] - 2024 55

You might also like