0% found this document useful (0 votes)

14 views55 pages

Transformer

Uploaded by

ingresos.ubaar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views55 pages

Transformer

Uploaded by

ingresos.ubaar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Transformers

Hichem Felouat
[email protected]
Contents

1.Natural Language Processing NLP

2.Self-Attention
3.Transformer
4.Vision Transformer (ViT)
5.Large Language Models
6.Vision Language Models

Hichem Felouat - [email protected] - 2024 2

Natural Language Processing NLP

• Natural language processing (NLP) is a subfield of artificial

intelligence concerned with the interactions between computers and
human (natural) languages, in particular how to program computers to
process and analyze large amounts of natural language data.

• Challenges in natural language processing frequently involve speech

recognition, natural language understanding, and natural language
generation.

Hichem Felouat - [email protected] - 2024 3

Natural Language Processing NLP

Hichem Felouat - [email protected] - 2024 4

Natural Language Processing NLP

Hichem Felouat - [email protected] - 2024 5

Generic NLP Pipeline

Hichem Felouat - [email protected] - 2024 6

Texts to Sequence/Matrix
• In natural language processing (NLP), texts can be represented as a
sequence or a matrix, depending on the task and the model type.

texts = ["I love Algeria", "machine learning", "Artificial intelligence", "AI"]

• The total number of documents: 4

• The number of distinct words (Tokenization ): 8
• word_index :
{'i': 1, 'love': 2, 'algeria': 3, 'machine': 4, 'learning': 5, 'artificial': 6, 'intelligence': 7, 'ai': 8}
• texts_to_sequences : input [Algeria love AI]
[3, 2, 8]
• sequences_to_texts : input [3, 4, 7, 2, 8, 1, 3]
['algeria machine intelligence love ai i algeria']

Hichem Felouat - [email protected] - 2024 7

Texts to Sequence/Matrix
• binary: Whether or not each word is present in the document. This is the default.
• count : The count of each word in the document.
• freq : The frequency of each word as a ratio of words within each document.
• tfidf : The Text Frequency-Inverse Document Frequency (TF-IDF) scoring for each
word in the document.
texts = [
"blue car and blue window", "black crow in the window","i see my reflection in the window"
]

Hichem Felouat - [email protected] - 2024 8

Texts to Sequence/Matrix

Hichem Felouat - [email protected] - 2024 9

Sequence Padding
• Sequence padding is the process of adding zeroes or other filler
tokens to sequences of variable length so that all sequences have the
same length.
• Many machine learning models require fixed-length inputs, and
variable-length sequences can't be fed directly into these models.

sequences = [ [1, 2, 3, 4], [1, 2, 3], [1] ]

maxlen= 4
result: [[1 2 3 4]
[0 1 2 3]
[0 0 0 1]]
Hichem Felouat - [email protected] - 2024 10
Deep Learning-Based NLP

Hichem Felouat - [email protected] - 2024 11

Word Embedding

• Word embedding is a technique used in NLP to represent words as

numerical vectors in a high-dimensional space.

• Word embedding aims to capture the meaning and context of words

in a way that is useful for downstream NLP tasks, such as text
classification, sentiment analysis, and machine translation.

• There are several popular algorithms for creating word embeddings,

such as Word2Vec, GloVe, and fastText.

Hichem Felouat - [email protected] - 2024 12

Word Embedding

Hichem Felouat - [email protected] - 2024 13

Recurrent Neural Network(RNN)
• The simplest possible RNN composed of one neuron receiving inputs, producing
an output, and sending that output back to itself (figure -left).

• We can represent this tiny network against the time axis, as shown in (figure -
right). This is called unrolling the network through time.

Hichem Felouat - [email protected] - 2024 14

Recurrent Neural Network(RNN)
• You can easily create a layer of recurrent neurons. At each time step t, every
neuron receives both the input vector x(t) and the output vector from the
previous time step y(t–1).

A recurrent neuron (left) unrolled through time (right)

Hichem Felouat - [email protected] - 2024 15
Recurrent Neural Network(RNN)

• Seq-to-seq (top left), seq-to-vector (top right), vector-to-seq (bottom left), and Encoder–Decoder
(bottom right) networks. Hichem Felouat - [email protected] - 2024 16
Recurrent Neural Network(RNN)

Deep RNN (left) unrolled through time (right)

Hichem Felouat - [email protected] - 2024 17
Long Short-Term Memory (LSTM)
• When the data traversing an RNN, some information is lost at each time step. After a while, the RNN’s state
contains virtually no trace of the first inputs.

Hichem Felouat - [email protected] - 2024 18

Gated Recurrent Unit (GRU)

Hichem Felouat - [email protected] - 2024 19

Bidirectional RNNs
For example: in Neural Machine Translation, it is often
preferable to look ahead at the next words before
encoding a given word.

• Consider the phrases "the queen of the United

Kingdom", "the queen of hearts", and "the queen
bee": to properly encode the word “queen”, you need
to look ahead.
Hichem Felouat - [email protected] - 2024 20
Bidirectional RNNs
• To implement this, run two recurrent layers on the same inputs, one
reading the words from left to right and the other reading them from
right to left. Then simply concatenating them.

Hichem Felouat - [email protected] - 2024 21

Self-Attention

Hichem Felouat - [email protected] - 2024 22

Self-Attention
• The following sentence is an input sentence we want to
translate: "The animal didn't cross the street because it was
too tired“

• What does "it" in this sentence refer to?

• Is it referring to the street or to the animal? It’s a simple
question to a human but not as simple to an algorithm.

• When the model is processing the word "it", self-attention

allows it to associate "it" with "animal".
Hichem Felouat - [email protected] - 2024 23
Self-Attention

As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism
was focusing on "The Animal", and baked a part of its representation into the encoding of "it".
Hichem Felouat - [email protected] - 2024 24
Self-Attention in Detail
Weights

Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up
creating a "query", a "key", and a "value" projection of each word in the input sentence.
Hichem Felouat - [email protected] - 2024 25
Self-Attention in Detail

dot product

Hichem Felouat - [email protected] - 2024 26

Matrix Calculation of Self-Attention

Every row in the X matrix

corresponds to a word in
the input sentence.
Hichem Felouat - [email protected] - 2024 27
The Attention Mechanism from Scratch

Hichem Felouat - [email protected] - 2024 28

Matrix Calculation of Self-Attention

Hichem Felouat - [email protected] - 2024 29

Multi-Headed Attention

Multi-Headed Attention improves the performance of the attention

layer in two ways:

• It expands the model’s ability to focus on different positions. Yes, in

the example above, z1 contains a little bit of every other encoding,
but it could be dominated by the actual word itself.

• It gives the attention layer multiple representation subspaces.

Hichem Felouat - [email protected] - 2024 30

Multi-Headed Attention

Hichem Felouat - [email protected] - 2024 31

Multi-Headed Attention

As we encode the word "it", one attention head is focusing most on If we add all the attention heads to the
"the animal", while another is focusing on "tired" , in a sense, the picture, however, things can be harder to
model's representation of the word "it" bakes in some of the interpret.
representation of both "animal" and "tired".
Hichem Felouat - [email protected] - 2024 32
Transformer
Positional Encoding:
The transformer adds a vector to each input embedding. These vectors
follow a specific pattern that the model learns, which helps it determine
the position of each word or the distance between different words in
the sequence.

Hichem Felouat - [email protected] - 2024 33

Transformer

Hichem Felouat - [email protected] - 2024 34

Transformer

Hichem Felouat - [email protected] - 2024 35

Transformer

Attention Is All You Need

https://arxiv.org/abs/1706.03762

Hichem Felouat - [email protected] - 2024 36

Transformer

The Annotated Transformer

a line-by-line implementation
http://nlp.seas.harvard.edu/annotated-transformer/

Hichem Felouat - [email protected] - 2024 37

Transformer

Hichem Felouat - [email protected] - 2024 38

Vision Transformer
(ViT)

Hichem Felouat - [email protected] - 2024 39

Vision Transformers (ViTs) vs CNNs

Performance benchmark comparison of Vision Transformers (ViT) with ResNet and MobileNet when trained
from scratch on ImageNet.
Hichem Felouat - [email protected] - 2024 40
Vision Transformers (ViTs) vs CNNs
The authors in [1] demonstrated that CNNs trained on ImageNet are strongly biased
towards recognizing textures rather than shapes. Below is an excellent example of
such a case:

[1]: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. https://arxiv.org/abs/1811.12231
Hichem Felouat - [email protected] - 2024 41
Vision Transformers (ViTs) vs CNNs
• Neuroscience studies (The importance of shape in early lexical learning [1])
showed that object shape is the single most important cue for human object
recognition.

• By studying the visual pathway of humans regarding image recognition,

researchers identified that the perception of object shape is invariant to most
perturbations. So as far as we know, the shape is the most reliable cue.

• Intuitively, the object shape remains relatively stable, while other cues can be
easily distorted by all sorts of noise [2].

1: https://psycnet.apa.org/doi/10.1016/0885-2014(88)90014-7
2: https://arxiv.org/abs/1811.12231
Hichem Felouat - [email protected] - 2024 42
Vision Transformers (ViTs) vs CNNs

Accuracies and example stimuli for five different experiments without cue conflict.
Source: https://arxiv.org/abs/1811.12231
Hichem Felouat - [email protected] - 2024 43
Vision Transformers (ViTs) vs CNNs
• The texture is not sufficient for determining whether the zebra is rotated. Thus,
predicting rotation requires modeling shape, to some extent.

• The object's shape can be invariant to rotations.

Hichem Felouat - [email protected] - 2024 44

Vision Transformers (ViTs) vs CNNs
The self-attention captures long-range dependencies and contextual information in
the input data.
The self-attention mechanism allows a ViT model to attend to different regions of
the input data based on their relevance to the task at hand.

Raw images (Left) and attention

maps of ViT-S/16 with (Right)
and without (Middle).

https://arxiv.org/abs/2106.01548

Hichem Felouat - [email protected] - 2024 45

Vision Transformers (ViTs) vs CNNs

The authors in [1] looked at the self-attention of the CLS token on the heads of the last layer. Crucially, no labels are used
during the self-supervised training. These maps demonstrate that the learned class-specific features lead to remarkable
unsupervised segmentation masks and visibly correlate with the shape of semantic objects in the images.
1: Self-Supervised Vision Transformers with DINO https://arxiv.org/abs/2104.14294
Hichem Felouat - [email protected] - 2024 46
Vision Transformers (ViTs) vs CNNs
• The adversarial perturbations computed for a ViT and a ResNet model.

• The adversarial perturbations are qualitatively very different even though both models may
perform similarly in image recognition.

ViTs and ResNets process their inputs very differently. https://arxiv.org/abs/2103.14586

Hichem Felouat - [email protected] - 2024 47
Vision Transformers (ViTs) vs CNNs
• The transformer can attend to all the
tokens (image patches) at each block
by design. The originally proposed ViT
model in [1] already demonstrated that
heads from early layers tend to attend
to far-away pixels, while heads from
later layers do not.

How heads of different layers attend to

their surrounding pixels [1].
[1]: https://arxiv.org/abs/2010.11929
Hichem Felouat - [email protected] - 2024 48
Vision Transformers (ViTs)
How the Vision Transformer Works:

1. Split an image into patches

2. Flatten the patches
3. Produce lower-dimensional linear embeddings from the flattened
patches
4. Add positional embeddings
5. Feed the sequence as an input to a standard transformer encoder
6. Pretrain the model with image labels (fully supervised on a huge
dataset)
7. Finetune on the downstream dataset for image classification
Hichem Felouat - [email protected] - 2024 49
Vision Transformers (ViTs)

https://github.com/hichemfelouat/my-codes-of-machine-learning/blob/master/Vision_Transformer_(ViT)_for_Image_Classification_(cifar10_dataset).ipynb
Hichem Felouat - [email protected] - 2024 50
Vision Transformers (ViTs)

Global Context Vision Transformer (GC ViT):

https://github.com/NVlabs/GCViT

Hichem Felouat - [email protected] - 2024 51

Large Language Models

A Survey of Large Language Models:

https://arxiv.org/abs/2303.18223 Hichem Felouat - [email protected] - 2024 52
Vision Language Models

The architecture of MiniGPT-4

https://minigpt-4.github.io
Hichem Felouat - [email protected] - 2024 53
Vision Language Models

https://github.com/Vision-CAIR/MiniGPT-4
Hichem Felouat - [email protected] - 2024 54
Thank You For Attending
Q&A

Hichem Felouat …

Hichem Felouat - [email protected] - 2024 55

Whitepaper - Foundational Large Language Models & Text Generation - v2
100% (1)
Whitepaper - Foundational Large Language Models & Text Generation - v2
86 pages
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
272 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Week 12
100% (1)
Week 12
64 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
GenAI For Developers
No ratings yet
GenAI For Developers
205 pages
DART
No ratings yet
DART
15 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
No ratings yet
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
42 pages
Advanced Techniques in Training and Applying Large Language Models
No ratings yet
Advanced Techniques in Training and Applying Large Language Models
6 pages
Transformer Autobots
No ratings yet
Transformer Autobots
50 pages
Ece265p Fahmy Day7
No ratings yet
Ece265p Fahmy Day7
93 pages
3-Natural Language Processing With Attention Models
No ratings yet
3-Natural Language Processing With Attention Models
62 pages
Module 3 Presentation
No ratings yet
Module 3 Presentation
48 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
Computer Vision ML For Trading
No ratings yet
Computer Vision ML For Trading
32 pages
Tianzheng Troy Wang CIS498EAS499 Submission
No ratings yet
Tianzheng Troy Wang CIS498EAS499 Submission
51 pages
NLP - Machine Learning
No ratings yet
NLP - Machine Learning
23 pages
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
No ratings yet
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
103 pages
Generative AI
No ratings yet
Generative AI
54 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
Project Report Format 2025
No ratings yet
Project Report Format 2025
59 pages
Report 1 Transformers
No ratings yet
Report 1 Transformers
7 pages
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
No ratings yet
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
117 pages
Qwen2-VL: Enhancing Vision-Language Model's Perception of The World at Any Resolution
No ratings yet
Qwen2-VL: Enhancing Vision-Language Model's Perception of The World at Any Resolution
52 pages
The Atlas of 50 Common AI Models
No ratings yet
The Atlas of 50 Common AI Models
72 pages
L15 Transformer1
No ratings yet
L15 Transformer1
19 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Slide
No ratings yet
Slide
28 pages
AI Primer
No ratings yet
AI Primer
12 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
No ratings yet
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
13 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
AN2DL 06 2324 AttentionAndTrasformers
No ratings yet
AN2DL 06 2324 AttentionAndTrasformers
60 pages
C11-Attention and Transformers
No ratings yet
C11-Attention and Transformers
59 pages
Uppwise Standard PPT 2
No ratings yet
Uppwise Standard PPT 2
13 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
Shivam Final
No ratings yet
Shivam Final
34 pages
Transformers
No ratings yet
Transformers
27 pages
Deepfake Video Detection Using Convolutional Vision Transformer
No ratings yet
Deepfake Video Detection Using Convolutional Vision Transformer
9 pages
Definition:: Large Language Models (LLMS)
No ratings yet
Definition:: Large Language Models (LLMS)
41 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Machine Translation Wise 2016/2017
No ratings yet
Machine Translation Wise 2016/2017
58 pages
Classification of Network Traffic Using Machine Learning Models On The NetML Dataset
No ratings yet
Classification of Network Traffic Using Machine Learning Models On The NetML Dataset
15 pages
Cluster1 Core ML NLP Techniques Summary
No ratings yet
Cluster1 Core ML NLP Techniques Summary
8 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
Final Project Report
No ratings yet
Final Project Report
34 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Unit 5
No ratings yet
Unit 5
5 pages
GenAI Syllabus
No ratings yet
GenAI Syllabus
17 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Slides
No ratings yet
Slides
26 pages
SCSegamba Lightweight Structure-Aware Vision Mamba For Crack Segmentation in Structures
No ratings yet
SCSegamba Lightweight Structure-Aware Vision Mamba For Crack Segmentation in Structures
17 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
Transformer
No ratings yet
Transformer
5 pages
Multimodal AI Explained: Benefits & Beginner's Quick Guide
No ratings yet
Multimodal AI Explained: Benefits & Beginner's Quick Guide
6 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
Transformer Architecture Explained in LLMs
No ratings yet
Transformer Architecture Explained in LLMs
2 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
DUSt 3 R
No ratings yet
DUSt 3 R
13 pages
AI Assistant For Visually Impaired 3
No ratings yet
AI Assistant For Visually Impaired 3
6 pages
Soma Final
No ratings yet
Soma Final
30 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
【更有效的掩码模型】Architecture-Agnostic Masked Image Modeling - From ViT Back to CNN
No ratings yet
【更有效的掩码模型】Architecture-Agnostic Masked Image Modeling - From ViT Back to CNN
19 pages
RF-DETR by Roboflow: Speed Meets Accuracy in Object Detection
No ratings yet
RF-DETR by Roboflow: Speed Meets Accuracy in Object Detection
13 pages
YOLOv12 Attention-Centric Real-Time Object Detectors
No ratings yet
YOLOv12 Attention-Centric Real-Time Object Detectors
13 pages
Aman Arora Blog On Vision Transformer
No ratings yet
Aman Arora Blog On Vision Transformer
11 pages
PHD Title: Efficient Multimodal Vision Transformers For Embedded System
No ratings yet
PHD Title: Efficient Multimodal Vision Transformers For Embedded System
4 pages
2024 - Generalizing VT For Face Anti-Spoofing
No ratings yet
2024 - Generalizing VT For Face Anti-Spoofing
14 pages
Syllabus 2025 Final
No ratings yet
Syllabus 2025 Final
4 pages
Enhancing Cell Segmentation
No ratings yet
Enhancing Cell Segmentation
6 pages
(Huawei, KD) One-for-All - Bridge The Gap Between Heterogeneous Architectures in Knowledge Distillation
No ratings yet
(Huawei, KD) One-for-All - Bridge The Gap Between Heterogeneous Architectures in Knowledge Distillation
13 pages
Animalformer Multimodal Vision Framework For Behavior Based 3w9vdr6dbn
No ratings yet
Animalformer Multimodal Vision Framework For Behavior Based 3w9vdr6dbn
10 pages
Vision Transformer-Based Feature Extraction For Generalized Zero-Shot Learning
No ratings yet
Vision Transformer-Based Feature Extraction For Generalized Zero-Shot Learning
21 pages
Cervix Visionator ELM A Novel Approach To Early Detection of Cervical Cancer
No ratings yet
Cervix Visionator ELM A Novel Approach To Early Detection of Cervical Cancer
6 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
Rahman 24 A
No ratings yet
Rahman 24 A
19 pages
Transformers
No ratings yet
Transformers
10 pages
Rishabh Singh: Experience
No ratings yet
Rishabh Singh: Experience
1 page
A Knowledge Distillation Integrated Pruning Method For Vision Transformer
No ratings yet
A Knowledge Distillation Integrated Pruning Method For Vision Transformer
6 pages
3 ICT Nawel
No ratings yet
3 ICT Nawel
6 pages
Accidentblip2: Accident Detection With Multi-View Motionblip2
No ratings yet
Accidentblip2: Accident Detection With Multi-View Motionblip2
6 pages