0% found this document useful (0 votes)
15 views26 pages

Unit 6

Transfer learning is a machine learning technique that reuses pre-trained models for new tasks, enabling efficient training with less data. Key concepts include feature extraction, fine-tuning, and domain adaptation, with applications in computer vision and natural language processing. The Reformer model addresses the limitations of traditional transformers by using Locality-Sensitive Hashing for efficient attention and reversible layers for memory efficiency.

Uploaded by

Anshaj Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views26 pages

Unit 6

Transfer learning is a machine learning technique that reuses pre-trained models for new tasks, enabling efficient training with less data. Key concepts include feature extraction, fine-tuning, and domain adaptation, with applications in computer vision and natural language processing. The Reformer model addresses the limitations of traditional transformers by using Locality-Sensitive Hashing for efficient attention and reversible layers for memory efficiency.

Uploaded by

Anshaj Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

BUILDING MODELS/ CASE STUDIES

TRANSFER LEARNING
 Transfer learning is a machine learning (ML)
technique where an already developed ML
model is reused in another task.
 Transfer learning is a popular approach

in deep learning, as it enables the training of


deep neural networks with less data.
TRANSFER LEARNING
 Key Concepts of Transfer Learning
 Pre-trained Models: These are models that

have already been trained on large datasets


(like ImageNet for images). Instead of starting
from zero, you can use these models to help
with your specific task.
 Feature Extraction: The earlier layers of a

neural network usually learn general features,


such as shapes and colors. The later layers
learn more specific features for the original
task. In transfer learning, you can keep the
early layers fixed (not changing them) and
only train the later layers on your new task.
TRANSFER LEARNING
 Fine-tuning: This means making small
adjustments to the model's weights by
training it a bit on your new dataset. This
helps the model learn the specifics of your
new task without losing what it already
knows.
 Domain Adaptation: Sometimes, the
original task and your task are quite
different. Domain adaptation helps adjust the
model so it can work better on the new task.
TRANSFER LEARNING
APPLICATIONS OF TRANSFER
LEARNING

 Computer Vision: Using models trained on


large datasets to do specific tasks, like
recognizing medical images.
 Natural Language Processing: Using pre-

trained language models (like BERT or GPT)


to perform tasks such as understanding
sentiments or identifying names in text.
ADVANTAGES OF TRANSFER
LEARNING
 Reduced Training Time: Since the model
already knows a lot, it requires less time to
train on the new task.
 Better Performance: Especially when you

have limited data, transfer learning can give


better results compared to training a new
model from scratch.
 Lower Resource Requirements: It usually

needs less computing power and memory.


CHALLENGES
 Negative Transfer: If the original task is
very different from the new task, the model
might perform worse than if it had been
trained from scratch.
 Domain Shift: If the pre-trained model was

trained in a different environment, it might


not work well on your new data.
LINK TO CODE
 https://colab.research.google.com/drive/
1_UZ2xtL6Ejvqj6CjLXc5xAWvLwCD-ese?
usp=sharing
BERT
BERT (Bidirectional Encoder Representations
from Transformers) is a powerful language
model developed by Google in 2018.
 It’s designed to understand language in a

better way than earlier models, making it


highly effective for natural language
processing (NLP) tasks.
BERT
 Bidirectional Context: Unlike traditional
models that process text left-to-right (or
right-to-left).
 BERT reads text bidirectionaLly, capturing

the full context of a word by looking at both


its previous and next words.
 This leads to better comprehension of
nuanced language.
BERT
 Transformer Architecture: BERT is based
on the Transformer,
 a deep learning architecture that uses

attention mechanisms to weigh the


importance of different words in a sentence.
 This allows it to handle long-range

dependencies and relationships between


words.
BERT
 Pre-trained on Large Corpora: BERT is
pre-trained on large text corpora (like
Wikipedia and BookCorpus)
 using two tasks: masked language modeling

(MLM) and next sentence prediction (NSP).


 This helps it generalize well across different

NLP tasks.
ARCHITECTURE: TRANSFORMER-
BASED

 Capture Long-Range Dependencies: Self-attention


enables BERT to weigh relationships between all words in a
sentence simultaneously, rather than sequentially, as in
recurrent neural networks (RNNs).
 Parallelize Processing: Unlike RNNs, Transformers process
tokens in parallel, allowing for much faster training.
 The Transformer consists of:
 Self-Attention Layers: Compute how much attention each
word should pay to every other word in the sequence.
 Feedforward Neural Networks: After the self-attention
layer, these apply transformations to enrich token
embeddings.
 Positional Encoding: Since BERT doesn’t have any
inherent word order, it adds positional embeddings to
encode the order of words in a sentence.
BIDIRECTIONAL TRAINING
 Understand Polysemy: By considering the
surrounding words, BERT can interpret words
with multiple meanings correctly based on
context (e.g., "bank" in "river bank" vs.
"savings bank").
 Improve Contextual Understanding:
Bidirectional training enables BERT to
understand more intricate dependencies,
such as handling complex negations or
sarcasm.
COMMON APPLICATIONS
 Text Classification: Assigning categories to
text (e.g., spam detection, sentiment
analysis).
 Question Answering: Extracting answers

from text for given questions.


 Named Entity Recognition (NER):
Identifying names, dates, locations, etc.,
within text.
 Sentence Similarity: Evaluating how
similar two sentences are, which is helpful in
paraphrasing and semantic search.
REFORMER MODEL
 Traditional transformer models (like BERT and GPT)
revolutionized natural language processing (NLP) by
introducing mechanisms that allowed them to
understand context and relationships in text.
However, these models have limitations:
 Complexity: The standard self-attention
mechanism computes attention scores between
every pair of tokens, leading to quadratic complexity
(O(n²)), which makes it computationally expensive
and memory-intensive for long sequences.
 Memory Constraints: As the length of input
sequences increases, the memory required to store
intermediate activations also increases, limiting the
length of sequences that can be processed.
THE REFORMER ARCHITECTURE
 The Reformer model addresses these
limitations through: innovations:
 Locality-Sensitive Hashing (LSH)

 Locality-Sensitive Hashing (LSH) is used

in the attention mechanism to reduce the


number of tokens each token attends to.
Instead of computing attention scores for all
pairs of tokens, LSH groups tokens based on
similarity, allowing each token to attend only
to its closest neighbors.
THE REFORMER ARCHITECTURE
 Example:
 Suppose you have a sequence of 8 tokens: ["I",
"love", "to", "learn", "about", "deep", "learning",
"today"].
 In traditional self-attention, every token would
compute attention scores with every other token
(8x8 matrix).
 With LSH, you might hash these tokens into
buckets based on their similarity (e.g., based on
their embeddings), allowing each token to only
attend to other tokens within the same bucket.
 This drastically reduces the computational
complexity from O(n²) to O(n log n), making it
feasible to process longer sequences.
THE REFORMER ARCHITECTURE
 Reversible Layers
 Reformer's architecture includes reversible

layers, which allow for memory-efficient


training.
 Instead of storing all intermediate activations

for each layer, reversible networks compute


the output of each layer based on the output
of the previous layer and can reconstruct
previous activations.
THE REFORMER ARCHITECTURE
THE REFORMER ARCHITECTURE
THE REFORMER ARCHITECTURE
 Example:
 In a traditional feed-forward neural network,

each layer’s output is saved for backpropagation.


 If you have 12 layers, you need to store 12

outputs, consuming significant memory.


 In a reversible layer, when you compute the

output of layer n, you can discard the output of


layer n−1 .
 since it can be computed again during
backpropagation.
 This means you only need to store the input to

the first layer and the output of the last


layer, drastically reducing memory usage.
THE REFORMER ARCHITECTURE
 Attention Mechanism
 The Reformer model employs a modified self-

attention mechanism where the attention


scores are computed only for the tokens in
the same bucket as determined by LSH.
 The efficient attention mechanism in

Reformer uses:
 Chunking(Buckets): The input sequence is

divided into smaller chunks, and attention is


computed within these chunks.

You might also like