0% found this document useful (0 votes)
17 views50 pages

Unit 4

Uploaded by

dengduchupaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views50 pages

Unit 4

Uploaded by

dengduchupaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Unit 4

Transfer Learning Techniques


Introduction to transfer learning
• We, humans, are very good at applying the transfer of knowledge between tasks.
• Similarly Transfer learning is a smart method in machine learning where a
model uses knowledge from one task to help with a different, but related, task.
• Instead of learning from zero, the model uses what it already knows to solve new
problems faster and better.
• Transfer learning is making a big impact in areas like understanding language
and recognizing images
• Transfer learning involves applying knowledge gained in one domain to another.
In deep learning, pre-trained models are fine-tuned for new tasks, reducing the
need for extensive data and training time.
• For example, if you trained a simple classifier to predict whether an image
contains a backpack, you could use the knowledge that the model gained during
its training to recognize other objects like sunglasses.
• Transfer learning is a technique in machine learning where a model
trained on one task is used as the starting point for a model on a
second task.
• This can be useful when the second task is similar to the first task, or
when there is limited data available for the second task.
• By using the learned features from the first task as a starting point,
the model can learn more quickly and effectively on the second task.
• This can also help to prevent overfitting, as the model will have
already learned general features that are likely to be useful in the
second task.
• With transfer learning, we basically try to exploit what has been learned in
one task to improve generalization in another. We transfer the weights that
a network has learned at “task A” to a new “task B.”
• Transfer learning is mostly used in computer vision and natural language
processing tasks like sentiment analysis due to the huge amount of
computational power required.
• Transfer learning isn’t really a machine learning technique, but can be seen
as a “design methodology” within the field, for example, active learning.
• It is also not an exclusive part or study-area of machine learning.
Nevertheless, it has become quite popular in combination with neural
networks that require huge amounts of data and computational power.
Need for Transfer Learning

• Transfer learning is essential in machine learning for several reasons:


• Limited Data: In many real-world scenarios, obtaining a large amount
of labeled data for training a model from scratch can be difficult and
expensive. Transfer learning allows us to leverage pre-trained models
and their knowledge, reducing the need for vast amounts of data.
• Improved Performance: By starting with a pre-trained model, which
has already learned from a large dataset, we can achieve better
performance on new tasks more quickly. This is especially useful in
applications where accuracy and efficiency are crucial.
• Time and Cost Efficiency: Transfer learning saves time and resources
because it speeds up the training process. Instead of training a new
model from scratch, we can build on existing models and fine-tune
them for specific tasks.
• Adaptability: Models trained on one task can be adapted to perform
well on related tasks. This adaptability makes transfer learning
suitable for a wide range of applications, from image recognition to
natural language processing.
• Transfer learning is also particularly useful with a limited
computing resource. A lof of state of the art models
takes several days and weeks in some cases to train
even when trained on highly powerful GPU machines.
Thus, in order not to repeat the same process over a
long period of time, transfer learning allows us to make
use of pre-trained weights as a starting point.
• Different transfer learning strategies and techniques are applied
based on the domain of the application, the task at hand, and the
availability of data.
• Before deciding on the strategy of transfer learning, it is crucial
to have an answer of the following questions:
• Which part of the knowledge can be transferred from the source
to the target to improve the performance of the target task?
• When to transfer and when not to, so that one improves the
target task performance/results and does not degrade them?
• How to transfer the knowledge gained from the source model
based on our current domain/task?
How Transfer Learning Works
• This is a general summary of how transfer learning works:
• Pre-trained Model: Start with a model that has previously been trained for a
certain task using a large set of data. Frequently trained on extensive datasets,
this model has identified general features and patterns relevant to numerous
related jobs.
• Base Model: The model that has been pre-trained is known as the base model.
It is made up of layers that have utilized the incoming data to learn hierarchical
feature representations.
• Transfer Layers: In the pre-trained model, find a set of layers that capture
generic information relevant to the new task as well as the previous one.
Because they are prone to learning low-level information, these layers are
frequently found near the top of the network.
• Fine-tuning: Using the dataset from the new challenge to retrain the
chosen layers. We define this procedure as fine-tuning. The goal is to
preserve the knowledge from the pre-training while enabling the
model to modify its parameters to better suit the demands of the
current assignment.
• Freezed and Trainable Layers:
• In transfer learning, there are two main components: frozen layers and modifiable
layers.
1.Frozen Layers: These are the layers of a pre-trained model that are kept unchanged
during the fine-tuning process. Frozen layers retain the knowledge learned from the
original task and are used to extract general features from the input data.
2.Modifiable Layers: These are the layers of the model that are adjusted or re-trained
during fine-tuning. Modifiable layers learn task-specific features from the new
dataset. By focusing on these layers, the model can adapt to the specific
requirements of the new task.
• Now, one may ask how to determine which layers we need to freeze, and which
layers need to train.
• The answer is simple, the more you want to inherit features from a pre-trained
model, the more you have to freeze layers.
• Let’s consider all situations where the size and dataset of the target task vary from
the base network.
• Transfer Learning Scenarios
1.New dataset that is small and similar to the original training dataset
2.New dataset that is small but different from the original training dataset
3.New dataset that is large and similar to original training dataset
4.New data set is large but new data is different from original training data.
• If the new data set is small and similar to the original training
data:
• remove the end of the fully connected neural network
• add a new fully-connected layer that has an output dimension equal to
the number of classes in the new data set.
• randomize the weights of the new fully connected layer;
• freeze all the weights from the pre-trained network
• train the network to update the weights of the new fully connected layer
• The whole CNN layers of the pre-trained models are kept constant ie
froze because the images are similar and they would contain higher-level
features. However, this approach has a tendency of overfitting our small
dataset. Thus, the weights of the original pre-trained model are held
constant and not retrained.
• If the new data set is small and different from the original training data, the approach is as
follows:
• we remove the end of the fully connected neural network and some CNN layers at the end of the
network.
• add a new fully-connected layer that has an output dimension equal to the number of classes in
the new data set.
• randomize the weights of the new fully connected layer;
• freeze all the weights from the remaining pre-trained CNN network
• train the network to update the weights of the new fully connected layer
• In this case, we note that the dataset is small but different. Because our datasets are images, we
leave the beginning of the network and remove the CNN layer that extracts higher features just
prior to the fully connected layers. However, this approach also has a tendency of being
overfitting our small dataset. Thus, the weights of the original pre-trained model are held
constant and not retrained.
• If the new data set is large and similar to the original training
data, the approach is as follows:
• remove the end of the fully connected neural network
• add a new fully-connected layer that has an output dimension equal to
the number of classes in the new data set.
• randomize the weights of the new fully connected layer;
• initialize the weights from the pre-trained network
• train the network to update the weights of the new fully connected layer
• Since the new dataset is similar to original training data, the higher layer
features are not removed from the pre-trained network. Overfitting is not
a potential problem; therefore, we can re-train all of the weights.
• If the new data set is large and different from the
original training data, the approach is as follows:
• remove the end of the fully connected neural network and add
a new fully-connected layer that has an output dimension
equal to the number of classes in the new data set.
• randomize the weights of the new fully connected layer
and initialize the weights with random weights
• train the network to update the weights of the new fully
connected layer
• In this case, the CNN layers are mostly retrained from scratch.
But we could as well initialize it with the pre-trained weights.
Difference between transfer
learning and fine-tuning
• Fine-tuning is an optional step in transfer learning. Fine-tuning will
usually improve the performance of the model. However, since it has
to retrain the entire model, it will likely overfit.
• Overfitting is avoidable. Just retrain the model or part of it using a low
learning rate. This is important because it prevents significant updates
to the gradient.
• These updates result in poor performance. Using a callback to stop
the training process when the model has stopped improving is also
helpful.
Transfer Learning Process
1.Obtain pre-trained model. The first step is to choose
the pre-trained model we would like to keep as the base
of our training, depending on the task.
2.Create a base model.
3.Freeze layers.
4.Add new trainable layers.
5.Train the new layers.
6.Fine-tune your model.
Step-1 Obtain the pre-trained model
• The first step is to get the pre-trained model that you would like to use for
your problem. we would like to keep as the base of our training, depending on
the task. Transfer learning requires a strong correlation between the
knowledge of the pre-trained source model and the target task domain for
them to be compatible.
• Here are some of the pre-trained models you can use:
• For computer vision:
• VGG-16, VGG-19,Inception V3, Xceptio, ResNet-50
• For NLP tasks:
• Word2Vec, GloVe, FastText
Step-2 Create a base model
• Usually, the first step is to instantiate the base model using one of the
architectures such as ResNet or Xception.
• You can also optionally download the pre-trained weights. If you don’t
download the weights, you will have to use the architecture to train your
model from scratch.
• Recall that the base model will usually have more units in the final output
layer than you require.
• When creating the base model, you, therefore, have to remove the final
output layer. Later on, you will add a final output layer that is compatible
with your problem.
Step-3 Freeze layers so they don’t change during training
• Freezing the layers from the pre-trained model is vital. This is because
you don’t want the weights in those layers to be re-initialized. If they
are, then you will lose all the learning that has already taken place.
This will be no different from training the model from scratch.
Step-4 Add new trainable layers
• The next step is to add new trainable layers that will turn old features
into predictions on the new dataset. This is important because the
pre-trained model is loaded without the final output layer.
Step-5 Train the new layers on the dataset
• Remember that the pre-trained model’s final output will most likely
be different from the output that you want for your model.
• For example, pre-trained models trained on the ImageNet dataset will
output 1000 classes. However, your model might just have two
classes. In this case, you have to train the model with a new output
layer in place.
• Therefore, you will add some new dense layers as you please, but
most importantly, a final dense layer with units corresponding to the
number of outputs expected by your model.
Step-6 Improve the model via fine-tuning
• Once you have done the previous step, you will have a model that can make predictions on your
dataset. Optionally, you can improve its performance through fine-tuning.
• Fine-tuning is done by unfreezing the base model or part of it and training the entire model
again on the whole dataset at a very low learning rate. The low learning rate will increase the
performance of the model on the new dataset while preventing overfitting.
• The learning rate has to be low because the model is quite large while the dataset is small. This
is a recipe for overfitting, hence the low learning rate.
• Recompile the model once you have made these changes so that they can take effect. This is
because the behavior of a model is frozen whenever you call the compile function. That means
that you have to call the compile function again whenever you want to change the model’s
behavior.
• The next step will be to train the model again while monitoring it via callbacks to ensure it does
not overfit.
Advantages of transfer learning
• Speed up the training process: By using a pre-trained model, the model
can learn more quickly and effectively on the second task, as it already
has a good understanding of the features and patterns in the data.
• Better performance: Transfer learning can lead to better performance
on the second task, as the model can leverage the knowledge it has
gained from the first task.
• Handling small datasets: When there is limited data available for the
second task, transfer learning can help to prevent overfitting, as the
model will have already learned general features that are likely to be
useful in the second task.
Disadvantages of transfer
learning
• Domain mismatch: The pre-trained model may not be well-suited to
the second task if the two tasks are vastly different or the data
distribution between the two tasks is very different.
• Overfitting: Transfer learning can lead to overfitting if the model is
fine-tuned too much on the second task, as it may learn task-specific
features that do not generalize well to new data.
• Complexity: The pre-trained model and the fine-tuning process can
be computationally expensive and may require specialized hardware.
Variants of CNN: DenseNet
• Densely Connected Convolutional Networks (DenseNet) is a feed-
forward convolutional neural network (CNN) architecture that links
each layer to every other layer.
• This allows the network to learn more effectively by reusing features,
hence reducing the number of parameters and enhancing the
gradient flow during training.
• In 2016, Gao Huang et al. presented the architecture in their
DenseNet paper “Densely Connected Convolutional Networks”.
• In Standard ConvNet, input image goes through
multiple convolution and obtain high-level features.
• In ResNet, identity mapping is proposed to promote the gradient propagation.
Element-wise addition is used. It can be viewed as algorithms with a state
passed from one ResNet module to another one.
• In DenseNet, each layer obtains additional inputs from all
preceding layers and passes on its own feature-maps to all
subsequent layers. Concatenation is used. Each layer is
receiving a “collective knowledge” from all preceding
layers.
• Since each layer receives feature maps from all
preceding layers, network can be thinner and compact,
i.e. number of channels can be fewer. The growth rate k is
the additional number of channels for each layer.
• So, it have higher computational efficiency and memory
efficiency. The following figure shows the concept of
concatenation during forward propagation:
DenseNet Architecture
• 1. Basic DenseNet Composition Layer
• For each composition layer, Pre-Activation Batch Norm (BN) and ReLU,
then 3×3 Conv are done with output feature maps of k channels, say
for example, to transform x0, x1, x2, x3 to x4. This is the idea from
Pre-Activation ResNet.
• 2.DenseNet-B (Bottleneck Layers)
• To reduce the model complexity and size, BN-ReLU-
1×1 Conv is done before BN-ReLU-3×3 Conv.
• 3. DenseNet-BC (Further Compression)
• If a dense block contains m feature-maps, The transition
layer generate θm output feature maps, where 0<θ≤1 is
referred to as the compression factor.
• When θ=1, the number of feature-maps across transition layers
remains unchanged. DenseNet with θ<1 is referred as DenseNet-
C, and θ=0.5 in the experiment.
• When both the bottleneck and transition layers with θ<1
are used, the model is referred as DenseNet-BC.
• Finally, DenseNets with/without B/C and with
different L layers and k growth rate are trained.
• Multiple Dense Blocks with Transition Layers
• 1×1 Conv followed by 2×2 average pooling are used as
the transition layers between two contiguous dense
blocks.
• Feature map sizes are the same within the dense block
so that they can be concatenated together easily.
• At the end of the last dense block, a global average
pooling is performed and then a SoftMax classifier is
attached.
Advantages of DenseNet
• Performance: As previously stated, DenseNet’s state-of-the-art
performance can be observed in a range of computer vision tasks
including picture classification, object recognition, and semantic
segmentation.
• Feature: DenseNet lets each layer access the features of all previous
layers, optimizing the gradient flow during training and allows the
network to acquire knowledge more effectively.
• Overfitting: The DenseNet design successfully tackles overfitting by
lowering the number of parameters and enabling feature reuse,
enhancing the model’s capacity to generalize to unknown data.
• Vanishing Gradients: The DenseNet design mitigates the vanishing
gradient issue by allowing gradients to flow across the whole
network, allowing the training of deeper networks.
• Redundancy: The DenseNet design manages redundancy successfully
by offering feature reuse and lowering the number of parameters,
enhancing the model’s capacity to generalize to unknown data.
Variant of CNN: Pixel Net
• Self study

You might also like