We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50
Unit 4
Transfer Learning Techniques
Introduction to transfer learning • We, humans, are very good at applying the transfer of knowledge between tasks. • Similarly Transfer learning is a smart method in machine learning where a model uses knowledge from one task to help with a different, but related, task. • Instead of learning from zero, the model uses what it already knows to solve new problems faster and better. • Transfer learning is making a big impact in areas like understanding language and recognizing images • Transfer learning involves applying knowledge gained in one domain to another. In deep learning, pre-trained models are fine-tuned for new tasks, reducing the need for extensive data and training time. • For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the knowledge that the model gained during its training to recognize other objects like sunglasses. • Transfer learning is a technique in machine learning where a model trained on one task is used as the starting point for a model on a second task. • This can be useful when the second task is similar to the first task, or when there is limited data available for the second task. • By using the learned features from the first task as a starting point, the model can learn more quickly and effectively on the second task. • This can also help to prevent overfitting, as the model will have already learned general features that are likely to be useful in the second task. • With transfer learning, we basically try to exploit what has been learned in one task to improve generalization in another. We transfer the weights that a network has learned at “task A” to a new “task B.” • Transfer learning is mostly used in computer vision and natural language processing tasks like sentiment analysis due to the huge amount of computational power required. • Transfer learning isn’t really a machine learning technique, but can be seen as a “design methodology” within the field, for example, active learning. • It is also not an exclusive part or study-area of machine learning. Nevertheless, it has become quite popular in combination with neural networks that require huge amounts of data and computational power. Need for Transfer Learning
• Transfer learning is essential in machine learning for several reasons:
• Limited Data: In many real-world scenarios, obtaining a large amount of labeled data for training a model from scratch can be difficult and expensive. Transfer learning allows us to leverage pre-trained models and their knowledge, reducing the need for vast amounts of data. • Improved Performance: By starting with a pre-trained model, which has already learned from a large dataset, we can achieve better performance on new tasks more quickly. This is especially useful in applications where accuracy and efficiency are crucial. • Time and Cost Efficiency: Transfer learning saves time and resources because it speeds up the training process. Instead of training a new model from scratch, we can build on existing models and fine-tune them for specific tasks. • Adaptability: Models trained on one task can be adapted to perform well on related tasks. This adaptability makes transfer learning suitable for a wide range of applications, from image recognition to natural language processing. • Transfer learning is also particularly useful with a limited computing resource. A lof of state of the art models takes several days and weeks in some cases to train even when trained on highly powerful GPU machines. Thus, in order not to repeat the same process over a long period of time, transfer learning allows us to make use of pre-trained weights as a starting point. • Different transfer learning strategies and techniques are applied based on the domain of the application, the task at hand, and the availability of data. • Before deciding on the strategy of transfer learning, it is crucial to have an answer of the following questions: • Which part of the knowledge can be transferred from the source to the target to improve the performance of the target task? • When to transfer and when not to, so that one improves the target task performance/results and does not degrade them? • How to transfer the knowledge gained from the source model based on our current domain/task? How Transfer Learning Works • This is a general summary of how transfer learning works: • Pre-trained Model: Start with a model that has previously been trained for a certain task using a large set of data. Frequently trained on extensive datasets, this model has identified general features and patterns relevant to numerous related jobs. • Base Model: The model that has been pre-trained is known as the base model. It is made up of layers that have utilized the incoming data to learn hierarchical feature representations. • Transfer Layers: In the pre-trained model, find a set of layers that capture generic information relevant to the new task as well as the previous one. Because they are prone to learning low-level information, these layers are frequently found near the top of the network. • Fine-tuning: Using the dataset from the new challenge to retrain the chosen layers. We define this procedure as fine-tuning. The goal is to preserve the knowledge from the pre-training while enabling the model to modify its parameters to better suit the demands of the current assignment. • Freezed and Trainable Layers: • In transfer learning, there are two main components: frozen layers and modifiable layers. 1.Frozen Layers: These are the layers of a pre-trained model that are kept unchanged during the fine-tuning process. Frozen layers retain the knowledge learned from the original task and are used to extract general features from the input data. 2.Modifiable Layers: These are the layers of the model that are adjusted or re-trained during fine-tuning. Modifiable layers learn task-specific features from the new dataset. By focusing on these layers, the model can adapt to the specific requirements of the new task. • Now, one may ask how to determine which layers we need to freeze, and which layers need to train. • The answer is simple, the more you want to inherit features from a pre-trained model, the more you have to freeze layers. • Let’s consider all situations where the size and dataset of the target task vary from the base network. • Transfer Learning Scenarios 1.New dataset that is small and similar to the original training dataset 2.New dataset that is small but different from the original training dataset 3.New dataset that is large and similar to original training dataset 4.New data set is large but new data is different from original training data. • If the new data set is small and similar to the original training data: • remove the end of the fully connected neural network • add a new fully-connected layer that has an output dimension equal to the number of classes in the new data set. • randomize the weights of the new fully connected layer; • freeze all the weights from the pre-trained network • train the network to update the weights of the new fully connected layer • The whole CNN layers of the pre-trained models are kept constant ie froze because the images are similar and they would contain higher-level features. However, this approach has a tendency of overfitting our small dataset. Thus, the weights of the original pre-trained model are held constant and not retrained. • If the new data set is small and different from the original training data, the approach is as follows: • we remove the end of the fully connected neural network and some CNN layers at the end of the network. • add a new fully-connected layer that has an output dimension equal to the number of classes in the new data set. • randomize the weights of the new fully connected layer; • freeze all the weights from the remaining pre-trained CNN network • train the network to update the weights of the new fully connected layer • In this case, we note that the dataset is small but different. Because our datasets are images, we leave the beginning of the network and remove the CNN layer that extracts higher features just prior to the fully connected layers. However, this approach also has a tendency of being overfitting our small dataset. Thus, the weights of the original pre-trained model are held constant and not retrained. • If the new data set is large and similar to the original training data, the approach is as follows: • remove the end of the fully connected neural network • add a new fully-connected layer that has an output dimension equal to the number of classes in the new data set. • randomize the weights of the new fully connected layer; • initialize the weights from the pre-trained network • train the network to update the weights of the new fully connected layer • Since the new dataset is similar to original training data, the higher layer features are not removed from the pre-trained network. Overfitting is not a potential problem; therefore, we can re-train all of the weights. • If the new data set is large and different from the original training data, the approach is as follows: • remove the end of the fully connected neural network and add a new fully-connected layer that has an output dimension equal to the number of classes in the new data set. • randomize the weights of the new fully connected layer and initialize the weights with random weights • train the network to update the weights of the new fully connected layer • In this case, the CNN layers are mostly retrained from scratch. But we could as well initialize it with the pre-trained weights. Difference between transfer learning and fine-tuning • Fine-tuning is an optional step in transfer learning. Fine-tuning will usually improve the performance of the model. However, since it has to retrain the entire model, it will likely overfit. • Overfitting is avoidable. Just retrain the model or part of it using a low learning rate. This is important because it prevents significant updates to the gradient. • These updates result in poor performance. Using a callback to stop the training process when the model has stopped improving is also helpful. Transfer Learning Process 1.Obtain pre-trained model. The first step is to choose the pre-trained model we would like to keep as the base of our training, depending on the task. 2.Create a base model. 3.Freeze layers. 4.Add new trainable layers. 5.Train the new layers. 6.Fine-tune your model. Step-1 Obtain the pre-trained model • The first step is to get the pre-trained model that you would like to use for your problem. we would like to keep as the base of our training, depending on the task. Transfer learning requires a strong correlation between the knowledge of the pre-trained source model and the target task domain for them to be compatible. • Here are some of the pre-trained models you can use: • For computer vision: • VGG-16, VGG-19,Inception V3, Xceptio, ResNet-50 • For NLP tasks: • Word2Vec, GloVe, FastText Step-2 Create a base model • Usually, the first step is to instantiate the base model using one of the architectures such as ResNet or Xception. • You can also optionally download the pre-trained weights. If you don’t download the weights, you will have to use the architecture to train your model from scratch. • Recall that the base model will usually have more units in the final output layer than you require. • When creating the base model, you, therefore, have to remove the final output layer. Later on, you will add a final output layer that is compatible with your problem. Step-3 Freeze layers so they don’t change during training • Freezing the layers from the pre-trained model is vital. This is because you don’t want the weights in those layers to be re-initialized. If they are, then you will lose all the learning that has already taken place. This will be no different from training the model from scratch. Step-4 Add new trainable layers • The next step is to add new trainable layers that will turn old features into predictions on the new dataset. This is important because the pre-trained model is loaded without the final output layer. Step-5 Train the new layers on the dataset • Remember that the pre-trained model’s final output will most likely be different from the output that you want for your model. • For example, pre-trained models trained on the ImageNet dataset will output 1000 classes. However, your model might just have two classes. In this case, you have to train the model with a new output layer in place. • Therefore, you will add some new dense layers as you please, but most importantly, a final dense layer with units corresponding to the number of outputs expected by your model. Step-6 Improve the model via fine-tuning • Once you have done the previous step, you will have a model that can make predictions on your dataset. Optionally, you can improve its performance through fine-tuning. • Fine-tuning is done by unfreezing the base model or part of it and training the entire model again on the whole dataset at a very low learning rate. The low learning rate will increase the performance of the model on the new dataset while preventing overfitting. • The learning rate has to be low because the model is quite large while the dataset is small. This is a recipe for overfitting, hence the low learning rate. • Recompile the model once you have made these changes so that they can take effect. This is because the behavior of a model is frozen whenever you call the compile function. That means that you have to call the compile function again whenever you want to change the model’s behavior. • The next step will be to train the model again while monitoring it via callbacks to ensure it does not overfit. Advantages of transfer learning • Speed up the training process: By using a pre-trained model, the model can learn more quickly and effectively on the second task, as it already has a good understanding of the features and patterns in the data. • Better performance: Transfer learning can lead to better performance on the second task, as the model can leverage the knowledge it has gained from the first task. • Handling small datasets: When there is limited data available for the second task, transfer learning can help to prevent overfitting, as the model will have already learned general features that are likely to be useful in the second task. Disadvantages of transfer learning • Domain mismatch: The pre-trained model may not be well-suited to the second task if the two tasks are vastly different or the data distribution between the two tasks is very different. • Overfitting: Transfer learning can lead to overfitting if the model is fine-tuned too much on the second task, as it may learn task-specific features that do not generalize well to new data. • Complexity: The pre-trained model and the fine-tuning process can be computationally expensive and may require specialized hardware. Variants of CNN: DenseNet • Densely Connected Convolutional Networks (DenseNet) is a feed- forward convolutional neural network (CNN) architecture that links each layer to every other layer. • This allows the network to learn more effectively by reusing features, hence reducing the number of parameters and enhancing the gradient flow during training. • In 2016, Gao Huang et al. presented the architecture in their DenseNet paper “Densely Connected Convolutional Networks”. • In Standard ConvNet, input image goes through multiple convolution and obtain high-level features. • In ResNet, identity mapping is proposed to promote the gradient propagation. Element-wise addition is used. It can be viewed as algorithms with a state passed from one ResNet module to another one. • In DenseNet, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Concatenation is used. Each layer is receiving a “collective knowledge” from all preceding layers. • Since each layer receives feature maps from all preceding layers, network can be thinner and compact, i.e. number of channels can be fewer. The growth rate k is the additional number of channels for each layer. • So, it have higher computational efficiency and memory efficiency. The following figure shows the concept of concatenation during forward propagation: DenseNet Architecture • 1. Basic DenseNet Composition Layer • For each composition layer, Pre-Activation Batch Norm (BN) and ReLU, then 3×3 Conv are done with output feature maps of k channels, say for example, to transform x0, x1, x2, x3 to x4. This is the idea from Pre-Activation ResNet. • 2.DenseNet-B (Bottleneck Layers) • To reduce the model complexity and size, BN-ReLU- 1×1 Conv is done before BN-ReLU-3×3 Conv. • 3. DenseNet-BC (Further Compression) • If a dense block contains m feature-maps, The transition layer generate θm output feature maps, where 0<θ≤1 is referred to as the compression factor. • When θ=1, the number of feature-maps across transition layers remains unchanged. DenseNet with θ<1 is referred as DenseNet- C, and θ=0.5 in the experiment. • When both the bottleneck and transition layers with θ<1 are used, the model is referred as DenseNet-BC. • Finally, DenseNets with/without B/C and with different L layers and k growth rate are trained. • Multiple Dense Blocks with Transition Layers • 1×1 Conv followed by 2×2 average pooling are used as the transition layers between two contiguous dense blocks. • Feature map sizes are the same within the dense block so that they can be concatenated together easily. • At the end of the last dense block, a global average pooling is performed and then a SoftMax classifier is attached. Advantages of DenseNet • Performance: As previously stated, DenseNet’s state-of-the-art performance can be observed in a range of computer vision tasks including picture classification, object recognition, and semantic segmentation. • Feature: DenseNet lets each layer access the features of all previous layers, optimizing the gradient flow during training and allows the network to acquire knowledge more effectively. • Overfitting: The DenseNet design successfully tackles overfitting by lowering the number of parameters and enabling feature reuse, enhancing the model’s capacity to generalize to unknown data. • Vanishing Gradients: The DenseNet design mitigates the vanishing gradient issue by allowing gradients to flow across the whole network, allowing the training of deeper networks. • Redundancy: The DenseNet design manages redundancy successfully by offering feature reuse and lowering the number of parameters, enhancing the model’s capacity to generalize to unknown data. Variant of CNN: Pixel Net • Self study