0% found this document useful (0 votes)
2 views

Assignment_8

The document discusses the use of pre-trained architectures for image classification, highlighting the merits and demerits of popular models like VGG, ResNet, and Inception. It also contrasts fine-tuning versus feature extraction strategies, emphasizing the benefits and drawbacks of each approach. Additionally, it touches on multi-modal pre-trained models, such as CLIP, and their application in zero-shot learning, noting their strengths and limitations.

Uploaded by

sureshvalmiki118
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Assignment_8

The document discusses the use of pre-trained architectures for image classification, highlighting the merits and demerits of popular models like VGG, ResNet, and Inception. It also contrasts fine-tuning versus feature extraction strategies, emphasizing the benefits and drawbacks of each approach. Additionally, it touches on multi-modal pre-trained models, such as CLIP, and their application in zero-shot learning, noting their strengths and limitations.

Uploaded by

sureshvalmiki118
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Date of Submission: Friday 6th Sep 2024

Ans 1 and 2

Using pre-trained architectures for developing an image classification model, leverages the knowledge learned by
models on large and diverse datasets to achieve high accuracy with less data.

1. Choice of Pre-Trained Model Architecture


One of the first design decisions in leveraging pre-trained models is selecting the appropriate architecture.
Popular choices include VGG, ResNet, and Inception, each with different configurations and number of
parameters.

a. Merits:
VGG16: Known for its simplicity, VGG16 has become a benchmark for CNNs due to its uniform design of using
3x3 convolutional layers stacked on top of each other.
ResNet: ResNet architectures, such as ResNet-50, introduce residual connections that enable training of very
deep networks by addressing the vanishing gradient problem. This results in improved performance on
complex tasks.
Inception: The Inception model incorporates multi-scale processing by using different kernel sizes within the
same layer, allowing the model to capture features at various scales.

b. Demerits:
VGG16: The simplicity of VGG16 comes with a high computational cost due to its depth and the number of
fully-connected layers, leading to a large number of parameters.
ResNet: While ResNet models are powerful, they can be complex to implement and may require more
computational resources than simpler architectures.
Inception: The complexity of the Inception architecture can make it difficult to modify and fine-tune for
specific tasks.

2. Fine-Tuning vs Feature Extraction


A second design choice is whether to fine-tune the entire pre-trained model or to use it as a feature
extractor, where the last few layers are trained on the new dataset.

a. Merits:
Fine-Tuning: Fine-tuning can lead to higher accuracy since the model is adjusted to the specifics of the new
task.
Feature Extraction: This approach is faster and requires less computational power, as only a small part of the
model is being trained.

b. Demerits:
Fine-Tuning: It is computationally expensive and may lead to overfitting if the new dataset is small.
Feature Extraction: While efficient, this might result in lower accuracy as the pre-trained features may not be
optimal for the new task.

3. Use of Multi-Modal Pre-trained Models


The third design choice involves the use of multi-modal pre-trained models like CLIP, which have been trained
on internet image-text pairs and can be applied in a zero-shot manner to various tasks.

a. Merits:
Multi-Modal Models: They offer robust and aligned semantic representations, which can be beneficial for
tasks involving both visual and textual data.
Zero-Shot Learning: These models can be used without further training, making them highly versatile and
efficient.
Date of Submission: Friday 6th Sep 2024
b. Demerits:
Multi-Modal Models: They may not outperform specialized image classification models on tasks that are
purely visual.
Zero-Shot Learning: The performance can be unpredictable on specific datasets or niche tasks.

Ans 3

Fine-tuning the deeper layers is generally a better choice for image classification, as they capture more task-
specific, abstract features. Shallower layers tend to learn general features (e.g., edges, textures), which are
often transferable across tasks, while deeper layers adapt to the particular dataset or problem.

You might also like