0% found this document useful (0 votes)
38 views7 pages

Clip Model

This document details the implementation and analysis of the CLIP model using ViT-B/32 for image retrieval, comparing various models and techniques. The selected dataset is CIFAR-100, and challenges such as slow feature extraction and precision calculation were addressed. Additionally, it explores similarity search methods like FAISS and LSH, and outlines future work to optimize retrieval methods and explore more models.

Uploaded by

M. Roshini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views7 pages

Clip Model

This document details the implementation and analysis of the CLIP model using ViT-B/32 for image retrieval, comparing various models and techniques. The selected dataset is CIFAR-100, and challenges such as slow feature extraction and precision calculation were addressed. Additionally, it explores similarity search methods like FAISS and LSH, and outlines future work to optimize retrieval methods and explore more models.

Uploaded by

M. Roshini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CLIP Model Implementation and

Comparative Analysis
Introduction
This documentation outlines the steps taken to implement and analyze the CLIP model using
ViT-B/32 for image retrieval. We compare various models, storage techniques for embeddings,
and similarity search methods while addressing challenges faced and solutions applied.
Additionally, we explore DINO and EVA-2 models to understand their advantages.

Dataset Selection and Preprocessing


• We selected CIFAR-100 as our dataset because it provides diverse image categories,
making it suitable for comparative model analysis.

• The dataset was loaded using PyTorch Data Loader, which facilitates batch processing
and augmentation.

• The CLIP model (ViT-B/32) was used for feature extraction, where images were pre-
processed before encoding.

from torchvision import datasets, transforms


from torch.utils.data import DataLoader

def load_cifar100():
dataset = datasets.CIFAR100(root="./data", train=True,
download=True, transform=transforms.ToTensor())
dataloader = DataLoader(dataset, batch_size=128, shuffle=False,
num_workers=4)
return dataset, dataloader

dataset, dataloader = load_cifar100()

Why ViT over Other Models?


CLIP offers both ViT and ResNet architectures. I selected ViT-B/32 due to its superior feature
extraction capabilities, leveraging transformers' ability to capture global dependencies within an
image. This ensures better semantic understanding compared to ResNet-based models, which
primarily focus on local features.
CLIP Model Variants Comparison

CLIP Model
Architecture Strengths Weaknesses
Type

Good performance on vision Higher computational


ViT-B/32 Vision Transformer
tasks cost

Requires more
ViT-L/14 Vision Transformer Better accuracy than ViT-B/32
memory

ResNet-50 CNN-based Faster inference speed Less flexible than ViT

Improved accuracy over Requires more


ResNet-101 CNN-based
ResNet-50 computation

Model Selection and Comparison

Model Description Reason for Selection

Transformer-based vision
CLIP (ViT-B/32) Chosen for its superior performance
model

ResNet-18 & ResNet-


CNN-based models Used for comparison with ViT
50

Explored for advanced feature


Swin Transformer Hierarchical Transformer
learning

DINO Self-supervised ViT Learns without labeled data

EVA-2 Optimized Transformer Evaluated for better efficiency


CLIP Model Explanation
CLIP (Contrastive Language-Image Pretraining) is trained using a contrastive learning approach:

• One model processes text and outputs a semantic vector.

• Another model processes images and outputs a visual vector.

• Both vectors are optimized to have high similarity when they represent the same
concept.

Contrastive learning helps in distinguishing similar and dissimilar data points, improving
retrieval accuracy.

DINO Model Explanation


• Uses a self-distillation framework without labeled data.

• Employs a student-teacher model where the student network learns to generate


representations without explicit labels.

• Processes images as sequences of patches, similar to text in transformers.


Challenges Faced and Resolutions
1. Slow Feature Extraction

▪ Initially, we stored embeddings in a list, leading to slow processing.

2. Incorrect Precision Calculation

▪ Precision was initially miscalculated due to retrieving only the top-5 images.

Feature Extraction and Embeddings


• CLIP was used to generate embeddings by encoding images into a latent space where
similar images have closer representations.

• The embeddings were normalized to improve retrieval performance.

• FAISS and LSH were used to store and retrieve embeddings efficiently.

import torch
import clip
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"


model, preprocess = clip.load("ViT-B/32", device=device)

def get_clip_embeddings(dataloader):
model.eval()
all_features = []
with torch.no_grad():
for images, _ in dataloader:
images = images.to(device)
images = preprocess_images(images)
features = model.encode_image(images)
features = features.cpu().numpy()
all_features.append(features)
return np.vstack(all_features)

def preprocess_images(images):
from PIL import Image
pil_images = [transforms.ToPILImage()(img) for img in images]
return torch.stack([preprocess(img) for img in
pil_images]).to(device)
Query Image and Retrieval Results

FAISS similar images:

LSH similar images:

Combined FAISS + LSH similar images:


Similarity Search Techniques Explored
FAISS (Facebook AI Similarity Search)
• Supports efficient similarity searches using indexing methods.

• Organizes vectors using K-means clustering, product quantization, and proximity


graphs.

• Utilizes GPU acceleration for fast search performance.

import faiss

def build_faiss_index(embeddings):
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
return index

faiss_index = build_faiss_index(get_clip_embeddings(dataloader))

LSH (Locality Sensitive Hashing)


• Maps high-dimensional data into low-dimensional hash buckets.
• Reduces search space by clustering similar embeddings.
• Uses random projections to approximate similarity efficiently.

from sklearn.neighbors import NearestNeighbors

def build_lsh_index(embeddings, n_neighbors=10):


lsh = NearestNeighbors(n_neighbors=n_neighbors,
algorithm='ball_tree', metric='euclidean')
lsh.fit(embeddings)
return lsh

lsh_index = build_lsh_index(get_clip_embeddings(dataloader))

Comparison of FAISS and LSH


• FAISS organizes vectors into index structures, offering optimized search speeds.

• LSH hashes similar items into the same bucket, reducing the search space for similarity
comparisons.

• Combined Approach: First retrieves images using FAISS, then refines results with LSH
for improved accuracy.
Metrics and Evaluation
Metric Description

Precision@K Measures relevance of retrieved images

Recall@K Assesses proportion of relevant images retrieved

False Positive Rate (FPR) Determines retrieval errors

def calculate_precision(true_indices, retrieved_indices, k=20):


relevant_count = sum(1 for idx in retrieved_indices if idx in
true_indices)
return relevant_count / k

Future Work
• Explore more models like EVA-2 and refine feature extraction methods.

• Optimize retrieval methods using hybrid approaches.

• Reduce the dataset size to 10 classes for easier comparison.

• Identify alternative methods to speed up the feature extraction process instead of using
lists.

You might also like