Clip Model
Clip Model
Comparative Analysis
Introduction
This documentation outlines the steps taken to implement and analyze the CLIP model using
ViT-B/32 for image retrieval. We compare various models, storage techniques for embeddings,
and similarity search methods while addressing challenges faced and solutions applied.
Additionally, we explore DINO and EVA-2 models to understand their advantages.
• The dataset was loaded using PyTorch Data Loader, which facilitates batch processing
and augmentation.
• The CLIP model (ViT-B/32) was used for feature extraction, where images were pre-
processed before encoding.
def load_cifar100():
dataset = datasets.CIFAR100(root="./data", train=True,
download=True, transform=transforms.ToTensor())
dataloader = DataLoader(dataset, batch_size=128, shuffle=False,
num_workers=4)
return dataset, dataloader
CLIP Model
Architecture Strengths Weaknesses
Type
Requires more
ViT-L/14 Vision Transformer Better accuracy than ViT-B/32
memory
Transformer-based vision
CLIP (ViT-B/32) Chosen for its superior performance
model
• Both vectors are optimized to have high similarity when they represent the same
concept.
Contrastive learning helps in distinguishing similar and dissimilar data points, improving
retrieval accuracy.
▪ Precision was initially miscalculated due to retrieving only the top-5 images.
• FAISS and LSH were used to store and retrieve embeddings efficiently.
import torch
import clip
import numpy as np
def get_clip_embeddings(dataloader):
model.eval()
all_features = []
with torch.no_grad():
for images, _ in dataloader:
images = images.to(device)
images = preprocess_images(images)
features = model.encode_image(images)
features = features.cpu().numpy()
all_features.append(features)
return np.vstack(all_features)
def preprocess_images(images):
from PIL import Image
pil_images = [transforms.ToPILImage()(img) for img in images]
return torch.stack([preprocess(img) for img in
pil_images]).to(device)
Query Image and Retrieval Results
import faiss
def build_faiss_index(embeddings):
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
return index
faiss_index = build_faiss_index(get_clip_embeddings(dataloader))
lsh_index = build_lsh_index(get_clip_embeddings(dataloader))
• LSH hashes similar items into the same bucket, reducing the search space for similarity
comparisons.
• Combined Approach: First retrieves images using FAISS, then refines results with LSH
for improved accuracy.
Metrics and Evaluation
Metric Description
Future Work
• Explore more models like EVA-2 and refine feature extraction methods.
• Identify alternative methods to speed up the feature extraction process instead of using
lists.