0% found this document useful (0 votes)
80 views13 pages

DINOv2 Presentation

DINOv2 is a self-supervised learning model by Meta AI that excels in extracting robust visual features from unlabeled images using Vision Transformer architecture, achieving 86.5% Top-1 accuracy in image classification. It addresses challenges in self-supervised learning by improving generalization and efficiency while also ensuring demographic fairness in its datasets. The model has diverse applications in fields such as autonomous systems, medical imaging, and AR/VR, paving the way for future advancements in AI.

Uploaded by

drash078692
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views13 pages

DINOv2 Presentation

DINOv2 is a self-supervised learning model by Meta AI that excels in extracting robust visual features from unlabeled images using Vision Transformer architecture, achieving 86.5% Top-1 accuracy in image classification. It addresses challenges in self-supervised learning by improving generalization and efficiency while also ensuring demographic fairness in its datasets. The model has diverse applications in fields such as autonomous systems, medical imaging, and AR/VR, paving the way for future advancements in AI.

Uploaded by

drash078692
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Presentation on DINOv2

Learning Robust Visual Features


without Supervision
Presented by: Drashti Bhavsar
Company: LOTI AI, Inc.
Abstract
• DINOv2 is a state-of-the-art self-supervised
learning model developed by Meta AI for
computer vision tasks. It utilizes Vision
Transformer (ViT) architecture to extract
robust visual features from unlabeled images,
enabling tasks like image classification, depth
estimation, and semantic segmentation
without specific fine-tuning.
Literature Review
• DINOv2 achieves exceptional performance
compared to other models. Below is a
comparison:
• • OpenCLIP: ViT-G/14 architecture, 86.2% Top-
1 accuracy with weak supervision.
• • iBOT: ViT-L/16 architecture, 82.3% Top-1
accuracy with self-supervision.
• • DINOv2: ViT-g/14 architecture, 86.5% Top-1
accuracy with self-supervision.
Methodology
• DINOv2 employs the following key techniques:
• • Vision Transformers (ViTs) for patch-based
feature extraction.
• • Self-supervised learning with curated
datasets and clustering.
• • Knowledge distillation for transferring
features from large to smaller models.
Data Processing Pipeline
• The pipeline combines curated and uncurated
datasets to create a diverse training set:
• • Curated images are mapped to embeddings.
• • Uncurated images are deduplicated and
matched with curated images.
• • Clustering ensures balanced representation
across groups.
Key Results
• DINOv2 demonstrates significant
improvements across tasks:
• • Image Classification: 86.5% Top-1 accuracy
on ImageNet-1k.
• • Depth Estimation: RMSE improved to 0.279
from 0.358.
• • Semantic Segmentation: 53.1 mIoU on
ADE20k.
Visualizations
• Qualitative examples highlight DINOv2's
capabilities:
• • Semantic segmentation results with frozen
features.
• • Depth estimation with smoother
predictions.
• • Principal Component Analysis (PCA) of patch
features.
Applications
• DINOv2 finds applications in various domains:
• • Autonomous Systems: Navigation and object
detection.
• • Medical Imaging: Accurate semantic
segmentation.
• • AR/VR: Enhanced 3D reconstruction and
depth estimation.
Fairness and Bias Analysis
• DINOv2 addresses demographic fairness:
• • Geographical Fairness: Dataset includes
images from 54 countries.
• • Bias Mitigation: Analysis of skintones,
gender, and age groups.
Challenges Addressed
• DINOv2 overcomes key limitations in self-
supervised learning:
• • Improved generalization across tasks.
• • Efficient training at scale with optimized
pipelines.
• • High-quality feature learning from diverse
datasets.
Conclusion
• DINOv2 represents a milestone in self-
supervised learning for computer vision:
• • State-of-the-art results across benchmarks.
• • Scalable architecture for diverse
applications.
• • Paves the way for future advancements in
AI.
Future Directions
• Potential areas for enhancing DINOv2:
• • Extending to multimodal domains like video,
text, and audio.
• • Optimizing for mobile and embedded
systems.
• • Hybrid models combining self-supervision
with domain-specific fine-tuning.
Thank You
• Contact: [email protected]
• LinkedIn: linkedin.com/in/drashti-bhavsar

You might also like