CV Project

OBJECT RECOGNITION
BENCHMARKING ON UNUSUAL DATASETS
SIMRAN CHOUDHARY SIDDHANT IIITD AISHIKI

2021205 2021204 BHATTACHARYA
2021007
PROBLEM
STATEMENT
In the field of computer vision, object recognition models often fall short when applied to specialized
environments, such as the diverse Indian Driving Dataset (IDD), wildlife identification, and medical imaging.
These models, typically validated on conventional datasets, may not perform reliably under the unique
conditions present in these specific domains. This presentation aims to benchmark object recognition models
on these unusual datasets and conduct a detailed error analysis using the TIDE tool.
By identifying and dissecting specific types of errors, we seek to enhance model robustness and tailor
advancements to meet the unique demands of these critical applications.
DATASETS DEFORMABLE CONVOLUTIONAL NETWORKS
YOLO : YOU ONLY LOOK ONCE
For implementing Deformable Convolutional
To implement YOLO for object detection, we’ll need Networks, you'll ideally want a dataset focused on
a dataset rich in variety: images containing objects object detection tasks. This means the data should
outlined by bounding boxes and assigned class have high-resolution images containing various object
labels (like "car" or "person"). categories, each marked with bounding boxes.
This variety should encompass diverse backgrounds The more diverse the object shapes and poses within
and object appearances to train the model for real- these categories, the better, as Deformable
world scenarios where objects might not always look Convolutional Networks benefit from learning to
the same. The more labeled data you have, the adapt to different object characteristics. While not
better the model will perform. essential, annotations like segmentation masks can
further enhance the model's performance.
COCO IDD
CITYSCAPES COCO (used in paper)
PASCAL VOC (used in paper) PASCAL VOC (used in paper)

APPROACHES INVESTIGATED WORKING
YOLO : YOU ONLY LOOK ONCE

The model "looks" at an image only once and predicts which objects exist and
where they are located by outputting bounding boxes and class probabilities.
This approach contrasts with other methods that involve multiple stages to first Fig : The model dividing the image into an even grid and
propose regions and then classify them. simultaneously predicts bounding boxes and class
probabilities
APPLICATIONS : YOLO divides the input image into a grid.

Each grid cell is responsible for detecting
Its ability to process images swiftly ( up to 45 frames per second )makes it ideal
objects that have their center in the cell.
for applications like autonomous driving, where it can detect pedestrians,
vehicles, and traffic signs.
Each cell predicts multiple bounding
boxes along with confidence scores that
YOLO enhances surveillance systems by monitoring for unusual activities,
indicate the presence of a class-specific
identifying unauthorised entries, and tracking individuals or objects continuously.
object and how accurate the box is.
Additionally, YOLO's effectiveness in sports analytics, agriculture monitoring via

The final output is a combination of the
drones, showcases its high efficiency in real-world applications where quick and
bounding box coordinates, confidence
accurate object detection is crucial.
scores, and class probabilities.
WORKING
DEFORMABLE CONVOLUTIONAL NETWORKS
Deformable convolution and deformable RoI (Region of Interest) pooling

allow the sampling grid in CNNs to be augmented with learned offsets,
improving the network's ability to handle geometric variations without
extra supervision.
ILLUSTRATION OF 3 × 3 DEFORMABLE ROI POOLING.
DEFORMABLE CONVOLUTION
This technique introduces additional learnable offsets into the regular grid
sampling of the standard convolution process. These offsets allow the
convolutional filters to adapt to different shapes and sizes of the input data.
DEFORMABLE ROI POOLING

Similar to deformable convolution, deformable RoI pooling modifies the standard
region of interest (RoI) pooling by adding learnable offsets, which allow for
flexible and adaptive pooling operations over the regions proposed by object
ILLUSTRATION OF 3 × 3 DEFORMABLE CONVOLUTION.
detectors.
EVALUATION METRICS
Mean Average Precision [mAP] Mean Intersection-over-Union [mIoU] Task Independent Detection
Errors [TIDE]
mAP is calculated based on the precision and mIoU measures the accuracy of an object A diagnostic tool that categorizes errors in
recall for each object class across different detection model by calculating the overlap object detection models into six distinct
types: Classification, Localization, Both,
confidence thresholds. Precision measures between the predicted bounding boxes and
Missed Ground Truth, Background, and
the accuracy of the predictions while recall the ground truth bounding boxes.
Duplicate Detection errors.
measures the ability of the model to detect all
relevant instances in the dataset. It is determined by the ratio of the It evaluates models by running them on
intersection of the predicted bounding box test datasets, comparing their predictions
The mAP depends on the IoU metric, which and the ground truth bounding box to their with ground truths, and then employing a
quantifies the overlap between the predicted union. This metric assesses how well the structured framework to classify and
bounding box and the ground truth bounding model identifies and localizes objects within measure different error types.
box. an image.
By providing detailed insights into error
The Average Precision is calculated for each patterns, TIDE supports accurate
class independently, and the mean of these diagnostics and ongoing enhancements in
models, effectively improving detection
AP scores is taken to get the mAP providing a
accuracy by focusing on the most
single performance figure that considers both
significant errors.
the detection and classification capabilities of
the model across multiple classes.
HARDWARE REQUIREMENTS
GPU (Recommended): A Graphics Processing Unit (GPU) is highly recommended for

running the code efficiently. Deep learning models benefit from the parallel processing
capabilities of GPUs.
Sufficient RAM: At least 8GB of RAM is good, but having more can improve performance.

CV Project

Uploaded by

Copyright:

Available Formats

CV Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CV Project

Uploaded by

Copyright:

Available Formats

OBJECT RECOGNITION

BENCHMARKING ON UNUSUAL DATASETS

SIMRAN CHOUDHARY SIDDHANT IIITD AISHIKI

CITYSCAPES COCO (used in paper)

PASCAL VOC (used in paper) PASCAL VOC (used in paper)

YOLO : YOU ONLY LOOK ONCE

APPLICATIONS : YOLO divides the input image into a grid.

Additionally, YOLO's effectiveness in sports analytics, agriculture monitoring via

Deformable convolution and deformable RoI (Region of Interest) pooling

ILLUSTRATION OF 3 × 3 DEFORMABLE ROI POOLING.

DEFORMABLE ROI POOLING

GPU (Recommended): A Graphics Processing Unit (GPU) is highly recommended for

You might also like