0% found this document useful (0 votes)
70 views43 pages

Fast Methods For Deep Learning Based Object Detection

This document summarizes problems with the R-CNN object detection method and introduces Fast R-CNN and Faster R-CNN as improved methods. R-CNN training is slow and requires extracting deep learning features for each object proposal. Fast R-CNN improves on this by only extracting features once per image and using ROI pooling to classify and regress proposals. Faster R-CNN further speeds up detection by adding a Region Proposal Network to generate proposals, removing the need for an external proposal method. It enables end-to-end training of the whole system.

Uploaded by

seul alone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views43 pages

Fast Methods For Deep Learning Based Object Detection

This document summarizes problems with the R-CNN object detection method and introduces Fast R-CNN and Faster R-CNN as improved methods. R-CNN training is slow and requires extracting deep learning features for each object proposal. Fast R-CNN improves on this by only extracting features once per image and using ROI pooling to classify and regress proposals. Faster R-CNN further speeds up detection by adding a Region Proposal Network to generate proposals, removing the need for an external proposal method. It enables end-to-end training of the whole system.

Uploaded by

seul alone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Fast Methods for Deep Learning based

Object Detection
R-CNN: Problems

● Training is a multi-stage pipeline.


○ R-CNN first finetunes a ConvNet on object proposals using log loss.
○ Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax
classifier learnt by fine-tuning.
○ In the third training stage, bounding-box regressors are learned.
● Training is expensive in space and time.
○ For SVM and bounding-box regressor training, features are extracted from each object proposal in
each image and written to disk.
○ With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the
VOC07 trainval set. These features require hundreds of gigabytes of storage.
● Object detection is slow.
○ At test-time, features are extracted from each object proposal in each test image.
○ Detection with VGG16 takes 47s / image (on a GPU).
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Training
Fast R-CNN
Training
Fast R-CNN

● Only calculate features once.


● ROI Pooling layer extracts constant length vector representations of proposals.
● Classify and regress bounding boxes with multi purpose loss for end-to-end
training.
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN

● Instead of SVM + bounding box regression:


○ SoftMax classifier output
○ Bounding box regression output
● Multi-task training:
Fast R-CNN

● Advantages
○ Training is single-stage, using a multi-task loss
○ Training can update all network layers
○ No disk storage is required for feature caching
○ More accurate 66.9mAP vs 66.0mAP.
○ Faster training time 9.5h vs 84h (x8.8)
○ Faster test time per image: 0.32s vs 47s (x146)
● Problem
○ Test time don’t include region proposals.
○ Test time with region proposals: 2s vs 50s (x25)
● Solution
○ Make the CNN do region proposals too!
Faster R-CNN
● Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks (2015)
○ Shaoqing Ren, Kaiming He, Ross Girshick
● Insert a Region Proposal Network (RPN) after the
last convolutional layer.
● RPN trained to produce region proposals directly;
no need for external region proposals!
● After RPN, use RoI Pooling and an upstream
classifier and bbox regressor just like Fast R-CNN.
Faster R-CNN: RPN
● Slide a small window on the already computed
feature map (FREE!).
● Build a small network for:
○ Classifying object or not-object, and
○ Regressing bbox locations
● Position of the sliding window provides
localization information with reference to the
image.
● Box regression provides finer localization
information with reference to this sliding
window
Faster R-CNN: Training
● In the paper: Ugly pipeline
○ Use alternating optimization to train RPN, then Fast
R-CNN with RPN proposals, etc.
○ More complex than it has to be
● Since publication: Joint training!
○ One network, four losses
■ RPN classification (anchor good / bad)
■ RPN regression (anchor -> proposal)
■ Fast R-CNN classification (over classes)
■ Fast R-CNN regression (proposal -> box)
How Many Anchors Do We Need?
How Many Proposals Do We Need?

● Fast R-CNN used 2000 proposals from selective search.


● Faster R-CNN needs only 300 proposals from the RPN.
● RPN is better than selective search
○ Deep learning vs. classical computer vision
○ Optimized for this task
How Much Data Do We Need?
Also Read:
R-FCN: Object Detection via Region-based Fully
Convolutional Networks
https://arxiv.org/abs/1605.06409
Another Approach For
Speeding Up
Proposals
Another Approach For
Speeding Up
Proposals
Just Don’t Do It
Just RPN From Faster R-CNN

● Much faster than Faster R-CNN!


● But RPN had only object/not object classifier.
Add Classification!

● What about accuracy?


● How well does it handle different object scales?
Add More Scales!
Add More classifiers
SSD: Single Shot MultiBox Detector
SSD: Single Shot MultiBox Detector
Why Does Stride Matter?
● Smaller stride means more scanned
windows.
● Handles close objects better.
○ Need to have enough default boxes to do
accurate matching in each.
● Handles small objects better.
○ Better IoU with objects.
○ More positive windows per object.
● Too little stride is bad
○ Too many windows means too many false
positives to filter.
Improving Accuracy

● Object detection data is unbalanced


○ 1-30 True Positives per image.
○ 8,000 - 25,000 False Positives per image.
● Solution
○ Resample at fixed ratio (1:3)
● Not all negatives are equal!
○ Some are harder than others
● Better Solution
○ Hard negative mining: resample worst-misclassified false positives at fixed ratio.
Improving Accuracy

● Not enough data?


● Solution: Data augmentation
○ Random horizontal flip
○ Random crop
○ Random color distortion
○ Random expansion
How Much Does It Help?
Also Read:
YOLO9000: Better, Faster, Stronger
https://arxiv.org/abs/1612.08242
Speed/accuracy factors in object detectors

● Algorithm: Faster R-CNN / SSD / R-FCN / YOLO / ...


● Backbone: VGG16 / ResNet / MobileNet / etc…
● Input size
● Many other hyperparameters...
Speed/accuracy trade-offs for modern convolutional object
detectors (Google)
Frameworks

● Caffe
○ Faster R-CNN: https://github.com/rbgirshick/py-faster-rcnn
○ SSD: https://github.com/weiliu89/caffe/tree/ssd
● Tensorflow Object Detection API:
○ https://github.com/tensorflow/models/tree/master/research/object_detection
● Detectron:
○ https://github.com/facebookresearch/Detectron
● Many more re-implementations in different languages...
Honorable mentions

● VGG16: https://arxiv.org/abs/1409.1556
● ResNet: https://arxiv.org/abs/1512.03385
● Inception-ResNet: https://arxiv.org/abs/1602.07261
● ResNeXt: https://arxiv.org/abs/1611.05431
● Xception: https://arxiv.org/abs/1610.02357
● DenseNet: https://arxiv.org/abs/1608.06993
● MobileNet: https://arxiv.org/abs/1704.04861
● SqueezeNet: https://arxiv.org/abs/1602.07360
Looking for brilliant researchers

[email protected]

You might also like