Paper - Review - 2 - EfficientDet - Scalable and Efficient Object Detection
Paper - Review - 2 - EfficientDet - Scalable and Efficient Object Detection
Problem Statement:
As one of the core applications in computer vision, object detection has become increasingly important in
scenarios that demand high accuracy, but have limited model size and latency, such as robotics and
driverless cars. Unfortunately, many current high-accuracy detectors are computationally expensive and
have large sizes.
Although there have been previous works aimed at achieving better efficiency, they usually do so by
sacrificing accuracy. Moreover they only focus on a small range of resource requirements, while the variety
of real-world applications have a wide range of resource constraints.
Question: Is it possible to build a scalable detection architecture with both higher accuracy and better
efficiency across a wide spectrum of resource constraints?
Summary
To design accurate and efficient object detectors that can also adapt to a wide range of resource constraints,
EfficientDet was introduced. It builds upon the previous work on scaling neural networks (EfficientNet),
and incorporates some new features.
The authors incorporated two major features in the current model:
❖ A weighted bi-directional feature pyramid network (BiFPN) for easy and fast multi-scale
feature fusion. It learns the importance of different input features and repeatedly applies
top-down and bottom-up multi-scale feature fusion.
❖ A new compound scaling method for simultaneous scaling of the resolution, depth, and width
for all backbone, feature network, and box/class prediction networks.
EfficientDet achieves state-of-the-art accuracy with much fewer parameters and FLOPs than previous object
detectors. EfficientDet is also up to 3x to 8x faster on GPU/CPU than previous detectors. It also gives better
performances on semantic segmentation.
Model Architecture
The backbone networks used are ImageNet
pretrained EfficientNets. The proposed BiFPN
serves as the feature network, which takes level
3–7 features {P3, P4, P5, P6, P7} from the
backbone network and repeatedly applies
top-down and bottom-up bidirectional feature
fusion. These fused features are fed to a class
and box network to produce object class and
bounding box predictions respectively. The class
and box network weights are shared across all
levels of features.