0% found this document useful (0 votes)
13 views13 pages

RF-DETR by Roboflow: Speed Meets Accuracy in Object Detection

RF-DETR is a transformer-based object detection model developed by Roboflow that balances speed and accuracy, achieving over 60 AP on the Microsoft COCO benchmark. The model is available in two sizes, optimized for real-time performance and accuracy, and utilizes a unique architecture that combines LW-DETR with a pre-trained DINOv2 backbone. The blog discusses the model's architecture, performance, and fine-tuning on an underwater dataset, highlighting its capabilities in real-world applications.

Uploaded by

dhhaikhmt2211044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

RF-DETR by Roboflow: Speed Meets Accuracy in Object Detection

RF-DETR is a transformer-based object detection model developed by Roboflow that balances speed and accuracy, achieving over 60 AP on the Microsoft COCO benchmark. The model is available in two sizes, optimized for real-time performance and accuracy, and utilizes a unique architecture that combines LW-DETR with a pre-trained DINOv2 backbone. The blog discusses the model's architecture, performance, and fine-tuning on an underwater dataset, highlighting its capabilities in real-world applications.

Uploaded by

dhhaikhmt2211044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

RF-DETR by Roboflow: Speed Meets

Accuracy in Object Detection


Bhomik Sharma
APRIL 15, 2025 LEAVE A COMMENT
Computer Vision Object Detection Transformer Neural Networks

Object detection has come a long way, especially with the rise of
transformer-based models. RF-DETR, developed by Roboflow, is
one such model that offers both speed and accuracy. Using
Roboflow’s tools makes the process even easier. Their platform
handles everything from uploading and annotating data to exporting
it in the right format. This means less time setting things up and
more time training and improving your model.

In this blog, we’ll look at how RF-DETR works, its architecture, and
fine-tuning it to perform well on an underwater dataset. We will also
work with tools like Supervision that can help in improving results
through smart data handling and visualization.

1. Model variants, performance, and benchmarking

2. Architecture Overview

3. Inference Results

4. Fine-tuning on aquatic dataset

5. Key Takeaways

6. Conclusion

7. References

Model variants, performance, and


benchmarking
RF-DETR is a real-time, transformer-based object detection model
architecture developed by Roboflow and released under the Apache
2.0 license.

RF-DETR can exceed 60 AP(average Precision) on the Microsoft


COCO benchmark alongside competitive performance at base sizes.
It also achieves state-of-the-art performance on RF100-VL, an object
detection benchmark that measures model domain adaptability to
real-world problems. RF-DETR has a speed comparable to current
real-time objection models.
RF-DETR is available in two model sizes: RF-DETR-Base (29M
parameters) and RF-DETR-large(128M parameters). The base
variant is best for fast inferencing, and the large version is best for
the most accurate predictions, but it takes longer to compute than
the base variant.

RF-DETR is small enough to run on the edge, making it ideal


for deployments requiring strong accuracy and real-time
performance.

Fig 1: RF-DETR performance on coco and RF100 VL benchmarks

Params mAP(coco) mAP(rf100-vl) mAP(rf100-vl) Total Latency(ms)


Model
(M) @0.50:0.95 average @0.50 average @0.50:0.95 T4, bs = 1

D-FINE-M 19.3 55.1 N/A N/A 6.3


LW-DETR-M 28.3 52.5 84.0 57.5 6.0

YOLO11m 20.0 51.5 84.9 59.7 5.7

YOLOv8m 28.9 50.6 85.0 59.8 6.3

RF-DETR-B 29.0 53.3 86.7 60.3 6.0


Architecture Overview
CNN remains the core component of many of the best real-time
object detection approaches, including models like D-FINE that
leverage both CNNs and Transformers in their architecture.

Recently, through the introduction of RT-DETR in 2023,


the DETR family of models supported by transformer architecture
has shown comparable and surpassing results on end-to-end object
detection tasks by eliminating hand-designed components like
anchor generation and non-maximum suppression (NMS), which
were standard in frameworks like Faster-RCNN.
Despite the advantages offered by DETR models, it suffers from two
significant limitations:

1. Slow Convergence
2. Poor Performance on Small Objects

Fig 2: Deformable DETR, RF-DETR, is built on this architecture.

RF-DETR compensates for the above limitations using an


architecture based on the Deformable DETR model. However, unlike
Deformable DETR, which uses a multi-scale self-attention
mechanism, RF-DETR extracts image feature maps from a single-
scale backbone.
RF-DETR combines LW-DETR with a pre-trained DINOv2 backbone.
Utilizing DINOv2 pre-trained backbone provides exceptional ability
to adapt to novel domains based on the knowledge stored in the
pre-trained model.
Let’s examine the architectural details of LW-DETR architecture,
which is adopted in RF-DETR along with DINOv2. The architectural
details of DINOv2 are beyond the scope of this article. For those
interested in grasping the ideologies and architecture of DINOv2,
visit our article on learnopencv, which covers paper explanation and
road segmentation implementation in one.
LW-DETR

LW-DETR’s architecture consists of a simple stack of a ViT encoder,


a projector, and a shallow DETR decoder. It explores the feasibility
of plain ViT backbones and a DETR framework for real-time
detection.

Encoder:

The paper’s authors used vanilla ViT for the detection encoder. A
plain ViT consists of a patchification layer and transformer encoder
layers. A transformer encoder layer in the initial ViT contains a
global self-attention layer over all the tokens and an FFN layer.
Global self-attention is computationally costly, and its time
complexity is quadratic to the number of tokens.
Fig 2: ViT encoder

Hence, the authors instead introduced window self-attention to


reduce the computational complexity; they also proposed the
aggregation of multi-level feature maps, the intermediate and final
feature maps in the encoder, forming stronger encoded feature
maps.

Decoder:

The decoder is a stack of transformer decoder layers. Each layer


consists of self-attention, cross-attention, and FFN. We adopt
deformable cross-attention for computational efficiency.DETR and
its variants usually adopt six decoder layers. Still, the authors
explained that using only three transformer decoder layers can lead
to a time reduction from 1.4 ms to 0.7 ms, which is significant
compared to the time cost of 1.3 ms of the remaining part for the
tiny version in their approach.

They adopted a mixed query selection scheme to form the object


queries in addition to content queries and spatial queries. The
content queries are learnable embeddings, similar to DETR. The
spatial queries are based on a two-stage scheme: selecting top-K
features from the last layer in the Projector, predicting the bounding
boxes, and transforming the corresponding boxes into embeddings
as spatial queries.

Fig 3: C2f
block (from YOLOv8)
Projector:

The projector connects the encoder and decoder. It takes the


aggregated encoded feature maps from the encoder as input. The
projector is a C2f block implemented in YOLOv8.

For large and x-large versions of LW-DETR, the projector is modified


to output two-scale feature maps, and the multi-scale decoder is
used accordingly. The projector contains two parallel C2f blocks.
One processes ⅛ feature maps, which are obtained by upsampling
the input through a deconvolution, and the other processes 1/32
maps, which are obtained by downsampling the input through a
stride convolution.

Fig 4: Single(a) and Multi-scale(b) Projector


Inference Results
Let’s examine how the model performs out of the box by writing a
simple inferencing script provided by Roboflow.

Fig 5: Inference Image before object detection

We will use the script below to detect objects in the image provided
above.

One thing to note is that we will use the Supervision library created
and maintained by Roboflow. This library is easy to use and doesn’t
require much overhead to understand the various functionality it
provides for object detection tasks. Whether you need to load your
dataset from your hard drive, draw detections on an image or video,
or count how many detections are in a zone, you can always count
on Supervision!!

Let’s begin some coding. 🙂

There are requirements to install before performing inferencing. If


you are working on VS code or terminal, it is highly recommended
that you create your virtual environment and work inside it for
better consistency and fewer dependency issues.

You might also like