0% found this document useful (0 votes)

25 views15 pages

Boosting R-CNN - Reweighting R-CNN Samples by RPN's Error For Underwater Object Detection

This document proposes a two-stage underwater object detection method called Boosting R-CNN. It aims to address the challenges of underwater environments like unbalanced lighting, low contrast and occlusion. It does this through uncertainty modeling using a region proposal network called RetinaRPN and a hard example mining method called boosting reweighting. The method was shown to be effective on underwater datasets and improved performance over existing underwater detectors.

Uploaded by

ultrabots2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views15 pages

Boosting R-CNN - Reweighting R-CNN Samples by RPN's Error For Underwater Object Detection

Uploaded by

ultrabots2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Boosting R-CNN: Reweighting R-CNN Samples by RPN’s Error for

Underwater Object Detection

Pinhao Songa , Pengteng Lib , Linhui Daia , Tao Wanga and Zhan Chena
a Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, Shenzhen, 518055, Guangdong, China
b Shenzhen University, Shenzhen, 518060, Guangdong, China

ARTICLE INFO ABSTRACT

Keywords: Complicated underwater environments bring new challenges to object detection, such as unbalanced
underwater object detection light conditions, low contrast, occlusion, and mimicry of aquatic organisms. Under these circum-
hard example mining stances, the objects captured by the underwater camera will become vague, and the generic detectors
uncertainty modeling often fail on these vague objects. This work aims to solve the problem from two perspectives:
uncertainty modeling and hard example mining. We propose a two-stage underwater detector named
boosting R-CNN, which comprises three key components. First, a new region proposal network
named RetinaRPN is proposed, which provides high-quality proposals and considers objectness
and IoU prediction for uncertainty to model the object prior probability. Second, the probabilistic
inference pipeline is introduced to combine the first-stage prior uncertainty and the second-stage
classification score to model the final detection score. Finally, we propose a new hard example mining
method named boosting reweighting. Specifically, when the region proposal network miscalculates
the object prior probability for a sample, boosting reweighting will increase the classification loss
of the sample in the R-CNN head during training, while reducing the loss of easy samples with
accurately estimated priors. Thus, a robust detection head in the second stage can be obtained. During
the inference stage, the R-CNN has the capability to rectify the error of the first stage to improve
the performance. Comprehensive experiments on two underwater datasets and two generic object
detection datasets demonstrate the effectiveness and robustness of our method. The link of code:
https://github.com/mousecpn/Boosting-R-CNN

1. Introduction
Oceans account for 71% of the earth’s total area and
contain rich biological and mineral resources. Humans cast
their eyes on ocean exploitation, for the resources on the (a) Unbalanced Light Condition (b) Low Contrast (c) Severe Occlusion
land have been fully exploited, which means the research
on the oceans is meaningful. Over the past few years, more
and more researchers have considered applying underwater
object detection (UOD) to autonomous underwater vehicles
(AUVs) with visual systems to fulfill a series of underwater
tasks such as marine organism capturing.
Generic Object Detection (GOD) has been researched (d) Camouflage and Mimicry
for a long time and obtained abundant achievements. How-
ever, GOD is not perfectly suitable for underwater envi-
ronments which bring new challenges to object detection Figure 1: The challenges of underwater environments. (a)
(see Figure 1): (i) The images captured by the underwater Complicated underwater terrains cause unbalanced light condi-
visual system suffer from unbalanced light conditions and tion. (b) The low-contrast image makes the boundaries of two
holothurian blurred. (c) The aquatic organisms tend to live
low contrast, which make the object boundary hard to be
together, causing occlusion. (d) The starfish has the similar
distinguished from the background. (ii) The aquatic organ- color with the environment, which makes them difficult to spot.
isms tend to live together, which cause severe occlusion. (iii)
The aquatic organisms are good at hiding themselves, which
have the similar color with the background and make it hard
Existing works on UOD typically apply data augmenta-
for people to recognize them. Facing these new challenges,
tion methods [19, 30, 47] and use a strong feature extrac-
the boundaries between the objects and background and the
tor [14, 57] to improve the performance. However, these
boundaries between different objects will be vague, leading
methods suffer from problems listed as follows. (i) Previous
to the existence of the vague objects in underwater environ-
underwater detectors receive the same supervision signal for
ments.
all objects regardless of their vagueness. Thus, the classifi-
[email protected] (P. Song); [email protected] (P. cation score trained with simple cross entropy loss does not
Li); [email protected] (L. Dai); [email protected] (T. Wang); accurate reflect the vagueness of the objects, which would
zhanchen\[email protected] (Z. Chen)
ORCID (s): cause false over-confident predictions. However, accurately

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 1 of 15

Boosting R-CNN

ranking the detection results is crucial for object detectors to Align are leveraged to crop the features from backbone and
achieve high performance. It is expected that the detectors resize them to the same size. In the second stage, the R-
assign low scores to the detection results containing vague CNN head realizes classification and regression tasks of
objects and assign high scores to the results with clear all objects. One-stage detectors abandon the usage of the
objects. (ii) Previous underwater detectors are vulnerable to RPN and RoI Align, directly obtaining the coordinates of
vague objects with blurring boundaries and similar color to bounding boxes and classes of the objects. Nowadays, one-
the background. That is because the gradient of the easy stage detectors can achieve the same level of performance
samples will dominate the training of underwater detectors, as two-stage detectors. There are two branches of one-stage
which makes detectors difficult to learn the subtle differences detectors: anchor-based methods and anchor-free methods.
between vague objects and underwater background. Early works on one-stage detectors are mostly anchor-based
Different from existing UOD methods, we address the methods [32, 28]. Recently, some works rethink whether the
above problems through uncertainty modeling and hard anchor is necessary, and propose their designs to abandon
example mining. We propose a two-stage detector named the use of anchors [44, 59].
Boosting R-CNN (see Figure 2), which consists of three key As the research on object detection goes deeper, the
components: RetinaRPN, probabilistic inference pipeline, researchers find that the concepts of one-stage and two-stage
and boosting reweighting. Specifically, RetinaRPN gener- detectors are not entirely different. Some research works aim
ates proposals from backbone features with heavier heads to to leverage the advantages of two-stage to enhance the per-
perform three tasks: objectness prediction, IoU prediction, formance of one-stage detectors. RefineDet [54] separates
and box localization. It includes the IoU prediction and the one-stage detection into two sub-module: the anchor re-
objectness as two indicators to model the prior uncertainty finement module and the object detection module. AlignDet
in order to accurately measure the vagueness of the objects. [8] uses deformable convolution (DCN) to imitate RoIAlign
With a proposed fast IoU loss, high-quality proposals can to obtain aligned features in the second stage. RepPoints
be obtained. Second, the probabilistic inference pipeline [50] leverages the idea of refinement and feature alignment
combines the RetinaRPN’s object prior and the R-CNN and applies it to the proposed anchor-free detectors based
classification score to make a prediction, which uses the on keypoint detection. Two-stage detectors are also nurtured
uncertainty from the first stage to improve the robustness by the achievements of one-stage detectors. CenterNet2 [58]
of the detector. Third, boosting reweighting attaches more finds that a strong anchor-free one-stage detector as the
attention to hard examples whose priors are miscalculated RPN can predict an accurate object likelihood that informs
by amplifying the loss according to the RPN’s error. Since the overall detection score. Combining the object likelihood
the final classification score of the object combines the of RPN and the conditional classification score of the R-
RPN’s prior and the R-CNN’s scores, the R-CNN trained CNN will achieve higher performance with fewer proposals,
with reweighted samples has a strong robustness to hard which reduces the inference cost. Our Boosting R-CNN
examples, modifying its score to correct the false positive is a probabilistic two-stage detector like CenterNet2. The
and false negative of the RetinaRPN. difference is that we build a strong anchor-based RPN, and
With these three components, our Boosting R-CNN can apply a hard example mining mechanism based on the RPN’s
handle complicated underwater challenges and be robust to errors.
vague objects. Our method is evaluated on two underwater
object detection datasets: UTDAC20201 and Brackish [35], 2.2. Hard Example Mining.
not only achieving state-of-the-art performance but also Hard example mining methods aim to attach more at-
maintaining a relatively high inference speed. Moreover, tention to hard examples, relying on the hypothesis that
the experiments on the Pascal VOC [13] and the MS training on hard examples leads to better performance. The
COCO [29] dataset show Boosting R-CNN obtains favorable first deep detector to use hard example mining is Single Shot
performance on general object detection. Our code will be Detector [32], which chooses only the negative examples
released at https://github.com/mousecpn/Boosting-R-CNN- with the highest loss values. Online Hard Exampling Mining
Reweighting-R-CNN-Samples-by-RPN-s-Error-for-Underwater-(OHEM) [42] considers both hard positive and negative ex-
Object-Detection.git amples for training. Considering the efficiency and memory
problems of OHEM, IoU-based sampling [34] is proposed,
associating the hardness of the negative examples with their
2. Related Work IoUs, and sampling averagely across all IoU ranges. Focal
2.1. Object Detection Loss [28] is a soft hard-example mining method, dynam-
Existing object detection can be categorized into two ically assigning more weight to the hard examples based
mainstreams: two-stage and one-stage detectors. For two- on the classification score. Prime Sample Attention (PISA)
stage detectors, the basic idea is to reduce the detection [3] proposes an IoU Hierarchical Local Rank for all sam-
task to the classification problem [40]. In the first stage, ples, assigning higher weight for positive examples with
the region proposal network (RPN) aims to propose can- higher IoUs. Different from the methods mentioned above,
didate object bounding boxes, and RoI Pooling and RoI our two-stage Boosting R-CNN defines the hardness of the
1
examples based on their prior probability from the proposed
http://uodac.pcl.ac.cn/

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 2 of 15

Boosting R-CNN

RetinaRPN. A soft reweighting mechanism is proposed to 3.2. Backbone and Feature Fusion Neck
amplify the loss of the hard examples and shrink the loss of Given an image 𝐼 ∈ ℝ3×𝐻0 ×𝑊0 (with RGB channels),
the easy examples. a backbone (e.g. ResNet50) generates multi-scale feature
maps {𝑥𝑙 }5𝑙=3 at 𝐶3 -𝐶5 (𝐶𝑙 has resolution 2𝑙 smaller than
2.3. Underwater Object Detection. the input). The multi-scale feature maps will be sent into the
As an indispensable technology for AUVs to perform feature fusion neck.
multiple tasks under the water, underwater object detection PAFPN [31] is employed as the feature fusion neck.
has attracted a large amount of attention from researchers PAFPN contains two parts: the top-down path and the
all around the world. For instance, Huang et al. [19] in- bottom-up path. In the top-down path, the high-level feature
troduce perspective transformation, turbulence simulation, is used to enhance the low-level feature. Given the multi-
and Illumination synthesis into data augmentation. Chen et scale feature maps {𝑥𝑙 }5𝑙=3 from backbone, the output feature
al. [9] design a novel underwater salient detection model {𝑝𝑙 }5𝑙=3 as:
that is established by mathematically stimulating the bio-
logical vision mechanism of aquatic animals RoIMix [30]
𝑝5 = 𝑐𝑜𝑛𝑣(𝑥5 ), (1)
is a data augmentation method that applies mixup on the
RoI level to imitate occlusion conditions. SWIPENET [7] 𝑝4 = 𝑐𝑜𝑛𝑣(𝑥4 ) + 𝑢(𝑝5 ), (2)
takes full advantage of both high resolution and semantic- 3 3
𝑝 = 𝑐𝑜𝑛𝑣(𝑥 ) + 𝑢(𝑝 ), 4
(3)
rich hyper feature maps to increase the performance of
small objects. Besides, a novel sample-reweighted loss and a where 𝑐𝑜𝑛𝑣(⋅) denotes the convolution layer, and 𝑢(⋅) denotes
new training paradigm CMA are proposed which are noise- the 2x upsampling layer. In the bottom-up path, the low-
immune. Poisson GAN [47] is also a data augmentation level feature is leveraged to augment the high-level feature
method, which pastes the object on the underwater back- to obtain feature maps {𝑞 𝑙 }7𝑙=3 , as:
ground by poisson blending and uses GAN to correct the
artifact. FERNet [14] consists of three modules: composite 𝑞 3 = 𝑐𝑜𝑛𝑣(𝑝3 ), (4)
connected backbone, receptive field augmentation module, 4 4 3
𝑞 = 𝑐𝑜𝑛𝑣(𝑝 ) + 𝑑(𝑞 ), (5)
and prediction refinement scheme. Composited FisherNet
[57] is based on underwater video object detection, leverag- 𝑞 5 = 𝑐𝑜𝑛𝑣(𝑝5 ) + 𝑑(𝑞 4 ), (6)
ing the differences between the foreground and background 6
𝑞 = 𝑐𝑜𝑛𝑣𝑠 (𝑞 ), 5
(7)
to extract salient features and proposing an enhanced path 7 6
aggregation network to solve the insufficient utilization of 𝑞 = 𝑐𝑜𝑛𝑣𝑠 (𝑞 ), (8)
semantic information caused by linear up-sampling. RoIAttn
where 𝑐𝑜𝑛𝑣𝑠 (⋅) denotes the convolution layer with stride 2,
[27] considers RoI patches as tokens and applies the external
𝑑(⋅) denotes the 2x downsampling layer. The output multi-
attention module on the RoIs to improve the performance
scale features {𝑞 𝑙 }7𝑙=3 are fed into the detection head.
of underwater object detection. Compared with the methods
mentioned above, to the best of our knowledge, our idea
3.3. RetinaRPN
of considering using RPN’s error for hard example mining
The RPN is responsible for providing proposals that
has not been investigated by any existing underwater object
have potential objects. Underwater images are blurring, low-
detection approaches.
contrast and distorted, which make it difficult to distinguish
the objects from the background. Besides, in the occlusion
3. Boosting R-CNN condition, the objectness trained with simple cross entropy
3.1. Overview loss in the vanilla RPN is not a good estimation of the
Different from the vanilla two-stage detector Faster R- proposal box localization accuracy. As a result, the high-
CNN, the proposed two-stage detector Boosting R-CNN quality proposals may be filtered by the poorly regressed
has three key components: RetinaRPN, the probabilistic proposals with higher objectness. To obtain high-quality
inference pipeline, and boosting reweighting. The pipeline proposals with accurate prior probabilities, we aim to build a
of our Boosting R-CNN is shown in Figure 2. In detail, the strong RPN inspired by the designs of the current one-stage
backbone and the feature fusion neck (e.g., ResNet+PAFPN) detector, which is named retina region proposal network
first extract features from images. Second, RetinaRPN pro- (RetinaRPN).
vides a series of high-quality proposals with corresponding Heavier Head. Instead of using one simple convolution
prior probability. Third, boosting reweighting amplifies the layer in the vanilla RPN, we use four convolution layers with
classification loss of the hard examples whose priors are group normalization. More convolution layers have a more
miscalculated, while decreasing the weight of the easy ex- powerful capability to detect vague objects in blurring, low-
amples with accurately estimated priors. Fourth, the R-CNN contrast, and distorted underwater images.
head which contains two fully-connected layers is trained Multi-Ratio Anchors. For each FPN level, we use anchors
on reweighted RoI samples. In the inference stage, the final at three aspect ratios {1:2, 1:1, 2:1} with sizes {20 , 21∕3 ,
score is the square root of the multiplication of the prior and 22∕3 } of 322 to 5122 for FPN levels 𝑄3 -𝑄7 . In total, there
the classification score. are 𝐴=9 anchors per pixel. Anchor is an important prior

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 3 of 15

Boosting R-CNN

ResNet Backbone PAFPN

Detection Head
C3 P3 Q3
Detection Head
C4 P4 Q4
Detection Head
C5 P5 Q5
Detection Head
Q6
Detection Head
Q7

Objectness
RetinaRPN (H×W×A)
prior
IoU Prediction
(H×W×A)
×4

H×W×256
Box Localization
(H×W×4A)

Cls. Score Final Score

Boosting
RoI Align Reweighting Reg. BBox
Training
RoI Features R-CNN Inference

Figure 2: The overview of the proposed Boosting R-CNN. The backbone and the feature fusion neck first extract features from
images. RetinaRPN provides a series of high-quality proposals with corresponding prior probability. Boosting reweighting amplifies
the classification loss of the hard examples whose priors are miscalculated while decreasing the weight of the easy examples with
accurately estimated priors. The R-CNN head which contains two fully-connected layers is trained on reweighted RoI samples. In
the inference stage, the final score is the square root of the multiplication of the prior and the classification score.

for regressing and classifying aquatic organisms with vague label of anchor 𝑖. It is set to 1 if anchor 𝑖 is a positive sample,
boundaries. otherwise, it is set to 0. As for the positive and negative
Loss Function. RetinaRPN performs three tasks: objectness samples assignment, the anchors with IoU over 0.5 with
prediction, box localization, and IoU prediction. The object- ground-truth boxes are regarded as positive samples, while
ness branch is trained to predict whether there is an object in the anchors with IoU below 0.5 are regarded as negative
an anchor. We leverage the focal loss as objectness loss: samples.
{ The localization branch aims to output the proposals
− 𝛼(1 − 𝑝̂𝑖 )𝛾 log(𝑝̂𝑖 ), 𝑦𝑖 = 1, which are refined on the anchors. Usually, IoU loss is lever-
𝐿𝑓 𝑙 (𝑝̂𝑖 ) = 𝛾 (9)
− (1 − 𝛼)𝑝̂𝑖 log(1 − 𝑝̂𝑖 ), 𝑦𝑖 = 0, aged in the regression loss:

𝑔𝑖 = 𝐼𝑜𝑈 (𝒃̂𝒊 , 𝒃∗𝒊 ), (11)

1 ∑
𝑛
𝐿𝐼𝑜𝑈 (𝒃̂𝒊 ) = 1 − 𝑔𝑖 , (12)
𝐿𝑜𝑏𝑗−𝑟𝑝𝑛 = 𝐿𝑓 𝑙 (𝑝̂𝑖 ), (10)
𝑛 𝑖=1
where 𝒃̂𝒊 and 𝒃∗ are the predicted box 𝑖 and corresponding
where 𝑝̂𝑖 is the objectness score of the anchor 𝑖 output by the ground-truth box, and 𝑔𝑖 is the IoU between them. IoU loss
RetinaRPN, 𝛼 is the balance parameter, and 𝛾 is the focus has some good properties, such as non-negative, symmetry,
parameter. 𝑛 is the number of anchors. 𝑦𝑖 ∈ {0, 1} is the triangle inequality, and scale insensitivity. And it is the

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 4 of 15

Boosting R-CNN

metric of object detection. However, the convergence speed With the IoU prediction branch, the detector can provide the
of the IoU loss is slow. In order to increase the convergence uncertainty into the prior when the objects are occluded in
speed, L2 loss is added to IoU loss. The improved IoU loss the underwater environment. In detail, objectness denotes
can be rewritten as: the likelihood of the object in an anchor. Although object-
∑ ness trained with focal loss can effectively filter the negative
𝐿′𝐼𝑜𝑈 (𝒃̂𝒊 ) = 1 − 𝑔𝑖 + ̂ − 𝑡∗𝑖,𝑗 ||2 ,
||𝑡𝑖,𝑗 2
(13) samples, it will also assign a high value to the proposal in
𝑗∈{𝑥,𝑦,𝑤,ℎ}
which the object is severely covered by other objects. IoU
prediction predicts the IoU between the proposal and its
𝑥̂𝑖 − 𝑥𝑎𝑖 𝑦̂𝑖 − 𝑦𝑎𝑖 ground truth and assigns a value to the object according to
̂ =
𝑡𝑖,𝑥 , ̂ =
𝑡𝑖,𝑦 , (14) its level of occlusion. Combining two indicators includes un-
𝑤𝑎𝑖 ℎ𝑎𝑖
certainties from different perspectives, and comprehensively
𝑤̂ 𝑖 ℎ̂𝑖 models the prior probabilities of the proposals.
̂ = log(
𝑡𝑖,𝑤 ), ̂ = log(
𝑡𝑖,ℎ ), (15)
𝑤𝑎𝑖 ℎ𝑎𝑖
𝑥∗ − 𝑥𝑎 𝑦∗ − 𝑦𝑎 3.4. Probabilistic Inference Pipeline
𝑡∗𝑖,𝑥 = 𝑖 𝑎 𝑖, 𝑡∗𝑖,𝑦 = 𝑖 𝑎 𝑖, (16) For the two-stage detector, in the first stage, the RPN
𝑤𝑖 ℎ𝑖
outputs K proposal boxes 𝑏1 , ..., 𝑏𝐾 . And for the proposal
𝑤∗ ℎ∗ 𝑘 ∈ {1, ..., 𝐾}, RPN predicts a class-agnostic foreground
𝑡∗𝑖,𝑤 = log( 𝑖𝑎 ), 𝑡∗𝑖,ℎ = log( 𝑖𝑎 ). (17)
𝑤𝑖 ℎ𝑖 prior probability 𝑃 (𝑂𝑘 ), where 𝑂𝑘 =1 denotes the proposal
𝑘 is an object and 𝑂𝑘 =0 suggests the background. This is
where {𝑥𝑎𝑖 , 𝑦𝑎𝑖 , 𝑤𝑎𝑖 , ℎ𝑎𝑖 } are the coordinates of anchor 𝑖, realized by a binary classifier trained with a log-likelihood
{𝑥̂𝑖 , 𝑦̂𝑖 , 𝑤̂ 𝑖 , ℎ̂𝑖 } and {𝑥∗𝑖 , 𝑦∗𝑖 , 𝑤∗𝑖 , ℎ∗𝑖 } are the coordinates of objective. In the second stage, high-scoring proposals are
the predicted box and its corresponding ground truth, and sampled to train the R-CNN head, a softmax classifier.
̂ , 𝑡𝑖,𝑦
{𝑡𝑖,𝑥 ̂ , 𝑡𝑖,𝑤 ̂ } and {𝑡∗𝑖,𝑥 , 𝑡∗𝑖,𝑦 , 𝑡∗𝑖,𝑤 , 𝑡∗ } denote the encod-
̂ , 𝑡𝑖,ℎ The R-CNN learns to classify each proposal into one of
𝑖,ℎ
ing of the 4 coordinates of the predicted box and ground the foreground classes or background. The output classi-
truth respectively. This encoding method is the same as [40]. fication score of the proposal 𝑘 for the class 𝐶𝑘 can be
However, L2 loss is very vulnerable to outliers, which will seen as a conditional categorical probability 𝑃 (𝐶𝑘 |𝑂𝑘 =1)
harm the regression accuracy. To solve this problem, we (𝐶𝑘 ∈{, 𝑏𝑔},  is the set of classes and 𝑏𝑔 denotes back-
design the fast IoU loss (FIoU), which is inspired by [56], ground). However, in the inference stage, the final detection
as: score directly uses the classification score in the R-CNN
∑ head, ignoring the prior probability from the RPN. During
𝐿𝐹 𝐼𝑜𝑈 (𝒃̂𝒊 ) = 𝑔𝑖𝜂 (1 − 𝑔𝑖 + ̂ − 𝑡∗𝑖,𝑗 ||2 ), (18)
||𝑡𝑖,𝑗 2 the training stage in the R-CNN head, since the supervision
𝑗∈{𝑥,𝑦,𝑤,ℎ}
signals of all proposals are the equivalent with a softmax
classifier regardless of the localization accuracy, the R-CNN
1∑
𝑚 head easily outputs false over-confident predictions. Thus,
𝐿𝑙𝑜𝑐−𝑟𝑝𝑛 = 𝐿 (𝒃̂ ), (19) compared with using the conditional categorical probability
𝑚 𝑖=1 𝐹 𝐼𝑜𝑈 𝒊
𝑃 (𝐶𝑘 =𝑐|𝑂𝑘 =1), it is more reasonable to use the marginal
where 𝜂 is a parameter to control the degree of inhibition probability 𝑃 (𝐶𝑘 = 𝑐), 𝑐 ∈  as the final detection score.
of outliers, and 𝑚 is the number of positive samples. We We set 𝑃 (𝐶𝑘 =𝑏𝑔|𝑂𝑘 =0)=1 and 𝑃 (𝐶𝑘 = 𝑐|𝑂𝑘 = 0) = 0,
add an IoU weighted term 𝑔𝑖𝜂 to alleviate the problem of which means that it is impossible for the R-CNN head to
vulnerability to outliers. With the IoU weighted term, low- reconsider a proposal as a positive sample if the RPN regards
quality samples with high regression loss will be filtered, for the proposal as a negative sample. The marginal probability
the weighted term will become small. And the RetinaRPN 𝑃 (𝐶𝑘 = 𝑐) can be written as:
will focus on the prime samples with moderate regression ∑
accuracy, which will enhance the robustness to the outliers 𝑃 (𝐶𝑘 = 𝑐) = 𝑃 (𝐶𝑘 = 𝑐|𝑂𝑘 )𝑃 (𝑂𝑘 = 𝑢)
and remain the fast convergence. 𝑢∈{0,1}

The IoU prediction branch is trained to predict the IoUs = 𝑃 (𝐶𝑘 = 𝑐|𝑂𝑘 = 1)𝑃 (𝑂𝑘 = 1) (22)
between regressed boxes and their corresponding ground + 𝑃 (𝐶𝑘 = 𝑐|𝑂𝑘 = 0)𝑃 (𝑂𝑘 = 0)
truths. And the cross entropy is used as the loss function:
= 𝑃 (𝐶𝑘 = 𝑐|𝑂𝑘 = 1)𝑃 (𝑂𝑘 = 1).

From the equation above, the marginal probability is the

1∑
𝑚
𝐿𝑖𝑜𝑢−𝑟𝑝𝑛 = −[𝑔𝑖 log(𝑔̂𝑖 ) + (1 − 𝑔𝑖 ) log(1 − 𝑔̂𝑖 )], (20) multiplication of the first-stage prior probability and the
𝑚 𝑖=1 second-stage conditional distribution. Using the marginal
probability, the first-stage prior uncertainty can be taken into
where 𝑔̂𝑖 is the predicted IoU of the anchor 𝑖. The object prior
consideration in the final prediction. In the implementation,
is the square root of the multiplication of the objectness score
the final detection score of the proposal 𝑘 is the square root
and IoU prediction, namely:
of the multiplication of RetinaRPN’s prior and the R-CNN’s
√
𝑝𝑟𝑖 = 𝑔̂𝑖 ∗ 𝑝̂𝑖 . (21)

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 5 of 15

Boosting R-CNN

its classification weight is:

{
(1 − 𝑝𝑟𝑘 )𝜔 , 𝑘 ∈  , (24a)
𝑤𝑘 =
𝑝𝑟𝜔
𝑘, 𝑘 ∈ , (24b)
Feature maps
where 𝜔≥0 is the boosting parameter.  denotes the set of
foreground samples, and  denotes the set of background
samples. With the reweighting scheme, the classification
RoIAlign loss of the R-CNN can be written as:

1 ∑ ∑
TP TN FP FN 𝐾 𝐶
RoI Samples 𝐿𝑐𝑙𝑠 = 𝑤𝑘 ⋅ (−𝑠𝑐𝑘 ⋅ log(𝑠̂𝑐𝑘 )), (25)
𝐾 𝑘=1 𝑐=1
0.75 0.92 0.01 0.21 0.8 0.6 0.32 0.29

Reweighting: 1 − 𝑝𝑟𝑖𝑜𝑟 𝜔 where 𝑠̂𝑐𝑘 and 𝑠𝑐𝑘 denote the predicted classification score
and label of sample 𝑘 for class 𝑐, 𝑠𝑐𝑘 ∈ {0, 1}. 𝐾 and 𝐶
Reweighted are the number of proposals in second stage and the number
RoI Samples of classes respectively. Note that the weighted terms are all
smaller than 1, the total value of the classification loss will
shrink, which will cause the shrink of the gradient. In order
Figure 3: The overview of the proposed boosting reweighting. to keep the norm of the total loss unchanged, we normalize
The patch size denotes the weight of RoI samples. 𝑤 to 𝑤′ :
∑𝐾 ∑𝐶 𝑐 ̂𝑐
′ 𝑘=1 𝑐=1 (−𝑠𝑘 ⋅ log(𝑠𝑘 ))
𝑤𝑘 = 𝑤𝑘 ⋅ ∑𝐾 ∑𝐶 , (26)
𝑤 ⋅ (−𝑠 𝑐 ⋅ log(𝑠̂𝑐 ))
classification score, namely: 𝑘=1 𝑘 𝑐=1 𝑘 𝑘
√ 1 ∑ ′ ∑ 𝑐
𝐾 𝐶
𝑠𝑘 (𝑐) = 𝑝𝑟𝑘 ∗ 𝑐𝑙𝑠𝑘 (𝑐), (23) 𝐿′𝑐𝑙𝑠 = 𝑤 ⋅ (−𝑠 ⋅ log(𝑠̂𝑐𝑘 )). (27)
𝐾 𝑘=1 𝑘 𝑐=1 𝑘
where 𝑐𝑙𝑠𝑘 (𝑐) is the classification score of the sample 𝑘 for
When the detector encounters hard positive/negative sam-
class 𝑐 in R-CNN, and 𝑠𝑘 (𝑐) is the final score.
ples, obviously the priors of RPN will be small/large. As
With the probabilistic inference pipeline, the detector
a result, the weighted term (1−𝑝𝑟𝑖𝑜𝑟(𝑘))𝛾 /𝑝𝑟𝑖𝑜𝑟(𝑘)𝛾 will
can take the first-stage uncertainty into consideration to
increase and amplify the loss of the hard examples, while
make the final predictions. Thus, compared with using the
the loss of the easy samples will be decreased.
conditional probability, the marginal probability is a better
BR can be seen as hard example mining. There are two
estimation of the detection box localization accuracy.
similar works to our BR: OHEM and focal loss. OHEM is
a bootstrapping method, which is originally designed for
3.5. Boosting Reweighting
Fast R-CNN (without RPN), it performs a feedforward for
There is a deficiency in the previous probabilistic infer-
all RoIs on the R-CNN, and selects the hardest samples for
ence pipeline. In the original two-stage detector, the second
training on the second feedforward. Our BR leverages the
stage makes predictions that is independent of the first stage.
prior information from the RPN from only one feedforward,
As a result, a low score for a high-quality sample in the
saving lots of memory cost and training time. Focal loss
first stage will not influence the final detection result as long
is designed for RetinaNet to solve the extreme imbalance
as the sample is selected as a proposal. However, in the
between foreground and background. However, the NMS
probabilistic two-stage pipeline, when the RPN mistakenly
in the RPN and bootstrapping mechanism in the second
generates a low prior for a high-quality positive proposal, it
stage alleviate the imbalance problem, which overlaps the
is hard to re-consider it as a high-confidence prediction, for
function of focal loss. Our BR is used in combination with
the final score is the square root of the multiplication of prior
NMS and bootstrapping. To avoid the shrinking of the
and classification score. In underwater environments, vague
loss, normalization is leveraged to re-distribute the weight
objects as hard examples often happen, and the RPN will
of each sample. Thus, BR aims to handle the problem of
severely suffer from those.
the hard samples in the underwater environment instead of
To solve this problem, we hope that when the RPN
foreground-background imbalance. Both OHEM and focal
miscalculates the prior of the proposal, the R-CNN can
loss reweight the loss by the R-CNN’s error, while our BR
rectify the error. Thus, we propose a soft sampling strategy
reweights the loss by the RPN’s error. The R-CNN trained
named boosting reweighting (BR, shown in Fig. 3), which
with BR excavates the subtle differences between aquatic
borrows the idea of reweighting from boosting algorithm
organisms and background and is robust to the samples to
and well fits the existing frameworks. Different from vanilla
which the RPN is vulnerable. Thus, the R-CNN can rectify
Faster R-CNN, where the weights of all proposals are set
the RPN’s error in the inference stage. The experiments
to 1, BR tends to attach more attention to hard examples
show that BR is more compatible with the probabilistic
whose priors are miscalculated. In detail, for the sample 𝑘,
inference pipeline compared with other methods.

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 6 of 15

Boosting R-CNN

Table 1
Comparisons with other object detection methods on UTDAC2020 dataset. The FPS is tested on a single Nvidia GTX 1080Ti
GPU. ‘*’ means that the model uses the second training recipe.
Method Backbone AP AP50 AP75 APS APM APL FPS
Two-Stage Detector:
Faster R-CNN w/ FPN [40] ResNet50 44.5 80.9 44.1 20.0 39.0 50.8 11.6
OHEM+Faster R-CNN w/ FPN [42] ResNet50 45.1 82.0 45.1 21.6 39.1 51.4 11.6
Cascade R-CNN [2] ResNet50 46.6 81.5 49.3 21.0 40.9 53.3 8.8
Libra R-CNN [34] ResNet50 45.8 82.0 46.4 20.1 40.2 52.3 11.0
Cascade RPN [45] ResNet50 46.5 79.5 41.2 20.4 38.6 47.7 8.3
Faster R-CNN w/ PAFPN [31] ResNet50 45.5 82.1 45.9 18.8 39.7 51.9 10.9
Double-Head [48] ResNet50 45.3 81.5 45.7 20.2 40.0 51.4 5.7
Dynamic R-CNN [51] ResNet50 45.6 80.1 47.3 19.0 39.7 52.1 12.1
Faster R-CNN w/ FPG [5] ResNet50 45.4 81.6 46.0 19.8 39.7 51.4 13.1
GRoIE [41] ResNet50 45.7 82.4 45.6 19.9 40.1 52.0 6.0
SABL+Faster R-CNN [46] ResNet50 46.6 81.6 48.2 19.6 40.4 53.4 9.9
PISA [3] ResNet50 46.3 82.1 47.4 20.8 40.8 52.6 10.3
Sparse R-CNN [43] ResNet50 37.4 70.4 35.9 17.7 33.3 43 10.8
DetectoRS [36] ResNet50 47.6 82.8 49.9 23.1 41.8 54.2 4.0
RoIAttn [27] ResNet50 46.0 82.0 47.5 22.9 40.5 52.2 8.8
CenterNet2 [58] ResNet50 47.2 81.6 49.8 18.2 41.3 53.4 14.2
CenterNet2* [58] ResNet50 48.9 83.0 52.6 21.7 43.5 55.2 14.2
One-Stage Detector:
SSD512 [32] VGG16 40.0 77.5 36.5 14.7 36.1 45.1 25.0
RetinaNet [28] ResNet50 43.9 80.4 42.9 18.1 38.2 50.1 11.4
FSAF [61] ResNet50 43.9 81.0 42.9 18.5 38.9 50.9 12.8
CenterNet [59] ResNet18 31.3 61.1 27.6 11.9 32.5 33.4 6.2
FCOS [44] ResNet50 43.9 81.1 43.0 19.9 38.2 50.4 12.7
RepPoints [50] ResNet50 44.0 80.5 43.0 18.7 38.5 50.3 11.1
FreeAnchor [55] ResNet50 46.3 82.3 46.9 21.0 40.5 52.6 11.4
RetinaNet w/ NASFPN [17] ResNet50 37.4 70.3 35.8 12.4 36.4 40.4 13.8
ATSS [53] ResNet50 46.2 82.5 46.9 19.7 41.4 52.4 11.8
PAA [20] ResNet50 47.5 83.1 49.7 19.5 42.4 53.6 6.6
AutoAssign [60] ResNet50 46.3 83.0 47.6 18.0 41.3 52.2 12.3
GFL [25] ResNet50 46.4 81.9 47.8 19.3 40.9 52.5 12.7
VFNet [52] ResNet50 44.0 79.3 44.1 18.8 38.1 50.4 10.5
Transfromer:
Deformable DETR [62] ResNet50 46.6 84.1 47.0 24.1 42.4 51.9 7.6
Ours:
Boosting R-CNN ResNet50 48.5 82.4 52.5 21.1 42.4 55.0 13.5
Boosting R-CNN* ResNet50 51.4 85.5 56.8 23.8 45.8 57.8 13.5

3.6. Loss Function 𝑤∗𝑘 ℎ∗𝑘

𝑡∗𝑘,𝑤 = log( ), 𝑡∗
= log( ), (32)
In the R-CNN head, we apply L1 loss on the positive 𝑤𝑝𝑘 𝑘,ℎ ℎ𝑝𝑘
proposals in the second stage for regression:
where {𝑥𝑝𝑘 , 𝑦𝑝𝑘 , 𝑤𝑝𝑘 , ℎ𝑝𝑘 } are the coordinates of proposals
𝐾𝑝𝑜𝑠
1 ∑ ∑ 𝑘, {𝑥̂𝑘 , 𝑦̂𝑘 , 𝑤̂𝑘 , ℎ̂𝑘 } and {𝑥∗𝑘 , 𝑦∗𝑘 , 𝑤∗𝑘 , ℎ∗𝑘 } are the coordi-
𝐿𝑟𝑒𝑔 = |𝑡 ̂ − 𝑡∗𝑘,𝑗 |, (28)
𝐾𝑝𝑜𝑠 𝑘=1 𝑗∈{𝑥,𝑦,𝑤,ℎ} 𝑘,𝑗 nates of predicted box and its corresponding ground truths,
̂ , 𝑡𝑘,𝑦
{𝑡𝑘,𝑥 ̂ , 𝑡𝑘,𝑤 ̂ } and {𝑡∗ , 𝑡∗ , 𝑡∗ , 𝑡∗ } denote the en-
̂ , 𝑡𝑘,ℎ
𝑘,𝑥 𝑘,𝑦 𝑘,𝑤 𝑘,ℎ
coding of the 4 coordinates of the predicted box and ground
𝑥̂𝑘 − 𝑥𝑝𝑘 𝑦̂𝑘 − 𝑦𝑝𝑘 truth respectively. The total loss in the Boosting R-CNN is:
̂ =
𝑡𝑘,𝑥 ̂ =
, 𝑡𝑘,𝑦 , (29)
𝑤𝑝𝑘 ℎ𝑝𝑘
𝐿𝑡𝑜𝑡𝑎𝑙 = 𝜆𝑜𝑏𝑗−𝑟𝑝𝑛 𝐿𝑜𝑏𝑗−𝑟𝑝𝑛 + 𝜆𝑙𝑜𝑐−𝑟𝑝𝑛 𝐿𝑙𝑜𝑐−𝑟𝑝𝑛
𝑤̂𝑘 ℎ̂𝑘 (33)
̂ = log(
𝑡𝑘,𝑤 ̂ = log( 𝑝 ),
), 𝑡𝑘,ℎ (30) + 𝜆𝑖𝑜𝑢−𝑟𝑝𝑛 𝐿𝑖𝑜𝑢−𝑟𝑝𝑛 + 𝜆𝑟𝑒𝑔 𝐿𝑟𝑒𝑔 + 𝜆𝑐𝑙𝑠 𝐿𝑐𝑙𝑠 ,
𝑤𝑝𝑘 ℎ𝑘
where 𝜆𝑜𝑏𝑗−𝑟𝑝𝑛 , 𝜆𝑙𝑜𝑐−𝑟𝑝𝑛 , 𝜆𝑖𝑜𝑢−𝑟𝑝𝑛 , 𝜆𝑟𝑒𝑔 , 𝜆𝑐𝑙𝑠 are the balanced
parameters for each loss respectively.
𝑥∗𝑘 − 𝑥𝑝𝑘 𝑦∗𝑘 − 𝑦𝑝𝑘
𝑡∗𝑘,𝑥 = , 𝑡∗𝑘,𝑦 = , (31)
𝑤𝑝𝑘 ℎ𝑝𝑘

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 7 of 15

Boosting R-CNN

Table 2
Comparisons with other object detection methods on Brackish 49.0
dataset. “Baseline” is the performance reported in the original
paper of Brackish. 48.8
Method Backbone AP AP50 48.6
Baseline (YOLOv3) [35] DarkNet53 38.9 83.7
Faster R-CNN w/ FPN [21] ResNet50 79.3 97.4 48.4
Cascade R-CNN [2] ResNet50 80.7 96.9

FPS
RetinaNet [28] ResNet50 78.0 96.5 48.2
DetectoRS [36] ResNet50 81.6 97.0
48.0
CenterNet2 [58] ResNet50 79.3 97.4
Boosting R-CNN ResNet50 82.0 97.4 47.8

47.6
4. Experiments 0 0.5 1 2 3
Number of queries
4.1. Datasets
We conduct experiments on four challenging object de- Figure 4: The choice of 𝜂 in FIoU loss. 𝜂 = 0 means droping
tection datasets to validate the generalization performance the IoU weighted term.
of our method.
(i) UTDAC2020 is an underwater dataset from the un-
derwater target detection algorithm competition 2020. There is utilized. The second training recipe adopts the 3x training
are 5,168 training images and 1,293 validation images. It scheme (36 epochs) with crop and multi-scale augmentation.
contains four classes: echinus, holothurian, starfish, and AdamW is leveraged as the optimizer with an initial learning
scallop. The images contain four resolutions: 3840×2160, rate of 0.0001 and a weight decay of 0.05. The learning rate
1920×1080, 720×405, and 586×480. We follow the COCO- is divided by a factor of 10 at epoch 24 and 33.
style evaluation metric. The method is trained on a single NVIDIA GTX 1080Ti
(ii) Brackish is an early proposed underwater image GPU. During the inference, we use a maximum of 256
dataset collected in temperate brackish waters. It contains proposal boxes in the second stage, which improves the
six classes: bigfish, crab, jellyfish, shrimp, small fish, and inference speed. As for the balanced parameters of loss,
starfish. There are 9,967, 1,467, 1,468 images in training, 𝜆𝑜𝑏𝑗−𝑟𝑝𝑛 , 𝜆𝑙𝑜𝑐−𝑟𝑝𝑛 , 𝜆𝑖𝑜𝑢−𝑟𝑝𝑛 , 𝜆𝑐𝑙𝑠 , 𝜆𝑟𝑒𝑔 are set to 1, 2, 1, 2, 2
validation, and test set, containing 25,613 annotations. The respectively.
image size is 960×540. We follow the MS COCO-style
AP[0.5:0.95:0.05] metric and Pascal VOC-style AP50 met- 4.3. Comparisons with Other State-of-the-art
ric as the original paper. Methods
(iii) Pascal VOC is a generic object detection dataset, We compare Boosting R-CNN against some state-of-the-
which contains 20 object categories. The dataset includes art methods in four object detection datasets in Table 1, 2, 3,
VOC2007 part and VOC2012 part. In VOC2007 part, there and 4.
are 9963 annotated images, consisting of trainval (5011
images) and test set (4952 images). In VOC2012 set, there 4.3.1. Results on UTDAC2020
are 11540 annotated images in trainval set. We train our The experiment results on the UTDAC2020 dataset are
detector on 07+12 trainval dataset, and evaluate it on 07 test shown in Table 1. CenterNet2* and Boosting R-CNN* de-
set. note that the models use multi-scale training and 3× training
(iiii) MS COCO is a generic object detection dataset, time. Besides, Deformable DETR is trained for 50 epochs
which contains 80 object categories. It contains 118k images with multi-scale training. All the detectors are implemented
for training (trainval), 5k images for validation (val) and 20k by MMdetection [6] except CenterNet2 which is officially
images for testing without provided annotations (test-dev). implemented in Detectron2 [49].
The final results are reported on test-dev set. As shown in Table 1, in single-scale training setting,
Boosting R-CNN achieves 48.5% AP, which is higher than
4.2. Implementation Details DetectoRS (47.6% AP), PAA (47.5% AP), and CenterNet2
Our method is implemented on MMdetection [6]. There (47.2% AP). In multi-scale training setting, Boosting R-
are two training recipes in the experiments. The first one CNN still surpasses CenterNet2 (51.4% AP vs 48.9% AP).
is the default training recipes, which adopts the classic 1x As a result, our Boosting R-CNN defeats all the detectors and
training scheme (12 epochs). SGD is adopted as an opti- builds new state-of-the-art performance. As for the inference
mizer, where the weight decay is 0.0001 and the momentum speed, Boosting R-CNN achieves 13.5 FPS, which is higher
is 0.9. The initial learning rate is 0.005. The learning rate than most of the two-stage detectors including Faster R-CNN
is divided by a factor of 10 at epoch 8 and 11. No extra (11.6 FPS) but lower than CenterNet2 (14.2 FPS).
data augmentation except the traditional horizontal flipping

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 8 of 15

Boosting R-CNN

Table 3 4.3.3. Results on Pascal VOC

Comparisons with other object detection methods on PASCAL The experiment results on Pascal VOC are shown in
VOC dataset. ’*’ means that our model uses the second Table 3. In single-scale training setting, Boosting R-CNN
training recipe. The performances of Faster R-CNN and achieves 81.9% mAP performance, which is higher than
RetinaNet are published by MMdetection. “†” means our re- Faster R-CNN (79.5%), Cascade R-CNN (80.0%), DSSD513
implementation. The performances of other methods are from (81.5% AP) and CenterNet2 (76.8%). In multi-scale training
their original paper. setting, Boosting R-CNN achieves the highest performance
Method Backbone Input Size mAP of 83.0% mAP.
Two-Stage Detector:
Faster R-CNN w/ FPN [21] ResNet50 1000 × 600 79.5
MR-CNN [18] VGG16 1000 × 600 78.2 4.3.4. Results on MS COCO
R-FCN [10] ResNet101 1000 × 600 80.5 Table 4 compares our method to the state-of-the-art
RON384++ [22] VGG16 384 × 384 77.6
Cascade R-CNN† [2] ResNet50 1000 × 600 80.0
detectors with a large backbone on MS COCO test-dev. All
CenterNet2† [58] ResNet50 1000 × 600 76.8 the models presented are using single-scale testing. Boosting
One-Stage Detector: R-CNN achieves 44.4% AP only with ResNet50 backbone,
SSD300 [32] VGG16 300 × 300 74.3
SSD512 [32] VGG16 512 × 512 76.8
outperforming some two-stage detectors (e.g. Cascade R-
YOLO [38] GoogleNet 448 × 448 63.4 CNN, Grid R-CNN, and Libra R-CNN), and some one-stage
YOLOv2 [39] DarkNet19 544 × 544 78.6 detectors (e.g. FCOS, CornerNet, and FSAF). And Boosting
RefineDet512 [54] VGG16 512 × 512 81.8
DSSD513 [15] ResNet101 513 × 513 81.5
R-CNN further achieves 50.7% AP with Res2Net101-DCN
RetinaNet [28] ResNet50 1000 × 600 77.3 backbone, outperforming GFLV2 with the same backbone
FERNet [14] VGG16+ResNet50 512 × 512 81.0 (50.6% AP) and achieving the highest performance.
Boosting R-CNN ResNet50 1000 × 600 81.9
Boosting R-CNN* ResNet50 1000 × 600 83.0
4.4. Ablation Study
We also conduct extensive ablation studies on UT-
DAC2020 to validate each module in our proposed Boosting

%5 R-CNN. Table 5 shows the detailed road map from the
%5ZRQRUP
default Faster R-CNN to the proposed Boosting R-CNN.
Our proposed RetinaRPN improves the performance from
44.5% AP (row 1) to 46.9% AP (row 13), which indicates
that RetinaRPN provides higher-quality proposals. With an
$3

accurate estimation of the object prior in the RetinaRPN, the

probabilistic inference pipeline considers the uncertainty of
the first stage to make a prediction, increasing the perfor-
mance from 46.9% AP to 47.9% AP (row 13 vs row 14).

Using the boosting reweighting alone achieves a favorable
improvement (from 44.5% AP to 45.3% AP), which is higher
than OHEM (45.1% AP) in Table 1, and BR boosts the
Figure 5: The choice of 𝜔 in boosting reweighting. Blue line performance from 47.9% AP to 48.3% AP (row 14 vs row
denotes the BR with normalization, while green dash line 15). With PAFPN, the final performance of 48.5% AP can
denotes the BR without normalization. be obtained (row 16).

4.4.1. Ablation Study of RetinaRPN

Moreover, among the one-stage detectors, we can con- The first 5 columns in Table 5 shows the detailed abla-
clude that the anchor-based methods (SSD, RetinaNet, ATSS, tion studies of the proposed RetinaRPN. (i) Using 4 layer
FreeAnchor, PAA, and GFL) can obtain relatively high convolution layers with GN (row 1 vs row 2) increases
performances than the anchor-free methods (FSAF, FCOS, the capability of feature extraction, improving performance
RepPoints, VFNet, and AutoAssign). That is because the (44.5% to 45.1% in AP). (ii) Using focal loss can take all the
boundaries of the aquatic organisms are vague in the low- samples into consideration (row 4), and further improve the
contrast and distorted underwater images, and anchors help performance (45.4 % AP). (iii) Compared with 3 anchors,
the model to obtain boundaries prior, which boosts the 9 anchors provide a great performance increase (row 4 vs
convergence. row 6, 45.4% to 46.7% AP), which further illustrates the
importance of anchors in underwater environments. (vi)
4.3.2. Results on Brackish Adding an IoU prediction provides an accurate prior, which
The experiment results on the Brackish dataset are further improves the robustness to the vague objects (row 8
shown in Table 2. Boosting R-CNN achieves 82.0% AP and vs row 9, 47.2% AP to 47.5% AP). (vi) Our proposed fast
97.4% AP50, achieving the highest performance. Compared IoU loss (row 14, 47.9% AP) is superior to L1 loss (row 9,
with DetectoRS (81.6% AP and 97.4% AP50), Boosting R- 47.5% AP), GIoU loss (row 10, 47.6% AP), CIoU loss (row
CNN is 0.4% higher in both AP and AP50 metric. 11, 47.6% AP), and focal EIoU loss (row12, 47.7% AP).

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 9 of 15

Boosting R-CNN

Table 4
Comparisons with other object detection methods on MS COCO dataset with the large backbone in single-scale testing. ‘*’ means
that our model uses the second training recipe. The performances of other methods are published in their papers, and they all
use 3x training time.
Method Backbone AP AP50 AP75 APS APM APL
Two-Stage Detector:
Faster R-CNN w/ FPN[40] ResNet101 36.2 59.1 39.0 18.2 39.0 48.2
Cascade R-CNN [2] ResNet101 42.8 62.1 46.3 23.7 45.5 55.2
Grid R-CNN [33] ResNet101 41.5 60.9 44.5 23.3 44.9 53.1
Libra R-CNN [34] ResNeXt101-64x4d 43.0 64.0 47.0 25.3 45.6 54.6
Double-Head [48] ResNet101 42.3 62.8 46.3 23.9 44.9 54.3
Dynamic R-CNN [51] ResNet101-DCN 46.9 65.9 51.3 28.1 49.6 60.0
BorderDet [37] ResNeXt101-64x4d-DCN 48.0 67.1 52.1 29.4 50.7 60.5
TridentNet [26] ResNet101-DCN 46.8 67.6 51.5 28.0 51.2 60.5
CPN [12] HG104 47.0 65.0 51.0 26.5 50.2 60.7
One-Stage Detector:
FCOS [44] ResNeXt101-64x4d-DCN 46.6 65.9 50.8 28.6 49.1 58.6
CornerNet [23] HG104 40.6 56.4 43.2 19.1 42.8 54.3
CenterNet [59] HG104 42.1 61.1 45.9 24.1 45.5 52.8
CentripetalNet [11] HG104 46.1 63.1 49.7 25.3 48.7 59.2
RetinaNet [28] ResNet101 39.1 59.1 42.3 21.8 42.7 50.2
FSAF [61] ResNeXt101-64x4d 42.9 63.8 46.3 26.6 46.2 52.7
RepPoints [50] ResNet101-DCN 45.0 66.1 49.0 26.6 48.6 57.5
RepPointsV2 ResNet101-DCN 48.1 67.5 51.8 28.7 50.9 60.8
FreeAnchor [55] ResNeXt101-32x8d 46.0 65.6 49.8 27.8 49.5 57.7
ATSS [53] ResNeXt101-32x8d-DCN 47.7 66.5 51.9 29.7 50.8 59.4
PAA [20] ResNeXt101-64x4d-DCN 49.0 67.8 53.3 30.2 52.8 62.2
AutoAssign [60] ResNeXt101-64x4d-DCN 49.5 68.7 54.0 29.9 52.6 62.0
GFL [25] ResNeXt101-32x4d-DCN 48.2 67.4 52.6 29.2 51.7 60.2
GFLV2 [24] Res2Net101-DCN 50.6 69.0 55.3 31.3 54.3 63.5
YOLOv4 [1] CSPDarkNet-53 43.5 65.7 47.3 26.7 46.7 53.3
Transfromer:
DETR [4] ResNet101 43.5 63.8 46.4 21.9 48.0 61.8
Deformable DETR [62] ResNeXt101-64x4d-DCN 50.1 69.7 54.6 30.6 52.8 65.6
Ours:
Boosting R-CNN* ResNet50 44.4 63.9 48.2 26.9 47.0 54.8
Boosting R-CNN* Res2Net101-DCN 50.7 69.2 55.8 31.7 54.1 63.5

Figure 4 shows the choice of hyper-parameter 𝜂 in fast gives a relatively lower performance (47.5% AP). PISA
IoU loss. If 𝜂 is too large, the gradient will be dominated by severely does harm to the performance (46.9% AP). Focal
the easy samples with high IoUs. If 𝜂 is too small, it will lack loss also causes severe performance decrease. When the 𝛾 is
the ability to filter the outliers. When 𝜂 is set to 2, the highest set lower, which means that focal loss gets closer to cross-
performance of 48.5% AP can be obtained. When 𝜂 is set to entropy loss, the performance is restored. The experiments
0, which is equivalent to the fast IoU loss without the IoU in 6 show that in the probabilistic pipeline, BR helps R-CNN
weighted term, the performance is lower. Thus, the training to correct the mistakes of RPN, which is more compatible
of the model will suffer from the outliers. than OHEM, PISA, and focal loss.
Figure 5 is the experiment of the choice of 𝜔 in BR. In
4.4.2. Hard Example Mining this experiment, PAFPN is not used. From the figure, it can
Table 6 shows the experiments of different hard example be concluded that normalization improves the performance
mining methods. Since our BR is a kind of hard example and shifts the optimal value of 𝜔. The highest performance
mining, it is necessary to compare it to other hard example of 48.3% AP is obtained when 𝜔 is set to 0.5.
mining methods, i.e., OHEM, PISA, and focal loss. We
replace BR with these methods to evaluate the effective- 4.4.3. Anchor Assignment
ness and the compatibility with the probabilistic inference In Table 7, we adopt other positive and negative an-
pipeline. “Cls. Loss” denotes the classification loss in the R- chor assignment strategies in our RetinaRPN. PAFPN and
CNN head, “Random” means randomly sampling positive boosting reweighting is not leveraged in the experiments.
and negative RoIs during training. The first row (47.9% AP) Although ATSS [53], PAA [20] and OTA [16] achieve aston-
corresponds to the next-to-last row in Table 5. BR achieves ishing performances in one-stage detectors, they decreases
the highest performance (48.3% AP). Using OHEM instead the performance when they play the role of the RPN. The

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 10 of 15

Boosting R-CNN

Table 5
A detailed abalation study of Boosting R-CNN. The first five columns denotes the ablation studies of RetinaRPN. “4 L.” denotes
using 4 convolution layers with GN. “NA.” denotes the number of anchors. “Reg Loss” denotes the regression loss in RPN. “FL”
denotes using focal loss and abandoning the bootstrapping in RPN. “IoUp.” means adding IoU prediction in RPN with cross
entropy loss. “Prob” denotes using probabilistic inference pipeline. “BR” means boosting reweighting.
RetinaRPN
Row Neck Prob BR AP AP50 AP75
4 L. NA. Reg Loss FL IoUp.
1 3 L1 FPN 44.5 80.9 44.1
2 ! 3 L1 FPN 45.1 81.6 45.9
3 3 L1 FPN ! 45.3 81.6 45.8
4 ! 3 L1 ! FPN 45.4 81.2 46.3
5 ! 3 GIoU ! FPN 45.8 80.5 47.5
6 ! 9 L1 ! FPN 46.7 80.0 49.0
7 ! 9 GIoU ! FPN ! 46.8 82.2 48.8
8 ! 9 L1 ! FPN ! 47.2 82.5 49.3
9 ! 9 L1 ! ! FPN ! 47.5 83.0 50.3
10 ! 9 GIoU ! ! FPN ! 47.6 82.7 50.4
11 ! 9 CIoU ! ! FPN ! 47.6 82.8 50.1
12 ! 9 F-EIoU ! ! FPN ! 47.7 83.0 49.8
13 ! 9 FIoU ! ! FPN 46.9 81.3 49.6
14 ! 9 FIoU ! ! FPN ! 47.9 82.8 50.7
15 ! 9 FIoU ! ! FPN ! ! 48.3 82.6 51.8
16 ! 9 FIoU ! ! PAFPN ! ! 48.5 82.4 52.5

Table 6 Table 7
The ablation studies of hard example mining. “FL(𝛼,𝛾)” de- Positive and negative assignment. “(0.5, 0.5)” denotes that the
notes using focal loss with hyper-parameter 𝛼 and 𝛾. samples with IoUs over 0.5 are regarded as positive, while the
Cls. Loss Sampling AP AP50 AP75 samples with IoUs below 0.5 are regarded as negative. BR is
CE Random 47.9 82.8 50.7 not used in the experiment.
CE OHEM 47.5 82.1 49.8 Assignment AP AP50 AP75
CE PISA 46.9 82.7 48.2 ATSS 46.5 81.9 48.1
FL (0.25, 2) Random 44.7 79.7 45.8 PAA 47.3 83.1 49.1
FL (0.5, 1) Random 46.5 81.1 48.8 OTA 46.4 81.1 48.8
FL (0.5, 0.1) Random 47.0 82.0 48.9 (0.5, 0.5) 47.9 82.8 50.7
FL (0.25, 2) None 46.7 80.2 49.7
CE BR 48.3 82.6 51.8
The first two rows denote the blurring and low contrast
conditions. Boosting R-CNN can detect all the ground truths
reason may be that the adaptive assignment provides over- (no red box) in the images with the highest precision (only
confident priors, which decreases the recall. Our setting one blue box). The third row denotes the unbalanced light
“(0.5, 0.5)” achieves the best performance in underwater condition. ATSS, PAA, and DetectoRS all miss the echinus
object detection. in the center. Our Boosting R-CNN does not miss any ground
truths. The Fourth row denotes the occlusion condition.
4.5. Qualitative Comparisons Boosting R-CNN can detect the condition that a starfish
Figure 6 shows the qualitative comparison between cover a scallop, on which DetectoRS makes a mistake. The
Boosting R-CNN and other state-of-the-art methods on last two rows denote the mimicry condition. In the fifth
UTDAC2020 dataset. We apply the detectors to some chal- case, the stone is very similar to the scallop. Boosting R-
lenging cases, and the prediction score threshold set to 0.05. CNN can precisely distinguish the scallop from the stone.
For clarity, in each image, we visualize the prediction boxes Moreover, for the small echinus on the left-top corner, other
with the top-k scores, and k is the number of the ground-truth detectors miss it for the lack of regression capacity. The
boxes in the images. The orange boxes in the images denote proposed Boosting R-CNN accurately detects this echinus,
a prediction box whose IoU with a certain ground truth is which proves that RetinaRPN with FIoU loss can provide
over 0.5 and higher than other predictions. The blue boxes high-quality proposals. In the last case, although the starfish
denote the unmatched predictions. Besides, we also print the hides in the waterweeds, Boosting R-CNN still successfully
missed ground-truth boxes in red. Thus, more blue boxes in detects the object.
the images suggest lower precision, and more red boxes in
the images suggest lower recall.

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 11 of 15

Boosting R-CNN

FCOS ATSS PAA DetectoRS Boosting R-CNN

Figure 6: Qualitative comparison results on the UTDAC2020 dataset. The orange boxes denote the matched predictions. The
blue boxes denote the unmatched predictions. The red boxes denote the undetected ground truths. The first two rows denote the
2828 2513
2979conditions. The third
blurring and low contrast row denotes the unbalanced light condition. The Fourth row denotes the occlusion
condition. The last two rows denote the mimicry condition.

“prior” denotes replacing the second-stage classification

prior score with the first-stage priors, and the labels are from the
classes with the highest score in the R-CNN head. “wo/
Prob.” means droping the probabilistic inference pipeline
wo/ Prob. and using the R-CNN head results directly. “full” denotes
the proposed Boosting R-CNN. The detection scores in
“full” are smaller than “wo/ Prob.”, which suggests that
full the RetinaRPN can provide the uncertainty for the R-
CNN head to avoid overfitting. For example, in the first
column, the detection scores of the blue box in the top
left corner (“wo/ Prob.”) are decreased by the RetinaRPN.
Figure 7: Visualization of the detection results of each variant Thus, this false over-confident prediction is filtered in the
of Boosting R-CNN. “prior” denotes replacing the second- final results (“full”). Besides, with the RetinaRPN and the
stage classification score with the first-stage priors. “wo/
probabilistic inference pipeline, the detection confidences
Prob.” means dropping the probabilistic inference pipeline.
“full” denotes the proposed Boosting R-CNN.
are more reasonable, for Boosting R-CNN considers various
uncertainty. In the second column, compared with “wo/
Prob.”, no ground truth is missed in the result of Boosting R-
CNN. What’s more, the R-CNN trained with BR can rectify
To further investigate the mechanism of Boosting R-
the error of the RetinaRPN. In the third column, the ground
CNN, we visualize the detection results of the variants.
truth (the red box) in the bottom of the “prior” is missed for

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 12 of 15

Boosting R-CNN

the RetinaRPN assigns a too small prior for the predictions. in: Proceedings of the IEEE/CVF conference on computer vision and
With the correction of the R-CNN, the missed ground truth pattern recognition, pp. 10519–10528.
can be detected by increasing the second-stage score. [12] Duan, K., Xie, L., Qi, H., Bai, S., Huang, Q., Tian, Q., 2020.
Corner proposal network for anchor-free, two-stage object detection,
in: European Conference on Computer Vision.
[13] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman,
5. Conclusion A., 2010. The pascal visual object classes (voc) challenge. Interna-
Underwater object detection is facing new challenges tional Journal of Computer Vision 88, 303–338.
such as blur, low contrast, occlusion, and mimicry conditions [14] Fan, B., Chen, W., Cong, Y., Tian, J., 2020. Dual refinement
underwater object detection network, in: European Conference on
compared with generic object detection. In this paper, we Computer Vision.
propose a brand new two-stage underwater detector Boost- [15] Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C., 2017. Dssd: De-
ing R-CNN to solve the problems mentioned above. First, the convolutional single shot detector. arXiv preprint arXiv:1701.06659
proposed RetinaRPN has a strong capacity to detect objects .
in blurring, low-contrast, distorted images, and provide high- [16] Ge, Z., Liu, S., Li, Z., Yoshie, O., Sun, J., 2021. Ota: Optimal transport
assignment for object detection, in: Proceedings of the IEEE/CVF
quality proposals with accurate estimations of object prior Conference on Computer Vision and Pattern Recognition, pp. 303–
probability in the occlusion condition. Second, the proposed 312.
probabilistic inference pipeline helps the detector make a [17] Ghiasi, G., Lin, T.Y., Le, Q.V., 2019. Nas-fpn: Learning scalable fea-
prediction based on the uncertainties of the vague objects, ture pyramid architecture for object detection, in: Proceedings of the
resulting in a reasonable ranking of the prediction scores. IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[18] Gidaris, S., Komodakis, N., 2015. Object detection via a multi-region
Third, boosting reweighting is proposed to learn the second and semantic segmentation-aware cnn model, in: Proceedings of the
stage by the error of the first stage, which is a kind of IEEE international conference on computer vision, pp. 1134–1142.
hard example mining and helps the second stage to rectify [19] Huang, H., Zhou, H., Yang, X., Zhang, L., Qi, L., Zang, A.Y., 2019.
the error at the probabilistic pipeline. The experiments on Faster r-cnn for marine organisms detection and recognition using
two public underwater datasets demonstrate that Boosting data augmentation. Neurocomputing 337, 372–384.
[20] Kim, K., Lee, H.S., 2020. Probabilistic anchor assignment with iou
R-CNN outperforms other state-of-the-art detectors in un- prediction for object detection, in: European Conference on Computer
derwater object detection. The competitive performances Vision.
on two public generic object detection datasets show the [21] Kim, S.W., Kook, H.K., Sun, J.Y., Kang, M.C., Ko, S.J., 2018.
generalization of Boosting R-CNN. Comprehensive ablation Parallel feature pyramid network for object detection, in: European
studies show the effectiveness of the proposed modules. Conference on Computer Vision, pp. 234–250.
[22] Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., Chen, Y., 2017. Ron:
Reverse connection with objectness prior networks for object detec-
tion, in: Proceedings of the IEEE conference on computer vision and
References pattern recognition, pp. 5936–5944.
[1] Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M., 2020. Yolov4: Op- [23] Law, H., Deng, J., 2018. Cornernet: Detecting objects as paired
timal speed and accuracy of object detection. arXiv preprint keypoints, in: European Conference on Computer Vision.
arXiv:2004.10934 . [24] Li, X., Wang, W., Hu, X., Li, J., Tang, J., Yang, J., 2021. Generalized
[2] Cai, Z., Vasconcelos, N., 2018. Cascade r-cnn: Delving into high focal loss v2: Learning reliable localization quality estimation for
quality object detection, in: Proceedings of the IEEE/CVF Conference dense object detection, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. on Computer Vision and Pattern Recognition, pp. 11632–11641.
[3] Cao, Y., Chen, K., Loy, C.C., Lin, D., 2020. Prime sample attention [25] Li, X., Wang, W., Wu, L., Chen, S., Hu, X., Li, J., Tang, J., Yang,
in object detection, in: Proceedings of the IEEE/CVF Conference on J., 2020. Generalized focal loss: Learning qualified and distributed
Computer Vision and Pattern Recognition. bounding boxes for dense object detection. Advances in Neural
[4] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Information Processing Systems .
Zagoruyko, S., 2020. End-to-end object detection with transformers, [26] Li, Y., Chen, Y., Wang, N., Zhang, Z., 2019. Scale-aware trident
in: European Conference on Computer Vision. networks for object detection, in: Proceedings of the IEEE/CVF
[5] Chen, K., Cao, Y., Loy, C.C., Lin, D., Feichtenhofer, C., 2020a. International Conference on Computer Vision, pp. 6054–6063.
Feature pyramid grids. arXiv preprint arXiv:2004.03580 . [27] Liang, X., Song, P., 2022. Excavating roi attention for underwater
[6] Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, object detection, in: 2022 IEEE International Conference on Image
W., Liu, Z., Xu, J., et al., 2019a. Mmdetection: Open mmlab detection Processing, IEEE.
toolbox and benchmark. arXiv preprint arXiv:1906.07155 . [28] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal
[7] Chen, L., Zhou, F., Wang, S., Dong, J., Li, N., Ma, H., Wang, X., Zhou, loss for dense object detection, in: Proceedings of the IEEE/CVF
H., 2020b. Swipenet: Object detection in noisy underwater images. International Conference on Computer Vision.
arXiv preprint arXiv:2010.10006 . [29] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.,
[8] Chen, Y., Han, C., Wang, N., Zhang, Z., 2019b. Revisiting Dollár, P., Zitnick, C.L., 2014. Microsoft coco: Common objects in
feature alignment for one-stage object detection. arXiv preprint context, in: European Conference on Computer Vision, Springer. pp.
arXiv:1908.01570 . 740–755.
[9] Chen, Z., Gao, H., Zhang, Z., Zhou, H., Wang, X., Tian, Y., 2020c. [30] Lin, W.H., Zhong, J.X., Liu, S., Li, T., Li, G., 2020. Roimix: Proposal-
Underwater salient object detection by combining 2d and 3d visual fusion among multiple images for underwater object detection, in:
features. Neurocomputing 391, 249–259. IEEE International Conference on Acoustics,Speech, and Signal Pro-
[10] Dai, J., Li, Y., He, K., Sun, J., 2016. R-fcn: Object detection cessing.
via region-based fully convolutional networks. Advances in Neural [31] Liu, S., Qi, L., Qin, H., Shi, J., Jia, J., 2018. Path aggregation
Information Processing Systems 29. network for instance segmentation, in: Proceedings of the IEEE/CVF
[11] Dong, Z., Li, G., Liao, Y., Wang, F., Ren, P., Qian, C., 2020. Cen- Conference on Computer Vision and Pattern Recognition.
tripetalnet: Pursuing high-quality keypoint pairs for object detection,

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 13 of 15

Boosting R-CNN

[32] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., [52] Zhang, H., Wang, Y., Dayoub, F., Sunderhauf, N., 2021a. Varifo-
Berg, A.C., 2016. Ssd: Single shot multibox detector, in: European calnet: An iou-aware dense object detector, in: Proceedings of the
Conference on Computer Vision, Springer. IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[33] Lu, X., Li, B., Yue, Y., Li, Q., Yan, J., 2019. Grid r-cnn, in: [53] Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z., 2020b. Bridging the gap
Proceedings of the IEEE/CVF Conference on Computer Vision and between anchor-based and anchor-free detection via adaptive training
Pattern Recognition, pp. 7363–7372. sample selection, in: Proceedings of the IEEE/CVF Conference on
[34] Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D., 2019. Libra Computer Vision and Pattern Recognition.
r-cnn: Towards balanced learning for object detection, in: Proceed- [54] Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z., 2018. Single-shot
ings of the IEEE/CVF Conference on Computer Vision and Pattern refinement neural network for object detection, in: Proceedings of the
Recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[35] Pedersen, M., Bruslund Haurum, J., Gade, R., Moeslund, T.B., 2019. [55] Zhang, X., Wan, F., Liu, C., Ji, R., Ye, Q., 2019. Freeanchor: Learning
Detection of marine animals in a new underwater dataset with varying to match anchors for visual object detection. Advances in Neural
visibility, in: Proceedings of the IEEE/CVF Conference on Computer Information Processing Systems .
Vision and Pattern Recognition Workshops. [56] Zhang, Y.F., Ren, W., Zhang, Z., Jia, Z., Wang, L., Tan, T., 2021b.
[36] Qiao, S., Chen, L.C., Yuille, A., 2021. Detectors: Detecting objects Focal and efficient iou loss for accurate bounding box regression.
with recursive feature pyramid and switchable atrous convolution, in: arXiv preprint arXiv:2101.08158 .
Proceedings of the IEEE/CVF Conference on Computer Vision and [57] Zhao, Z., Liu, Y., Sun, X., Liu, J., Yang, X., Zhou, C., 2021. Compos-
Pattern Recognition. ited fishnet: Fish detection and species recognition from low-quality
[37] Qiu, H., Ma, Y., Li, Z., Liu, S., Sun, J., 2020. Borderdet: Border fea- underwater videos. IEEE Transaction on Image Processing 30, 4719–
ture for dense object detection, in: European Conference on Computer 4734.
Vision, Springer. pp. 549–564. [58] Zhou, X., Koltun, V., Krähenbühl, P., 2021. Probabilistic two-stage
[38] Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only detection. arXiv preprint arXiv:2103.07461 .
look once: Unified, real-time object detection, in: Proceedings of the [59] Zhou, X., Wang, D., Krähenbühl, P., 2019. Objects as points. arXiv
IEEE/CVF Conference on Computer Vision and Pattern Recognition. preprint arXiv:1904.07850 .
[39] Redmon, J., Farhadi, A., 2017. Yolo9000: better, faster, stronger, in: [60] Zhu, B., Wang, J., Jiang, Z., Zong, F., Liu, S., Li, Z., Sun, J.,
Proceedings of the IEEE conference on computer vision and pattern 2020a. Autoassign: Differentiable label assignment for dense object
recognition, pp. 7263–7271. detection. arXiv preprint arXiv:2007.03496 .
[40] Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards [61] Zhu, C., He, Y., Savvides, M., 2019. Feature selective anchor-
real-time object detection with region proposal networks. Advances free module for single-shot object detection, in: Proceedings of the
in Neural Information Processing Systems . IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[41] Rossi, L., Karimi, A., Prati, A., 2020. A novel region of in- [62] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., 2020b. Deformable
terest extraction layer for instance segmentation. arXiv preprint detr: Deformable transformers for end-to-end object detection, in:
arXiv:2004.13665 . International Conference on Learning Representations.
[42] Shrivastava, A., Gupta, A., Girshick, R., 2016. Training region-based
object detectors with online hard example mining, in: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recogni- Pinhao Song received the B.E. degree in Mechan-
tion. ical Engineering in 2019, where he is currently
[43] Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, pursuing a master’s degree in computer applied
M., Li, L., Yuan, Z., Wang, C., et al., 2021. Sparse r-cnn: End-to- technology in Peking University. His current re-
end object detection with learnable proposals, in: Proceedings of the search interests include underwater object detec-
IEEE/CVF Conference on Computer Vision and Pattern Recognition. tion, generic object detection, and domain gener-
[44] Tian, Z., Shen, C., Chen, H., He, T., 2019. Fcos: Fully convolutional alization.
one-stage object detection, in: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision.
[45] Vu, T., Jang, H., Pham, T.X., Yoo, C., 2019. Cascade rpn: Delving
into high-quality region proposal network with adaptive convolution.
Advances in Neural Information Processing Systems .
[46] Wang, J., Zhang, W., Cao, Y., Chen, K., Pang, J., Gong, T., Shi, Pengteng Li received the B.E degree in Financial
J., Loy, C.C., Lin, D., 2020a. Side-aware boundary localization for Engineeing in 2020, where he is currently pursuing
more precise object detection, in: European Conference on Computer a master’s degree in computer applied technology
Vision. in Shenzhen University.His current research inter-
[47] Wang, Z., Liu, C., Wang, S., Tang, T., Tao, Y., Yang, C., Li, H., ests include generic object detection and reinforce
Liu, X., Fan, X., 2020b. Udd: an underwater open-sea farm ob- learning.
ject detection dataset for underwater robot picking. arXiv preprint
arXiv:2003.01446 .
[48] Wu, Y., Chen, Y., Yuan, L., Liu, Z., Wang, L., Li, H., Fu, Y., 2020.
Rethinking classification and localization for object detection, in:
Proceedings of the IEEE/CVF Conference on Computer Vision and
Linhui Dai received the B.E. degree in Informa-
Pattern Recognition.
tion System and Information Management in 2018,
[49] Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R., 2019. Detec-
and is currently working toward the Ph.D. degree
tron2. https://github.com/facebookresearch/detectron2.
with School of Electronics Engineering and Com-
[50] Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S., 2019. Reppoints: Point set
puter Science, Peking University, Beijing, China.
representation for object detection, in: Proceedings of the IEEE/CVF
Her current research interests include underwater
International Conference on Computer Vision.
object detection, open world object detection, and
[51] Zhang, H., Chang, H., Ma, B., Wang, N., Chen, X., 2020a. Dynamic
salient object detection.
r-cnn: Towards high quality object detection via dynamic training, in:
European Conference on Computer Vision.

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 14 of 15

Boosting R-CNN

Tao Wang received the B.E. degree in Automation

in 2020. He is currently pursuing the M.S. degree
at the School of Electronic and Computer Engi-
neering, Peking University. His research interests
include person re-identification, object detection,
and deep metric learning.

Zhan Chen received the B.S. degree from Hunan

University(HNU), China. He is a research gradu-
ate student studying at Peking University (PKU),
China. His research interest lies in machine learn-
ing and computer vision.

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 15 of 15

AI Bot Trading For Beginners (Plus Premium Bot) The Ultimate Guide To Maximizing Profits (VICTOR ABEE) (Z-Library)
100% (2)
AI Bot Trading For Beginners (Plus Premium Bot) The Ultimate Guide To Maximizing Profits (VICTOR ABEE) (Z-Library)
82 pages
An Underwater Image Enhancement Benchmark Dataset and Beyond
No ratings yet
An Underwater Image Enhancement Benchmark Dataset and Beyond
12 pages
Fast Underwater Image Enhancement For Improved Visual Perception
No ratings yet
Fast Underwater Image Enhancement For Improved Visual Perception
10 pages
Underwater Object Detection Using Deep Learning Techniques
No ratings yet
Underwater Object Detection Using Deep Learning Techniques
5 pages
A Comprehensive Review On Underwater Object Detection Techniques
No ratings yet
A Comprehensive Review On Underwater Object Detection Techniques
6 pages
REPORT
No ratings yet
REPORT
38 pages
A Deep Learning Approach To Detecting Objects in Underwater Images
No ratings yet
A Deep Learning Approach To Detecting Objects in Underwater Images
16 pages
2203 04822
No ratings yet
2203 04822
18 pages
Sensors 20 00726 With Cover
No ratings yet
Sensors 20 00726 With Cover
26 pages
Huang 2019
No ratings yet
Huang 2019
13 pages
Targeted Data Augmentation and Hierarchical Classi
No ratings yet
Targeted Data Augmentation and Hierarchical Classi
22 pages
Electronics 09 01763
No ratings yet
Electronics 09 01763
19 pages
Marine Robotics 4 0 Present and Future of Real Time 3rk7wjyv
No ratings yet
Marine Robotics 4 0 Present and Future of Real Time 3rk7wjyv
20 pages
Domain Adaptation For Underwater Image Enhancement
No ratings yet
Domain Adaptation For Underwater Image Enhancement
16 pages
Jmse 12 00055 v2
No ratings yet
Jmse 12 00055 v2
18 pages
2001 RUIE 244次
No ratings yet
2001 RUIE 244次
15 pages
Live Fish Species Classification in Underwater Ima
No ratings yet
Live Fish Species Classification in Underwater Ima
15 pages
ACM Submission Underwater Image Restoration
No ratings yet
ACM Submission Underwater Image Restoration
13 pages
Sensors: Yolov8-C2F-Faster-Ema: An Improved Underwater Trash Detection Model Based On Yolov8
No ratings yet
Sensors: Yolov8-C2F-Faster-Ema: An Improved Underwater Trash Detection Model Based On Yolov8
26 pages
Uncertainty Inspired Underwater Image
No ratings yet
Uncertainty Inspired Underwater Image
18 pages
Detecting Marine Organisms Via Joint Attention-Relation Learning For Marine Video Surveillance
No ratings yet
Detecting Marine Organisms Via Joint Attention-Relation Learning For Marine Video Surveillance
16 pages
Fyp CNN Present
No ratings yet
Fyp CNN Present
11 pages
UG Research
No ratings yet
UG Research
7 pages
HydroNet PPT
No ratings yet
HydroNet PPT
23 pages
Faia 347 Faia220009
No ratings yet
Faia 347 Faia220009
10 pages
Underwater Object Detection Using Image Enhancement and Deep Learning Models
No ratings yet
Underwater Object Detection Using Image Enhancement and Deep Learning Models
6 pages
Electronics 14 00201
No ratings yet
Electronics 14 00201
20 pages
Underwater Objects Detection and Tracking Using Image Processing
No ratings yet
Underwater Objects Detection and Tracking Using Image Processing
9 pages
Diving Deeper Into Underwater Image Enhancement: A Survey
No ratings yet
Diving Deeper Into Underwater Image Enhancement: A Survey
21 pages
Dipu Rpaper
No ratings yet
Dipu Rpaper
11 pages
Hammerhead Shark Detection Using Regions With Convolutional Neural Networks
No ratings yet
Hammerhead Shark Detection Using Regions With Convolutional Neural Networks
6 pages
Underwater Image Classification Using ML Techniques
No ratings yet
Underwater Image Classification Using ML Techniques
6 pages
Advancing Underwater Vision A Survey of Deep Learning Models For Underwater Object Recognition and Tracking
No ratings yet
Advancing Underwater Vision A Survey of Deep Learning Models For Underwater Object Recognition and Tracking
38 pages
Marine Robotics: An Improved Algorithm For Object Detection Underwater
No ratings yet
Marine Robotics: An Improved Algorithm For Object Detection Underwater
9 pages
Yolov7T-Cebc Network For Underwater Litter Detection: Marine Science and Engineering
No ratings yet
Yolov7T-Cebc Network For Underwater Litter Detection: Marine Science and Engineering
19 pages
A Perception-Aware Decomposition and Fusion Framework For Underwater Image Enhancement
No ratings yet
A Perception-Aware Decomposition and Fusion Framework For Underwater Image Enhancement
15 pages
1 s2.0 S157495412400222X Main
No ratings yet
1 s2.0 S157495412400222X Main
16 pages
Object Detection Using Deep Learning
No ratings yet
Object Detection Using Deep Learning
8 pages
DeepPoolAI v1 (Print)
No ratings yet
DeepPoolAI v1 (Print)
24 pages
Ai Research
No ratings yet
Ai Research
4 pages
A Framework For Underwater Image Enhancement and Object Detection
No ratings yet
A Framework For Underwater Image Enhancement and Object Detection
6 pages
Paper UnderwaterObjectDetectionUsingYOLOV4
No ratings yet
Paper UnderwaterObjectDetectionUsingYOLOV4
8 pages
GAN Based Underwater Image Enhancement
No ratings yet
GAN Based Underwater Image Enhancement
6 pages
A Novel Finetuned Yolov8 Model For Real Time Underwater Trash Detection
No ratings yet
A Novel Finetuned Yolov8 Model For Real Time Underwater Trash Detection
16 pages
Pedersen Detection of Marine Animals in A New Underwater Dataset With CVPRW 2019 Paper
No ratings yet
Pedersen Detection of Marine Animals in A New Underwater Dataset With CVPRW 2019 Paper
9 pages
Literature Review Hritick
No ratings yet
Literature Review Hritick
5 pages
IJRPR7632
No ratings yet
IJRPR7632
8 pages
A Sea Creatures Classification Method Using CNN
No ratings yet
A Sea Creatures Classification Method Using CNN
4 pages
YOLOv8 - Fish Journal
No ratings yet
YOLOv8 - Fish Journal
10 pages
Report - GRP 6 - Merged
No ratings yet
Report - GRP 6 - Merged
8 pages
PRJ s250
No ratings yet
PRJ s250
13 pages
Remotesensing 14 04487 v2
No ratings yet
Remotesensing 14 04487 v2
18 pages
Comparative Analysis of Neural Architectures For Underwater Object Detection
No ratings yet
Comparative Analysis of Neural Architectures For Underwater Object Detection
8 pages
Hybr UR
No ratings yet
Hybr UR
13 pages
A Deep Learning Approach For Underwater Fish Detection: Cience Echnology
No ratings yet
A Deep Learning Approach For Underwater Fish Detection: Cience Echnology
13 pages
Unsupervised Knowledge Transfer For Object Detection in Marine Environmental Monitoring and Exploration
No ratings yet
Unsupervised Knowledge Transfer For Object Detection in Marine Environmental Monitoring and Exploration
11 pages
AUVs May Scour The Oceans With Appropriate Grabbers To Find Objects and Explore Water Bodies
No ratings yet
AUVs May Scour The Oceans With Appropriate Grabbers To Find Objects and Explore Water Bodies
2 pages
1 s2.0 S2772671124002146 Main
No ratings yet
1 s2.0 S2772671124002146 Main
10 pages
Real Time Object Recognition and Classification
No ratings yet
Real Time Object Recognition and Classification
6 pages
Underwater Plastic
No ratings yet
Underwater Plastic
4 pages
Petrophysics-Driven Well Log Quality Control Using Machine Learning-2
No ratings yet
Petrophysics-Driven Well Log Quality Control Using Machine Learning-2
15 pages
7 Aimlsyll
No ratings yet
7 Aimlsyll
11 pages
Forest Fire Prediction Using Machine Learning
No ratings yet
Forest Fire Prediction Using Machine Learning
28 pages
HEART DISEASE PREDICTION Using MACHINE LEARNING ALGORITHM Presentation
No ratings yet
HEART DISEASE PREDICTION Using MACHINE LEARNING ALGORITHM Presentation
15 pages
ch102 Optics
No ratings yet
ch102 Optics
16 pages
Machine Learning-Driven Credit Risk: A Systemic Review
No ratings yet
Machine Learning-Driven Credit Risk: A Systemic Review
13 pages
s8 - Detection of Malicious Social Bots - Project Report
No ratings yet
s8 - Detection of Malicious Social Bots - Project Report
58 pages
Gaur 2021 J. Phys. Conf. Ser. 2007 012046
No ratings yet
Gaur 2021 J. Phys. Conf. Ser. 2007 012046
15 pages
SEO Audit For 5movierulz - La - SEOptimer
No ratings yet
SEO Audit For 5movierulz - La - SEOptimer
24 pages
Developed Title List - Updated Till FEB
No ratings yet
Developed Title List - Updated Till FEB
67 pages
Enhancing Malicious URL Detection A Novel Framework Leveraging Priority Coefficient and Feature Evaluation
No ratings yet
Enhancing Malicious URL Detection A Novel Framework Leveraging Priority Coefficient and Feature Evaluation
26 pages
Batch-4 Idp
No ratings yet
Batch-4 Idp
52 pages
A New Data-Mining Based Approach For Network Intrusion Detection
No ratings yet
A New Data-Mining Based Approach For Network Intrusion Detection
6 pages
Article - Artificial Intelligence in The Stock Market
No ratings yet
Article - Artificial Intelligence in The Stock Market
19 pages
THE CONSTITUTION OF INDIA - 6th
No ratings yet
THE CONSTITUTION OF INDIA - 6th
4 pages
FLIGHT DELAY Prediction 4th
No ratings yet
FLIGHT DELAY Prediction 4th
18 pages
Applied Computational Intelligence and Soft Computing - 2022 - Chung - Mental Health Prediction Using Machine Learning
No ratings yet
Applied Computational Intelligence and Soft Computing - 2022 - Chung - Mental Health Prediction Using Machine Learning
19 pages
AI310 & CS361 Intro. To Artificial Intelligence - Fall 2023 - Module Main Contents - 1
No ratings yet
AI310 & CS361 Intro. To Artificial Intelligence - Fall 2023 - Module Main Contents - 1
5 pages
Chapter 8 - 1 Machine Learning
No ratings yet
Chapter 8 - 1 Machine Learning
167 pages
Unit
No ratings yet
Unit
13 pages
Breaking Bad: De-Anonymising Entity Types On The Bitcoin Blockchain Using Supervised Machine Learning
No ratings yet
Breaking Bad: De-Anonymising Entity Types On The Bitcoin Blockchain Using Supervised Machine Learning
10 pages
InfoMat - 2023 - Li - Methods Progresses and Opportunities of Materials Informatics
No ratings yet
InfoMat - 2023 - Li - Methods Progresses and Opportunities of Materials Informatics
30 pages
Improve Quality and Efficiency of Textile Process Using Data-Driven
No ratings yet
Improve Quality and Efficiency of Textile Process Using Data-Driven
13 pages
Variable Importance Analysis in Imbalanced Datasets A New Approach
No ratings yet
Variable Importance Analysis in Imbalanced Datasets A New Approach
27 pages
Games of Prediction
No ratings yet
Games of Prediction
33 pages
ANN Report
No ratings yet
ANN Report
26 pages
EHB 420E - Artificial Neural Networks Term Project: Machine Learning Models For Heart Attack Prediction
No ratings yet
EHB 420E - Artificial Neural Networks Term Project: Machine Learning Models For Heart Attack Prediction
10 pages
Ensemble Learning
No ratings yet
Ensemble Learning
8 pages
Maria Abastillas Project 2011
No ratings yet
Maria Abastillas Project 2011
56 pages
Result Prediction by Mining Replays in Dota 2: Filip Johansson, Jesper Wikström
No ratings yet
Result Prediction by Mining Replays in Dota 2: Filip Johansson, Jesper Wikström
29 pages
Jurnal Inter FD
No ratings yet
Jurnal Inter FD
22 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Final F04soln
No ratings yet
Final F04soln
10 pages
Cornell CS578: Bagging and Boosting
No ratings yet
Cornell CS578: Bagging and Boosting
10 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Boosting R-CNN - Reweighting R-CNN Samples by RPN's Error For Underwater Object Detection

Uploaded by

Boosting R-CNN - Reweighting R-CNN Samples by RPN's Error For Underwater Object Detection

Uploaded by

Boosting R-CNN: Reweighting R-CNN Samples by RPN’s Error for

Underwater Object Detection

ARTICLE INFO ABSTRACT

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 1 of 15

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 2 of 15

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 3 of 15

ResNet Backbone PAFPN

Cls. Score Final Score

𝑔𝑖 = 𝐼𝑜𝑈 (𝒃̂𝒊 , 𝒃∗𝒊 ), (11)

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 4 of 15

From the equation above, the marginal probability is the

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 5 of 15

its classification weight is:

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 6 of 15

3.6. Loss Function 𝑤∗𝑘 ℎ∗𝑘

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 7 of 15

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 8 of 15

Table 3 4.3.3. Results on Pascal VOC

 accurate estimation of the object prior in the RetinaRPN, the

4.4.1. Ablation Study of RetinaRPN

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 9 of 15

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 10 of 15

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 11 of 15

FCOS ATSS PAA DetectoRS Boosting R-CNN

“prior” denotes replacing the second-stage classification

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 12 of 15

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 13 of 15

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 14 of 15

Tao Wang received the B.E. degree in Automation

Zhan Chen received the B.S. degree from Hunan

Pinhao Song, Pengteng Li et al.: Preprint submitted to Elsevier Page 15 of 15

You might also like

accurate estimation of the object prior in the RetinaRPN, the