Boosting R-CNN - Reweighting R-CNN Samples by RPN's Error For Underwater Object Detection
Boosting R-CNN - Reweighting R-CNN Samples by RPN's Error For Underwater Object Detection
1. Introduction
Oceans account for 71% of the earth’s total area and
contain rich biological and mineral resources. Humans cast
their eyes on ocean exploitation, for the resources on the (a) Unbalanced Light Condition (b) Low Contrast (c) Severe Occlusion
land have been fully exploited, which means the research
on the oceans is meaningful. Over the past few years, more
and more researchers have considered applying underwater
object detection (UOD) to autonomous underwater vehicles
(AUVs) with visual systems to fulfill a series of underwater
tasks such as marine organism capturing.
Generic Object Detection (GOD) has been researched (d) Camouflage and Mimicry
for a long time and obtained abundant achievements. How-
ever, GOD is not perfectly suitable for underwater envi-
ronments which bring new challenges to object detection Figure 1: The challenges of underwater environments. (a)
(see Figure 1): (i) The images captured by the underwater Complicated underwater terrains cause unbalanced light condi-
visual system suffer from unbalanced light conditions and tion. (b) The low-contrast image makes the boundaries of two
holothurian blurred. (c) The aquatic organisms tend to live
low contrast, which make the object boundary hard to be
together, causing occlusion. (d) The starfish has the similar
distinguished from the background. (ii) The aquatic organ- color with the environment, which makes them difficult to spot.
isms tend to live together, which cause severe occlusion. (iii)
The aquatic organisms are good at hiding themselves, which
have the similar color with the background and make it hard
Existing works on UOD typically apply data augmenta-
for people to recognize them. Facing these new challenges,
tion methods [19, 30, 47] and use a strong feature extrac-
the boundaries between the objects and background and the
tor [14, 57] to improve the performance. However, these
boundaries between different objects will be vague, leading
methods suffer from problems listed as follows. (i) Previous
to the existence of the vague objects in underwater environ-
underwater detectors receive the same supervision signal for
ments.
all objects regardless of their vagueness. Thus, the classifi-
[email protected] (P. Song); [email protected] (P. cation score trained with simple cross entropy loss does not
Li); [email protected] (L. Dai); [email protected] (T. Wang); accurate reflect the vagueness of the objects, which would
zhanchen\[email protected] (Z. Chen)
ORCID (s): cause false over-confident predictions. However, accurately
ranking the detection results is crucial for object detectors to Align are leveraged to crop the features from backbone and
achieve high performance. It is expected that the detectors resize them to the same size. In the second stage, the R-
assign low scores to the detection results containing vague CNN head realizes classification and regression tasks of
objects and assign high scores to the results with clear all objects. One-stage detectors abandon the usage of the
objects. (ii) Previous underwater detectors are vulnerable to RPN and RoI Align, directly obtaining the coordinates of
vague objects with blurring boundaries and similar color to bounding boxes and classes of the objects. Nowadays, one-
the background. That is because the gradient of the easy stage detectors can achieve the same level of performance
samples will dominate the training of underwater detectors, as two-stage detectors. There are two branches of one-stage
which makes detectors difficult to learn the subtle differences detectors: anchor-based methods and anchor-free methods.
between vague objects and underwater background. Early works on one-stage detectors are mostly anchor-based
Different from existing UOD methods, we address the methods [32, 28]. Recently, some works rethink whether the
above problems through uncertainty modeling and hard anchor is necessary, and propose their designs to abandon
example mining. We propose a two-stage detector named the use of anchors [44, 59].
Boosting R-CNN (see Figure 2), which consists of three key As the research on object detection goes deeper, the
components: RetinaRPN, probabilistic inference pipeline, researchers find that the concepts of one-stage and two-stage
and boosting reweighting. Specifically, RetinaRPN gener- detectors are not entirely different. Some research works aim
ates proposals from backbone features with heavier heads to to leverage the advantages of two-stage to enhance the per-
perform three tasks: objectness prediction, IoU prediction, formance of one-stage detectors. RefineDet [54] separates
and box localization. It includes the IoU prediction and the one-stage detection into two sub-module: the anchor re-
objectness as two indicators to model the prior uncertainty finement module and the object detection module. AlignDet
in order to accurately measure the vagueness of the objects. [8] uses deformable convolution (DCN) to imitate RoIAlign
With a proposed fast IoU loss, high-quality proposals can to obtain aligned features in the second stage. RepPoints
be obtained. Second, the probabilistic inference pipeline [50] leverages the idea of refinement and feature alignment
combines the RetinaRPN’s object prior and the R-CNN and applies it to the proposed anchor-free detectors based
classification score to make a prediction, which uses the on keypoint detection. Two-stage detectors are also nurtured
uncertainty from the first stage to improve the robustness by the achievements of one-stage detectors. CenterNet2 [58]
of the detector. Third, boosting reweighting attaches more finds that a strong anchor-free one-stage detector as the
attention to hard examples whose priors are miscalculated RPN can predict an accurate object likelihood that informs
by amplifying the loss according to the RPN’s error. Since the overall detection score. Combining the object likelihood
the final classification score of the object combines the of RPN and the conditional classification score of the R-
RPN’s prior and the R-CNN’s scores, the R-CNN trained CNN will achieve higher performance with fewer proposals,
with reweighted samples has a strong robustness to hard which reduces the inference cost. Our Boosting R-CNN
examples, modifying its score to correct the false positive is a probabilistic two-stage detector like CenterNet2. The
and false negative of the RetinaRPN. difference is that we build a strong anchor-based RPN, and
With these three components, our Boosting R-CNN can apply a hard example mining mechanism based on the RPN’s
handle complicated underwater challenges and be robust to errors.
vague objects. Our method is evaluated on two underwater
object detection datasets: UTDAC20201 and Brackish [35], 2.2. Hard Example Mining.
not only achieving state-of-the-art performance but also Hard example mining methods aim to attach more at-
maintaining a relatively high inference speed. Moreover, tention to hard examples, relying on the hypothesis that
the experiments on the Pascal VOC [13] and the MS training on hard examples leads to better performance. The
COCO [29] dataset show Boosting R-CNN obtains favorable first deep detector to use hard example mining is Single Shot
performance on general object detection. Our code will be Detector [32], which chooses only the negative examples
released at https://github.com/mousecpn/Boosting-R-CNN- with the highest loss values. Online Hard Exampling Mining
Reweighting-R-CNN-Samples-by-RPN-s-Error-for-Underwater-(OHEM) [42] considers both hard positive and negative ex-
Object-Detection.git amples for training. Considering the efficiency and memory
problems of OHEM, IoU-based sampling [34] is proposed,
associating the hardness of the negative examples with their
2. Related Work IoUs, and sampling averagely across all IoU ranges. Focal
2.1. Object Detection Loss [28] is a soft hard-example mining method, dynam-
Existing object detection can be categorized into two ically assigning more weight to the hard examples based
mainstreams: two-stage and one-stage detectors. For two- on the classification score. Prime Sample Attention (PISA)
stage detectors, the basic idea is to reduce the detection [3] proposes an IoU Hierarchical Local Rank for all sam-
task to the classification problem [40]. In the first stage, ples, assigning higher weight for positive examples with
the region proposal network (RPN) aims to propose can- higher IoUs. Different from the methods mentioned above,
didate object bounding boxes, and RoI Pooling and RoI our two-stage Boosting R-CNN defines the hardness of the
1
examples based on their prior probability from the proposed
http://uodac.pcl.ac.cn/
RetinaRPN. A soft reweighting mechanism is proposed to 3.2. Backbone and Feature Fusion Neck
amplify the loss of the hard examples and shrink the loss of Given an image 𝐼 ∈ ℝ3×𝐻0 ×𝑊0 (with RGB channels),
the easy examples. a backbone (e.g. ResNet50) generates multi-scale feature
maps {𝑥𝑙 }5𝑙=3 at 𝐶3 -𝐶5 (𝐶𝑙 has resolution 2𝑙 smaller than
2.3. Underwater Object Detection. the input). The multi-scale feature maps will be sent into the
As an indispensable technology for AUVs to perform feature fusion neck.
multiple tasks under the water, underwater object detection PAFPN [31] is employed as the feature fusion neck.
has attracted a large amount of attention from researchers PAFPN contains two parts: the top-down path and the
all around the world. For instance, Huang et al. [19] in- bottom-up path. In the top-down path, the high-level feature
troduce perspective transformation, turbulence simulation, is used to enhance the low-level feature. Given the multi-
and Illumination synthesis into data augmentation. Chen et scale feature maps {𝑥𝑙 }5𝑙=3 from backbone, the output feature
al. [9] design a novel underwater salient detection model {𝑝𝑙 }5𝑙=3 as:
that is established by mathematically stimulating the bio-
logical vision mechanism of aquatic animals RoIMix [30]
𝑝5 = 𝑐𝑜𝑛𝑣(𝑥5 ), (1)
is a data augmentation method that applies mixup on the
RoI level to imitate occlusion conditions. SWIPENET [7] 𝑝4 = 𝑐𝑜𝑛𝑣(𝑥4 ) + 𝑢(𝑝5 ), (2)
takes full advantage of both high resolution and semantic- 3 3
𝑝 = 𝑐𝑜𝑛𝑣(𝑥 ) + 𝑢(𝑝 ), 4
(3)
rich hyper feature maps to increase the performance of
small objects. Besides, a novel sample-reweighted loss and a where 𝑐𝑜𝑛𝑣(⋅) denotes the convolution layer, and 𝑢(⋅) denotes
new training paradigm CMA are proposed which are noise- the 2x upsampling layer. In the bottom-up path, the low-
immune. Poisson GAN [47] is also a data augmentation level feature is leveraged to augment the high-level feature
method, which pastes the object on the underwater back- to obtain feature maps {𝑞 𝑙 }7𝑙=3 , as:
ground by poisson blending and uses GAN to correct the
artifact. FERNet [14] consists of three modules: composite 𝑞 3 = 𝑐𝑜𝑛𝑣(𝑝3 ), (4)
connected backbone, receptive field augmentation module, 4 4 3
𝑞 = 𝑐𝑜𝑛𝑣(𝑝 ) + 𝑑(𝑞 ), (5)
and prediction refinement scheme. Composited FisherNet
[57] is based on underwater video object detection, leverag- 𝑞 5 = 𝑐𝑜𝑛𝑣(𝑝5 ) + 𝑑(𝑞 4 ), (6)
ing the differences between the foreground and background 6
𝑞 = 𝑐𝑜𝑛𝑣𝑠 (𝑞 ), 5
(7)
to extract salient features and proposing an enhanced path 7 6
aggregation network to solve the insufficient utilization of 𝑞 = 𝑐𝑜𝑛𝑣𝑠 (𝑞 ), (8)
semantic information caused by linear up-sampling. RoIAttn
where 𝑐𝑜𝑛𝑣𝑠 (⋅) denotes the convolution layer with stride 2,
[27] considers RoI patches as tokens and applies the external
𝑑(⋅) denotes the 2x downsampling layer. The output multi-
attention module on the RoIs to improve the performance
scale features {𝑞 𝑙 }7𝑙=3 are fed into the detection head.
of underwater object detection. Compared with the methods
mentioned above, to the best of our knowledge, our idea
3.3. RetinaRPN
of considering using RPN’s error for hard example mining
The RPN is responsible for providing proposals that
has not been investigated by any existing underwater object
have potential objects. Underwater images are blurring, low-
detection approaches.
contrast and distorted, which make it difficult to distinguish
the objects from the background. Besides, in the occlusion
3. Boosting R-CNN condition, the objectness trained with simple cross entropy
3.1. Overview loss in the vanilla RPN is not a good estimation of the
Different from the vanilla two-stage detector Faster R- proposal box localization accuracy. As a result, the high-
CNN, the proposed two-stage detector Boosting R-CNN quality proposals may be filtered by the poorly regressed
has three key components: RetinaRPN, the probabilistic proposals with higher objectness. To obtain high-quality
inference pipeline, and boosting reweighting. The pipeline proposals with accurate prior probabilities, we aim to build a
of our Boosting R-CNN is shown in Figure 2. In detail, the strong RPN inspired by the designs of the current one-stage
backbone and the feature fusion neck (e.g., ResNet+PAFPN) detector, which is named retina region proposal network
first extract features from images. Second, RetinaRPN pro- (RetinaRPN).
vides a series of high-quality proposals with corresponding Heavier Head. Instead of using one simple convolution
prior probability. Third, boosting reweighting amplifies the layer in the vanilla RPN, we use four convolution layers with
classification loss of the hard examples whose priors are group normalization. More convolution layers have a more
miscalculated, while decreasing the weight of the easy ex- powerful capability to detect vague objects in blurring, low-
amples with accurately estimated priors. Fourth, the R-CNN contrast, and distorted underwater images.
head which contains two fully-connected layers is trained Multi-Ratio Anchors. For each FPN level, we use anchors
on reweighted RoI samples. In the inference stage, the final at three aspect ratios {1:2, 1:1, 2:1} with sizes {20 , 21∕3 ,
score is the square root of the multiplication of the prior and 22∕3 } of 322 to 5122 for FPN levels 𝑄3 -𝑄7 . In total, there
the classification score. are 𝐴=9 anchors per pixel. Anchor is an important prior
Detection Head
C3 P3 Q3
Detection Head
C4 P4 Q4
Detection Head
C5 P5 Q5
Detection Head
Q6
Detection Head
Q7
Objectness
RetinaRPN (H×W×A)
prior
IoU Prediction
(H×W×A)
×4
H×W×256
Box Localization
(H×W×4A)
Figure 2: The overview of the proposed Boosting R-CNN. The backbone and the feature fusion neck first extract features from
images. RetinaRPN provides a series of high-quality proposals with corresponding prior probability. Boosting reweighting amplifies
the classification loss of the hard examples whose priors are miscalculated while decreasing the weight of the easy examples with
accurately estimated priors. The R-CNN head which contains two fully-connected layers is trained on reweighted RoI samples. In
the inference stage, the final score is the square root of the multiplication of the prior and the classification score.
for regressing and classifying aquatic organisms with vague label of anchor 𝑖. It is set to 1 if anchor 𝑖 is a positive sample,
boundaries. otherwise, it is set to 0. As for the positive and negative
Loss Function. RetinaRPN performs three tasks: objectness samples assignment, the anchors with IoU over 0.5 with
prediction, box localization, and IoU prediction. The object- ground-truth boxes are regarded as positive samples, while
ness branch is trained to predict whether there is an object in the anchors with IoU below 0.5 are regarded as negative
an anchor. We leverage the focal loss as objectness loss: samples.
{ The localization branch aims to output the proposals
− 𝛼(1 − 𝑝̂𝑖 )𝛾 log(𝑝̂𝑖 ), 𝑦𝑖 = 1, which are refined on the anchors. Usually, IoU loss is lever-
𝐿𝑓 𝑙 (𝑝̂𝑖 ) = 𝛾 (9)
− (1 − 𝛼)𝑝̂𝑖 log(1 − 𝑝̂𝑖 ), 𝑦𝑖 = 0, aged in the regression loss:
metric of object detection. However, the convergence speed With the IoU prediction branch, the detector can provide the
of the IoU loss is slow. In order to increase the convergence uncertainty into the prior when the objects are occluded in
speed, L2 loss is added to IoU loss. The improved IoU loss the underwater environment. In detail, objectness denotes
can be rewritten as: the likelihood of the object in an anchor. Although object-
∑ ness trained with focal loss can effectively filter the negative
𝐿′𝐼𝑜𝑈 (𝒃̂𝒊 ) = 1 − 𝑔𝑖 + ̂ − 𝑡∗𝑖,𝑗 ||2 ,
||𝑡𝑖,𝑗 2
(13) samples, it will also assign a high value to the proposal in
𝑗∈{𝑥,𝑦,𝑤,ℎ}
which the object is severely covered by other objects. IoU
prediction predicts the IoU between the proposal and its
𝑥̂𝑖 − 𝑥𝑎𝑖 𝑦̂𝑖 − 𝑦𝑎𝑖 ground truth and assigns a value to the object according to
̂ =
𝑡𝑖,𝑥 , ̂ =
𝑡𝑖,𝑦 , (14) its level of occlusion. Combining two indicators includes un-
𝑤𝑎𝑖 ℎ𝑎𝑖
certainties from different perspectives, and comprehensively
𝑤̂ 𝑖 ℎ̂𝑖 models the prior probabilities of the proposals.
̂ = log(
𝑡𝑖,𝑤 ), ̂ = log(
𝑡𝑖,ℎ ), (15)
𝑤𝑎𝑖 ℎ𝑎𝑖
𝑥∗ − 𝑥𝑎 𝑦∗ − 𝑦𝑎 3.4. Probabilistic Inference Pipeline
𝑡∗𝑖,𝑥 = 𝑖 𝑎 𝑖, 𝑡∗𝑖,𝑦 = 𝑖 𝑎 𝑖, (16) For the two-stage detector, in the first stage, the RPN
𝑤𝑖 ℎ𝑖
outputs K proposal boxes 𝑏1 , ..., 𝑏𝐾 . And for the proposal
𝑤∗ ℎ∗ 𝑘 ∈ {1, ..., 𝐾}, RPN predicts a class-agnostic foreground
𝑡∗𝑖,𝑤 = log( 𝑖𝑎 ), 𝑡∗𝑖,ℎ = log( 𝑖𝑎 ). (17)
𝑤𝑖 ℎ𝑖 prior probability 𝑃 (𝑂𝑘 ), where 𝑂𝑘 =1 denotes the proposal
𝑘 is an object and 𝑂𝑘 =0 suggests the background. This is
where {𝑥𝑎𝑖 , 𝑦𝑎𝑖 , 𝑤𝑎𝑖 , ℎ𝑎𝑖 } are the coordinates of anchor 𝑖, realized by a binary classifier trained with a log-likelihood
{𝑥̂𝑖 , 𝑦̂𝑖 , 𝑤̂ 𝑖 , ℎ̂𝑖 } and {𝑥∗𝑖 , 𝑦∗𝑖 , 𝑤∗𝑖 , ℎ∗𝑖 } are the coordinates of objective. In the second stage, high-scoring proposals are
the predicted box and its corresponding ground truth, and sampled to train the R-CNN head, a softmax classifier.
̂ , 𝑡𝑖,𝑦
{𝑡𝑖,𝑥 ̂ , 𝑡𝑖,𝑤 ̂ } and {𝑡∗𝑖,𝑥 , 𝑡∗𝑖,𝑦 , 𝑡∗𝑖,𝑤 , 𝑡∗ } denote the encod-
̂ , 𝑡𝑖,ℎ The R-CNN learns to classify each proposal into one of
𝑖,ℎ
ing of the 4 coordinates of the predicted box and ground the foreground classes or background. The output classi-
truth respectively. This encoding method is the same as [40]. fication score of the proposal 𝑘 for the class 𝐶𝑘 can be
However, L2 loss is very vulnerable to outliers, which will seen as a conditional categorical probability 𝑃 (𝐶𝑘 |𝑂𝑘 =1)
harm the regression accuracy. To solve this problem, we (𝐶𝑘 ∈{, 𝑏𝑔}, is the set of classes and 𝑏𝑔 denotes back-
design the fast IoU loss (FIoU), which is inspired by [56], ground). However, in the inference stage, the final detection
as: score directly uses the classification score in the R-CNN
∑ head, ignoring the prior probability from the RPN. During
𝐿𝐹 𝐼𝑜𝑈 (𝒃̂𝒊 ) = 𝑔𝑖𝜂 (1 − 𝑔𝑖 + ̂ − 𝑡∗𝑖,𝑗 ||2 ), (18)
||𝑡𝑖,𝑗 2 the training stage in the R-CNN head, since the supervision
𝑗∈{𝑥,𝑦,𝑤,ℎ}
signals of all proposals are the equivalent with a softmax
classifier regardless of the localization accuracy, the R-CNN
1∑
𝑚 head easily outputs false over-confident predictions. Thus,
𝐿𝑙𝑜𝑐−𝑟𝑝𝑛 = 𝐿 (𝒃̂ ), (19) compared with using the conditional categorical probability
𝑚 𝑖=1 𝐹 𝐼𝑜𝑈 𝒊
𝑃 (𝐶𝑘 =𝑐|𝑂𝑘 =1), it is more reasonable to use the marginal
where 𝜂 is a parameter to control the degree of inhibition probability 𝑃 (𝐶𝑘 = 𝑐), 𝑐 ∈ as the final detection score.
of outliers, and 𝑚 is the number of positive samples. We We set 𝑃 (𝐶𝑘 =𝑏𝑔|𝑂𝑘 =0)=1 and 𝑃 (𝐶𝑘 = 𝑐|𝑂𝑘 = 0) = 0,
add an IoU weighted term 𝑔𝑖𝜂 to alleviate the problem of which means that it is impossible for the R-CNN head to
vulnerability to outliers. With the IoU weighted term, low- reconsider a proposal as a positive sample if the RPN regards
quality samples with high regression loss will be filtered, for the proposal as a negative sample. The marginal probability
the weighted term will become small. And the RetinaRPN 𝑃 (𝐶𝑘 = 𝑐) can be written as:
will focus on the prime samples with moderate regression ∑
accuracy, which will enhance the robustness to the outliers 𝑃 (𝐶𝑘 = 𝑐) = 𝑃 (𝐶𝑘 = 𝑐|𝑂𝑘 )𝑃 (𝑂𝑘 = 𝑢)
and remain the fast convergence. 𝑢∈{0,1}
The IoU prediction branch is trained to predict the IoUs = 𝑃 (𝐶𝑘 = 𝑐|𝑂𝑘 = 1)𝑃 (𝑂𝑘 = 1) (22)
between regressed boxes and their corresponding ground + 𝑃 (𝐶𝑘 = 𝑐|𝑂𝑘 = 0)𝑃 (𝑂𝑘 = 0)
truths. And the cross entropy is used as the loss function:
= 𝑃 (𝐶𝑘 = 𝑐|𝑂𝑘 = 1)𝑃 (𝑂𝑘 = 1).
1 ∑ ∑
TP TN FP FN 𝐾 𝐶
RoI Samples 𝐿𝑐𝑙𝑠 = 𝑤𝑘 ⋅ (−𝑠𝑐𝑘 ⋅ log(𝑠̂𝑐𝑘 )), (25)
𝐾 𝑘=1 𝑐=1
0.75 0.92 0.01 0.21 0.8 0.6 0.32 0.29
Reweighting: 1 − 𝑝𝑟𝑖𝑜𝑟 𝜔 where 𝑠̂𝑐𝑘 and 𝑠𝑐𝑘 denote the predicted classification score
and label of sample 𝑘 for class 𝑐, 𝑠𝑐𝑘 ∈ {0, 1}. 𝐾 and 𝐶
Reweighted are the number of proposals in second stage and the number
RoI Samples of classes respectively. Note that the weighted terms are all
smaller than 1, the total value of the classification loss will
shrink, which will cause the shrink of the gradient. In order
Figure 3: The overview of the proposed boosting reweighting. to keep the norm of the total loss unchanged, we normalize
The patch size denotes the weight of RoI samples. 𝑤 to 𝑤′ :
∑𝐾 ∑𝐶 𝑐 ̂𝑐
′ 𝑘=1 𝑐=1 (−𝑠𝑘 ⋅ log(𝑠𝑘 ))
𝑤𝑘 = 𝑤𝑘 ⋅ ∑𝐾 ∑𝐶 , (26)
𝑤 ⋅ (−𝑠 𝑐 ⋅ log(𝑠̂𝑐 ))
classification score, namely: 𝑘=1 𝑘 𝑐=1 𝑘 𝑘
√ 1 ∑ ′ ∑ 𝑐
𝐾 𝐶
𝑠𝑘 (𝑐) = 𝑝𝑟𝑘 ∗ 𝑐𝑙𝑠𝑘 (𝑐), (23) 𝐿′𝑐𝑙𝑠 = 𝑤 ⋅ (−𝑠 ⋅ log(𝑠̂𝑐𝑘 )). (27)
𝐾 𝑘=1 𝑘 𝑐=1 𝑘
where 𝑐𝑙𝑠𝑘 (𝑐) is the classification score of the sample 𝑘 for
When the detector encounters hard positive/negative sam-
class 𝑐 in R-CNN, and 𝑠𝑘 (𝑐) is the final score.
ples, obviously the priors of RPN will be small/large. As
With the probabilistic inference pipeline, the detector
a result, the weighted term (1−𝑝𝑟𝑖𝑜𝑟(𝑘))𝛾 /𝑝𝑟𝑖𝑜𝑟(𝑘)𝛾 will
can take the first-stage uncertainty into consideration to
increase and amplify the loss of the hard examples, while
make the final predictions. Thus, compared with using the
the loss of the easy samples will be decreased.
conditional probability, the marginal probability is a better
BR can be seen as hard example mining. There are two
estimation of the detection box localization accuracy.
similar works to our BR: OHEM and focal loss. OHEM is
a bootstrapping method, which is originally designed for
3.5. Boosting Reweighting
Fast R-CNN (without RPN), it performs a feedforward for
There is a deficiency in the previous probabilistic infer-
all RoIs on the R-CNN, and selects the hardest samples for
ence pipeline. In the original two-stage detector, the second
training on the second feedforward. Our BR leverages the
stage makes predictions that is independent of the first stage.
prior information from the RPN from only one feedforward,
As a result, a low score for a high-quality sample in the
saving lots of memory cost and training time. Focal loss
first stage will not influence the final detection result as long
is designed for RetinaNet to solve the extreme imbalance
as the sample is selected as a proposal. However, in the
between foreground and background. However, the NMS
probabilistic two-stage pipeline, when the RPN mistakenly
in the RPN and bootstrapping mechanism in the second
generates a low prior for a high-quality positive proposal, it
stage alleviate the imbalance problem, which overlaps the
is hard to re-consider it as a high-confidence prediction, for
function of focal loss. Our BR is used in combination with
the final score is the square root of the multiplication of prior
NMS and bootstrapping. To avoid the shrinking of the
and classification score. In underwater environments, vague
loss, normalization is leveraged to re-distribute the weight
objects as hard examples often happen, and the RPN will
of each sample. Thus, BR aims to handle the problem of
severely suffer from those.
the hard samples in the underwater environment instead of
To solve this problem, we hope that when the RPN
foreground-background imbalance. Both OHEM and focal
miscalculates the prior of the proposal, the R-CNN can
loss reweight the loss by the R-CNN’s error, while our BR
rectify the error. Thus, we propose a soft sampling strategy
reweights the loss by the RPN’s error. The R-CNN trained
named boosting reweighting (BR, shown in Fig. 3), which
with BR excavates the subtle differences between aquatic
borrows the idea of reweighting from boosting algorithm
organisms and background and is robust to the samples to
and well fits the existing frameworks. Different from vanilla
which the RPN is vulnerable. Thus, the R-CNN can rectify
Faster R-CNN, where the weights of all proposals are set
the RPN’s error in the inference stage. The experiments
to 1, BR tends to attach more attention to hard examples
show that BR is more compatible with the probabilistic
whose priors are miscalculated. In detail, for the sample 𝑘,
inference pipeline compared with other methods.
Table 1
Comparisons with other object detection methods on UTDAC2020 dataset. The FPS is tested on a single Nvidia GTX 1080Ti
GPU. ‘*’ means that the model uses the second training recipe.
Method Backbone AP AP50 AP75 APS APM APL FPS
Two-Stage Detector:
Faster R-CNN w/ FPN [40] ResNet50 44.5 80.9 44.1 20.0 39.0 50.8 11.6
OHEM+Faster R-CNN w/ FPN [42] ResNet50 45.1 82.0 45.1 21.6 39.1 51.4 11.6
Cascade R-CNN [2] ResNet50 46.6 81.5 49.3 21.0 40.9 53.3 8.8
Libra R-CNN [34] ResNet50 45.8 82.0 46.4 20.1 40.2 52.3 11.0
Cascade RPN [45] ResNet50 46.5 79.5 41.2 20.4 38.6 47.7 8.3
Faster R-CNN w/ PAFPN [31] ResNet50 45.5 82.1 45.9 18.8 39.7 51.9 10.9
Double-Head [48] ResNet50 45.3 81.5 45.7 20.2 40.0 51.4 5.7
Dynamic R-CNN [51] ResNet50 45.6 80.1 47.3 19.0 39.7 52.1 12.1
Faster R-CNN w/ FPG [5] ResNet50 45.4 81.6 46.0 19.8 39.7 51.4 13.1
GRoIE [41] ResNet50 45.7 82.4 45.6 19.9 40.1 52.0 6.0
SABL+Faster R-CNN [46] ResNet50 46.6 81.6 48.2 19.6 40.4 53.4 9.9
PISA [3] ResNet50 46.3 82.1 47.4 20.8 40.8 52.6 10.3
Sparse R-CNN [43] ResNet50 37.4 70.4 35.9 17.7 33.3 43 10.8
DetectoRS [36] ResNet50 47.6 82.8 49.9 23.1 41.8 54.2 4.0
RoIAttn [27] ResNet50 46.0 82.0 47.5 22.9 40.5 52.2 8.8
CenterNet2 [58] ResNet50 47.2 81.6 49.8 18.2 41.3 53.4 14.2
CenterNet2* [58] ResNet50 48.9 83.0 52.6 21.7 43.5 55.2 14.2
One-Stage Detector:
SSD512 [32] VGG16 40.0 77.5 36.5 14.7 36.1 45.1 25.0
RetinaNet [28] ResNet50 43.9 80.4 42.9 18.1 38.2 50.1 11.4
FSAF [61] ResNet50 43.9 81.0 42.9 18.5 38.9 50.9 12.8
CenterNet [59] ResNet18 31.3 61.1 27.6 11.9 32.5 33.4 6.2
FCOS [44] ResNet50 43.9 81.1 43.0 19.9 38.2 50.4 12.7
RepPoints [50] ResNet50 44.0 80.5 43.0 18.7 38.5 50.3 11.1
FreeAnchor [55] ResNet50 46.3 82.3 46.9 21.0 40.5 52.6 11.4
RetinaNet w/ NASFPN [17] ResNet50 37.4 70.3 35.8 12.4 36.4 40.4 13.8
ATSS [53] ResNet50 46.2 82.5 46.9 19.7 41.4 52.4 11.8
PAA [20] ResNet50 47.5 83.1 49.7 19.5 42.4 53.6 6.6
AutoAssign [60] ResNet50 46.3 83.0 47.6 18.0 41.3 52.2 12.3
GFL [25] ResNet50 46.4 81.9 47.8 19.3 40.9 52.5 12.7
VFNet [52] ResNet50 44.0 79.3 44.1 18.8 38.1 50.4 10.5
Transfromer:
Deformable DETR [62] ResNet50 46.6 84.1 47.0 24.1 42.4 51.9 7.6
Ours:
Boosting R-CNN ResNet50 48.5 82.4 52.5 21.1 42.4 55.0 13.5
Boosting R-CNN* ResNet50 51.4 85.5 56.8 23.8 45.8 57.8 13.5
Table 2
Comparisons with other object detection methods on Brackish 49.0
dataset. “Baseline” is the performance reported in the original
paper of Brackish. 48.8
Method Backbone AP AP50 48.6
Baseline (YOLOv3) [35] DarkNet53 38.9 83.7
Faster R-CNN w/ FPN [21] ResNet50 79.3 97.4 48.4
Cascade R-CNN [2] ResNet50 80.7 96.9
FPS
RetinaNet [28] ResNet50 78.0 96.5 48.2
DetectoRS [36] ResNet50 81.6 97.0
48.0
CenterNet2 [58] ResNet50 79.3 97.4
Boosting R-CNN ResNet50 82.0 97.4 47.8
47.6
4. Experiments 0 0.5 1 2 3
Number of queries
4.1. Datasets
We conduct experiments on four challenging object de- Figure 4: The choice of 𝜂 in FIoU loss. 𝜂 = 0 means droping
tection datasets to validate the generalization performance the IoU weighted term.
of our method.
(i) UTDAC2020 is an underwater dataset from the un-
derwater target detection algorithm competition 2020. There is utilized. The second training recipe adopts the 3x training
are 5,168 training images and 1,293 validation images. It scheme (36 epochs) with crop and multi-scale augmentation.
contains four classes: echinus, holothurian, starfish, and AdamW is leveraged as the optimizer with an initial learning
scallop. The images contain four resolutions: 3840×2160, rate of 0.0001 and a weight decay of 0.05. The learning rate
1920×1080, 720×405, and 586×480. We follow the COCO- is divided by a factor of 10 at epoch 24 and 33.
style evaluation metric. The method is trained on a single NVIDIA GTX 1080Ti
(ii) Brackish is an early proposed underwater image GPU. During the inference, we use a maximum of 256
dataset collected in temperate brackish waters. It contains proposal boxes in the second stage, which improves the
six classes: bigfish, crab, jellyfish, shrimp, small fish, and inference speed. As for the balanced parameters of loss,
starfish. There are 9,967, 1,467, 1,468 images in training, 𝜆𝑜𝑏𝑗−𝑟𝑝𝑛 , 𝜆𝑙𝑜𝑐−𝑟𝑝𝑛 , 𝜆𝑖𝑜𝑢−𝑟𝑝𝑛 , 𝜆𝑐𝑙𝑠 , 𝜆𝑟𝑒𝑔 are set to 1, 2, 1, 2, 2
validation, and test set, containing 25,613 annotations. The respectively.
image size is 960×540. We follow the MS COCO-style
AP[0.5:0.95:0.05] metric and Pascal VOC-style AP50 met- 4.3. Comparisons with Other State-of-the-art
ric as the original paper. Methods
(iii) Pascal VOC is a generic object detection dataset, We compare Boosting R-CNN against some state-of-the-
which contains 20 object categories. The dataset includes art methods in four object detection datasets in Table 1, 2, 3,
VOC2007 part and VOC2012 part. In VOC2007 part, there and 4.
are 9963 annotated images, consisting of trainval (5011
images) and test set (4952 images). In VOC2012 set, there 4.3.1. Results on UTDAC2020
are 11540 annotated images in trainval set. We train our The experiment results on the UTDAC2020 dataset are
detector on 07+12 trainval dataset, and evaluate it on 07 test shown in Table 1. CenterNet2* and Boosting R-CNN* de-
set. note that the models use multi-scale training and 3× training
(iiii) MS COCO is a generic object detection dataset, time. Besides, Deformable DETR is trained for 50 epochs
which contains 80 object categories. It contains 118k images with multi-scale training. All the detectors are implemented
for training (trainval), 5k images for validation (val) and 20k by MMdetection [6] except CenterNet2 which is officially
images for testing without provided annotations (test-dev). implemented in Detectron2 [49].
The final results are reported on test-dev set. As shown in Table 1, in single-scale training setting,
Boosting R-CNN achieves 48.5% AP, which is higher than
4.2. Implementation Details DetectoRS (47.6% AP), PAA (47.5% AP), and CenterNet2
Our method is implemented on MMdetection [6]. There (47.2% AP). In multi-scale training setting, Boosting R-
are two training recipes in the experiments. The first one CNN still surpasses CenterNet2 (51.4% AP vs 48.9% AP).
is the default training recipes, which adopts the classic 1x As a result, our Boosting R-CNN defeats all the detectors and
training scheme (12 epochs). SGD is adopted as an opti- builds new state-of-the-art performance. As for the inference
mizer, where the weight decay is 0.0001 and the momentum speed, Boosting R-CNN achieves 13.5 FPS, which is higher
is 0.9. The initial learning rate is 0.005. The learning rate than most of the two-stage detectors including Faster R-CNN
is divided by a factor of 10 at epoch 8 and 11. No extra (11.6 FPS) but lower than CenterNet2 (14.2 FPS).
data augmentation except the traditional horizontal flipping
Table 4
Comparisons with other object detection methods on MS COCO dataset with the large backbone in single-scale testing. ‘*’ means
that our model uses the second training recipe. The performances of other methods are published in their papers, and they all
use 3x training time.
Method Backbone AP AP50 AP75 APS APM APL
Two-Stage Detector:
Faster R-CNN w/ FPN[40] ResNet101 36.2 59.1 39.0 18.2 39.0 48.2
Cascade R-CNN [2] ResNet101 42.8 62.1 46.3 23.7 45.5 55.2
Grid R-CNN [33] ResNet101 41.5 60.9 44.5 23.3 44.9 53.1
Libra R-CNN [34] ResNeXt101-64x4d 43.0 64.0 47.0 25.3 45.6 54.6
Double-Head [48] ResNet101 42.3 62.8 46.3 23.9 44.9 54.3
Dynamic R-CNN [51] ResNet101-DCN 46.9 65.9 51.3 28.1 49.6 60.0
BorderDet [37] ResNeXt101-64x4d-DCN 48.0 67.1 52.1 29.4 50.7 60.5
TridentNet [26] ResNet101-DCN 46.8 67.6 51.5 28.0 51.2 60.5
CPN [12] HG104 47.0 65.0 51.0 26.5 50.2 60.7
One-Stage Detector:
FCOS [44] ResNeXt101-64x4d-DCN 46.6 65.9 50.8 28.6 49.1 58.6
CornerNet [23] HG104 40.6 56.4 43.2 19.1 42.8 54.3
CenterNet [59] HG104 42.1 61.1 45.9 24.1 45.5 52.8
CentripetalNet [11] HG104 46.1 63.1 49.7 25.3 48.7 59.2
RetinaNet [28] ResNet101 39.1 59.1 42.3 21.8 42.7 50.2
FSAF [61] ResNeXt101-64x4d 42.9 63.8 46.3 26.6 46.2 52.7
RepPoints [50] ResNet101-DCN 45.0 66.1 49.0 26.6 48.6 57.5
RepPointsV2 ResNet101-DCN 48.1 67.5 51.8 28.7 50.9 60.8
FreeAnchor [55] ResNeXt101-32x8d 46.0 65.6 49.8 27.8 49.5 57.7
ATSS [53] ResNeXt101-32x8d-DCN 47.7 66.5 51.9 29.7 50.8 59.4
PAA [20] ResNeXt101-64x4d-DCN 49.0 67.8 53.3 30.2 52.8 62.2
AutoAssign [60] ResNeXt101-64x4d-DCN 49.5 68.7 54.0 29.9 52.6 62.0
GFL [25] ResNeXt101-32x4d-DCN 48.2 67.4 52.6 29.2 51.7 60.2
GFLV2 [24] Res2Net101-DCN 50.6 69.0 55.3 31.3 54.3 63.5
YOLOv4 [1] CSPDarkNet-53 43.5 65.7 47.3 26.7 46.7 53.3
Transfromer:
DETR [4] ResNet101 43.5 63.8 46.4 21.9 48.0 61.8
Deformable DETR [62] ResNeXt101-64x4d-DCN 50.1 69.7 54.6 30.6 52.8 65.6
Ours:
Boosting R-CNN* ResNet50 44.4 63.9 48.2 26.9 47.0 54.8
Boosting R-CNN* Res2Net101-DCN 50.7 69.2 55.8 31.7 54.1 63.5
Figure 4 shows the choice of hyper-parameter 𝜂 in fast gives a relatively lower performance (47.5% AP). PISA
IoU loss. If 𝜂 is too large, the gradient will be dominated by severely does harm to the performance (46.9% AP). Focal
the easy samples with high IoUs. If 𝜂 is too small, it will lack loss also causes severe performance decrease. When the 𝛾 is
the ability to filter the outliers. When 𝜂 is set to 2, the highest set lower, which means that focal loss gets closer to cross-
performance of 48.5% AP can be obtained. When 𝜂 is set to entropy loss, the performance is restored. The experiments
0, which is equivalent to the fast IoU loss without the IoU in 6 show that in the probabilistic pipeline, BR helps R-CNN
weighted term, the performance is lower. Thus, the training to correct the mistakes of RPN, which is more compatible
of the model will suffer from the outliers. than OHEM, PISA, and focal loss.
Figure 5 is the experiment of the choice of 𝜔 in BR. In
4.4.2. Hard Example Mining this experiment, PAFPN is not used. From the figure, it can
Table 6 shows the experiments of different hard example be concluded that normalization improves the performance
mining methods. Since our BR is a kind of hard example and shifts the optimal value of 𝜔. The highest performance
mining, it is necessary to compare it to other hard example of 48.3% AP is obtained when 𝜔 is set to 0.5.
mining methods, i.e., OHEM, PISA, and focal loss. We
replace BR with these methods to evaluate the effective- 4.4.3. Anchor Assignment
ness and the compatibility with the probabilistic inference In Table 7, we adopt other positive and negative an-
pipeline. “Cls. Loss” denotes the classification loss in the R- chor assignment strategies in our RetinaRPN. PAFPN and
CNN head, “Random” means randomly sampling positive boosting reweighting is not leveraged in the experiments.
and negative RoIs during training. The first row (47.9% AP) Although ATSS [53], PAA [20] and OTA [16] achieve aston-
corresponds to the next-to-last row in Table 5. BR achieves ishing performances in one-stage detectors, they decreases
the highest performance (48.3% AP). Using OHEM instead the performance when they play the role of the RPN. The
Table 5
A detailed abalation study of Boosting R-CNN. The first five columns denotes the ablation studies of RetinaRPN. “4 L.” denotes
using 4 convolution layers with GN. “NA.” denotes the number of anchors. “Reg Loss” denotes the regression loss in RPN. “FL”
denotes using focal loss and abandoning the bootstrapping in RPN. “IoUp.” means adding IoU prediction in RPN with cross
entropy loss. “Prob” denotes using probabilistic inference pipeline. “BR” means boosting reweighting.
RetinaRPN
Row Neck Prob BR AP AP50 AP75
4 L. NA. Reg Loss FL IoUp.
1 3 L1 FPN 44.5 80.9 44.1
2 ! 3 L1 FPN 45.1 81.6 45.9
3 3 L1 FPN ! 45.3 81.6 45.8
4 ! 3 L1 ! FPN 45.4 81.2 46.3
5 ! 3 GIoU ! FPN 45.8 80.5 47.5
6 ! 9 L1 ! FPN 46.7 80.0 49.0
7 ! 9 GIoU ! FPN ! 46.8 82.2 48.8
8 ! 9 L1 ! FPN ! 47.2 82.5 49.3
9 ! 9 L1 ! ! FPN ! 47.5 83.0 50.3
10 ! 9 GIoU ! ! FPN ! 47.6 82.7 50.4
11 ! 9 CIoU ! ! FPN ! 47.6 82.8 50.1
12 ! 9 F-EIoU ! ! FPN ! 47.7 83.0 49.8
13 ! 9 FIoU ! ! FPN 46.9 81.3 49.6
14 ! 9 FIoU ! ! FPN ! 47.9 82.8 50.7
15 ! 9 FIoU ! ! FPN ! ! 48.3 82.6 51.8
16 ! 9 FIoU ! ! PAFPN ! ! 48.5 82.4 52.5
Table 6 Table 7
The ablation studies of hard example mining. “FL(𝛼,𝛾)” de- Positive and negative assignment. “(0.5, 0.5)” denotes that the
notes using focal loss with hyper-parameter 𝛼 and 𝛾. samples with IoUs over 0.5 are regarded as positive, while the
Cls. Loss Sampling AP AP50 AP75 samples with IoUs below 0.5 are regarded as negative. BR is
CE Random 47.9 82.8 50.7 not used in the experiment.
CE OHEM 47.5 82.1 49.8 Assignment AP AP50 AP75
CE PISA 46.9 82.7 48.2 ATSS 46.5 81.9 48.1
FL (0.25, 2) Random 44.7 79.7 45.8 PAA 47.3 83.1 49.1
FL (0.5, 1) Random 46.5 81.1 48.8 OTA 46.4 81.1 48.8
FL (0.5, 0.1) Random 47.0 82.0 48.9 (0.5, 0.5) 47.9 82.8 50.7
FL (0.25, 2) None 46.7 80.2 49.7
CE BR 48.3 82.6 51.8
The first two rows denote the blurring and low contrast
conditions. Boosting R-CNN can detect all the ground truths
reason may be that the adaptive assignment provides over- (no red box) in the images with the highest precision (only
confident priors, which decreases the recall. Our setting one blue box). The third row denotes the unbalanced light
“(0.5, 0.5)” achieves the best performance in underwater condition. ATSS, PAA, and DetectoRS all miss the echinus
object detection. in the center. Our Boosting R-CNN does not miss any ground
truths. The Fourth row denotes the occlusion condition.
4.5. Qualitative Comparisons Boosting R-CNN can detect the condition that a starfish
Figure 6 shows the qualitative comparison between cover a scallop, on which DetectoRS makes a mistake. The
Boosting R-CNN and other state-of-the-art methods on last two rows denote the mimicry condition. In the fifth
UTDAC2020 dataset. We apply the detectors to some chal- case, the stone is very similar to the scallop. Boosting R-
lenging cases, and the prediction score threshold set to 0.05. CNN can precisely distinguish the scallop from the stone.
For clarity, in each image, we visualize the prediction boxes Moreover, for the small echinus on the left-top corner, other
with the top-k scores, and k is the number of the ground-truth detectors miss it for the lack of regression capacity. The
boxes in the images. The orange boxes in the images denote proposed Boosting R-CNN accurately detects this echinus,
a prediction box whose IoU with a certain ground truth is which proves that RetinaRPN with FIoU loss can provide
over 0.5 and higher than other predictions. The blue boxes high-quality proposals. In the last case, although the starfish
denote the unmatched predictions. Besides, we also print the hides in the waterweeds, Boosting R-CNN still successfully
missed ground-truth boxes in red. Thus, more blue boxes in detects the object.
the images suggest lower precision, and more red boxes in
the images suggest lower recall.
Figure 6: Qualitative comparison results on the UTDAC2020 dataset. The orange boxes denote the matched predictions. The
blue boxes denote the unmatched predictions. The red boxes denote the undetected ground truths. The first two rows denote the
2828 2513
2979conditions. The third
blurring and low contrast row denotes the unbalanced light condition. The Fourth row denotes the occlusion
condition. The last two rows denote the mimicry condition.
the RetinaRPN assigns a too small prior for the predictions. in: Proceedings of the IEEE/CVF conference on computer vision and
With the correction of the R-CNN, the missed ground truth pattern recognition, pp. 10519–10528.
can be detected by increasing the second-stage score. [12] Duan, K., Xie, L., Qi, H., Bai, S., Huang, Q., Tian, Q., 2020.
Corner proposal network for anchor-free, two-stage object detection,
in: European Conference on Computer Vision.
[13] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman,
5. Conclusion A., 2010. The pascal visual object classes (voc) challenge. Interna-
Underwater object detection is facing new challenges tional Journal of Computer Vision 88, 303–338.
such as blur, low contrast, occlusion, and mimicry conditions [14] Fan, B., Chen, W., Cong, Y., Tian, J., 2020. Dual refinement
underwater object detection network, in: European Conference on
compared with generic object detection. In this paper, we Computer Vision.
propose a brand new two-stage underwater detector Boost- [15] Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C., 2017. Dssd: De-
ing R-CNN to solve the problems mentioned above. First, the convolutional single shot detector. arXiv preprint arXiv:1701.06659
proposed RetinaRPN has a strong capacity to detect objects .
in blurring, low-contrast, distorted images, and provide high- [16] Ge, Z., Liu, S., Li, Z., Yoshie, O., Sun, J., 2021. Ota: Optimal transport
assignment for object detection, in: Proceedings of the IEEE/CVF
quality proposals with accurate estimations of object prior Conference on Computer Vision and Pattern Recognition, pp. 303–
probability in the occlusion condition. Second, the proposed 312.
probabilistic inference pipeline helps the detector make a [17] Ghiasi, G., Lin, T.Y., Le, Q.V., 2019. Nas-fpn: Learning scalable fea-
prediction based on the uncertainties of the vague objects, ture pyramid architecture for object detection, in: Proceedings of the
resulting in a reasonable ranking of the prediction scores. IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[18] Gidaris, S., Komodakis, N., 2015. Object detection via a multi-region
Third, boosting reweighting is proposed to learn the second and semantic segmentation-aware cnn model, in: Proceedings of the
stage by the error of the first stage, which is a kind of IEEE international conference on computer vision, pp. 1134–1142.
hard example mining and helps the second stage to rectify [19] Huang, H., Zhou, H., Yang, X., Zhang, L., Qi, L., Zang, A.Y., 2019.
the error at the probabilistic pipeline. The experiments on Faster r-cnn for marine organisms detection and recognition using
two public underwater datasets demonstrate that Boosting data augmentation. Neurocomputing 337, 372–384.
[20] Kim, K., Lee, H.S., 2020. Probabilistic anchor assignment with iou
R-CNN outperforms other state-of-the-art detectors in un- prediction for object detection, in: European Conference on Computer
derwater object detection. The competitive performances Vision.
on two public generic object detection datasets show the [21] Kim, S.W., Kook, H.K., Sun, J.Y., Kang, M.C., Ko, S.J., 2018.
generalization of Boosting R-CNN. Comprehensive ablation Parallel feature pyramid network for object detection, in: European
studies show the effectiveness of the proposed modules. Conference on Computer Vision, pp. 234–250.
[22] Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., Chen, Y., 2017. Ron:
Reverse connection with objectness prior networks for object detec-
tion, in: Proceedings of the IEEE conference on computer vision and
References pattern recognition, pp. 5936–5944.
[1] Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M., 2020. Yolov4: Op- [23] Law, H., Deng, J., 2018. Cornernet: Detecting objects as paired
timal speed and accuracy of object detection. arXiv preprint keypoints, in: European Conference on Computer Vision.
arXiv:2004.10934 . [24] Li, X., Wang, W., Hu, X., Li, J., Tang, J., Yang, J., 2021. Generalized
[2] Cai, Z., Vasconcelos, N., 2018. Cascade r-cnn: Delving into high focal loss v2: Learning reliable localization quality estimation for
quality object detection, in: Proceedings of the IEEE/CVF Conference dense object detection, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. on Computer Vision and Pattern Recognition, pp. 11632–11641.
[3] Cao, Y., Chen, K., Loy, C.C., Lin, D., 2020. Prime sample attention [25] Li, X., Wang, W., Wu, L., Chen, S., Hu, X., Li, J., Tang, J., Yang,
in object detection, in: Proceedings of the IEEE/CVF Conference on J., 2020. Generalized focal loss: Learning qualified and distributed
Computer Vision and Pattern Recognition. bounding boxes for dense object detection. Advances in Neural
[4] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Information Processing Systems .
Zagoruyko, S., 2020. End-to-end object detection with transformers, [26] Li, Y., Chen, Y., Wang, N., Zhang, Z., 2019. Scale-aware trident
in: European Conference on Computer Vision. networks for object detection, in: Proceedings of the IEEE/CVF
[5] Chen, K., Cao, Y., Loy, C.C., Lin, D., Feichtenhofer, C., 2020a. International Conference on Computer Vision, pp. 6054–6063.
Feature pyramid grids. arXiv preprint arXiv:2004.03580 . [27] Liang, X., Song, P., 2022. Excavating roi attention for underwater
[6] Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, object detection, in: 2022 IEEE International Conference on Image
W., Liu, Z., Xu, J., et al., 2019a. Mmdetection: Open mmlab detection Processing, IEEE.
toolbox and benchmark. arXiv preprint arXiv:1906.07155 . [28] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal
[7] Chen, L., Zhou, F., Wang, S., Dong, J., Li, N., Ma, H., Wang, X., Zhou, loss for dense object detection, in: Proceedings of the IEEE/CVF
H., 2020b. Swipenet: Object detection in noisy underwater images. International Conference on Computer Vision.
arXiv preprint arXiv:2010.10006 . [29] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.,
[8] Chen, Y., Han, C., Wang, N., Zhang, Z., 2019b. Revisiting Dollár, P., Zitnick, C.L., 2014. Microsoft coco: Common objects in
feature alignment for one-stage object detection. arXiv preprint context, in: European Conference on Computer Vision, Springer. pp.
arXiv:1908.01570 . 740–755.
[9] Chen, Z., Gao, H., Zhang, Z., Zhou, H., Wang, X., Tian, Y., 2020c. [30] Lin, W.H., Zhong, J.X., Liu, S., Li, T., Li, G., 2020. Roimix: Proposal-
Underwater salient object detection by combining 2d and 3d visual fusion among multiple images for underwater object detection, in:
features. Neurocomputing 391, 249–259. IEEE International Conference on Acoustics,Speech, and Signal Pro-
[10] Dai, J., Li, Y., He, K., Sun, J., 2016. R-fcn: Object detection cessing.
via region-based fully convolutional networks. Advances in Neural [31] Liu, S., Qi, L., Qin, H., Shi, J., Jia, J., 2018. Path aggregation
Information Processing Systems 29. network for instance segmentation, in: Proceedings of the IEEE/CVF
[11] Dong, Z., Li, G., Liao, Y., Wang, F., Ren, P., Qian, C., 2020. Cen- Conference on Computer Vision and Pattern Recognition.
tripetalnet: Pursuing high-quality keypoint pairs for object detection,
[32] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., [52] Zhang, H., Wang, Y., Dayoub, F., Sunderhauf, N., 2021a. Varifo-
Berg, A.C., 2016. Ssd: Single shot multibox detector, in: European calnet: An iou-aware dense object detector, in: Proceedings of the
Conference on Computer Vision, Springer. IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[33] Lu, X., Li, B., Yue, Y., Li, Q., Yan, J., 2019. Grid r-cnn, in: [53] Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z., 2020b. Bridging the gap
Proceedings of the IEEE/CVF Conference on Computer Vision and between anchor-based and anchor-free detection via adaptive training
Pattern Recognition, pp. 7363–7372. sample selection, in: Proceedings of the IEEE/CVF Conference on
[34] Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D., 2019. Libra Computer Vision and Pattern Recognition.
r-cnn: Towards balanced learning for object detection, in: Proceed- [54] Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z., 2018. Single-shot
ings of the IEEE/CVF Conference on Computer Vision and Pattern refinement neural network for object detection, in: Proceedings of the
Recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[35] Pedersen, M., Bruslund Haurum, J., Gade, R., Moeslund, T.B., 2019. [55] Zhang, X., Wan, F., Liu, C., Ji, R., Ye, Q., 2019. Freeanchor: Learning
Detection of marine animals in a new underwater dataset with varying to match anchors for visual object detection. Advances in Neural
visibility, in: Proceedings of the IEEE/CVF Conference on Computer Information Processing Systems .
Vision and Pattern Recognition Workshops. [56] Zhang, Y.F., Ren, W., Zhang, Z., Jia, Z., Wang, L., Tan, T., 2021b.
[36] Qiao, S., Chen, L.C., Yuille, A., 2021. Detectors: Detecting objects Focal and efficient iou loss for accurate bounding box regression.
with recursive feature pyramid and switchable atrous convolution, in: arXiv preprint arXiv:2101.08158 .
Proceedings of the IEEE/CVF Conference on Computer Vision and [57] Zhao, Z., Liu, Y., Sun, X., Liu, J., Yang, X., Zhou, C., 2021. Compos-
Pattern Recognition. ited fishnet: Fish detection and species recognition from low-quality
[37] Qiu, H., Ma, Y., Li, Z., Liu, S., Sun, J., 2020. Borderdet: Border fea- underwater videos. IEEE Transaction on Image Processing 30, 4719–
ture for dense object detection, in: European Conference on Computer 4734.
Vision, Springer. pp. 549–564. [58] Zhou, X., Koltun, V., Krähenbühl, P., 2021. Probabilistic two-stage
[38] Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only detection. arXiv preprint arXiv:2103.07461 .
look once: Unified, real-time object detection, in: Proceedings of the [59] Zhou, X., Wang, D., Krähenbühl, P., 2019. Objects as points. arXiv
IEEE/CVF Conference on Computer Vision and Pattern Recognition. preprint arXiv:1904.07850 .
[39] Redmon, J., Farhadi, A., 2017. Yolo9000: better, faster, stronger, in: [60] Zhu, B., Wang, J., Jiang, Z., Zong, F., Liu, S., Li, Z., Sun, J.,
Proceedings of the IEEE conference on computer vision and pattern 2020a. Autoassign: Differentiable label assignment for dense object
recognition, pp. 7263–7271. detection. arXiv preprint arXiv:2007.03496 .
[40] Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards [61] Zhu, C., He, Y., Savvides, M., 2019. Feature selective anchor-
real-time object detection with region proposal networks. Advances free module for single-shot object detection, in: Proceedings of the
in Neural Information Processing Systems . IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[41] Rossi, L., Karimi, A., Prati, A., 2020. A novel region of in- [62] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., 2020b. Deformable
terest extraction layer for instance segmentation. arXiv preprint detr: Deformable transformers for end-to-end object detection, in:
arXiv:2004.13665 . International Conference on Learning Representations.
[42] Shrivastava, A., Gupta, A., Girshick, R., 2016. Training region-based
object detectors with online hard example mining, in: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recogni- Pinhao Song received the B.E. degree in Mechan-
tion. ical Engineering in 2019, where he is currently
[43] Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, pursuing a master’s degree in computer applied
M., Li, L., Yuan, Z., Wang, C., et al., 2021. Sparse r-cnn: End-to- technology in Peking University. His current re-
end object detection with learnable proposals, in: Proceedings of the search interests include underwater object detec-
IEEE/CVF Conference on Computer Vision and Pattern Recognition. tion, generic object detection, and domain gener-
[44] Tian, Z., Shen, C., Chen, H., He, T., 2019. Fcos: Fully convolutional alization.
one-stage object detection, in: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision.
[45] Vu, T., Jang, H., Pham, T.X., Yoo, C., 2019. Cascade rpn: Delving
into high-quality region proposal network with adaptive convolution.
Advances in Neural Information Processing Systems .
[46] Wang, J., Zhang, W., Cao, Y., Chen, K., Pang, J., Gong, T., Shi, Pengteng Li received the B.E degree in Financial
J., Loy, C.C., Lin, D., 2020a. Side-aware boundary localization for Engineeing in 2020, where he is currently pursuing
more precise object detection, in: European Conference on Computer a master’s degree in computer applied technology
Vision. in Shenzhen University.His current research inter-
[47] Wang, Z., Liu, C., Wang, S., Tang, T., Tao, Y., Yang, C., Li, H., ests include generic object detection and reinforce
Liu, X., Fan, X., 2020b. Udd: an underwater open-sea farm ob- learning.
ject detection dataset for underwater robot picking. arXiv preprint
arXiv:2003.01446 .
[48] Wu, Y., Chen, Y., Yuan, L., Liu, Z., Wang, L., Li, H., Fu, Y., 2020.
Rethinking classification and localization for object detection, in:
Proceedings of the IEEE/CVF Conference on Computer Vision and
Linhui Dai received the B.E. degree in Informa-
Pattern Recognition.
tion System and Information Management in 2018,
[49] Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R., 2019. Detec-
and is currently working toward the Ph.D. degree
tron2. https://github.com/facebookresearch/detectron2.
with School of Electronics Engineering and Com-
[50] Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S., 2019. Reppoints: Point set
puter Science, Peking University, Beijing, China.
representation for object detection, in: Proceedings of the IEEE/CVF
Her current research interests include underwater
International Conference on Computer Vision.
object detection, open world object detection, and
[51] Zhang, H., Chang, H., Ma, B., Wang, N., Chen, X., 2020a. Dynamic
salient object detection.
r-cnn: Towards high quality object detection via dynamic training, in:
European Conference on Computer Vision.