Varifocal Net
Varifocal Net
Abstract
1
To overcome these shortcomings, we naturally would popular two-stage methods [3, 4] and multi-stage meth-
like to ask: Instead of predicting an additional localiza- ods [17] usually employ anchors to generate object propos-
tion accuracy score, can we merge it into the classification als for downstream classification and regression, anchor-
score? That is, predict a localization-aware or IoU-aware based one-stage methods [6–8, 12, 18, 19] directly classify
classification score (IACS) that simultaneously represents and refine anchor boxes without object proposal generation.
the presence of a certain object class and the localization More recently, anchor-free detectors have attracted sub-
accuracy of a generated bounding box. stantial attention due to their novelty and simplicity. One
In this paper, we answer the above question and make kind of them formulates the object detection problem as
the following contributions. (1) We show that accurately a key-point or a semantic-point detection problem, includ-
ranking candidate detections is critical for high performing ing CornerNet [20], CenterNet [21], ExtremeNet [22], Ob-
dense object detectors, and IACS achieves a better ranking jectsAsPoints [23] and RepPoints [24]. Another type of
than other methods (Section 3). (2) We propose a new Var- anchor-free detectors are similar to anchor-based one-stage
ifocal Loss for training dense object detectors to regress the methods, but they remove the usage of anchor boxes. In-
IACS. (3) We design a new star-shaped bounding box fea- stead, they classify each point on the feature pyramids [25]
ture representation for computing the IACS and refining the into foreground classes or background, and directly predict
bounding box. (4) We develop a new dense object detector the distances from the foreground point to the four sides
based on the FCOS [9]+ATSS [12] and the proposed com- of the ground-truth bounding box, to produce the detection.
ponents, named VarifocalNet or VFNet for short, to exploit Popular methods include DenseBox [26], FASF [27], Fove-
the advantage of the IACS. An illustration of our method is aBox [15], FCOS [9], and SPAD [28]. We build our VFNet
shown in Figure 1. based on the ATSS [12] version of FCOS due to its simplic-
The Varifocal Loss, inspired by the focal loss [8], is a ity, high efficiency and excellent performance.
dynamically scaled binary cross entropy loss. However, it Detection Ranking Measures. In addition to the classifi-
supervises the dense object detector to regress continuous cation score, other detection ranking measures have been
IACSs, and more distinctively it adopts an asymmetrical proposed. IoU-Net [10] adopts an additional network to
training example weighting method. It down-weights only predict the IoU and uses it to rank bounding boxes in NMS,
negative examples for addressing the class imbalance prob- but it still selects the classification score as the final detec-
lem during training, and yet up-weights high-quality posi- tion score. Fitness NMS [29], IoU-aware RetinaNet [11]
tive examples for generating prime detections. This focuses and [30] are similar to IoU-Net in essence, except that they
the training on high-quality positive examples, which is im- multiply the predicted IoU or IoU-based ranking scores and
portant to achieve a high detection performance. the classification score as the ranking basis. Instead of pre-
The star-shaped bounding box feature representation dicting the IoU-based score, FCOS [9] predicts centerness
uses the features at nine fixed sampling points (yellow cir- scores to suppress the low-quality detections.
cles in Figure 1) to represent a bounding box with the de- By contrast, we predict only the IACS as the ranking
formable convolution [13, 14]. Compared to the point fea- score. This avoids the overhead of an additional network
ture used in most existing dense object detectors [7–9, 15], and the possible worse ranking basis resulting from multi-
this feature representation can capture the geometry of the plying the imperfect localization and classification scores.
bounding box and its nearby contextual information, which
is essential for predicting an accurate IACS. It also enables Encoding the Bounding Box. Extracting discriminative
us to effectively refine the initially generated coarse bound- features to represent a bounding box is important for down-
ing box without losing efficiency. stream classification and regression in object detection. In
To verify the effectiveness of our proposed modules, we two-stage and multi-stage methods, RoI Pooling [2, 3] or
build the VFNet based on the FCOS+ATSS and evaluate RoIAlign [4] are widely employed to extract bounding box
it on COCO benchmark [16]. Experiments show that our features. But applying them in dense object detectors is
VFNet consistently exceeds the strong baseline by ∼2.0 time-consuming. Instead, one-stage detectors generally use
AP with different backbones, and our best model VFNet- point features as the bounding box descriptor [7–9], due to
X-1200 with Res2Net-101-DCN reaches a single-model the efficiency consideration. However, these local features
single-scale 55.1 AP on COCO test-dev, surpassing fail to capture the geometry of the bounding box and essen-
previously published best single-model single-scale results. tial contextual information.
Alternatively, HSD [31] and RepPoints [24] extract fea-
2. Related Work tures at learned semantic points with the deformable con-
volution to encode a bounding box. However, learning to
Object Detection. With the development of object de- localize semantic points is challenging due to the lack of
tection, currently popular object detectors can be catego- strong supervision, and the prediction of semantic points
rized by whether they use anchor boxes or not. While also aggravates the computation burden.
2
FCOS is built on FPN [25] and its detection head has
three branches. One predicts the classification score for
each point on the feature map, one regresses the distances
from the point to the four sides of a bounding box, and an-
other predicts the centerness score which is multiplied by
the classification score to rank the bounding box in NMS.
Figure 2 shows an example of the output from the FCOS
head. In this paper, we actually study the ATSS version
of FCOS (FCOS+ATSS) in which the Adaptive Training
Sample Selection (ATSS) mechanism [12] is used to define
Figure 2: An example of the output from the FCOS head foreground and background points on the feature pyramids
which includes a classification score, a bounding box and a during training. We refer the reader to [12] for more details.
centerness score. To investigate the performance upper bound of the
FCOS+ATSS (trained on COCO train2017 [16]), we al-
FCOS+ATSS ternately replace the predicted classification score, the dis-
w/ctr X X X X X X tance offsets and the centerness score with corresponding
gt ctr X ground-truth values for foreground points before NMS, and
gt ctr iou X evaluate the detection performance in terms of AP [16] on
gt bbox X X COCO val2017. For the classification score vector, we
gt cls X X implement two options, that is, replacing its element at the
gt cls iou X X ground-truth label position with a value of 1.0 or the IoU
AP 38.5 39.2 41.1 43.5 56.1 56.3 43.1 58.1 74.7 67.4 between the predicted bounding box and the ground-truth
Table 1: Performance of the FCOS+ATSS on COCO one (termed as gt IoU). We also consider replacing the cen-
val2017 with different oracle predictions. W/ctr means terness score with the gt IoU in addition to its true value.
using the centerness score in inference. Please see the text The results are shown in Table 1. We can see that
for the meaning of other abbreviations. the original FCOS+ATSS achieves 39.2 AP. When using
the ground-truth centerness score (gt ctr) in inference, un-
In comparison, our proposed star-shaped bounding box expectedly, only about 2.0 AP is increased. Similarly,
representation uses the features at nine fixed sampling replacing the predicted centerness score with the gt IoU
points to describe a bounding box. It is simple, efficient, (gt ctr iou) only achieves 43.5 AP. This indicates that us-
and yet able to capture the geometry of the bounding box ing the product of either the predicted centerness score or
and spatial context cues around it. the IoU score and the classification score to rank detections
Generalized Focal Loss. The most similar work to ours is certainly unable to bring significant performance gain.
is a concurrent work, Generalized Focal Loss (GFL) [32].
By contrast, the FCOS+ATSS with ground-truth bound-
The GFL extends the focal loss [8] to a continuous version
ing boxes (gt bbox) achieves 56.1 AP even without center-
and trains a detector to predict a joint representation of lo-
ness score (no w/ctr) in inference. But if setting the classifi-
calization quality and classification.
cation score as 1.0 at the ground-truth label position (gt cls),
We emphasize first that our varifocal loss is a distinct
whether or not to use the centerness score becomes impor-
function from the GFL. It weights positive and negative ex-
tant (43.1 AP vs 58.1 AP). Because the centerness score can
amples asymmetrically, whereas the GFL deals with them
differentiate accurate and inaccurate boxes to some extent.
equally, and experiment results show that our varifocal loss
The most surprising result is the one obtained by re-
performs better than the GFL. Moreover, we propose a star-
placing the classification score of the ground-truth class
shaped bounding box feature representation to facilitate the
with the gt IoU (gt cls iou). Without the centerness score,
prediction of IACS, and further improve the object local-
this case achieves an impressive 74.7 AP which is signif-
ization accuracy through a bounding box refinement step,
icantly higher than other cases. This in fact reveals that
which are not considered in the GFL.
there already exist accurately localized bounding boxes in
3. Motivation the large candidate pool for most objects. The key to achiev-
ing an excellent detection performance is to accurately se-
In this section, we investigate the performance up- lect those high-quality detections from the pool, and these
per bound of a popular anchor-free dense object detector, results show that replacing the classification score of the
FCOS [9], identify its main performance hindrance and ground-truth class with the gt IoU is the most promising
show the importance of producing the IoU-aware classifi- selection measure. We refer to the element of such a score
cation score as the ranking criterion. vector as the IoU-aware Classification Score (IACS).
3
4. VarifocalNet As Equation 2 shows, the varifocal loss only reduces the
loss contribution from negative examples (q=0) by scaling
Based on the discovery above, we propose to learn the their losses with a factor of pγ and does not down-weight
IoU-aware classification score (IACS) to rank detections. positive examples (q>0) in the same way. This is because
To this end, we build a new dense object detector, coined positive examples are extremely rare compared with nega-
as VarifocalNet or VFNet, based on the FCOS+ATSS tives and we should keep their precious learning signals. On
with the centerness branch removed. Compared with the the other hand, inspired by PISA [33] and [34], we weight
FCOS+ATSS, it has three new components: the varifcoal the positive example with the training target q. If a positive
loss, the star-shaped bounding box feature representation example has a high gt IoU, its contribution to the loss will
and the bounding box refinement. thus be relatively big. This focuses the training on those
4.1. IACS – IoU-Aware Classification Score high-quality positive examples which are more important
for achieving a higher AP than those low-quality ones.
We define the IACS as a scalar element of a classification To balance the losses between positive examples and
score vector, in which the value at the ground-truth class negative examples, we add an adjustable scaling factor α
label position is the IoU between the predicted bounding to the negative loss term.
box and its ground truth, and 0 at other positions.
4.3. Star-Shaped Box Feature Representation
4.2. Varifocal Loss
We design a star-shaped bounding box feature represen-
We design the novel Varifocal Loss for training a dense tation for IACS prediction. It uses the features at nine fixed
object detector to predict the IACS. Since it is inspired sampling points (yellow circles in Figure 1) to represent
by Focal Loss [8], we first briefly review the focal loss. a bounding box with the deformable convolution [13, 14].
Focal loss is designed to address the extreme imbalance This new representation can capture the geometry of a
problem between foreground and background classes dur- bounding box and its nearby contextual information, which
ing the training of dense object detectors. It is defined as: is essential for encoding the misalignment between the pre-
( dicted bounding box and the ground-truth one.
−α(1 − p)γ log(p) if y = 1 Specifically, given a sampling location (x, y) on the im-
FL(p, y) = γ
(1)
−(1 − α)p log(1 − p) otherwise, age plane (or a projecting point on the feature map), we
first regress an initial bounding box from it with 3x3 convo-
where y ∈ {±1} specifies the ground-truth class and p ∈ lution. Following the FCOS, this bounding box is encoded
[0, 1] is the predicted probability for the foreground class. by a 4D vector (l’, t’, r’, b’) which means the distance from
As shown in Equation 1, the modulating factor ((1 − p)γ for the location (x, y) to the left, top, right and bottom side of
the foreground class and pγ for the background class) can the bounding box respectively. With this distance vector,
reduce the loss contribution from easy examples and rela- we heuristically select nine sampling points at: (x, y), (x-
tively increases the importance of mis-classified examples. l’, y), (x, y-t’), (x+r’, y), (x, y+b’), (x-l’, y-t’), (x+l’, y-t’),
Thus, the focal loss prevents the vast number of easy neg- (x-l’, y+b’) and (x+r’, y+b’), and then map them onto the
atives from overwhelming the detector during training and feature map. Their relative offsets to the projecting point
focuses the detector on a sparse set of hard examples. of (x, y) serve as the offsets to the deformable convolu-
We borrow the example weighting idea from the focal tion [13,14] and then features at these nine projecting points
loss to address the class imbalance problem when training a are convolved by the deformable convolution to represent
dense object detector to regress the continuous IACS. How- a bounding box. Since these points are manually selected
ever, unlike the focal loss that deals with positives and neg- without additional prediction burden, our new representa-
atives equally, we treat them asymmetrically. Our varifocal tion is computation efficient.
loss is also based on the binary cross entropy loss and is
defined as:
4.4. Bounding Box Refinement
( We further improve the object localization accuracy
−q(qlog(p) + (1 − q)log(1 − p)) q > 0 through a bounding box refinement step. Bounding box re-
VFL(p, q) =
−αpγ log(1 − p) q = 0, finement is a common technique in object detection [17,35],
(2) however, it is not widely adopted in dense object detec-
where p is the predicted IACS and q is the target score. For tors due to the lack of an efficient and discriminative object
a foreground point, q for its ground-truth class is set as the descriptor. With our new star representation, we can now
IoU between the generated bounding box and its ground adopt it in dense object detectors without losing efficiency.
truth (gt IoU) and 0 otherwise, whereas for a background We model the bounding box refinement as a residual
point, the target q for all classes is 0. See Figure 1. learning problem. For an initially regressed bounding box
4
Figure 3: The network architecture of our VFNet. The VFNet is built on the FPN (P3-P7). Its head consists of two subnet-
works, one for regressing the initial bounding box and refining it, and the other for predicting the IoU-aware classification
score, based on a star-shaped bounding box feature representation (Star Dconv). H×W denotes the size of the feature map.
(l’, t’, r’, b’), we first extract the star-shaped representa- ber) elements per spatial location, where each element rep-
tion to encode it. Then, based on the representation, we resents jointly the object presence confidence and localiza-
learn four distance scaling factors (∆l, ∆t, ∆r, ∆b) to scale tion accuracy.
the initial distance vector, so that the refined bounding box
4.6. Loss Function and Inference
that is represented by (l, t, r, b) = (∆l×l’, ∆t×t’, ∆r×r’,
∆b×b’) is closer to the ground truth. Loss Function. The training of our VFNet is supervised
by the loss function:
4.5. VarifocalNet
1 XX
Loss = VFL(pc,i , qc,i )
Attaching the above three components to the FCOS Npos
i c
network architecture and removing the original centerness
λ0 X
branch, we get the VarifocalNet. + qc∗ ,i Lbbox (bbox0i , bbox∗i ) (3)
Figure 3 illustrates the network architecture of the Npos
i
VFNet. The backbone and FPN network parts of the VFNet λ1 X
are the same as the FCOS. The difference lies in the head + qc∗ ,i Lbbox (bboxi , bbox∗i )
Npos
i
structure. The VFNet head consists of two subnetworks.
The localization subnet performs bounding box regression where pc,i and qc,i denote the predicted and target IACS
and subsequent refinement. It takes as input the feature map respectively for the class c at the location i on each level
from each level of the FPN and first applies three 3x3 conv feature map of FPN. Lbbox is the GIoU loss [36], and bbox0i ,
layers with ReLU activations. This produces a feature map bboxi and bbox∗i represent the initial, refined and ground-
with 256 channels. One branch of the localization subnet truth bounding box respectively. We weight the Lbbox with
convolves the feature map again and then outputs a 4D dis- the training target qc∗ ,i , which is the gt IoU for foreground
tance vector (l’, t’, r’, b’) per spatial location which rep- points and 0 otherwise, following the FCOS. λ0 and λ1 are
resents the initial bounding box. Given the initial box and the balance weights for Lbbox and are empirically set as 1.5
the feature map, the other branch applies a star-shaped de- and 2.0 respectively in this paper. Npos is the number of
formable convolution to the nine feature sampling points foreground points and is used to normalize the total loss.
and produces the distance scaling factor (∆l, ∆t, ∆r, ∆b) As mentioned in Section 3, we employ the ATSS [12] to
which is multiplied by the initial distance vector to generate define foreground and background points during training.
the refined bounding box (l, t, r, b). Inference. The inference of the VFNet is straightforward.
The other subnet aims to predict the IACS. It has the It involves simply forwarding an input image through the
similar structure to the localization subnet (the refinement network and a NMS post-processing step for removing re-
branch) except that it outputs a vector of C (the class num- dundant detections.
5
5. Experiments γ α q weighting AP AP50 AP75
1.0 0.50 X 41.2 59.2 44.7
Dataset and Evaluation Metrics. We evaluate the VFNet 1.5 0.75 X 41.5 59.7 45.1
on the challenging MS COCO 2017 benchmark [16]. Fol- 2.0 0.75 X 41.6 59.5 45.0
lowing the common practice [3, 8, 9, 12], we train detec- 2.0 0.75 41.2 59.1 44.4
tors on the train2017 split, report ablation results on 2.5 1.25 X 41.5 59.4 45.2
the val2017 split and compare with other detectors on the 3.0 1.00 X 41.3 59.0 44.7
test-dev split by uploading the results to the evaluation
server. We adopt the standard COCO-style Average Preci- Table 2: Peformance of the VFNet when changing the
sion (AP) as the evaluation metrics. hyper-parameters (α, γ) of the varifocal loss. q weighting
Implementation and Training Details. We implement means weighting the loss of the positive example with the
the VFNet with MMDetection [37]. Unless specified, we learning target q.
adopt the default hyper-parameters used in MMDetection.
The initial learning rate is set as 0.01 and we employ the Star BBox Re-
VFL AP AP50 AP75
linear warming up policy [38] to start the training where the Dconv finement
warm-up ratio is set as 0.1. We use 8 V100 GPUs for train- 39.0 57.7 41.8
ing with a total batch size of 16 (2 images per GPU) in both X 40.1 58.5 43.4
ablation studies and performance comparison. X X 40.7 59.0 44.0
For ablation studies on the val2017, the ResNet- X X X 41.6 59.5 45.0
50 [39] is used as the backbone network and 1x training FCOS+ATSS 39.2 57.3 42.4
schedule (12 epochs) [37] is adopted. Input images are re- Table 3: Individual contribution of the components in our
sized to a maximum scale of 1333×800, without changing method. The first row represents the results of the raw
the aspect ratio. Only random horizontal image flipping is VFNet trained with the focal loss [8].
used for data augmentation.
For performance comparison with the state-of-the-art on
the test-dev, we train the VFNet with different back- two hyper-parameters: α for balancing the losses between
bone networks, including those ones with deformable con- positive examples and negative examples, and γ for down-
volution layers [13, 14] (denoted as DCN) inserted. When weighting the losses of the easy negative examples. We
DCN is used in the backbone, we also insert it into the last show the performance of the VFNet in Table 2 when vary-
layers before the star deformable convolution in the VFNet ing α from 0.5 to 1.5 and γ from 1.0 to 3.0 (only the results
head. 2x (24 epochs) training scheme and multi-scale train- obtained with optimal α are shown). It shows that similar
ing (MSTrain) are adopted, where a maximum image scale results above 41.2 AP are achieved and our varifocal loss is
for each iteration is randomly selected from a scale range. quite robust to different sets of (α, γ). Among those, α =
In fact, we apply two image scale ranges in experiments. 0.75 and γ = 2.0 work best (41.6 AP), and we adopt these
For fair comparison with the baseline, we use the scale two values for all the following experiments.
range 1333×[640:800]; out of curiosity, we also experiment We also investigate the effect of weighting the loss of
with a wider scale range 1333×[480:960]. Note that even the positive example with the training target q, termed as
MSTrain is employed, we keep the maximum image scale q weighting. The fourth row in Table 2 shows the perfor-
to 1333×800 in inference, although a bigger scale performs mance of the optimal set of (α, γ) without q weighting and
slightly better (about 0.4 AP gain with 1333×900 scale). 0.4 AP drop is observed (41.2 AP v.s 41.6 AP). This con-
firms the positive effect of q weighting.
Inference Details. In inference, we forward the input im-
age which is resized to a maximum scale of 1333×800 5.1.2 Individual Component Contribution
through the network and obtain estimated bounding boxes
with corresponding IACSs. We first filter out those bound- We study the impact of the individual component of our
ing boxes with pmax ≤ 0.05 and select at most 1k top- method and results are shown in Table 3. The first row
scoring detections per FPN level. Then, the selected de- shows the performance of the raw VFNet (FCOS+ATSS
tections are merged and redundant detections are removed without centerness branch) trained with the focal loss and
by NMS with a threshold of 0.6 to yield the final results. 39.0 AP is acquired. Replacing the focal loss with our var-
ifocal loss, the performance is improved to 40.1 AP, which
5.1. Ablation Study is 0.9 AP higher than the FCOS+ATSS. By adding the star-
shaped representation and bounding box refinement mod-
5.1.1 Varifocal Loss
ules, the performance is further boosted to 40.7 AP and 41.6
We first investigate the effect of the hyper-parameters of AP respectively. These results verify the effectiveness of the
the varifocal loss on the detection performance. There are three modules in our VFNet.
6
Method Backbone FPS AP AP50 AP75 APS APM APL
Anchor-based multi-stage:
Faster R-CNN [3] X-101 40.3 62.7 44.0 24.4 43.7 49.8
Libra R-CNN [40] R-101 41.1 62.1 44.7 23.4 43.7 52.5
Mask R-CNN [4] X-101 41.4 63.4 45.2 24.5 44.9 51.8
R-FCN [41] R-101 41.4 63.4 45.2 24.5 44.9 51.8
TridentNet [42] R-101 42.7 63.6 46.5 23.9 46.6 56.6
Cascade R-CNN [17] R-101 42.8 62.1 46.3 23.7 45.5 55.2
SNIP [43] R-101 43.4 65.5 48.4 27.2 46.5 54.9
Anchor-based one-stage:
SSD512 [7] R-101 31.2 50.4 33.3 10.2 34.5 49.8
YOLOv3 [6] DarkNet-53 33.0 57.9 34.4 18.3 35.4 41.9
DSSD513 [44] R-101 33.2 53.3 35.2 13.0 35.4 51.1
RefineDet [35] R-101 36.4 57.5 39.5 16.6 39.9 51.4
RetinaNet [8] R-101 39.1 59.1 42.3 21.8 42.7 50.2
FreeAnchor [18] R-101 43.1 62.2 46.4 24.5 46.1 54.8
GFL [32] R-101-DCN 47.3 66.3 51.4 28.0 51.1 59.2
GFL [32] X-101-32x4d-DCN 48.2 67.4 52.6 29.2 51.7 60.2
EfficientDet-D6 [45] B6 5.3† 51.7 71.2 56.0 34.1 55.2 64.1
EfficientDet-D7 [45] B6 3.8† 52.2 71.4 56.3 34.8 55.5 64.6
Anchor-free key-point:
ExtremeNet [22] Hourglass-104 40.2 55.5 43.2 20.4 43.2 53.1
CornerNet [20] Hourglass-104 40.5 56.5 43.1 19.4 42.7 53.9
Grid R-CNN [46] X-101 43.2 63.0 46.6 25.1 46.5 55.2
CenterNet [20] Hourglass-104 44.9 62.4 48.1 25.6 47.4 57.4
RepPoints [24] R-101-DCN 45.0 66.1 49.0 26.6 48.6 57.5
Anchor-free one-stage:
FoveaBox [15] X-101 42.1 61.9 45.2 24.9 46.8 55.6
FSAF [27] X-101-64x4d 42.9 63.8 46.3 26.6 46.2 52.7
FCOS [9] R-101 43.0 61.7 46.3 26.0 46.8 55.0
SAPD [28] R-101 43.5 63.6 46.5 24.9 46.8 54.6
SAPD [28] R-101-DCN 46.0 65.9 49.6 26.3 49.2 59.6
Baseline:
ATSS [12] R-101 17.5 43.6 62.1 47.4 26.1 47.0 53.6
ATSS [12] X-101-64x4d 8.9 45.6 64.6 49.7 28.5 48.9 55.6
ATSS [12] R-101-DCN 13.7 46.3 64.7 50.4 27.7 49.8 58.4
ATSS [12] X-101-64x4d-DCN 6.9 47.7 66.5 51.9 29.7 50.8 59.4
Ours:
VFNet R-50 19.3 44.3/44.8 62.5/63.1 48.1/48.7 26.7/27.2 47.3/48.1 54.3/54.8
VFNet R-101 15.6 46.0/46.7 64.2/64.9 50.0/50.8 27.5/28.4 49.4/50.2 56.9/57.6
VFNet X-101-32x4d 13.1 46.7/47.6 65.2/66.1 50.8/51.8 28.3/29.4 50.1/50.9 57.3/58.4
VFNet X-101-64x4d 9.2 47.4/48.5 65.8/67.0 51.5/52.6 29.5/30.1 50.7/51.7 58.1/59.7
VFNet R2-101 [47] 13.0 48.4/49.3 66.9/67.6 52.6/53.5 30.3/30.5 52.0/53.1 59.2/60.5
VFNet R-50-DCN 16.3 47.3/48.0 65.6/66.4 51.4/52.3 28.4/29.0 50.3/51.2 59.4/60.4
VFNet R-101-DCN 12.6 48.4/49.2 66.7/67.5 52.6/53.7 28.9/29.7 51.7/52.6 61.0/62.4
VFNet X-101-32x4d-DCN 10.1 49.2/50.0 67.8/68.5 53.6/54.4 30.0/30.4 52.6/53.2 62.1/62.9
VFNet X-101-64x4d-DCN 6.7 49.9/50.8 68.5/69.3 54.3/55.3 30.7/31.6 53.1/54.2 62.8/64.4
VFNet R2-101-DCN [47] 10.3 50.4/51.3 68.9/69.7 54.7/55.8 31.2/31.9 53.7/54.7 63.3/64.4
VFNet-X-800 R2-101-DCN [47] 8.0 53.7 71.6 58.7 34.4 57.5 67.5
VFNet-X-1200 R2-101-DCN [47] 4.2 55.1 73.0 60.1 37.4 58.2 67.0
Table 4: Performance (single-model single-scale) comparison with state-of-the-art detectors on MS COCO test-dev.
VFNet consistently outperforms the strong baseline ATSS by ∼2.0 AP. Our best model VFNet-X-1200 reaches 55.1 AP,
achieving the new stat-of-the-art. ’R’: ResNet. ’X’: ResNeXt. ’R2’: Res2Net. ’DCN’: Deformable convolution network. ’/’
separates results of the MSTrain image scale range 1333×[640:800] / 1333×[480:960]. FPSs with † are from papers.
7
Method AP AP50 AP75 to 384 channels.
RetinaNet [8] + FL 36.5 55.5 38.8 RandomCrop and Cutout. We employ the random crop
RetinaNet [8] + GFL 37.3 56.4 40.0 and cutout [50] as additional data augmentation methods.
RetinaNet [8] + VFL 37.4 56.5 40.2 Wider MSTrain Scale Range and Longer Training. We
FoveaBox [15] + FL 36.3 56.3 38.3 adopt a wider MSTrain scale range, from 750×500 to
FoveaBox [15] + GFL 36.9 56.0 39.7 2100×1400, and initially train the VFNet-X for 41 epochs.
FoveaBox [15] + VFL 37.2 56.2 39.8 SWA. We apply the technique of stochastic weight averag-
RepPoints [24] + FL 38.3 59.2 41.1 ing (SWA) [51] in training the VFNet-X, which brings 1.2
RepPoints [24] + GFL 39.2 59.8 42.5 AP gain. Specifically, after the initial 41-epoch training of
RepPoints [24] + VFL 39.7 59.8 43.1 VFNet-X, we further train it for another 18 epochs using a
ATSS [12] + FL 39.3 57.5 42.5 cyclic learning rate schedule and then simply average those
ATSS [12] + GFL 39.8 57.7 43.2 18 checkpoints as our final model.
ATSS [12] + VFL 40.2 58.2 44.0 The performance of VFNet-X on COCO test-dev is
VFNet + FL 40.0 58.0 43.2 shown in the last rows of Table 4. When the inference scale
VFNet + GFL 41.1 58.9 42.2 1333×800 and soft-NMS [52] are adopted, VFNet-X-800
VFNet + VFL 41.6 59.5 45.0 achieves 53.7 AP, while simply increasing the image scale
to 1800×1200, VFNet-X-1200 reaches a new state-of-the-
Table 5: Comparison of performances when applying the
art 55.1 AP, surpassing prior detectors by a large margin.
focal loss (FL) [8], the generalized focal loss (GFL) [32]
Qualitative detection examples of applying this model to the
and our varifocal loss (VFL) to existing popular dense ob-
COCO test-dev can be found in Figure 4.
ject detectors and our VFNet.
5.4. Generality and Superiority of Varifocal Loss
5.2. Comparison with State-of-the-Art To verify the generality of our varifocal loss, we ap-
We compare our VFNet with other detectors on the ply it to some existing popular dense object detectors, in-
COCO test-dev. We select the ATSS [12] as our base- cluding RetinaNet [8], FoveaBox [15], RepPoints [24] and
line since it has similar performance to the FCOS+ATSS. ATSS [12], and evaluate the performance on the val2017.
Table 4 presents the results. Compared with the strong We simply replace the focal loss (FL) [8] used in these de-
baseline ATSS, our VFNet achieves ∼2.0 AP gaps with tectors (ResNet-50 backbone) with our varifocal loss for
different backbones, e.g. 46.0 AP vs. 43.6 AP with the training. We also train them with the generalized focal loss
ResNet-101 backbone. This validates the contributions (GFL) for comparison.
of our method. Compared to the concurrent work, the Table 5 shows the results. We can see that our varifocal
GFL [32] (whose MSTrain scale range is 1333x[480:800]), loss improves RetinaNet, FoveaBox and ATSS consistently
our VFNet is consistently better than it by a considerable by 0.9 AP. For RepPoints, the gain increases to 1.4 AP. This
margin. Meanwhile, our model trained with Res2Net-101- shows that our varifocal loss can easily bring considerable
DCN [47] achieves a single-model single-scale AP of 51.3, performance boost to existing dense object detectors. Com-
surpassing almost all recent state-of-the-art detectors. pared to the GFL, our varifocal loss performs better than it
We also report the inference speed of VFNet in terms of in all cases, evidencing the superiority of our varifocal loss.
frame per second (FPS) on a Nvidia V100 GPU. Since it is Additionally, we train our VFNet with the FL and GFL
difficult to get the speed of all the listed detectors under ex- for further comparison. Results are shown in the last section
actly same settings, we only compare VFNet with the base- of Table 5 and the consistent advantage of our varifocal loss
line ATSS. It can be seen that our VFNet is very efficient, over the FL and GFL can be observed.
e.g. achieving 44.8 AP at 19.3 FPS, and only incurs small 6. Conclusion
additional computation overhead compared to the baseline. In this paper, we propose to learn the IACS for rank-
5.3. VarifocalNet-X ing detections. We first show the importance of produc-
ing the IACS to rank bounding boxes and then develop a
To push the envelope of VFNet, We also implement dense object detector, VarifocalNet, to exploit the advan-
some extensions to the original VFNet. This version of tage of the IACS. In particular, we design a varifocal loss
VFNet is called VFNet-X and those extensions include: for training the detector to predict the IACS, and a star-
PAFPN. We replace the FPN with the PAFPN [48], and shaped bounding box feature representation for IACS pre-
apply the DCN and group normalization (GN) [49] in it. diction and bounding box refinement. Experiments on the
More and Wider Conv Layers. We stack 4 convolution MS COCO benchmark verify the effectiveness of our meth-
layers in the detection head, instead of 3 layers in the orig- ods and show that our VarifocalNet achieves the new stat-
inal VFNet, and increase the original 256 feature channels of-the-art performance among various object detectors.
8
Figure 4: Detection examples of applying our best model on COCO test-dev. The score threshold for visualization is 0.3.
9
References [18] Xiaosong Zhang, Fang Wan, Chang Liu, Rongrong Ji, and
Qixiang Ye. Freeanchor: Learning to match anchors for vi-
[1] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra sual object detection. In NIPS, 2019. 2, 7
Malik. Region-based convolutional networks for accurate
object detection and segmentation. IEEE Transactions on [19] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and
Pattern Analysis and Machine Intelligence, 2015. 1 Dahua Lin. Region proposal by guided anchoring. In CVPR,
2019. 2
[2] Ross Girshick. Fast R-CNN. In ICCV, 2015. 1, 2
[20] Hei Law and Jia Deng. Cornernet: Detecting objects as
[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. paired keypoints. In ECCV, 2018. 2, 7
Faster r-cnn: Towards real-time object detection with region
proposal networks. In NIPS, 2015. 1, 2, 6, 7 [21] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qing-
ming Huang, and Qi Tian. Centernet: Keypoint triplets for
[4] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
object detection. In ICCV, 2019. 2
shick. Mask r-cnn. In ICCV, 2017. 1, 2, 7
[22] Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl.
[5] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
Bottom-up object detection by grouping extreme and center
Farhadi. You only look once: Unified, real-time object de-
points. In CVPR, 2019. 2, 7
tection. In CVPR, 2016. 1
[23] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
[6] Joseph Redmon and Ali Farhadi. Yolov3: An incremental
jects as points. arXiv preprint arXiv:1904.07850, 2019. 2
improvement. arXiv preprint arXiv:1804.02767, 2018. 1, 2,
7 [24] Ze Yang, Shaohui Liu, Han Hu, Liwei Wang, and Stephen
Lin. Reppoints: Point set representation for object detection.
[7] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
In ICCV, 2019. 2, 7, 8
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
Berg. Ssd: Single shot multibox detector. In ECCV. Springer. [25] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
1, 2, 7 Bharath Hariharan, and Serge Belongie. Feature pyramid
networks for object detection. In CVPR, 2017. 2, 3
[8] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Piotr Dollár. Focal loss for dense object detection. In ICCV, [26] Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Dense-
2017. 1, 2, 3, 4, 6, 7, 8 box: Unifying landmark localization with end to end object
detection. arXiv preprint arXiv:1509.04874, 2015. 2
[9] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
Fully convolutional one-stage object detection. In ICCV, [27] Chenchen Zhu, Yihui He, and Marios Savvides. Feature se-
2019. 1, 2, 3, 6, 7 lective anchor-free module for single-shot object detection.
In CVPR, 2019. 2, 7
[10] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yun-
ing Jiang. Acquisition of localization confidence for accurate [28] Chenchen Zhu, Fangyi Chen, Zhiqiang Shen, and Marios
object detection. In ECCV, 2018. 1, 2 Savvides. Soft anchor-point object detection. In ECCV,
2020. 2, 7
[11] Shengkai Wu, Xiaoping Li, and Xinggang Wang. Iou-aware
single-stage object detector for accurate localization. Image [29] Lachlan Tychsen-Smith and Lars Petersson. Improving ob-
and Vision Computing, 2020. 1, 2 ject localization with fitness nms and bounded iou loss. In
CVPR, 2018. 2
[12] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
Stan Z Li. Bridging the gap between anchor-based and [30] Zhiyu Tan, Xuecheng Nie, Qi Qian, Nan Li, and Hao Li.
anchor-free detection via adaptive training sample selection. Learning to rank proposals for object detection. In ICCV,
In CVPR, 2020. 2, 3, 5, 6, 7, 8 2019. 2
[13] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong [31] Jiale Cao, Yanwei Pang, Jungong Han, and Xuelong Li. Hi-
Zhang, Han Hu, and Yichen Wei. Deformable convolutional erarchical shot detector. In ICCV, 2019. 2
networks. In ICCV, 2017. 2, 4, 6 [32] Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu,
[14] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De- Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss:
formable convnets v2: More deformable, better results. In Learning qualified and distributed bounding boxes for dense
CVPR, 2019. 2, 4, 6 object detection. arXiv preprint arXiv:2006.04388, 2020. 3,
[15] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Lei Li, 7, 8
and Jianbo Shi. Foveabox: Beyound anchor-based object [33] Yuhang Cao, Kai Chen, Chen Change Loy, and Dahua Lin.
detection. IEEE Transactions on Image Processing, 2020. 2, Prime sample attention in object detection. In CVPR, 2020.
7, 8 4
[16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, [34] Shengkai Wu and Xiaoping Li. Iou-balanced loss func-
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence tions for single-stage object detection. arXiv preprint
Zitnick. Microsoft coco: Common objects in context. In arXiv:1908.05641, 2019. 4
ECCV, 2014. 2, 3, 6 [35] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and
[17] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving Stan Z Li. Single-shot refinement neural network for object
into high quality object detection. In CVPR, 2018. 2, 4, 7 detection. In CVPR, 2018. 4, 7
10
[36] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir [44] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi,
Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- and Alexander C Berg. Dssd: Deconvolutional single shot
tersection over union: A metric and a loss for bounding box detector. In CoRR, 2017. 7
regression. In CVPR, 2019. 5
[45] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet:
[37] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Scalable and efficient object detection. In CVPR, 2020. 7
Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei
Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection [46] Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and Junjie Yan.
toolbox and benchmark. arXiv preprint arXiv:1906.07155, Grid r-cnn. In CVPR, 2019. 7
2019. 6 [47] Shanghua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu
[38] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- Zhang, Ming-Hsuan Yang, and Philip HS Torr. Res2net: A
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, new multi-scale backbone architecture. IEEE transactions
Yangqing Jia, and Kaiming He. Accurate, large mini- on pattern analysis and machine intelligence, 2019. 7, 8
batch sgd: Training imagenet in 1 hour. arXiv preprint
[48] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.
arXiv:1706.02677, 2017. 6
Path aggregation network for instance segmentation. In
[39] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. CVPR, 2018. 8
Deep residual learning for image recognition. In CVPR,
2016. 6 [49] Yuxin Wu and Kaiming He. Group normalization. In ECCV,
2018. 8
[40] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng,
Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towards bal- [50] Terrance DeVries and Graham W Taylor. Improved regular-
anced learning for object detection. In CVPR, 2019. 7 ization of convolutional neural networks with cutout. arXiv
[41] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object preprint arXiv:1708.04552, 2017. 8
detection via region-based fully convolutional networks. In [51] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry
NIPS, 2016. 7 Vetrov, and Andrew Gordon Wilson. Averaging weights
[42] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang leads to wider optima and better generalization. arXiv
Zhang. Scale-aware trident networks for object detection. preprint arXiv:1803.05407, 2018. 8
In ICCV, 2019. 7
[52] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and
[43] Bharat Singh and Larry S Davis. An analysis of scale invari- Larry S Davis. Soft-nms–improving object detection with
ance in object detection snip. In CVPR, 2018. 7 one line of code. In ICCV, 2017. 8
11