Chen Dense Learning Based Semi-Supervised Object Detection CVPR 2022 Paper
Chen Dense Learning Based Semi-Supervised Object Detection CVPR 2022 Paper
Binghui Chen1 , Pengyu Li1 , Xiang Chen1 , Biao Wang1 , Lei Zhang2 , Xian-Sheng Hua1
1
Alibaba Group, 2 The Hong Kong Polytechnic University
[email protected], [email protected], [email protected], [email protected]
[email protected], [email protected]
Abstract
4815
have been reported, and how to handle the dense pseudo- employed in [42] to produce accurate labels instead of la-
labels predicted by anchor-free detectors remains a chal- bel ensembles. Generally speaking, the above consistency-
lenging problem. based methods apply perturbations to the input image and
To address the above mentioned challenges, in this paper then minimize the differences between their output pre-
we propose a DenSe Learning (DSL) algorithm for anchor- dictions. These methods have proved to be effective at
free SSOD 1 . Specifically, to perform careful label guid- smoothing the feature manifold, and consequently improv-
ance for dense learning, we first present an Adaptive Filter- ing the generalization performance of models. There are
ing (AF) strategy to partition pseudo-labels into three fine- also some other techniques targeting at utilizing the unla-
grained parts, including background, foreground, and ignor- beled data to improve image classification, including self-
able regions. Then we refine these pseudo-labels by using training [6, 20, 23, 46], data augmentation [2, 37] and so on.
a MetaNet so as to remove the classification false-positives, Though many SSL methods have been proposed for im-
which have higher prediction scores but are actually false age classification, it is not a trivial work to transfer them
predictions in category. Considering that the correctness of to the task of object detection due to the complex archi-
pseudo-labels determines the performance of SSOD mod- tectural design and multi-task learning (classification and
els, we introduce an Aggregated Teacher (AT) to further regression) nature of object detectors.
enhance the stability and quality of the estimated pseudo- Object Detection is a fundamental task in computer vi-
labels. Moreover, to improve the model generalization ca- sion. Current CNN-based object detectors can be catego-
pability, we learn from shuffled image patches and regu- rized into anchor-based and anchor-free methods. Faster
larize the uncertainty of dense feature maps to make them R-CNN [36] is a well-known and representative two-stage
consistent among image scales. The main contributions of anchor-based detector. It consists of a region proposal
this paper are summarized as follows: network (RPN) and a region-wise prediction network (R-
CNN) for detecting objects. Many works [1, 3, 4, 21, 24, 43]
• A simple yet effective DenSe Learning (DSL) method have been proposed to improve the performance of Faster
is developed to improve the utilization of large-scale RCNN. For anchor-free object detection, the state-of-the-
unlabelled data for SSOD. To our best knowledge, this art methods [13,18,30,35,44] mostly regard the center (e.g.,
is the first anchor-free method for SSOD. the center point or part) of an object as a foreground to de-
fine positives, and then predict the distances from positives
• An Adaptive Filtering (AF) strategy is proposed to to the four sides of the object bounding box (BBox). For
assign fine-grained pseudo-labels to each pixel; an example, FCOS [44] takes all the pixels inside the BBox
Aggregated Teacher (AT) is introduced to enhance as positives, and uses these four distances and a centerness
the stability and quality of estimated pseudo-labels; score to detect objects. CSP [30] defines only the center
and learning from shuffled patches and uncertainty- point of the object box as positive to detect pedestrians with
consistency-regularization among scales are employed fixed aspect ratio. FoveaBox [18] regards pixels in the mid-
to improve the model generalization performance. dle part of object as positives and learns four distances to
perform detection. Without the need to set anchors, anchor-
Extensive experiments conducted on MS-COCO [27] free detectors are much easier and more flexible to be de-
and PASCAL-VOC [8] demonstrate that the proposed DSL ployed in real applications.
method achieves significant performance improvements
Semi-Supervised Object Detection (SSOD). SSOD
over existing state-of-the-art SSOD methods.
aims to improve the performance of object detectors by us-
ing larger-scale unlabeled data. Since the manual annota-
2. Related Work tion of object labels is very expensive, producing pseudo-
Semi-Supervised Learning for Image Classification. labels for unlabeled data is very attractive. In [34, 39, 52],
Recently, semi-supervised learning (SSL) has achieved sig- the pseudo-labels are produced by ensembling the predic-
nificant progress in image classification with the rapid de- tions from different data augmentations. STAC [38] uses
velopment of deep learning techniques. SSL aims to em- both weak and strong augmentations for model training,
ploy a large amount of unlabeled data to learn robust and where strong augmentations are only applied to unlabeled
discriminative classification boundaries. Specifically, self- data while weak augmentations are used to produce stable
ensembling is used in [19] to stabilize the learning targets pseudo-labels. UBA [31] employs the EMA teacher [42] for
for unlabeled data. A new measure of local smoothness of producing more accurate pseudo-labels. ISMT [48] fuses
the conditional label distribution is proposed in [32] for im- the current pseudo-labels with history labels via NMS, and
proving the SSL learning performance. Mean teacher is uses multiple detection heads to improve the accuracy of
pseudo-labels. Instant-Teaching [51] combines more pow-
1 In this paper, we employ FCOS [44] as our baseline detector. erful augmentations like Mixup and Mosaic into the train-
4816
Figure 2. The pipeline of our proposed DenSe Learning (DSL) based SSOD method. The training data contain both labeled and unlabeled
images. During each training iteration, a teacher model is employed to produce pseudo-labels for weakly augmented unlabeled images.
In anchor-free based detectors like FCOS [44], each spatial location of the dense predictions will be assigned with one label, and the
model performance is sensitive to noisy pseudo-labels. To alleviate this problem, an Adaptive Filtering strategy is proposed to split the
pseudo-labels into three types, including background, foreground and ignorable regions. Moreover, there exist some false positive cases,
which have higher scores but are obviously wrong predictions. Thus, a MetaNet is proposed to refine these cases. To improve the model
generalization capability, unlabeled images are patch-shuffled and consistency regularizations are applied on these images with different
scales. For improving the stability and quality of pseudo-labels, the teacher model is updated by the student models via aggregation, called
Aggregated Teacher. After obtaining the fine-grained pixel-wise pseudo-labels, the detector can be optimized by the final loss, which is the
sum of Ls , Lu and Lscale .
ing stage. Humble-Teacher [41] uses plenty of proposals where Ls and Lu denote supervised loss and unsupervised
and soft pseudo-labels for the unlabeled data. Certainty- loss, respectively, and α is the hyper-parameter to control
aware pseudo-labels are tailored in [22] for object detection. the contribution of unlabeled data.
E2E [47] uses a soft teacher mechanism for training with Both of the supervised and unsupervised losses are nor-
the unlabeled data. Almost all the above methods are built malized by the corresponding number of positive pixels in
upon anchor-based detectors, e.g., Faster RCNN, which are each mini-batch as follows:
not convenient to deploy in real applications with limited
\label {eq_FCOS} L_{s}=\frac {1}{N_{pos}}\sum _{i}\sum _{h,w}(&L_{cls}(X_{i,h,w})+\mathbbm {1}_{\{p^{*}_{h,w}\in [0,C-1]\}}L_{reg}(X_{i,h,w})\nonumber \\ +&\mathbbm {1}_{\{p^{*}_{h,w}\in [0,C-1]\}}L_{center}(X_{i,h,w}))
resources. Therefore, in this work we develop, for the first
time to our best knowledge, an anchor-free SSOD method.
(2)
3. Methods
L_{u}=\frac {1}{N_{pos}}\sum _{i}\sum _{h,w}(&L_{cls}(U_{i,h,w})+\mathbbm {1}_{\{\bar {p}^{*}_{h,w}\in [0,C-1]\}}L_{reg}(U_{i,h,w})\nonumber \\ +&\mathbbm {1}_{\{\bar {p}^{*}_{h,w}\in [0,C-1]\}}L_{center}(U_{i,h,w}))
3.1. Preliminary
For the convenience of expression, we first provide some (3)
notations for the SSOD task. Suppose that we have two sets
of data, a labeled set X = {Xi |N where Npos means the number of positive pixels in one
i=1 } and an unlabeled set
l
Nu
U = {Ui |i=1 }, where Nl and Nu are the number of labeled mini-batch, Xi,h,w means the predicted vector at spatial lo-
and unlabeled images, respectively, and Nu ≫ Nl . Each cation (h, w) from the ith image, p̄∗h,w is the corresponding
labeled image has annotations of category p∗ ∈ [0, C − 1] estimated pseudo-labels at location (h, w). Lcls , Lreg and
(C is the number of foreground classes) and annotations of Lcenter are the default losses used in FCOS [44]. 1{·} is
bounding box (BBox) t∗ . In an image, each region anno- the indicator function, which outputs 1 if condition {·} is
tated by BBox and class label is called an instance. Without satisfied and 0 otherwise.
loss of generality, we take the anchor-free FCOS [44] de- In this paper, we propose a DenSe Learning (DSL) algo-
tector as our baseline, which is composed of a ResNet50 [9] rithm for bridging the gap between SSOD and anchor-free
backbone, an FPN [26] neck and a dense head. To use both detector. The pipeline of our DSL method is illustrated in
labeled and unlabeled data for training, the overall loss can Figure 2. It is mainly composed of an Adaptive Filtering
be defined as follows: (AF) strategy, a MetaNet, an Aggregated Teacher (AT) and
an Uncertainty-Consistency regularization term, which are
\label {eq_overloss} L=L_{s}+\alpha L_{u} (1) introduced in detail in the following sections.
4817
Figure 3. The distributions of TP+, TP- and BG when using 10% Figure 4. (a) The estimated classification-false-positive instances
labeled data on COCO. ‘TP+’ means that the estimated instance which have high scores but are obvious false predictions in cate-
has the same class ID as the ground-truth (GT) and the IOU of gory. (b) Our proposed MetaNet for refining the pseudo-labels of
√
BBox is above 0.5. ‘TP-’ means that the estimated instance has instances. ‘ ’ and ‘×’ mean reservation and deletion, resp.
the same class ID as GT but the IOU of BBox is below 0.5. ‘BG’
means that the estimated instance belongs to the background or Different from foreground and background regions, we ig-
has wrong class ID. nore the gradients computation and propagation for ignor-
able regions as:
3.2. Adaptive Filtering Strategy
The FCOS [44] detector reduces the dependency on pre- \label {eq_ignore} L_{u}=\frac {1}{N_{pos}}\sum _{i}\sum _{h,w}(&\mathbbm {1}_{\{\bar {p}^{*}_{h,w}\geq 0\}}L_{cls}(U_{i,h,w})+\mathbbm {1}_{\{\bar {p}^{*}_{h,w}\in [0,C-1]\}}\nonumber \\ L_{reg}(U_{i,h,w})+&\mathbbm {1}_{\{\bar {p}^{*}_{h,w}\in [0,C-1]\}}L_{center}(U_{i,h,w})).
defined anchors by introducing dense pixel-wise supervi-
sion. Though this is helpful for the easy deployment in ac- (5)
tual applications, the performance of the model is sensitive
τ1 in Eq. 4 is used to filter out the background and thus it
to the quality of pixel-wise labels. Because the predicted
is relatively easy to set. We set τ1 = 0.1 throughout our
pseudo-labels in SSOD will have noise no matter how pow-
experiments. τ2 is employed to filter out the foreground and
erful the detector is, the pixel-wise supervision for FCOS
it is harder to set for different classes. We propose to use a
should be treated prudently. To this end, we propose an
class-adaptive τ2k instead of a fixed τ2 :
Adaptive Filtering (AF) strategy to elaborately handle the
pseudo-labels for dense learning.
To exploit the unlabeled data, we need to assign a \label {eq_ada} \tau _{2}^{k} = (\frac {\sum _{h,w}\mathbbm {1}_{\{\bar {p}_{h,w}^{*}==k\}}p_{h,w}}{N_{pos}})^{\beta }\tau , (6)
pseudo-label for each pixel in the output dense tensor. As
shown in Figure 3, however, we can see that the TP+, TP-
where τ2k is the threshold for the k th class, β = 0.7 is used
and BG instances coexist with each other, and their distri-
to control the degree of focus on tail-classes, and τ = 0.35
butions are much more complex. If we simply use a sin-
is used as a fixed reference threshold.
gle threshold to define foreground and background, many
Remarks: Different from those anchor-based detectors,
instances will be assigned with wrong labels, resulting in
anchor-free detectors will predict each pixel as either back-
heavy noise and damaging the learning of an accurate de-
ground or foreground, and compute gradients for all of
tector. For example, if we set a relatively higher threshold
them. However, for unlabeled data, instances with scores
0.4 to define the positive instances, there will be many TP+
within interval [τ1 , τ2k ] are noisy and confusing, and treat-
and TP- wrongly assigned to the background. Conversely,
ing them as either foreground or background will degrade
if we set a relatively lower threshold 0.1 to define the back-
the detection performance. Therefore, in anchor-free SSOD
ground instances, there will be many BG instances wrongly
we should explicitly set multiple fine-grained thresholds to
assigned to the foreground. Therefore, we propose to use
identify not only the background and foreground but also
multiple thresholds {τ1 , τ2 } to partition the estimated in-
the ignorable regions. The proposed AF strategy can well
stances into three parts: background, ignorable region and
handle this problem and assign fine-grained and multi-level
foreground:
labels to the dense pixels, as illustrated in Figure. 2. We
experimentally demonstrate that the AF strategy is very im-
\label {eq_pl} \bar {p}_{h,w}^{*}=\left \{ \begin {aligned} &Foreground:[0,\cdots ,C-1]& & p_{h,w}>=\tau _{2}, \\ &Ignorable~Region:[-1] & & \tau _{1}<p_{h,w}<\tau _{2},\\ &Background:[C] & & p_{h,w}<=\tau _{1}. \end {aligned} \right . portant for anchor-free SSOD.
3.3. MetaNet
(4)
where ph,w is the predicted score at location (h, w) (If not Though AF has the ability to improve the quality of
specified, it is the product of classification score and cen- pseudo-labels for dense learning, there still exist some
terness score), and p̄∗h,w is the corresponding pseudo-label. classification-false-positive instances, which have high
4818
scores but are obvious false predictions, as shown in Figure
4(a). In order to handle these instances, we resort to using
a MetaNet, as shown in Figure 4(b). We use a ResNet50 to
implement the MetaNet. Before DSL training, we first pass
all the labeled instances into the MetaNet and compute the
following class-wise proxies mk :
\label {eq_meta} m_{k}=\frac {\sum _{i}f_{i,k}}{N_{k}}, (7) Figure 5. The illustration of (a) EMA Teacher and (b) our Ag-
gregated Teacher. EMA teacher performs aggregation only over
where fi,k is the 1-D feature vector of the ith instance be- parameters, while our Aggregated teacher performs aggregation
longing to the k th class, Nk is the number of instances of over both parameters and layers.
the k th class. After obtaining the class-wise proxies, we
refine the pseudo-labels by computing the cosine distance the recurrent learning [11, 25, 50] and use a recurrent layer
between the feature vector of the unlabeled instance and the aggregation mechanism as bellow:
corresponding class proxy vector. If the distance is smaller
than a threshold d = 0.6, we will change the label ‘Fore- x_{l+1}&=\theta _{l+1}[x_{l}+h_{l}]+x_{l}\label {eq_res_rla},\\ h_{l+1}&=g_{2}[g_{1}[\theta _{l+1}[x_{l}+h_{l}]]+h_{l}]\label {eq_rla},
ground’ of this instance to the label ‘Ignorable Region’. (10)
Remarks: MetaNet is employed to rectify the predicted
foreground class labels of those error-prone instances. It where xl is the lth layer’s tensor in CNN and θl denotes
only performs the meta update step and thus can work in a the corresponding convolution parameters. hl is the hidden
plug-and-play manner. The computation of MetaNet only state tensor for the lth layer, and h1 is initialized with zero.
involves the class proxy update on the labeled instances g1 and g2 are the corresponding 1 × 1 and 3 × 3 Conv layers
without gradient back-propagation, and thus it is fast and used for recurrent computing, which are parameter-shared
the cost is negligible compared with the training of DSL. across the adjacent layers within the same stage. ∗[·] in-
With the help of stable class proxies, we can successfully dicates the convolution operation between input tensor ‘·’
remove many classification-false-positive instances. and parameter ‘∗’. By using the recurrent mechanism, the
3.4. Aggregated Teacher number of introduced parameters is negligible. One can see
from Eq. 9 that it will degrade to the default residual unit
In pseudo-label based methods, the stability and quality of ResNet when the hidden state hl−1 is removed. In other
of the predicted pseudo-labels are important to the final per- words, the recurrent layer aggregation can be easily applied
formance. Therefore, almost all the existing anchor-based to the current residual CNN models. Moreover, since neck
methods [22, 31, 41, 47, 48] employ an EMA Teacher to im- and heads in the detector are very shallow, we only perform
prove the quality of pseudo-labels for the unlabeled data. layer aggregation over the backbone.
As illustrated in Figure 5(a), EMA is usually performed in Remarks: Since the parameter aggregation in EMA
following manner: Teacher treats each layer independently, the relationship be-
tween layers might be destroyed during aggregation, and
\label {eq_ema} \theta ^{'t}=\epsilon \theta ^{'t-1}+(1-\epsilon )\theta ^{t}, (8)
thus one aggregated layer may not work well with the ad-
where ϵ is a smoothing hyperparameter, t means the iter- jacent ones. Therefore, layer aggregation is considered in
′
ation, θ and θ are parameters of the student and teacher our model. By explicitly using the hidden state to connect
models, respectively. the current layer with the previous layers, the knowledge
EMA update aims to obtain a more stable and power- propagation will be more stable and accurate. Moreover,
ful teacher model via the ensemble of students. However, the shared recurrent layers impose regularization over the
such an update in Eq. 8 might still be coarse and weak be- propagated information. Compared with EMA Teacher, the
cause it only aggregates parameters in the same layer at dif- Aggregated Teacher is able to produce more stable and ac-
ferent iterations, without considering the correlation across curate pseudo-labels for dense learning.
layers. To further enhance the capability of teacher model,
3.5. Uncertainty Consistency
motivated by the dense aggregation mechanism [12, 49, 50],
we introduce an Aggregated Teacher (AT), which performs By using the proposed AF, MetaNet and AT, the dense
not only parameter aggregation across time but also recur- pixel-wise pseudo-labels can be obtained to supervise the
rent layer aggregation across layers, as illustrated in Figure learning of SSOD models by optimizing the loss Lu . In or-
5(b). Specifically, for parameter aggregation, we still adopt der to further improve the generalization capability of the
the existing EMA update as in Eq. 8. While for layer aggre- SSOD model, we propose to regularize the uncertainty con-
gation, to avoid the problem of heavy parameter, we follow sistency over the unlabeled images. From Figure 6, one
4819
Algorithm 1: Patch Shuffle
Input: Unlabeled image U ;
Output: Patch shuffled image Up ;
Initialization: U 0 = U , total iteration number J;
for j = 0, · · · , J − 1 do
(1) Mode m: randomly select a mode from
[‘horizontal’,‘vertical’];
(2) Normalized size s: randomly generate s from
interval [0, 1];
(3) Crop U j into two parts based on mode m and
normalized size s;
(4) Shuffle the order of the two parts, and concatenate
them into a new image Û j ;
(5) U j+1 = Û j ;
end
Figure 6. Illustration of the uncertainty consistency regularization
among scales. The input images come from the same unlabeled
image Ui .
4. Experiments
can see that the input consists of a pair of images: Strong Datasets & Evaluation Metrics: We conduct experi-
& Patch Augmented image (Usp ) and the corresponding ments on the popular object detection benchmarks, includ-
Down-sampled image (Ud ). The downsampling ratio is set ing MS-COCO [27] and PASCAL-VOC [8]. MS-COCO
to r = 2 in producing Ud . By patch shuffle augmentation, contains more than 118k labeled images, and there are
we randomly crop an image into several parts along the hor- about 850k instances from 80 classes. In addition, there
izontal or vertical directions and then shuffle these parts (de- are 123k unlabeled images provided for semi-supervised
tailed algorithm can be found in Algorithm 1). Both the two learning. VOC07 contains 5,011 training images from 20
images will be fed into our detector, producing dense score classes, while VOC12 has 11,540 training images.
maps at different scale levels. (In FCOS, there are 5 levels, On MS-COCO, we follow the settings in STAC [38] and
i.e., v ∈ [1, · · · , 5].) evaluate with both the protocols of Partially Labeled Data
To improve the generalization performance of SSOD, we and Fully Labeled Data. The former randomly samples
adopt the following regularization loss: 1%, 2%, 5% and 10% of the training data as labeled data,
and treats the remainder as unlabeled data. (For this pro-
tocol, we create 3 data folds and report the mean results
L_{scale}&=\sum _{v=1}^{4}\|p^{v}[U_{d}]-p^{v+1}[U_{sp}]\|^{2}_{2}, (11) over them.) The latter uses all the training data as labeled
data and the additional unlabeled data as unlabeled samples.
We adopt the mean average precision AP50:90 (denoted by
where pv [U∗ ] indicates the score map pv derived from im- mAP) as the evaluation metric.
age U∗ . Since the downsampling ratio r = 2, pv [Ud ] has For experiments on PASCAL-VOC07, following STAC
the same resolution as pv+1 [Usp ], and they are constrained [38], we use the VOC07 training set as the labeled data, and
to be consistent. the VOC12 training set or together with the images from
Remarks: The output dense score maps reveal the un- the same 20 classes in MS-COCO (denoted by COCO20)
certainty or the reliability of the predicted label for each as the unlabeled data. We adopt VOC default AP50 metric
pixel. The lower the score is, the higher the uncertainty and COCO default mAP metric as the evaluation metrics.
that the pixel belongs to a foreground object. Data uncer- Implementation Details: We adopt the popular anchor-
tainty has been widely used to indicate the data importance free detector FCOS [44] with ResNet50 [9] as backbone,
in previous works [6, 10, 15, 16, 45]. In this paper, we reg- and FPN [27] as neck and dense heads. Images in MS-
ularize the uncertainty consistency. Patch shuffle is used COCO are resized to have shorter edge 800, or 640 if the
to reduce the dependency of foreground objects on their longer edge is less than 1,333. Images in PASCAL-VOC are
surrounding contexts, improving the model robustness to resized to have shorter edge 600, or 480 if the longer edge is
context variations. In addition, to ensure consistent outputs less than 1,000. For fair comparison, following [31, 38], in
among scales, Lscale is then defined to improve the model all experiments, random flip is used as weak augmentation,
robustness to object scaling variations. while strong augmentation includes random flip, color jit-
By far, all the components of our DSL have been de- tering and cutout. The iteration J is set to 2 in Patch Shuf-
scribed, and the overall pipeline is shown in Figure 2. fle. For training configurations, learning rate starts from
4820
Table 1. The mAP performance (%) of competing methods on the MS-COCO [27] dataset. The used protocol is Partially Labeled Data. †
means that the method uses a larger batch size 32 or 40, and ‡ indicates that strong augmentation is applied on the labeled data. Note that
†, ‡ are not the default settings in STAC [38] but they will improve the performance of both supervised baseline and SSOD. ‘Supervised’
means that only the corresponding labeled data are used for training, and this is set as the baseline for SSOD.
Methods Deployment 1% 2% 5% 10%
Supervised [38] Hard 9.05 ± 0.16 12.70 ± 0.15 18.47 ± 0.22 23.86 ± 0.81
CSD [14] Hard 11.12 ± 0.15 14.15 ± 0.13 18.79 ± 0.13 24.50 ± 0.15
STAC [38] Hard 13.97 ± 0.35 18.25 ± 0.25 24.38 ± 0.12 28.64 ± 0.21
IT [51] Hard 16.00 ± 0.20 20.70 ± 0.30 25.50 ± 0.05 29.45 ± 0.15
Anchor-based
ISMT [48] Hard 18.88 ± 0.74 22.43 ± 0.56 26.37 ± 0.24 30.53 ± 0.52
Humble [41] Hard 16.96 ± 0.38 21.72 ± 0.24 27.70 ± 0.15 31.60 ± 0.28
UB† [31] Hard 20.75 ± 0.12 24.30 ± 0.97 28.27 ± 0.11 31.50 ± 0.10
E2E†‡ [47] Hard 20.46 ± 0.39 - 30.74 ± 0.08 34.04 ± 0.14
Supervised(Ours) Easy 9.53 ± 0.23 11.71 ± 0.26 18.74 ± 0.18 23.70 ± 0.22
Anchor-free
DSL(Ours) Easy 22.03 ± 0.28 25.19 ± 0.37 30.87 ± 0.24 36.22 ± 0.18
0.01 and is divided by 10 at 16 and 22 epochs. The max Table 2. The mAP performance (%) of competing methods on the
epoch is 24. α is set to 3 and 1 for the partially and fully la- MS-COCO [27] dataset. The used protocol is Fully Labeled Data.
beled protocols, resp, and 2.5 for VOC. ϵ is set to 0.99. For Methods Deployment 100%
parameter τ2k , we set it within the range [0.25, 0.35]. All of STAC [38] Hard
1.6
37.6−→39.2
1.8
our experiments are based on Pytorch [33] and MMDetec- Anchor-based ISMT [48] Hard 37.8−→39.6
1.1
tion [7]. We use 8 NVIDIA-V100 GPUs with 32G memory UB† [31] Hard 40.2−→41.3
3.6
per GPU. For each GPU, we randomly sample 2 images E2E†‡ [47] Hard 40.9−→44.5
3.6
from labeled set and unlabeled set with ratio 1:1. Anchor-free DSL(Ours) Easy 40.2−→43.8
4821
Table 3. The results (%) of competing methods on the PASCAL-VOC [8] dataset. The performances are evaluated on the VOC07 test set.
Unlabeled: VOC12 Unlabeled: VOC12 + COCO20
Methods Deployment
AP50 AP50:90 AP50 AP50:90
Supervised [38] Hard 72.75 42.04 72.75 42.04
CSD [14] Hard 74.7 - 75.1 -
STAC [38] Hard 77.45 44.64 79.08 46.01
Anchor-based
IT [51] Hard 78.3 48.7 79 49.7
ISMT [48] Hard 77.23 46.23 77.75 49.59
UB† [31] Hard 77.37 48.69 78.82 50.34
Supervised(Ours) Easy 69.6 45.9 69.6 45.9
Anchor-free
DSL (Ours) Easy 80.7 56.8 82.1 59.8
Table 4. Effectiveness of each component of the proposed DSL Table 5. Ablation studies on Adaptive Filtering.
method. ‘+’ means training by the proposed method.
Single threshold AF(fixed τ2k )
Methods AF
Methods mAP 0.05 0.1 0.2 0.3 0.2 0.3 0.4
mAP 27.1 28.8 30.7 27.5 34.3 36.0 35.6 36.2
Supervised 23.7
+ AF 32.2 Table 6. Ablation studies on Aggregated Teacher. ‘LA’ means
+ MetaNet 32.5 layer aggregation.
+ AT 34.5
Methods No teacher + EMA + LA AT
+ Patch-Shuffle 34.9
mAP 33.0 34.1 35.0 36.2
+ Lscale 36.2
Table 7. Ablation studies on loss weight α for unlabeled data.
‘fail’ means that the training loss will easily get to ‘nan’.
single threshold strategy as reference, where instances are
regarded as foreground if their scores are above the thresh- α 1 2 3 4
old and background otherwise. One can see that the single mAP 33.9 35.4 36.2 fail
threshold strategy cannot achieve satisfactory performance.
The best result is only 30.7 mAP when the threshold is
set to 0.2, indicating that there are many instances being how to recall the foreground instances via a threshold. In
wrongly defined by a single threshold. In contrast, by us- contrast, in anchor-free SSOD the multi-level pseudo-labels
ing our multi-level thresholds strategy, i.e., AF, the perfor- should be explicitly considered due to the pixel-wise gradi-
mance can be significantly improved: even by using a fixed ent propagation. This can be demonstrated by our AF strat-
τ2k =0.3, the result can be improved to 36.0 mAP; and when egy as in Table 5. Moreover, without the help of predefined
the adaptive τ2k is used for each class, it can be further im- anchors for scale variances, FPN [27] with a dense head
proved to 36.2 mAP, showing the effectiveness and impor- has been widely used in anchor-free detectors to address
tance of our AF strategy. the scaling issue. Thus Lscale can be generally adopted and
regarded as a default trick in anchor-free SSOD, and this is
Ablation studies on AT. From Table 6, one can see
verified to be effective in Table 4. In summary, most of our
that layer aggregation (LA) achieves higher performance
techniques are proposed by considering the special charac-
gain than EMA because it considers the fine-grained rela-
teristics of anchor-free detectors, and our work in this paper
tionships across layers, while EMA just simply aggregates
makes the first step towards anchor-free SSOD.
layer-wise parameters independently so that the relation-
ships between layers can be harmed. In addition, by em-
ploying both EMA and LA, our AT can further improve the 5. Conclusion
performance to 36.2 mAP. This implies that aggregations In this paper, we made the first attempt, to the best of our
over parameters and layers are actually complementary. knowledge, to bridge the gap between SSOD and anchor-
Ablation studies on loss weight α. From Table 7, one free detector, and developed a DSL based SSOD method.
can see that the performance peaks around α = 3. A too The DSL was built upon several novel techniques, such
large weight such as α = 4 will give the model too many as Adaptive Filtering, Aggregated Teacher and uncertainty
chances to employ the unlabeled images in training, and regularization. Our experiments showed that the proposed
hence reduce the stability of the model. DSL outperformed the state-of-the-art SSOD methods by a
Discussions. In anchor-based SSOD, the nega- large margin on both COCO and VOC datasets. It is ex-
tive/ignorable instances have been implicitly handled by la- pected our work can inspire more and in-depth explorations
bel assigner and sampler, and we only need to consider on anchor-free SSOD methods.
4822
References tection. Advances in neural information processing systems,
32:10759–10768, 2019. 7, 8
[1] Sean Bell, C Lawrence Zitnick, Kavita Bala, and Ross Gir-
[15] Alex Kendall and Yarin Gal. What uncertainties do we
shick. Inside-outside net: Detecting objects in context with
need in bayesian deep learning for computer vision? arXiv
skip pooling and recurrent neural networks. In Proceed-
preprint arXiv:1703.04977, 2017. 6
ings of the IEEE conference on computer vision and pattern
recognition, pages 2874–2883, 2016. 2 [16] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task
[2] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas learning using uncertainty to weigh losses for scene geome-
Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A try and semantics. In Proceedings of the IEEE conference on
holistic approach to semi-supervised learning. arXiv preprint computer vision and pattern recognition, pages 7482–7491,
arXiv:1905.02249, 2019. 2 2018. 6
[3] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vas- [17] Kang Kim and Hee Seok Lee. Probabilistic anchor assign-
concelos. A unified multi-scale deep convolutional neural ment with iou prediction for object detection. In ECCV,
network for fast object detection. In European conference 2020. 1
on computer vision, pages 354–370. Springer, 2016. 2 [18] Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Lei Li,
[4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv- and Jianbo Shi. Foveabox: Beyound anchor-based object de-
ing into high quality object detection. In Proceedings of the tection. IEEE Transactions on Image Processing, 29:7389–
IEEE conference on computer vision and pattern recogni- 7398, 2020. 2
tion, pages 6154–6162, 2018. 2 [19] Samuli Laine and Timo Aila. Temporal ensembling for semi-
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas supervised learning. arXiv preprint arXiv:1610.02242, 2016.
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- 2
end object detection with transformers. In European confer- [20] Dong-Hyun Lee et al. Pseudo-label: The simple and effi-
ence on computer vision, pages 213–229. Springer, 2020. 1 cient semi-supervised learning method for deep neural net-
[6] Binghui Chen and Weihong Deng. Weakly-supervised deep works. In Workshop on challenges in representation learn-
self-learning for face recognition. In 2016 IEEE Interna- ing, ICML, volume 3, page 896, 2013. 2
tional Conference on Multimedia and Expo (ICME), pages [21] Hyungtae Lee, Sungmin Eum, and Heesung Kwon. Me r-
1–6. IEEE, 2016. 2, 6 cnn: Multi-expert r-cnn for object detection. IEEE Transac-
[7] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu tions on Image Processing, 29:1030–1044, 2019. 2
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, [22] Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, and Larry S
Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian- Davis. Rethinking pseudo labels for semi-supervised object
heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, detection. arXiv preprint arXiv:2106.00168, 2021. 3, 5
Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang,
[23] Xinzhe Li, Qianru Sun, Yaoyao Liu, Qin Zhou, Shibao
Chen Change Loy, and Dahua Lin. MMDetection: Open
Zheng, Tat-Seng Chua, and Bernt Schiele. Learning to self-
mmlab detection toolbox and benchmark. arXiv preprint
train for semi-supervised few-shot classification. Advances
arXiv:1906.07155, 2019. 7
in Neural Information Processing Systems, 32:10276–10286,
[8] Mark Everingham, Luc Van Gool, Christopher KI Williams,
2019. 2
John Winn, and Andrew Zisserman. The pascal visual object
classes (voc) challenge. International journal of computer [24] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxiang
vision, 88(2):303–338, 2010. 2, 6, 8 Zhang. Scale-aware trident networks for object detection.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. In Proceedings of the IEEE/CVF International Conference
Deep residual learning for image recognition. In Proceed- on Computer Vision, pages 6054–6063, 2019. 2
ings of the IEEE conference on computer vision and pattern [25] Tsungnan Lin, Bill G Horne, Peter Tino, and C Lee Giles.
recognition, pages 770–778, 2016. 3, 6 Learning long-term dependencies in narx recurrent neu-
[10] Jay Heo, Hae Beom Lee, Saehoon Kim, Juho Lee, ral networks. IEEE Transactions on Neural Networks,
Kwang Joon Kim, Eunho Yang, and Sung Ju Hwang. 7(6):1329–1338, 1996. 5
Uncertainty-aware attention for reliable interpretation and [26] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
prediction. arXiv preprint arXiv:1805.09653, 2018. 6 Bharath Hariharan, and Serge Belongie. Feature pyra-
[11] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term mid networks for object detection. In Proceedings of the
memory. Neural computation, 9(8):1735–1780, 1997. 5 IEEE conference on computer vision and pattern recogni-
[12] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- tion, pages 2117–2125, 2017. 3
ian Q Weinberger. Densely connected convolutional net- [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
works. In Proceedings of the IEEE conference on computer Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
vision and pattern recognition, pages 4700–4708, 2017. 5 Zitnick. Microsoft coco: Common objects in context. In
[13] Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Dense- European conference on computer vision, pages 740–755.
box: Unifying landmark localization with end to end object Springer, 2014. 1, 2, 6, 7, 8
detection. arXiv preprint arXiv:1509.04874, 2015. 2 [28] Songtao Liu, Zeming Li, and Jian Sun. Self-emd: Self-
[14] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak. supervised object detection without imagenet. arXiv preprint
Consistency-based semi-supervised learning for object de- arXiv:2011.13677, 2020. 1
4823
[29] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian [42] Antti Tarvainen and Harri Valpola. Mean teachers are better
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C role models: Weight-averaged consistency targets improve
Berg. Ssd: Single shot multibox detector. In European con- semi-supervised deep learning results. Advances in Neural
ference on computer vision, pages 21–37. Springer, 2016. 7 Information Processing Systems, 30, 2017. 2
[30] Wei Liu, Shengcai Liao, Weiqiang Ren, Weidong Hu, and [43] Wanxin Tian, Zixuan Wang, Haifeng Shen, Weihong Deng,
Yinan Yu. High-level semantic feature detection: A new Yiping Meng, Binghui Chen, Xiubao Zhang, Yuan Zhao,
perspective for pedestrian detection. In Proceedings of and Xiehe Huang. Learning better features for face detec-
the IEEE/CVF Conference on Computer Vision and Pattern tion with feature fusion and segmentation supervision. arXiv
Recognition, pages 5187–5196, 2019. 2 preprint arXiv:1811.08557, 2018. 2
[31] Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, [44] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, and Peter Fully convolutional one-stage object detection. In Proceed-
Vajda. Unbiased teacher for semi-supervised object detec- ings of the IEEE/CVF international conference on computer
tion. arXiv preprint arXiv:2102.09480, 2021. 1, 2, 5, 6, 7, vision, pages 9627–9636, 2019. 1, 2, 3, 4, 6
8 [45] Zhenyu Wang, Yali Li, Ye Guo, Lu Fang, and Shengjin
[32] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Wang. Data-uncertainty guided multi-phase learning for
Shin Ishii. Virtual adversarial training: a regularization semi-supervised object detection. In Proceedings of the
method for supervised and semi-supervised learning. IEEE IEEE/CVF Conference on Computer Vision and Pattern
transactions on pattern analysis and machine intelligence, Recognition, pages 4568–4577, 2021. 6
41(8):1979–1993, 2018. 2 [46] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V
[33] Pytorch. https://pytorch.org/. 7 Le. Self-training with noisy student improves imagenet clas-
[34] Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia sification. In Proceedings of the IEEE/CVF Conference on
Gkioxari, and Kaiming He. Data distillation: Towards omni- Computer Vision and Pattern Recognition, pages 10687–
supervised learning. In Proceedings of the IEEE conference 10698, 2020. 2
on computer vision and pattern recognition, pages 4119– [47] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan
4128, 2018. 2 Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-
[35] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali end semi-supervised object detection with soft teacher. arXiv
Farhadi. You only look once: Unified, real-time object de- preprint arXiv:2106.09018, 2021. 3, 5, 7
tection. In Proceedings of the IEEE conference on computer [48] Qize Yang, Xihan Wei, Biao Wang, Xian-Sheng Hua, and
vision and pattern recognition, pages 779–788, 2016. 2 Lei Zhang. Interactive self-training with mean teachers
[36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. for semi-supervised object detection. In Proceedings of
Faster r-cnn: Towards real-time object detection with region the IEEE/CVF Conference on Computer Vision and Pattern
proposal networks. Advances in neural information process- Recognition, pages 5941–5950, 2021. 1, 2, 5, 7, 8
ing systems, 28:91–99, 2015. 1, 2, 7 [49] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor
[37] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Darrell. Deep layer aggregation. In Proceedings of the
Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han IEEE conference on computer vision and pattern recogni-
Zhang, and Colin Raffel. Fixmatch: Simplifying semi- tion, pages 2403–2412, 2018. 5
supervised learning with consistency and confidence. arXiv [50] Jingyu Zhao, Yanwen Fang, and Guodong Li. Recurrence
preprint arXiv:2001.07685, 2020. 2 along depth: Deep convolutional neural networks with re-
[38] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, current layer aggregation. Advances in Neural Information
Chen-Yu Lee, and Tomas Pfister. A simple semi-supervised Processing Systems, 34, 2021. 5
learning framework for object detection. arXiv preprint [51] Qiang Zhou, Chaohui Yu, Zhibin Wang, Qi Qian, and Hao
arXiv:2005.04757, 2020. 1, 2, 6, 7, 8 Li. Instant-teaching: An end-to-end semi-supervised object
[39] Xiaolin Song, Binghui Chen, Pengyu Li, Biao Wang, and detection framework. In Proceedings of the IEEE/CVF Con-
Honggang Zhang. Prnet++: Learning towards general- ference on Computer Vision and Pattern Recognition, pages
ized occluded pedestrian detection via progressive refine- 4081–4090, 2021. 1, 2, 7, 8
ment network. Neurocomputing, 2022. 2 [52] Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanx-
[40] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chen- iao Liu, Ekin Dogus Cubuk, and Quoc Le. Rethinking pre-
feng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan training and self-training. Advances in Neural Information
Yuan, Changhu Wang, et al. Sparse r-cnn: End-to-end ob- Processing Systems, 33, 2020. 2
ject detection with learnable proposals. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 14454–14463, 2021. 1
[41] Yihe Tang, Weifeng Chen, Yijun Luo, and Yuting Zhang.
Humble teachers teach better students for semi-supervised
object detection. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pages
3132–3141, 2021. 3, 5, 7
4824