20 CVPR SCNet
20 CVPR SCNet
Jiang-Jiang Liu1∗ Qibin Hou2∗ Ming-Ming Cheng1 Changhu Wang3 Jiashi Feng2
1 2 3
CS, Nankai University NUS ByteDance AI Lab
https://mmcheng.net/scconv/
Abstract
F1 K1
Input: X X2 C/2 × H × W C/2 × H × W Y2 Output: Y
C ×H×W C ×H×W
Figure 2. Schematic illustration of the proposed self-calibrated convolutions. As can be seen, in self-calibrated convolutions, the original
filters are separated into four portions, each of which is in charge of a different functionality. This makes self-calibrated convolutions
quiet different from traditional convolutions or grouped convolutions that are performed in a homogeneous way. More details about the
self-calibration operation can be found in Sec. 3.1.
an input X = [x1 , x2 , . . . , xC ] ∈ RC×H×W to an output input channel number C is identical to the output channel
Y = [y1 , y2 , . . . , yĈ ] ∈ RĈ×Ĥ×Ŵ . Note that we omit number Ĉ, i.e., Ĉ = C. Thus, in the following, we use C
the spatial size of the filters and the bias term for notational to replace Ĉ for notational convenience. Given a group of
convenience. Given the above notations, the output feature filter sets K with shape (C, C, kh , kw ) where kh and kw are
map at channel i can be written as respectively the spatial height and width, we first uniformly
separate it into four portions, each of which is in charge of a
C
X different functionality. Without loss of generality, suppose
yi = ki ∗ X = kji ∗ xj , (1)
C can be divided by 2. After separation, we have four por-
j=1
tions of filters denoted by {Ki }4i=1 , each of which is with
where ‘*’ denotes convolution and ki = [k1i , k2i , . . . , kC
i ]. shape ( C2 , C2 , kh , kw ), respectively.
As can be seen above, each output feature map is com- Given the four portions of filters, we then uniformly split
puted by summation through all channels and all of them the input X into two portions {X1 , X2 }, each of which
are produced uniformly by repeating Eqn. 1 multiple times. is then sent into a special pathway for collecting different
In this way, the convolutional filters can learn similar pat- types of contextual information. In the first pathway, we uti-
terns. Moreover, the fields-of-view for each spatial loca- lize {K1 , K2 , K3 } to perform the self-calibration operation
tion in the convolutional feature transformation is mainly upon X1 , yielding Y1 . In the second pathway, we perform
controlled by the predefined kernel size and networks com- a simple convolution operation: Y2 = F1 (X2 ) = X2 ∗ K1 ,
posed of a stack of such convolutional layers are also short which targets at retaining the original spatial context. Both
of large receptive fields to capture enough high-level seman- the intermediate outputs {Y1 , Y2 } are then concatenated
tics [46, 45]. Both above shortcomings may lead to feature together as the output Y. In what follows, we detailedly de-
maps that are less discriminative. To alleviate the above is- scribe how to perform the self-calibration operation in the
sues, we propose self-calibrated convolution, which is elab- first pathway.
orated below.
3.1. Self-Calibrated Convolutions 3.1.2 Self-Calibration
In grouped convolutions, the feature transformation pro- To efficiently and effectively gather informative contextual
cess is homogeneously and individually performed in mul- information for each spatial location, we propose to conduct
tiple parallel branches and the outputs from each branch are convolutional feature transformation in two different scale
concatenated as the final output. Similar to grouped convo- spaces: an original scale space in which feature maps share
lutions, the proposed self-calibrated convolutions also split the same resolution with the input and a small latent space
the learnable convolutional filters into multiple portions, yet after down-sampling. The embeddings after transformation
differently, each portion of filters is not equally treated but in the small latent space are used as references to guide the
responsible for a special functionality. feature transformation process in the original feature space
because of their large fields-of-view.
3.1.1 Overview Self-Calibration: Given the input X1 , we adopt the aver-
age pooling with filter size r × r and stride r as follows:
The workflow of the proposed design is illustrated in Fig-
ure 2. In our approach, we consider a simple case where the T1 = AvgPoolr (X1 ). (2)
the self-calibration operation encodes multi-scale informa-
Conv
where F3 (X1 ) = X1 ∗K3 , σ is the sigmoid function, and ‘·’ Relation to Attention-Based Modules: Our work is also
denotes element-wise multiplication. As shown in Eqn. 4, quiet different from the existing methods relying on add-on
we use X01 as residuals to form the weights for calibration, attention blocks, such as the SE block [16], GE [15] block,
which is found beneficial. The final output after calibration or the CBAM [38]. Those methods require additional learn-
can be written as follows: able parameters, while our self-calibrated convolutions in-
ternally change the way of exploiting convolutional filters of
Y1 = F4 (Y10 ) = Y10 ∗ K4 . (5) convolutional layers, and hence require no additional learn-
able parameters. Moreover, though the GE block [15] en-
codes spatial information in a lower-dimension space as we
Advantages: The advantages of the proposed self-
do, it does not explicitly preserve the spatial information
calibration operation are three-fold. First of all, compared
from the original scale space. In the following experiment
to conventional convolutions, by employing the calibration
section, we will show without any extra learnable param-
operation as shown in Eqn. 4, each spatial location is al-
eters, our self-calibrated convolutions can yield significant
lowed to not only adaptively consider its surrounding infor-
improvements over baselines and other attention-based ap-
mative context as embeddings from the latent space func-
proaches on image classification. Furthermore, our self-
tioning as scalars in the responses from the original scale
calibrated convolutions are complementary to attention and
space, but also model inter-channel dependencies. Thus, the
thus can also benefit from the add-on attention modules.
fields-of-view for convolutional layer with self-calibration
can be effectively enlarged. As shown in Figure 3, convolu-
4. Experiments
tional layers with self-calibration encode larger but more ac-
curate discriminative regions. Second, instead of collecting 4.1. Implementation Details
global context, the self-calibration operation only consid-
We implement our approach using the publicly available
ers the context around each spatial location, avoiding some
PyTorch framework1 . For fair comparison, we adopt the
contaminating information from irrelevant regions to some
official classification framework to perform all classifica-
extent. As can be seen in the right two columns of Figure 6,
tion experiments unless specially declared. We report re-
convolutions with self-calibration can accurately locate the
target objects when visualizing the final score layer. Third, 1 https://pytorch.org
Network Params MAdds FLOPs Top-1 Top-5
Binoculars
50-layer
Lhasa
ResNet [12] 25.6M 4.1G 8.2G 76.4 93.0
SCNet 25.6M 4.0G 7.9G 77.8 93.9
ResNeXt [40] 25.0M 4.3G 8.5G 77.4 93.4
Accordion
ResNeXt 2x40d 25.4M 4.2G 8.3G 76.8 93.3
Daisy
SCNeXt 25.0M 4.3G 8.5G 78.3 94.0
SE-ResNet[16] 28.1M 4.1G 8.2G 77.2 93.4
SE-SCNet 28.1M 4.0G 7.9G 78.2 93.9
101-layer
Gondola
Thatch
ResNet [12] 44.5M 7.8G 15.7G 78.0 93.9
SCNet 44.6M 7.2G 14.4G 78.9 94.3
ResNeXt [40] 44.2M 8.0G 16.0G 78.5 94.2
SCNeXt 44.2M 8.0G 15.9G 79.2 94.4 ResNet (res3) SCNet (res3) ResNet (res3) SCNet (res3)
SE-ResNet[16] 49.3M 7.9G 15.7G 78.4 94.2 Figure 4. Visualizations of feature maps from the side outputs at
SE-SCNet 49.3M 7.2G 14.4G 78.9 94.3 res3 of different networks (ResNet v.s. SCNet). We use 50-layer
Table 1. Comparisons on ImageNet-1K dataset when the proposed settings for both networks.
structure is utilized in different classification frameworks. We re-
port single-crop accuracy rates (%).
just the cardinality of each group convolution according to
our structure to ensure that the capacity of SCNeXt is close
sults on the ImageNet dataset [30]. The size of input im- to ResNeXt. For SE-SCNet, we apply the SE module to
ages is 224 × 224 which are randomly cropped from resized SCNet in the same way as [16].
images as done in [40]. We use SGD to optimize all mod- In Table 1, we show the results produced by both 50-
els. The weight decay and momentum are set to 0.0001 and and 101-layer versions of each model. Compared to the
0.9, respectively. Four Tesla V100 GPUs are used and the original ResNet-50 architecture, SCNet-50 has an improve-
mini-batch size is set to 256 (64 per GPU). By default, we ment of 1.4% in accuracy (77.8% vs. 76.4%). More-
train all models for 100 epochs with an initial learning rate over, the improvement by SCNet-50 (1.4%) is also higher
0.1, which is divided by 10 after every 30 epochs. In test- than that by ResNeXt-50 (1.0%) and SE-ResNet-50 (0.8%).
ing, we report the accuracy results on the single 224 × 224 This demonstrates that self-calibrated convolutions perform
center crop from an image with shorter side resized to 256 much better than increasing cardinality or introducing the
as in [40]. Note that models in all ablation comparisons SE module [16]. When the networks go deeper, a similar
share the same running environment and hyper-parameters phenomenon can also be observed.
except for the network structures themselves. All models in Another way to investigate the generalization ability of
Table 1 are trained under the same strategy and tested under the proposed structure is to see its behaviors on other vision
the same settings. tasks as backbones, such as object detection and instance
segmentation. We will give more experiment comparisons
4.2. Results on ImageNet
in the next subsection.
We conduct ablation experiments to verify the impor-
Self-Calibrated Convolution v.s. Vanilla Convolution:
tance of each component in our proposed architecture and
To further investigate the effectiveness of the proposed self-
compare with existing attention-based approaches on the
calibrated convolutions compared to the vanilla convolu-
ImageNet-1K classification dataset [30].
tions, we add side supervisions (auxiliary losses) as done
in [21] to both ResNet-50 and SCNet-50 after one inter-
4.2.1 Ablation Analysis mediate stage, namely res3. Results from side outputs
can reflect how a network performs when the depth varies
Generalization Ability: To demonstrate the generaliza- and how strong the feature representations at different lev-
tion ability of the proposed structure, we consider three els are. The top-1 accuracy results from the side supervision
widely used classification architectures as baselines, includ- at res3 have been depicted in Figure 5. It is obvious that
ing ResNet [12], ResNeXt [40], and SE-ResNet [16]. The the side results from SCNet-50 are much better than those
corresponding networks with self-calibrated convolutions from ResNet-50. This phenomenon indirectly indicates that
are named as SCNet, SCNeXt, and SE-SCNet, respectively. networks with the proposed self-calibrated convolutions can
Following the default version of ResNeXt [40] (32 × 4d), generate richer and more discriminative feature representa-
we set the bottleneck width to 4 in SCNeXt. We also ad- tions than the vanilla convolutions. To further demonstrate
70
ResNet
60
Top-1 Error (%)
50
ResNeXt
40 ResNet-50 (train)
ResNet-50 (val)
30
SE-ResNet
SCNet-50 (train)
SCNet-50 (val)
20
0 20 40 60 80
Epoch
Figure 5. Auxiliary loss curves for both ResNet-50 and SCNet-50.
We add auxiliary loss after res3. As can be seen, SCNet (red
SCNet
lines) works much better than ResNet (cyan lines). This demon-
strates that self-calibrated convolutions work better for networks
with lower depth. Leonberg Obelisk Paddlewheel Sunglass Neck brace
Figure 6. Visualization of attention maps generated by Grad-CAM
[31]. It is obvious that our SCNet can more precisely locate the
this, we show some visualizations from the score layers of foreground objects than other networks no matter how large and
the side outputs in Figure 4. Apparently, SCNet can more what shape they are. This heavily relies on our self-calibration
precisely and integrally locate the target objects even at a operation which benefits adaptively capturing rich context infor-
mation. We use 50-layer settings for all networks.
lower depth of the network. In Sec. 4.3, we will give more
demonstrations on this by applying both convolutions to dif-
ferent vision tasks. last residual blocks is already very small (e.g., 7 × 7). Fur-
Attention Comparisons: To show why the proposed self- thermore, we find taking the feature maps at lower resolu-
calibrated convolution is helpful for classification networks, tion (after F2 ) as residuals by adding an identity connection
we adopt the Grad-CAM [31] as an attention extraction as shown in Figure 2 is also helpful for better performance.
tool to visualize the attentions produced by ResNet-50, Discarding the extra identity connection leads to a decrease
ResNeXt-50, SE-ResNet-50, and SCNet-50, as shown in of performance to 77.48%.
Figure 6. It can be clearly seen that the attentions produced Average Pooling vs. Max Pooling: In addition to the above
by SCNet-50 can more precisely locate the target objects design choices, we also investigate the influence of different
and do not expand to the background areas too much. When pooling types on the performance. In our experiments, we
the target objects are small, the attentions by our network attempt to replace all the average pooling operators in self-
are also better confined to the semantic regions compared calibrated convolutions with the max pooling operators and
to those produced by other three networks. This suggests see the performance difference. With all other configura-
that our self-calibrated convolution is helpful for discover- tions unchanged, as shown in Table 2, using the max pool-
ing more integral target objects even though their sizes are ing operator yields a performance decrease of about 0.3%
small. in top-1 accuracy (77.81 vs. 77.53). We argue that this may
be due to the fact that, unlike max pooling, average pool-
Design Choices: As demonstrated in Sec. 3.1, we in-
ing builds connections among locations within the whole
troduce the down-sampling operation to achieve self-
pooling window, which can better capture local contextual
calibration, which has been proven useful for improving
information.
CNNs. Here, we investigate how the down-sampling rate
in self-calibrated convolutions influences the classification Discussion: According to the above ablation experiments,
performance. In Table 2, we show the performance with introducing self-calibrated convolutions is helpful for clas-
different down-sampling rates used in self-calibrated con- sification networks, like ResNet and ResNeXt. However,
volutions. As can be seen, when no down-sampling oper- note that exploring the optimal architecture setting is be-
ation is adopted (r = 1), the result is already much bet- yond the scope of this paper. This paper just provides a
ter than the original ResNet-50 (77.38% v.s. 76.40%). As preliminary study about how to improve the vanilla convo-
the down-sampling rate increases, better performance can lutions. We encourage readers to further investigate more
be achieved. Specially, when the down-sampling rate is set effective structures. In the next subsection, we will show
to 4, we have a top-1 accuracy of 77.81%. Note that we do how our approach behaves as pretrained backbones when
not use larger down-sampling rates as the resolution of the applied to popular vision tasks.
Model DS Rate (r) Identity Pooling Top-1 Accuracy Network Params MAdds Top-1 Top-5
ResNet - - - 76.40% ResNet [12] 25.6M 4.1G 76.4 93.0
ResNeXt - - - 77.40% ResNeXt [40] 25.0M 4.3G 77.4 93.4
SE-ResNet - - AVG 77.20% SE-ResNet [16] 28.1M 4.1G 77.2 93.4
ResNet + CBAM [38] 28.1M 4.1G 77.3 93.6
SCNet 1 3 - 77.38%
GCNet [3] 28.1M 4.1G 77.7 93.7
SCNet 2 3 AVG 77.48%
ResNet + GALA [25] 29.4M 4.1G 77.3 93.6
SCNet 4 7 AVG 77.48% ResNet + AA [1] 28.1M 4.1G 77.7 93.6
SCNet 4 3 MAX 77.53% ResNet + GE [15]† 31.2M 4.1G 78.0 93.6
SCNet 4 3 AVG 77.81% SCNet 25.6M 4.0G 77.8 93.9
SCNeXt 4 3 AVG 78.30% SCNet† 25.6M 4.0G 78.2 94.0
Table 2. Ablation experiments about the design choices of SC- SE-SCNet 28.1M 4.0G 78.2 93.9
Net. ‘Identity’ refers to the corresponding component with the GE-SCNet 31.1M 4.0G 78.3 94.0
same name as in Figure 2. ‘DS Rate’ is the down-sampling rate Table 3. Comparisons with prior attention-based approaches on the
in Eqn. 2. We also show results under two types of pooling opera- ImageNet-1K dataset. All approaches are based on the ResNet-
tions: average pooling (AVG) and max pooling (MAX). 50 baseline. We report single-crop accuracy rate (%) and show
complexity comparisons as well. ’†’ means models trained with
300 epochs.
4.2.2 Comparisons with Attention-Based Approaches
Here, we benchmark the proposed SCNet against exist-
11], we train each model using the union of 80k COCO
ing attention-based approaches, including CBAM [38],
train images and 35k images from the validation set
SENet [16], GALA [25], AA [1], and GE [15], on the
(trainval35k) [24] and report results on the rest 5k val-
ResNet-50 architecture. The comparison results can be
idation images (minival).
found in Table 3. It can be easily found that most atten-
tion or non-local based approaches require additional learn- We set hyper-parameters strictly following the Faster R-
able parameters to build their corresponding modules and CNN work [29] and its FPN version [23]. Images are all re-
then plug them into building blocks. Quite differently, our sized so that their shorter edges are with 800 pixels. We use
approach does not rely on any extra learnable parameters, 8 Tesla V100 GPUs to train each model and the mini-batch
but only heterogeneously exploits the convolutional filters. is set to 16, i.e., , 2 images on each GPU. The initial learn-
Out results are obviously better than those of all other ap- ing rate is set to 0.02 and we use the 2× training schedule
proaches. It should also be mentioned that the proposed to train each model. Weight decay and momentum are set
self-calibrated convolutions are also compatible with the to 0.0001 and 0.9, respectively. We report the results using
above mentioned attention-based approaches. For example, the standard COCO metrics, including AP (averaged mean
when adding GE blocks to each building block of SCNet Average Precision over different IoU thresholds), AP0.5 ,
as done in [15], we can further gain another 0.5% boost in AP0.75 and APS , APM , APL (AP at different scales). Both
accuracy. This also indicates that our approach is different 50-layer and 101-layer backbones are adopted.
from this kind of add-on modules. Detection Results: In the top part of Table 4, we show ex-
perimental results on object detection when different clas-
4.3. Applications
sification backbones are used. When taking Faster R-
In this subsection, we investigate the generalization ca- CNN [29] as an example, adopting ResNet-50-FPN as
pability of the proposed approach by applying it to popular the backbone gives an AP score of 37.6 while replacing
vision tasks as backbones, including object detection, in- ResNet-50 with SCNet-50 yields a large improvement of
stance segmentation, and human keypoint detection. 3.2 (40.8 v.s. 37.6). More interestingly, Faster R-CNN with
SCNet-50 backbone performs even better than that with
4.3.1 Object Detection ResNeXt-50 (40.8 v.s. 38.2). This indicates the proposed
way of leveraging convolutional filters is much more effi-
Network Settings: In the object detection task, we take cient than directly grouping the filters. This may be because
the widely used Faster R-CNN architecture [29] with fea- the proposed self-calibrated convolutions contain the adap-
ture pyramid networks (FPNs) [23] as baselines. We tive response calibration operation, which help more pre-
adopt the widely used mmdetection framework2 [4] cisely locate the exact positions of target objects as shown
to run all our experiments. As done in previous work [23, in Figure 6. In addition, from Table 4, we can observe that
using deeper backbones leads to a similar phenomenon as
2 https://github.com/open-mmlab/mmdetection above (ResNet-101-FPN: 39.9 → SCNet-101-FPN: 42.0).
Backbone AP AP0.5 AP0.75 APS APM APL Backbone Scale AP AP.5 AP.75 APm APl
Object Detection (Faster R-CNN) ResNet-50 256 × 192 70.6 88.9 78.2 67.2 77.4
ResNet-50-FPN 37.6 59.4 40.4 21.9 41.2 48.4 SCNet-50 256 × 192 72.1 89.4 79.8 69.0 78.7
SCNet-50-FPN 40.8 62.7 44.5 24.4 44.8 53.1 ResNet-50 384 × 288 71.9 89.2 78.6 67.7 79.6
ResNeXt-50-FPN 38.2 60.1 41.4 22.2 41.7 49.2 SCNet-50 384 × 288 74.4 89.7 81.4 70.7 81.7
SCNeXt-50-FPN 40.4 62.8 43.7 23.4 43.5 52.8 ResNet-101 256 × 192 71.6 88.9 79.3 68.5 78.2
ResNet-101-FPN 39.9 61.2 43.5 23.5 43.9 51.7 SCNet-101 256 × 192 72.6 89.4 80.4 69.4 79.4
SCNet-101-FPN 42.0 63.7 45.5 24.4 46.3 54.6 ResNet-101 384 × 288 73.9 89.6 80.5 70.3 81.1
ResNeXt-101-FPN 40.5 62.1 44.2 23.2 44.4 52.9 SCNet-101 384 × 288 74.8 89.6 81.8 71.2 81.9
SCNeXt-101-FPN 42.0 64.1 45.7 25.5 46.1 54.2 Table 5. Experiments on keypoint detection [24]. We report re-
Instance Segmentation (Mask R-CNN) sults on the COCO val2017 set using the OKS-based mAP and
take the state-of-the-art method [39] as our baseline. Two different
ResNet-50-FPN 35.0 56.5 37.4 18.3 38.2 48.3 input sizes (256 × 192 and 384 × 288) are considered as in [39].
SCNet-50-FPN 37.2 59.9 39.5 17.8 40.3 54.2
ResNeXt-50-FPN 35.5 57.6 37.6 18.6 38.7 48.7
SCNeXt-50-FPN 37.5 60.3 40.0 18.2 40.5 55.0 evaluate the results on the COCO val2017 set using the
standard OKS-based mAP, where OKS (object keypoints
ResNet-101-FPN 36.7 58.6 39.3 19.3 40.3 50.9
similarity) defines the similarity between different human
SCNet-101-FPN 38.4 61.0 41.0 18.2 41.6 56.6
poses. A Faster R-CNN object detector [29] with detection
ResNeXt-101-FPN 37.3 59.5 39.8 19.9 40.6 51.2 AP of 56.4 for the ‘person’ category on COCO val2017
SCNeXt-101-FPN 38.2 61.2 40.8 18.8 41.4 56.1 set is adopted for detection in the test phase as in [39].
Table 4. Comparisons with state-of-the-art approaches on COCO Table 5 shows the comparisons. As can be seen, simply
minival dataset. All results are based on single-model test and replacing ResNet-50 with SCNet-50 improves the AP score
the same hyper-parameters. For object detection, AP refers to box by 1.5% for 256 × 192 input size and 2.5% for 384 × 288
IoU while for instance segmentation AP refers to mask IoU. input size. These results demonstrate that introducing the
proposed self-calibration operation in convolutional layers
benefits human keypoint detection. When using deeper net-
4.3.2 Instance Segmentation works as backbones, we also have more than 1% perfor-
For instance segmentation, we use the same hyper- mance gain in AP as shown in Table 5.
parameters and datasets as in Mask R-CNN [11] for a fair
comparison. The results are based on the mmdetection 5. Conclusions and Future Work
framework [4] for all experiments performed in this part.
This paper presents a new self-calibrated convolution,
We compare the SCNet version Mask R-CNN to the which is able to heterogeneously exploit the convolutional
ResNet version at the bottom of Table 4. Because we have filters nested in a convolutional layer. To promote the filters
introduced object detection results in details, here we only to be of diverse patterns, we introduce the adaptive response
report the results using mask APs. As can be seen, the calibration operation. The proposed self-calibrated convo-
ResNet-50-FPN version and the ResNeXt-50-FPN version lutions can be easily embedded into modern classification
Mask R-CNNs have 35.0 and 35.5 mask APs, respectively. networks. Experiments on large-scale image classification
However, when taking SCNet into account, the correspond- dataset demonstrate that building multi-scale feature repre-
ing results are respectively improved by 2.2 and 2.0 in mask sentations in building blocks greatly improves the predic-
AP. Similar results can also be observed when adopting tion accuracy. To investigate the generalization ability of
deeper backbones. This suggests our self-calibrated con- our approach, we apply it to multiple popular vision tasks
volutions are also helpful for instance segmentation. and find substantial improvements over the baseline models.
We hope the thought of heterogeneously exploiting convo-
4.3.3 Keypoint Detection lutional filters can provide the vision community a different
perspective on network architecture design.
At last, we apply SCNet to human keypoint detection and
report results on the COCO keypoint detection dataset [24]. Acknowledgement. This research was partly supported by
We take the state-of-the-art method [39] as our baseline. We Major Project for New Generation of AI under Grant No.
only replace the backbone ResNet in [39] with SCNet and 2018AAA01004, NSFC (61620106008), the national youth
all other train and test settings3 are kept unchanged. We talent support program, and Tianjin Natural Science Foun-
dation (18ZXZNGX00110). Part of this work was done
3 https://github.com/Microsoft/human-pose-estimation.pytorch when Jiang-Jiang Liu interned at ByteDance AI Lab.
References volutional neural networks. In NeurIPS, pages 9401–9411,
2018. 1, 2, 4, 7
[1] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, [16] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
and Quoc V Le. Attention augmented convolutional net- works. In CVPR, pages 7132–7141, 2018. 1, 2, 4, 5, 7
works. arXiv preprint arXiv:1904.09925, 2019. 2, 7
[17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-
[2] Ali Borji, Ming-Ming Cheng, Qibin Hou, Huaizu Jiang, and ian Q Weinberger. Densely connected convolutional net-
Jia Li. Salient object detection: A survey. Computational works. In CVPR, pages 4700–4708, 2017. 2
Visual Media, 5(2):117–150, 2019. 1 [18] Sergey Ioffe and Christian Szegedy. Batch normalization:
[3] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Accelerating deep network training by reducing internal co-
Hu. Gcnet: Non-local networks meet squeeze-excitation net- variate shift. In ICML, 2015. 2
works and beyond. arXiv preprint arXiv:1904.11492, 2019. [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
1, 2, 7 Imagenet classification with deep convolutional neural net-
[4] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- works. In NeurIPS, pages 1097–1105, 2012. 2
iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping [20] Thuc Trinh Le, Andrés Almansa, Yann Gousseau, and Si-
Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. mon Masnou. Object removal from complex videos using
mmdetection. https://github.com/open-mmlab/ a few annotations. Computational Visual Media, 5(3):267–
mmdetection, 2018. 7, 8 291, 2019. 1
[5] Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George [21] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou
Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, Zhang, and Zhuowen Tu. Deeply-supervised nets. In Ar-
and Jon Shlens. Searching for efficient multi-scale architec- tificial intelligence and statistics, pages 562–570, 2015. 5
tures for dense image prediction. In NeurIPS, pages 8699– [22] Yi Li, Zhanghui Kuang, Yimin Chen, and Wayne Zhang.
8710, 2018. 2 Data-driven neuron allocation for scale aggregation net-
[6] Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yan- works. In CVPR, pages 11526–11534, 2019. 2
nis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Ji- [23] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He,
ashi Feng. Drop an octave: Reducing spatial redundancy in Bharath Hariharan, and Serge J Belongie. Feature pyramid
convolutional neural networks with octave convolution. In networks for object detection. In CVPR, 2017. 1, 7
ICCV, pages 3435–3444, 2019. 2 [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[7] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Shuicheng Yan, and Jiashi Feng. Dual path networks. In Zitnick. Microsoft coco: Common objects in context. In
Advances in Neural Information Processing Systems, pages ECCV, 2014. 7, 8
4467–4475, 2017. 1, 2 [25] Drew Linsley, Dan Shiebler, Sven Eberhardt, and Thomas
[8] Deng-Ping Fan, Ming-Ming Cheng, Jiang-Jiang Liu, Shang- Serre. Learning what and where to attend. In ICLR, 2019. 2,
Hua Gao, Qibin Hou, and Ali Borji. Salient objects in clut- 7
ter: Bringing salient object detection to the foreground. In [26] Hanxiao Liu, Karen Simonyan, and Yiming Yang.
ECCV, pages 186–202, 2018. 1 Darts: Differentiable architecture search. arXiv preprint
[9] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu arXiv:1806.09055, 2018. 1
Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new [27] Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Jiashi Feng,
multi-scale backbone architecture. IEEE TPAMI, pages 1–1, and Jianmin Jiang. A simple pooling-based design for real-
2020. 4 time salient object detection. In CVPR, pages 3917–3926,
[10] Shiming Ge, Xin Jin, Qiting Ye, Zhao Luo, and Qiang Li. 2019. 1
Image editing by object-aware optimal boundary searching [28] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and
and mixed-domain composition. Computational Visual Me- Jian Sun. Large kernel matters–improve semantic segmenta-
dia, 4(1):71–82, 2018. 1 tion by global convolutional network. In CVPR, pages 4353–
[11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- 4361, 2017. 2
shick. Mask r-cnn. In ICCV, pages 2980–2988. IEEE, 2017. [29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
1, 7, 8 Faster r-cnn: Towards real-time object detection with region
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. proposal networks. IEEE TPAMI, 39(6):1137–1149, 2016.
Deep residual learning for image recognition. In CVPR, 1, 7, 8
2016. 1, 2, 4, 5, 7 [30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Identity mappings in deep residual networks. In ECCV, Aditya Khosla, Michael Bernstein, et al. Imagenet large
pages 630–645. Springer, 2016. 2 scale visual recognition challenge. IJCV, 115(3):211–252,
[14] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, 2015. 1, 5
Zhuowen Tu, and Philip Torr. Deeply supervised salient [31] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
object detection with short connections. IEEE TPAMI, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
41(4):815–828, 2019. 2 Grad-cam: Visual explanations from deep networks via
[15] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea gradient-based localization. In ICCV, pages 618–626, 2017.
Vedaldi. Gather-excite: Exploiting feature context in con- 1, 6
[32] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Math-
ieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated
recognition, localization and detection using convolutional
networks. In ICLR, 2014. 2
[33] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. In ICLR,
2015. 2
[34] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and
Alexander A Alemi. Inception-v4, inception-resnet and the
impact of residual connections on learning. In AAAI, vol-
ume 4, page 12, 2017. 1, 2
[35] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In CVPR, pages 1–9, 2015. 2
[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
Shlens, and Zbigniew Wojna. Rethinking the inception ar-
chitecture for computer vision. In CVPR, pages 2818–2826,
2016. 2
[37] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
ing He. Non-local neural networks. In CVPR, pages 7794–
7803, 2018. 1, 2
[38] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In
So Kweon. Cbam: Convolutional block attention module.
In ECCV, pages 3–19, 2018. 1, 2, 4, 7
[39] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines
for human pose estimation and tracking. In ECCV, 2018. 1,
8
[40] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In CVPR, pages 5987–5995, 2017. 1, 2, 4,
5, 7
[41] Fisher Yu and Vladlen Koltun. Multi-scale context aggrega-
tion by dilated convolutions. In ICLR, 2016. 2
[42] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Dar-
rell. Deep layer aggregation. In CVPR, pages 2403–2412,
2018. 2
[43] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-
works. arXiv preprint arXiv:1605.07146, 2016. 1, 2
[44] Xingcheng Zhang, Zhizhong Li, Chen Change Loy, and
Dahua Lin. Polynet: A pursuit of structural diversity in very
deep networks. In CVPR, pages 718–726, 2017. 2
[45] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
CVPR, 2017. 1, 2, 3
[46] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,
and Antonio Torralba. Learning deep features for discrimi-
native localization. In CVPR, pages 2921–2929, 2016. 3
[47] Barret Zoph and Quoc V Le. Neural architecture search with
reinforcement learning. arXiv preprint arXiv:1611.01578,
2016. 1
[48] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V
Le. Learning transferable architectures for scalable image
recognition. In CVPR, pages 8697–8710, 2018. 1, 2