0% found this document useful (0 votes)
2 views10 pages

20 CVPR SCNet

The paper introduces self-calibrated convolutions to enhance convolutional neural networks (CNNs) by improving the feature transformation process without altering the model architecture. This method expands the fields-of-view in each convolutional layer, allowing for better representation learning through adaptive long-range spatial and inter-channel dependencies. Extensive experiments demonstrate significant performance improvements in various vision tasks, such as image recognition and object detection, using this approach.

Uploaded by

2794411427
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

20 CVPR SCNet

The paper introduces self-calibrated convolutions to enhance convolutional neural networks (CNNs) by improving the feature transformation process without altering the model architecture. This method expands the fields-of-view in each convolutional layer, allowing for better representation learning through adaptive long-range spatial and inter-channel dependencies. Extensive experiments demonstrate significant performance improvements in various vision tasks, such as image recognition and object detection, using this approach.

Uploaded by

2794411427
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Improving Convolutional Networks with Self-Calibrated Convolutions

Jiang-Jiang Liu1∗ Qibin Hou2∗ Ming-Ming Cheng1 Changhu Wang3 Jiashi Feng2
1 2 3
CS, Nankai University NUS ByteDance AI Lab
https://mmcheng.net/scconv/

Abstract

Recent advances on CNNs are mostly devoted to design-


ing more complex architectures to enhance their represen-
tation learning capacity. In this paper, we consider improv-
ing the basic convolutional feature transformation process
of CNNs without tuning the model architectures. To this
end, we present a novel self-calibrated convolution that ex-
plicitly expands fields-of-view of each convolutional layer
through internal communications and hence enriches the
output features. In particular, unlike the standard convolu-
tions that fuse spatial and channel-wise information using
small kernels (e.g., 3 × 3), our self-calibrated convolution
adaptively builds long-range spatial and inter-channel de-
pendencies around each spatial location through a novel ResNet-50 ResNeXt-50 SE-ResNet-50 Ours
self-calibration operation. Thus, it can help CNNs generate
Figure 1. Visualizations of feature activation maps learned by dif-
more discriminative representations by explicitly incorpo- ferent networks through Grad-CAM [31]. All the networks are
rating richer information. Our self-calibrated convolution trained on ImageNet [30]. Our results are obtained from ResNet-
design is simple and generic, and can be easily applied to 50 with the proposed self-calibrated convolution. From the ac-
augment standard convolutional layers without introducing tivation maps, one can observe residual networks [12, 40] with
extra parameters and complexity. Extensive experiments conventional (grouped) convolutions and even SE blocks [16] fail
demonstrate that when applying our self-calibrated convo- to capture the whole discriminative regions, due to limited re-
lution into different backbones, the baseline models can be ceptive fields of their convolution layers. In contrast, calibrated-
significantly improved in a variety of vision tasks, includ- convolutions help our model well capture the whole discriminative
ing image recognition, object detection, instance segmen- regions.
tation, and keypoint detection, with no need to change net-
work architectures. We hope this work could provide future
transformation capability and therefore provides powerful
research with a promising way of designing novel convo-
representations to benefit down-stream tasks [20, 10, 27].
lutional feature transformation for improving convolutional
Hence, it is highly desired to enhance the feature transfor-
networks. Code is available on the project page.
mation capability of convolutional networks.
In the literature, an effective way to generate rich rep-
resentations is using powerful hand-designed network ar-
1. Introduction chitectures, such as residual networks (ResNets) [12] as
Deep neural networks trained on large-scale image clas- well as their diverse variants [40, 43, 34, 7] or design-
sification datasets (e.g., ImageNet [30]) are usually adopted ing networks based on AutoML techniques [47, 26]. Re-
as backbones to extract strong representative features for cently, some methods attempt to do so by incorporating
down-stream tasks, such as object detection [23, 29, 2, 8], either attention mechanisms [38, 48, 16, 15] or non-local
segmentation [45, 11], and human keypoint detection [11, blocks [37, 3] into mature networks to model the interde-
39]. A good classification network often has strong feature pendencies among spatial locations or channels or both.
The common idea behind the above methods is focused on
∗ Authors contributed equally. adjusting the network architectures for producing rich fea-
ture representations, which needs too much human labors. Architecture Design: In recent years, remarkable progress
In this paper, rather than designing complex network has been made in the field of novel architecture design [33,
architectures to strengthen feature representations, we in- 35, 32, 44]. As an early work, VGGNet [33] builds deeper
troduce self-calibrated convolution as an efficient way to networks using convolutional filters with smaller kernel size
help convolutional networks learn discriminative represen- (3 × 3) compared to AlexNet [19], yielding better perfor-
tations by augmenting the basic convolutional transforma- mance while using fewer parameters. ResNets [12, 13] im-
tion per layer. Similar to grouped convolutions, it separates prove the sequential structure by introducing residual con-
the convolutional filters of a specific layer into multiple por- nections and using batch normalization [18], making it pos-
tions but unevenly, the filters within each portion are lever- sible to build very deep networks. ResNeXt [40] and Wide
aged in a heterogeneous way. Specifically, instead of per- ResNet [43] extend ResNet by grouping 3×3 convolutional
forming all the convolutions over the input in the original layers or increasing their widths. GoogLeNet [35] and In-
space homogeneously, Self-calibrated convolutions trans- ceptions [36, 34] utilize carefully designed Inception mod-
form the inputs to low-dimensional embeddings through ules with multiple parallel paths of sets of specialized filters
down-sampling at first. The low-dimensional embeddings (3×3, etc.) for feature transformations. NASNet [48] learns
transformed by one filter portion are adopted to calibrate to construct model architectures by exploring a predefined
the convolutional transformations of the filters within an- search space, enabling transferability. DenseNet [17] and
other portion. Benefiting from such heterogeneous convolu- DLA [42] aggregate features through complicated bottom-
tions and between-filter communication, the receptive field up skip connections. Dual Path Networks (DPNs) [7] ex-
for each spatial location can be effectively enlarged. ploit both residual and dense connections to build strong
As an augmented version of the standard convolution, feature representations. SENet [16] introduces a squeeze-
our self-calibrated convolution offers two advantages. First, and-excitation operation to explicitly model the interdepen-
it enables each spatial location to adaptively encode infor- dencies between channels.
mative context from a long-range region, breaking the tradi- Long-Range Dependency Modeling: Building long-range
tion of convolution operating within small regions (e.g., 3 × dependencies is helpful in most computer vision tasks.
3). This makes the feature representations produced by One of the successful examples is the SENet [16], which
our self-calibrated convolution more discriminative. In Fig- adopts Squeeze-and-Excitation blocks to build interdepen-
ure 1, we visualize the feature activation maps produced dencies among the channel dimensions. Later work, like
by ResNets with different types of convolutions [12, 40]. GENet [15], CBAM [38], GCNet [3], GALA [25], AA [1],
As can be seen, ResNet with self-calibrated convolutions and NLNet [37] further extend this idea by introducing spa-
can more accurately and integrally locate the target objects. tial attention mechanisms or designing advanced attention
Second, the proposed self-calibrated convolution is generic blocks. Another way to model long-range dependency is
and can be easily applied to standard convolutional layers to exploit spatial pooling or convolutional operators with
without introducing any parameters and complexity over- large kernel windows. Some typical examples like PSPNet
head or changing the hyper-parameters. [45] adopt multiple spatial pooling operators with different
To demonstrate the effectiveness of the proposed self- sizes to capture multi-scale context. There are also many
calibrated convolution, we first apply it to the large-scale work [28, 14, 41, 5, 22] that leverage large convolutional
image classification problem. We take the residual net- kernels or dilated convolutions for long-range context ag-
work [12] and its variants [40, 16] as baselines, which gregation. Our work is also different from Octave convo-
get large improvements in top-1 accuracy with comparable lution [6], which aims at reducing spatial redundancy and
model parameters and computational capacity. In addition computation cost.
to image classification, we also conduct extensive experi- Different from all above-mentioned approaches that fo-
ments to demonstrate the generalization capability of the cus on tuning network architectures or adding additional
proposed self-calibrated convolution in several vision appli- hand-designed blocks to improve convolutional networks,
cations, including object detection, instance segmentation, our approach considers more efficiently exploiting the con-
and keypoint detection. Experiments show that the base- volutional filters in convolutional layers and designing pow-
line results can be greatly improved by using the proposed erful feature transformations to generate more expressive
self-calibrated convolutions for all three tasks. feature representations.

2. Related Work 3. Method


In this section, we briefly review the recent representa- A conventional 2D convolutional layer F is associated
tive work on architecture design and long-range dependency with a group of filter sets K = [k1 , k2 , . . . , kĈ ], where ki
building of convolutional networks. denotes the i-th set of filters with size C, and transforms
Identity Self-Calibration Element-wise Summation

r× Down Element-wise Multiplication


r× Up
F2 K2 Sigmoid
C/2 × H/r × W/r C/2 × H × W
C/2 × H × W
F3 K3 F4 K4
X1 C/2 × H × W C/2 × H × W Y1
Split Concat

F1 K1
Input: X X2 C/2 × H × W C/2 × H × W Y2 Output: Y
C ×H×W C ×H×W
Figure 2. Schematic illustration of the proposed self-calibrated convolutions. As can be seen, in self-calibrated convolutions, the original
filters are separated into four portions, each of which is in charge of a different functionality. This makes self-calibrated convolutions
quiet different from traditional convolutions or grouped convolutions that are performed in a homogeneous way. More details about the
self-calibration operation can be found in Sec. 3.1.

an input X = [x1 , x2 , . . . , xC ] ∈ RC×H×W to an output input channel number C is identical to the output channel
Y = [y1 , y2 , . . . , yĈ ] ∈ RĈ×Ĥ×Ŵ . Note that we omit number Ĉ, i.e., Ĉ = C. Thus, in the following, we use C
the spatial size of the filters and the bias term for notational to replace Ĉ for notational convenience. Given a group of
convenience. Given the above notations, the output feature filter sets K with shape (C, C, kh , kw ) where kh and kw are
map at channel i can be written as respectively the spatial height and width, we first uniformly
separate it into four portions, each of which is in charge of a
C
X different functionality. Without loss of generality, suppose
yi = ki ∗ X = kji ∗ xj , (1)
C can be divided by 2. After separation, we have four por-
j=1
tions of filters denoted by {Ki }4i=1 , each of which is with
where ‘*’ denotes convolution and ki = [k1i , k2i , . . . , kC
i ]. shape ( C2 , C2 , kh , kw ), respectively.
As can be seen above, each output feature map is com- Given the four portions of filters, we then uniformly split
puted by summation through all channels and all of them the input X into two portions {X1 , X2 }, each of which
are produced uniformly by repeating Eqn. 1 multiple times. is then sent into a special pathway for collecting different
In this way, the convolutional filters can learn similar pat- types of contextual information. In the first pathway, we uti-
terns. Moreover, the fields-of-view for each spatial loca- lize {K1 , K2 , K3 } to perform the self-calibration operation
tion in the convolutional feature transformation is mainly upon X1 , yielding Y1 . In the second pathway, we perform
controlled by the predefined kernel size and networks com- a simple convolution operation: Y2 = F1 (X2 ) = X2 ∗ K1 ,
posed of a stack of such convolutional layers are also short which targets at retaining the original spatial context. Both
of large receptive fields to capture enough high-level seman- the intermediate outputs {Y1 , Y2 } are then concatenated
tics [46, 45]. Both above shortcomings may lead to feature together as the output Y. In what follows, we detailedly de-
maps that are less discriminative. To alleviate the above is- scribe how to perform the self-calibration operation in the
sues, we propose self-calibrated convolution, which is elab- first pathway.
orated below.
3.1. Self-Calibrated Convolutions 3.1.2 Self-Calibration

In grouped convolutions, the feature transformation pro- To efficiently and effectively gather informative contextual
cess is homogeneously and individually performed in mul- information for each spatial location, we propose to conduct
tiple parallel branches and the outputs from each branch are convolutional feature transformation in two different scale
concatenated as the final output. Similar to grouped convo- spaces: an original scale space in which feature maps share
lutions, the proposed self-calibrated convolutions also split the same resolution with the input and a small latent space
the learnable convolutional filters into multiple portions, yet after down-sampling. The embeddings after transformation
differently, each portion of filters is not equally treated but in the small latent space are used as references to guide the
responsible for a special functionality. feature transformation process in the original feature space
because of their large fields-of-view.
3.1.1 Overview Self-Calibration: Given the input X1 , we adopt the aver-
age pooling with filter size r × r and stride r as follows:
The workflow of the proposed design is illustrated in Fig-
ure 2. In our approach, we consider a simple case where the T1 = AvgPoolr (X1 ). (2)
the self-calibration operation encodes multi-scale informa-
Conv

tion, which is highly desired by object detection related


tasks. We will give more experimental analysis in Sec. 4.
3.2. Instantiations
SC Conv

To demonstrate the performance of the proposed self-


calibrated convolutions, we take several variants of the
Tiger shark Stingray Redshank sax Cowboy hat
residual networks [12, 40, 16] as exemplars. Both 50- and
101-layer bottleneck structures are considered. For sim-
Figure 3. Visual comparisons of the intermediate feature maps pro-
duced by different settings of ResNet-50. The feature maps are plicity, we only replace the convolutional operation in the
selected from the 3 × 3 convolutional layer in the last building 3 × 3 convolutional layer in each building block with our
block. For the top row, we use the traditional convolutions while self-calibrated convolutions and keep all relevant hyper-
for the bottom row, we use the proposed self-calibrated convolu- parameters unchanged. By default, the down-sampling rate
tions (SC-Conv). It is obvious that ResNet-50 with self-calibrated r in self-calibrated convolutions is set to 4.
convolutions can capture richer context information.
Relation to Grouped Convolutions: Grouped convolu-
tions adopt the split-transform-merge strategy, in which in-
Feature transformations on T1 is performed based on K2 : dividual convolutional transformations are conducted ho-
mogeneously in multiple parallel branches [40] or in a hier-
X01 = Up(F2 (T1 )) = Up(T1 ∗ K2 ), (3) archical way [9]. Unlike grouped convolutions, our self-
calibrated convolutions can exploit different portions of
where Up(·) is a bilinear interpolation operator that maps convolutional filters in a heterogeneous way. Thus, each
the intermediate references from the small scale space to spatial location during transformation can fuse information
the original feature space. Now, the calibration operation from two different spatial scale spaces through the self-
can be formulated as follows: calibration operation, which largely increases the fields-of-
view when applied to convolutional layers and hence results
Y10 = F3 (X1 ) · σ(X1 + X01 ), (4) in more discriminative feature representations.

where F3 (X1 ) = X1 ∗K3 , σ is the sigmoid function, and ‘·’ Relation to Attention-Based Modules: Our work is also
denotes element-wise multiplication. As shown in Eqn. 4, quiet different from the existing methods relying on add-on
we use X01 as residuals to form the weights for calibration, attention blocks, such as the SE block [16], GE [15] block,
which is found beneficial. The final output after calibration or the CBAM [38]. Those methods require additional learn-
can be written as follows: able parameters, while our self-calibrated convolutions in-
ternally change the way of exploiting convolutional filters of
Y1 = F4 (Y10 ) = Y10 ∗ K4 . (5) convolutional layers, and hence require no additional learn-
able parameters. Moreover, though the GE block [15] en-
codes spatial information in a lower-dimension space as we
Advantages: The advantages of the proposed self-
do, it does not explicitly preserve the spatial information
calibration operation are three-fold. First of all, compared
from the original scale space. In the following experiment
to conventional convolutions, by employing the calibration
section, we will show without any extra learnable param-
operation as shown in Eqn. 4, each spatial location is al-
eters, our self-calibrated convolutions can yield significant
lowed to not only adaptively consider its surrounding infor-
improvements over baselines and other attention-based ap-
mative context as embeddings from the latent space func-
proaches on image classification. Furthermore, our self-
tioning as scalars in the responses from the original scale
calibrated convolutions are complementary to attention and
space, but also model inter-channel dependencies. Thus, the
thus can also benefit from the add-on attention modules.
fields-of-view for convolutional layer with self-calibration
can be effectively enlarged. As shown in Figure 3, convolu-
4. Experiments
tional layers with self-calibration encode larger but more ac-
curate discriminative regions. Second, instead of collecting 4.1. Implementation Details
global context, the self-calibration operation only consid-
We implement our approach using the publicly available
ers the context around each spatial location, avoiding some
PyTorch framework1 . For fair comparison, we adopt the
contaminating information from irrelevant regions to some
official classification framework to perform all classifica-
extent. As can be seen in the right two columns of Figure 6,
tion experiments unless specially declared. We report re-
convolutions with self-calibration can accurately locate the
target objects when visualizing the final score layer. Third, 1 https://pytorch.org
Network Params MAdds FLOPs Top-1 Top-5

Binoculars
50-layer

Lhasa
ResNet [12] 25.6M 4.1G 8.2G 76.4 93.0
SCNet 25.6M 4.0G 7.9G 77.8 93.9
ResNeXt [40] 25.0M 4.3G 8.5G 77.4 93.4

Accordion
ResNeXt 2x40d 25.4M 4.2G 8.3G 76.8 93.3

Daisy
SCNeXt 25.0M 4.3G 8.5G 78.3 94.0
SE-ResNet[16] 28.1M 4.1G 8.2G 77.2 93.4
SE-SCNet 28.1M 4.0G 7.9G 78.2 93.9
101-layer

Gondola
Thatch
ResNet [12] 44.5M 7.8G 15.7G 78.0 93.9
SCNet 44.6M 7.2G 14.4G 78.9 94.3
ResNeXt [40] 44.2M 8.0G 16.0G 78.5 94.2
SCNeXt 44.2M 8.0G 15.9G 79.2 94.4 ResNet (res3) SCNet (res3) ResNet (res3) SCNet (res3)
SE-ResNet[16] 49.3M 7.9G 15.7G 78.4 94.2 Figure 4. Visualizations of feature maps from the side outputs at
SE-SCNet 49.3M 7.2G 14.4G 78.9 94.3 res3 of different networks (ResNet v.s. SCNet). We use 50-layer
Table 1. Comparisons on ImageNet-1K dataset when the proposed settings for both networks.
structure is utilized in different classification frameworks. We re-
port single-crop accuracy rates (%).
just the cardinality of each group convolution according to
our structure to ensure that the capacity of SCNeXt is close
sults on the ImageNet dataset [30]. The size of input im- to ResNeXt. For SE-SCNet, we apply the SE module to
ages is 224 × 224 which are randomly cropped from resized SCNet in the same way as [16].
images as done in [40]. We use SGD to optimize all mod- In Table 1, we show the results produced by both 50-
els. The weight decay and momentum are set to 0.0001 and and 101-layer versions of each model. Compared to the
0.9, respectively. Four Tesla V100 GPUs are used and the original ResNet-50 architecture, SCNet-50 has an improve-
mini-batch size is set to 256 (64 per GPU). By default, we ment of 1.4% in accuracy (77.8% vs. 76.4%). More-
train all models for 100 epochs with an initial learning rate over, the improvement by SCNet-50 (1.4%) is also higher
0.1, which is divided by 10 after every 30 epochs. In test- than that by ResNeXt-50 (1.0%) and SE-ResNet-50 (0.8%).
ing, we report the accuracy results on the single 224 × 224 This demonstrates that self-calibrated convolutions perform
center crop from an image with shorter side resized to 256 much better than increasing cardinality or introducing the
as in [40]. Note that models in all ablation comparisons SE module [16]. When the networks go deeper, a similar
share the same running environment and hyper-parameters phenomenon can also be observed.
except for the network structures themselves. All models in Another way to investigate the generalization ability of
Table 1 are trained under the same strategy and tested under the proposed structure is to see its behaviors on other vision
the same settings. tasks as backbones, such as object detection and instance
segmentation. We will give more experiment comparisons
4.2. Results on ImageNet
in the next subsection.
We conduct ablation experiments to verify the impor-
Self-Calibrated Convolution v.s. Vanilla Convolution:
tance of each component in our proposed architecture and
To further investigate the effectiveness of the proposed self-
compare with existing attention-based approaches on the
calibrated convolutions compared to the vanilla convolu-
ImageNet-1K classification dataset [30].
tions, we add side supervisions (auxiliary losses) as done
in [21] to both ResNet-50 and SCNet-50 after one inter-
4.2.1 Ablation Analysis mediate stage, namely res3. Results from side outputs
can reflect how a network performs when the depth varies
Generalization Ability: To demonstrate the generaliza- and how strong the feature representations at different lev-
tion ability of the proposed structure, we consider three els are. The top-1 accuracy results from the side supervision
widely used classification architectures as baselines, includ- at res3 have been depicted in Figure 5. It is obvious that
ing ResNet [12], ResNeXt [40], and SE-ResNet [16]. The the side results from SCNet-50 are much better than those
corresponding networks with self-calibrated convolutions from ResNet-50. This phenomenon indirectly indicates that
are named as SCNet, SCNeXt, and SE-SCNet, respectively. networks with the proposed self-calibrated convolutions can
Following the default version of ResNeXt [40] (32 × 4d), generate richer and more discriminative feature representa-
we set the bottleneck width to 4 in SCNeXt. We also ad- tions than the vanilla convolutions. To further demonstrate
70

ResNet
60
Top-1 Error (%)

50

ResNeXt
40 ResNet-50 (train)
ResNet-50 (val)
30

SE-ResNet
SCNet-50 (train)
SCNet-50 (val)
20
0 20 40 60 80
Epoch
Figure 5. Auxiliary loss curves for both ResNet-50 and SCNet-50.
We add auxiliary loss after res3. As can be seen, SCNet (red

SCNet
lines) works much better than ResNet (cyan lines). This demon-
strates that self-calibrated convolutions work better for networks
with lower depth. Leonberg Obelisk Paddlewheel Sunglass Neck brace
Figure 6. Visualization of attention maps generated by Grad-CAM
[31]. It is obvious that our SCNet can more precisely locate the
this, we show some visualizations from the score layers of foreground objects than other networks no matter how large and
the side outputs in Figure 4. Apparently, SCNet can more what shape they are. This heavily relies on our self-calibration
precisely and integrally locate the target objects even at a operation which benefits adaptively capturing rich context infor-
mation. We use 50-layer settings for all networks.
lower depth of the network. In Sec. 4.3, we will give more
demonstrations on this by applying both convolutions to dif-
ferent vision tasks. last residual blocks is already very small (e.g., 7 × 7). Fur-
Attention Comparisons: To show why the proposed self- thermore, we find taking the feature maps at lower resolu-
calibrated convolution is helpful for classification networks, tion (after F2 ) as residuals by adding an identity connection
we adopt the Grad-CAM [31] as an attention extraction as shown in Figure 2 is also helpful for better performance.
tool to visualize the attentions produced by ResNet-50, Discarding the extra identity connection leads to a decrease
ResNeXt-50, SE-ResNet-50, and SCNet-50, as shown in of performance to 77.48%.
Figure 6. It can be clearly seen that the attentions produced Average Pooling vs. Max Pooling: In addition to the above
by SCNet-50 can more precisely locate the target objects design choices, we also investigate the influence of different
and do not expand to the background areas too much. When pooling types on the performance. In our experiments, we
the target objects are small, the attentions by our network attempt to replace all the average pooling operators in self-
are also better confined to the semantic regions compared calibrated convolutions with the max pooling operators and
to those produced by other three networks. This suggests see the performance difference. With all other configura-
that our self-calibrated convolution is helpful for discover- tions unchanged, as shown in Table 2, using the max pool-
ing more integral target objects even though their sizes are ing operator yields a performance decrease of about 0.3%
small. in top-1 accuracy (77.81 vs. 77.53). We argue that this may
be due to the fact that, unlike max pooling, average pool-
Design Choices: As demonstrated in Sec. 3.1, we in-
ing builds connections among locations within the whole
troduce the down-sampling operation to achieve self-
pooling window, which can better capture local contextual
calibration, which has been proven useful for improving
information.
CNNs. Here, we investigate how the down-sampling rate
in self-calibrated convolutions influences the classification Discussion: According to the above ablation experiments,
performance. In Table 2, we show the performance with introducing self-calibrated convolutions is helpful for clas-
different down-sampling rates used in self-calibrated con- sification networks, like ResNet and ResNeXt. However,
volutions. As can be seen, when no down-sampling oper- note that exploring the optimal architecture setting is be-
ation is adopted (r = 1), the result is already much bet- yond the scope of this paper. This paper just provides a
ter than the original ResNet-50 (77.38% v.s. 76.40%). As preliminary study about how to improve the vanilla convo-
the down-sampling rate increases, better performance can lutions. We encourage readers to further investigate more
be achieved. Specially, when the down-sampling rate is set effective structures. In the next subsection, we will show
to 4, we have a top-1 accuracy of 77.81%. Note that we do how our approach behaves as pretrained backbones when
not use larger down-sampling rates as the resolution of the applied to popular vision tasks.
Model DS Rate (r) Identity Pooling Top-1 Accuracy Network Params MAdds Top-1 Top-5
ResNet - - - 76.40% ResNet [12] 25.6M 4.1G 76.4 93.0
ResNeXt - - - 77.40% ResNeXt [40] 25.0M 4.3G 77.4 93.4
SE-ResNet - - AVG 77.20% SE-ResNet [16] 28.1M 4.1G 77.2 93.4
ResNet + CBAM [38] 28.1M 4.1G 77.3 93.6
SCNet 1 3 - 77.38%
GCNet [3] 28.1M 4.1G 77.7 93.7
SCNet 2 3 AVG 77.48%
ResNet + GALA [25] 29.4M 4.1G 77.3 93.6
SCNet 4 7 AVG 77.48% ResNet + AA [1] 28.1M 4.1G 77.7 93.6
SCNet 4 3 MAX 77.53% ResNet + GE [15]† 31.2M 4.1G 78.0 93.6
SCNet 4 3 AVG 77.81% SCNet 25.6M 4.0G 77.8 93.9
SCNeXt 4 3 AVG 78.30% SCNet† 25.6M 4.0G 78.2 94.0
Table 2. Ablation experiments about the design choices of SC- SE-SCNet 28.1M 4.0G 78.2 93.9
Net. ‘Identity’ refers to the corresponding component with the GE-SCNet 31.1M 4.0G 78.3 94.0
same name as in Figure 2. ‘DS Rate’ is the down-sampling rate Table 3. Comparisons with prior attention-based approaches on the
in Eqn. 2. We also show results under two types of pooling opera- ImageNet-1K dataset. All approaches are based on the ResNet-
tions: average pooling (AVG) and max pooling (MAX). 50 baseline. We report single-crop accuracy rate (%) and show
complexity comparisons as well. ’†’ means models trained with
300 epochs.
4.2.2 Comparisons with Attention-Based Approaches
Here, we benchmark the proposed SCNet against exist-
11], we train each model using the union of 80k COCO
ing attention-based approaches, including CBAM [38],
train images and 35k images from the validation set
SENet [16], GALA [25], AA [1], and GE [15], on the
(trainval35k) [24] and report results on the rest 5k val-
ResNet-50 architecture. The comparison results can be
idation images (minival).
found in Table 3. It can be easily found that most atten-
tion or non-local based approaches require additional learn- We set hyper-parameters strictly following the Faster R-
able parameters to build their corresponding modules and CNN work [29] and its FPN version [23]. Images are all re-
then plug them into building blocks. Quite differently, our sized so that their shorter edges are with 800 pixels. We use
approach does not rely on any extra learnable parameters, 8 Tesla V100 GPUs to train each model and the mini-batch
but only heterogeneously exploits the convolutional filters. is set to 16, i.e., , 2 images on each GPU. The initial learn-
Out results are obviously better than those of all other ap- ing rate is set to 0.02 and we use the 2× training schedule
proaches. It should also be mentioned that the proposed to train each model. Weight decay and momentum are set
self-calibrated convolutions are also compatible with the to 0.0001 and 0.9, respectively. We report the results using
above mentioned attention-based approaches. For example, the standard COCO metrics, including AP (averaged mean
when adding GE blocks to each building block of SCNet Average Precision over different IoU thresholds), AP0.5 ,
as done in [15], we can further gain another 0.5% boost in AP0.75 and APS , APM , APL (AP at different scales). Both
accuracy. This also indicates that our approach is different 50-layer and 101-layer backbones are adopted.
from this kind of add-on modules. Detection Results: In the top part of Table 4, we show ex-
perimental results on object detection when different clas-
4.3. Applications
sification backbones are used. When taking Faster R-
In this subsection, we investigate the generalization ca- CNN [29] as an example, adopting ResNet-50-FPN as
pability of the proposed approach by applying it to popular the backbone gives an AP score of 37.6 while replacing
vision tasks as backbones, including object detection, in- ResNet-50 with SCNet-50 yields a large improvement of
stance segmentation, and human keypoint detection. 3.2 (40.8 v.s. 37.6). More interestingly, Faster R-CNN with
SCNet-50 backbone performs even better than that with
4.3.1 Object Detection ResNeXt-50 (40.8 v.s. 38.2). This indicates the proposed
way of leveraging convolutional filters is much more effi-
Network Settings: In the object detection task, we take cient than directly grouping the filters. This may be because
the widely used Faster R-CNN architecture [29] with fea- the proposed self-calibrated convolutions contain the adap-
ture pyramid networks (FPNs) [23] as baselines. We tive response calibration operation, which help more pre-
adopt the widely used mmdetection framework2 [4] cisely locate the exact positions of target objects as shown
to run all our experiments. As done in previous work [23, in Figure 6. In addition, from Table 4, we can observe that
using deeper backbones leads to a similar phenomenon as
2 https://github.com/open-mmlab/mmdetection above (ResNet-101-FPN: 39.9 → SCNet-101-FPN: 42.0).
Backbone AP AP0.5 AP0.75 APS APM APL Backbone Scale AP AP.5 AP.75 APm APl
Object Detection (Faster R-CNN) ResNet-50 256 × 192 70.6 88.9 78.2 67.2 77.4
ResNet-50-FPN 37.6 59.4 40.4 21.9 41.2 48.4 SCNet-50 256 × 192 72.1 89.4 79.8 69.0 78.7
SCNet-50-FPN 40.8 62.7 44.5 24.4 44.8 53.1 ResNet-50 384 × 288 71.9 89.2 78.6 67.7 79.6
ResNeXt-50-FPN 38.2 60.1 41.4 22.2 41.7 49.2 SCNet-50 384 × 288 74.4 89.7 81.4 70.7 81.7
SCNeXt-50-FPN 40.4 62.8 43.7 23.4 43.5 52.8 ResNet-101 256 × 192 71.6 88.9 79.3 68.5 78.2
ResNet-101-FPN 39.9 61.2 43.5 23.5 43.9 51.7 SCNet-101 256 × 192 72.6 89.4 80.4 69.4 79.4
SCNet-101-FPN 42.0 63.7 45.5 24.4 46.3 54.6 ResNet-101 384 × 288 73.9 89.6 80.5 70.3 81.1
ResNeXt-101-FPN 40.5 62.1 44.2 23.2 44.4 52.9 SCNet-101 384 × 288 74.8 89.6 81.8 71.2 81.9
SCNeXt-101-FPN 42.0 64.1 45.7 25.5 46.1 54.2 Table 5. Experiments on keypoint detection [24]. We report re-
Instance Segmentation (Mask R-CNN) sults on the COCO val2017 set using the OKS-based mAP and
take the state-of-the-art method [39] as our baseline. Two different
ResNet-50-FPN 35.0 56.5 37.4 18.3 38.2 48.3 input sizes (256 × 192 and 384 × 288) are considered as in [39].
SCNet-50-FPN 37.2 59.9 39.5 17.8 40.3 54.2
ResNeXt-50-FPN 35.5 57.6 37.6 18.6 38.7 48.7
SCNeXt-50-FPN 37.5 60.3 40.0 18.2 40.5 55.0 evaluate the results on the COCO val2017 set using the
standard OKS-based mAP, where OKS (object keypoints
ResNet-101-FPN 36.7 58.6 39.3 19.3 40.3 50.9
similarity) defines the similarity between different human
SCNet-101-FPN 38.4 61.0 41.0 18.2 41.6 56.6
poses. A Faster R-CNN object detector [29] with detection
ResNeXt-101-FPN 37.3 59.5 39.8 19.9 40.6 51.2 AP of 56.4 for the ‘person’ category on COCO val2017
SCNeXt-101-FPN 38.2 61.2 40.8 18.8 41.4 56.1 set is adopted for detection in the test phase as in [39].
Table 4. Comparisons with state-of-the-art approaches on COCO Table 5 shows the comparisons. As can be seen, simply
minival dataset. All results are based on single-model test and replacing ResNet-50 with SCNet-50 improves the AP score
the same hyper-parameters. For object detection, AP refers to box by 1.5% for 256 × 192 input size and 2.5% for 384 × 288
IoU while for instance segmentation AP refers to mask IoU. input size. These results demonstrate that introducing the
proposed self-calibration operation in convolutional layers
benefits human keypoint detection. When using deeper net-
4.3.2 Instance Segmentation works as backbones, we also have more than 1% perfor-
For instance segmentation, we use the same hyper- mance gain in AP as shown in Table 5.
parameters and datasets as in Mask R-CNN [11] for a fair
comparison. The results are based on the mmdetection 5. Conclusions and Future Work
framework [4] for all experiments performed in this part.
This paper presents a new self-calibrated convolution,
We compare the SCNet version Mask R-CNN to the which is able to heterogeneously exploit the convolutional
ResNet version at the bottom of Table 4. Because we have filters nested in a convolutional layer. To promote the filters
introduced object detection results in details, here we only to be of diverse patterns, we introduce the adaptive response
report the results using mask APs. As can be seen, the calibration operation. The proposed self-calibrated convo-
ResNet-50-FPN version and the ResNeXt-50-FPN version lutions can be easily embedded into modern classification
Mask R-CNNs have 35.0 and 35.5 mask APs, respectively. networks. Experiments on large-scale image classification
However, when taking SCNet into account, the correspond- dataset demonstrate that building multi-scale feature repre-
ing results are respectively improved by 2.2 and 2.0 in mask sentations in building blocks greatly improves the predic-
AP. Similar results can also be observed when adopting tion accuracy. To investigate the generalization ability of
deeper backbones. This suggests our self-calibrated con- our approach, we apply it to multiple popular vision tasks
volutions are also helpful for instance segmentation. and find substantial improvements over the baseline models.
We hope the thought of heterogeneously exploiting convo-
4.3.3 Keypoint Detection lutional filters can provide the vision community a different
perspective on network architecture design.
At last, we apply SCNet to human keypoint detection and
report results on the COCO keypoint detection dataset [24]. Acknowledgement. This research was partly supported by
We take the state-of-the-art method [39] as our baseline. We Major Project for New Generation of AI under Grant No.
only replace the backbone ResNet in [39] with SCNet and 2018AAA01004, NSFC (61620106008), the national youth
all other train and test settings3 are kept unchanged. We talent support program, and Tianjin Natural Science Foun-
dation (18ZXZNGX00110). Part of this work was done
3 https://github.com/Microsoft/human-pose-estimation.pytorch when Jiang-Jiang Liu interned at ByteDance AI Lab.
References volutional neural networks. In NeurIPS, pages 9401–9411,
2018. 1, 2, 4, 7
[1] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, [16] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
and Quoc V Le. Attention augmented convolutional net- works. In CVPR, pages 7132–7141, 2018. 1, 2, 4, 5, 7
works. arXiv preprint arXiv:1904.09925, 2019. 2, 7
[17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-
[2] Ali Borji, Ming-Ming Cheng, Qibin Hou, Huaizu Jiang, and ian Q Weinberger. Densely connected convolutional net-
Jia Li. Salient object detection: A survey. Computational works. In CVPR, pages 4700–4708, 2017. 2
Visual Media, 5(2):117–150, 2019. 1 [18] Sergey Ioffe and Christian Szegedy. Batch normalization:
[3] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Accelerating deep network training by reducing internal co-
Hu. Gcnet: Non-local networks meet squeeze-excitation net- variate shift. In ICML, 2015. 2
works and beyond. arXiv preprint arXiv:1904.11492, 2019. [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
1, 2, 7 Imagenet classification with deep convolutional neural net-
[4] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- works. In NeurIPS, pages 1097–1105, 2012. 2
iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping [20] Thuc Trinh Le, Andrés Almansa, Yann Gousseau, and Si-
Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. mon Masnou. Object removal from complex videos using
mmdetection. https://github.com/open-mmlab/ a few annotations. Computational Visual Media, 5(3):267–
mmdetection, 2018. 7, 8 291, 2019. 1
[5] Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George [21] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou
Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, Zhang, and Zhuowen Tu. Deeply-supervised nets. In Ar-
and Jon Shlens. Searching for efficient multi-scale architec- tificial intelligence and statistics, pages 562–570, 2015. 5
tures for dense image prediction. In NeurIPS, pages 8699– [22] Yi Li, Zhanghui Kuang, Yimin Chen, and Wayne Zhang.
8710, 2018. 2 Data-driven neuron allocation for scale aggregation net-
[6] Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yan- works. In CVPR, pages 11526–11534, 2019. 2
nis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Ji- [23] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He,
ashi Feng. Drop an octave: Reducing spatial redundancy in Bharath Hariharan, and Serge J Belongie. Feature pyramid
convolutional neural networks with octave convolution. In networks for object detection. In CVPR, 2017. 1, 7
ICCV, pages 3435–3444, 2019. 2 [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[7] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Shuicheng Yan, and Jiashi Feng. Dual path networks. In Zitnick. Microsoft coco: Common objects in context. In
Advances in Neural Information Processing Systems, pages ECCV, 2014. 7, 8
4467–4475, 2017. 1, 2 [25] Drew Linsley, Dan Shiebler, Sven Eberhardt, and Thomas
[8] Deng-Ping Fan, Ming-Ming Cheng, Jiang-Jiang Liu, Shang- Serre. Learning what and where to attend. In ICLR, 2019. 2,
Hua Gao, Qibin Hou, and Ali Borji. Salient objects in clut- 7
ter: Bringing salient object detection to the foreground. In [26] Hanxiao Liu, Karen Simonyan, and Yiming Yang.
ECCV, pages 186–202, 2018. 1 Darts: Differentiable architecture search. arXiv preprint
[9] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu arXiv:1806.09055, 2018. 1
Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new [27] Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Jiashi Feng,
multi-scale backbone architecture. IEEE TPAMI, pages 1–1, and Jianmin Jiang. A simple pooling-based design for real-
2020. 4 time salient object detection. In CVPR, pages 3917–3926,
[10] Shiming Ge, Xin Jin, Qiting Ye, Zhao Luo, and Qiang Li. 2019. 1
Image editing by object-aware optimal boundary searching [28] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and
and mixed-domain composition. Computational Visual Me- Jian Sun. Large kernel matters–improve semantic segmenta-
dia, 4(1):71–82, 2018. 1 tion by global convolutional network. In CVPR, pages 4353–
[11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- 4361, 2017. 2
shick. Mask r-cnn. In ICCV, pages 2980–2988. IEEE, 2017. [29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
1, 7, 8 Faster r-cnn: Towards real-time object detection with region
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. proposal networks. IEEE TPAMI, 39(6):1137–1149, 2016.
Deep residual learning for image recognition. In CVPR, 1, 7, 8
2016. 1, 2, 4, 5, 7 [30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Identity mappings in deep residual networks. In ECCV, Aditya Khosla, Michael Bernstein, et al. Imagenet large
pages 630–645. Springer, 2016. 2 scale visual recognition challenge. IJCV, 115(3):211–252,
[14] Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, 2015. 1, 5
Zhuowen Tu, and Philip Torr. Deeply supervised salient [31] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
object detection with short connections. IEEE TPAMI, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
41(4):815–828, 2019. 2 Grad-cam: Visual explanations from deep networks via
[15] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea gradient-based localization. In ICCV, pages 618–626, 2017.
Vedaldi. Gather-excite: Exploiting feature context in con- 1, 6
[32] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Math-
ieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated
recognition, localization and detection using convolutional
networks. In ICLR, 2014. 2
[33] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. In ICLR,
2015. 2
[34] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and
Alexander A Alemi. Inception-v4, inception-resnet and the
impact of residual connections on learning. In AAAI, vol-
ume 4, page 12, 2017. 1, 2
[35] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In CVPR, pages 1–9, 2015. 2
[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
Shlens, and Zbigniew Wojna. Rethinking the inception ar-
chitecture for computer vision. In CVPR, pages 2818–2826,
2016. 2
[37] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
ing He. Non-local neural networks. In CVPR, pages 7794–
7803, 2018. 1, 2
[38] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In
So Kweon. Cbam: Convolutional block attention module.
In ECCV, pages 3–19, 2018. 1, 2, 4, 7
[39] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines
for human pose estimation and tracking. In ECCV, 2018. 1,
8
[40] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In CVPR, pages 5987–5995, 2017. 1, 2, 4,
5, 7
[41] Fisher Yu and Vladlen Koltun. Multi-scale context aggrega-
tion by dilated convolutions. In ICLR, 2016. 2
[42] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Dar-
rell. Deep layer aggregation. In CVPR, pages 2403–2412,
2018. 2
[43] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-
works. arXiv preprint arXiv:1605.07146, 2016. 1, 2
[44] Xingcheng Zhang, Zhizhong Li, Chen Change Loy, and
Dahua Lin. Polynet: A pursuit of structural diversity in very
deep networks. In CVPR, pages 718–726, 2017. 2
[45] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
CVPR, 2017. 1, 2, 3
[46] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,
and Antonio Torralba. Learning deep features for discrimi-
native localization. In CVPR, pages 2921–2929, 2016. 3
[47] Barret Zoph and Quoc V Le. Neural architecture search with
reinforcement learning. arXiv preprint arXiv:1611.01578,
2016. 1
[48] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V
Le. Learning transferable architectures for scalable image
recognition. In CVPR, pages 8697–8710, 2018. 1, 2

You might also like