Attentional Feature Fusion
Attentional Feature Fusion
Yimian Dai1 Fabian Gieseke2,3 Stefan Oehmcke3 Yiquan Wu1 Kobus Barnard4
1
College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China
2
Department of Information Systems, University of Münster, Münster, Germany
3
Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
4
Department of Computer Science, University of Arizona, Tucson, AZ, USA
arXiv:2009.14082v2 [cs.CV] 9 Nov 2020
Abstract tures and has been studied extensively in the previous lit-
erature [38, 36, 12, 30, 23]. For instance, in the Inception-
Feature fusion, the combination of features from differ- Net family [38, 39, 37], the outputs of filters with multiple
ent layers or branches, is an omnipresent part of modern sizes on the same level are fused to handle the large varia-
network architectures. It is often implemented via simple tion of object size. In Residual Networks (ResNet) [12, 13]
operations, such as summation or concatenation, but this and its follow-ups [49, 47], the identity mapping features
might not be the best choice. In this work, we propose a uni- and residual learning features are fused as the output via
form and general scheme, namely attentional feature fusion, short skip connections, enabling the training of very deep
which is applicable for most common scenarios, including networks. In Feature Pyramid Networks (FPN) [23] and
feature fusion induced by short and long skip connections U-Net [30], low-level features and high-level features are
as well as within Inception layers. To better fuse features fused via long skip connections to obtain high-resolution
of inconsistent semantics and scales, we propose a multi- and semantically strong features, which are vital for seman-
scale channel attention module, which addresses issues that tic segmentation and object detection. However, despite its
arise when fusing features given at different scales. We also prevalence in modern networks, most works on feature fu-
demonstrate that the initial integration of feature maps can sion focus on constructing sophisticated pathways to com-
become a bottleneck and that this issue can be alleviated bine features in different kernels, groups, or layers. The
by adding another level of attention, which we refer to as feature fusion method has rarely been addressed and is usu-
iterative attentional feature fusion. With fewer layers or pa- ally implemented via simple operations such as addition or
rameters, our models outperform state-of-the-art networks concatenation, which merely offer a fixed linear aggrega-
on both CIFAR-100 and ImageNet datasets, which suggests tion of feature maps and are entirely unaware of whether
that more sophisticated attention mechanisms for feature this combination is suitable for specific objects.
fusion hold great potential to consistently yield better re- Recently, Selective Kernel Networks (SKNet) [21] and
sults compared to their direct counterparts. Our codes and ResNeSt [51] have been proposed to render dynamic
trained models are available online1 . weighted averaging of features from multiple kernels or
groups in the same layer based on the global channel atten-
1. Introduction tion mechanism [16]. Although such attention-based meth-
ods present nonlinear approaches for feature fusion, they
Convolutional neural networks (CNNs) have seen a sig- still suffer from the following shortcomings:
nificant improvement of the representation power by going
deeper [12], going wider [38, 49], increasing cardinality 1. Limited scenarios: SKNet and ResNeSt only focus on
[47], and refining features dynamically [16], corresponding the soft feature selection in the same layer, whereas the
to advances in many computer vision tasks. cross-layer fusion in skip connections has not been ad-
Apart from these strategies, in this paper, we inves- dressed, leaving their schemes quite heuristic. Despite
tigate a different component of the network, feature fu- having different scenarios, all kinds of feature fusion
sion, to further boost the representation power of CNNs. implementations face the same challenge, in essence,
Whether explicit or implicit, intentional or unintentional, that is, how to integrate features of different scales for
feature fusion is omnipresent for modern network architec- better performance. A module that can overcome the
semantic inconsistency and effectively integrate fea-
1 https://github.com/YimianDai/open-aff tures of different scales should be able to consistently
1
improve the quality of fused features in various net- feature fusion. Our key observation is that scale is not an is-
work scenarios. However, so far, there is still a lack of sue exclusive to the spatial attention, and the channel atten-
a generalized approach that can unify different feature tion can also have scales other than the global by varying the
fusion scenarios in a consistent manner. spatial pooling size. By aggregating the multi-scale context
2. Unsophisticated initial integration: To feed the re- information along the channel dimension, MS-CAM can si-
ceived features into the attention module, SKNet intro- multaneously emphasize large objects that distribute more
duces another phase of feature fusion in an involuntary globally and highlight small objects that distribute more lo-
but inevitable way, which we call initial integration cally, facilitating the network to recognize and detect ob-
and is implemented by addition. Therefore, besides jects under extreme scale variation.
the design of the attention module, as its input, the
initial integration approach also has a large impact on 2. Related Work
the quality of fusion weights. Considering the features
may have a large inconsistency on the scale and seman- 2.1. Multi-scale Attention Mechanism
tic level, an unsophisticated initial integration strategy
The scale variation of objects is one of the key challenges
ignoring this issue can be a bottleneck.
in computer vision. To remedy this issue, an intuitive way
3. Biased context aggregation scale: The fusion weights
is to leverage multi-scale image pyramids [29, 2], in which
in SKNet and ResNeSt are generated via the global
objects are recognized at multiple scales and the predictions
channel attention mechanism [16], which is preferred
are combined using non-maximum suppression. The other
for information that distributes more globally. How-
line of effort aims to exploit the inherent multi-scale, hierar-
ever, objects in the image can have an extremely large
chical feature pyramid of CNNs to approximate image pyra-
variation in size. Numerous studies have emphasized
mids, in which features from multiple layers are fused to
this issue that arises when designing CNNs, i.e., that
obtain semantic features with high resolutions [11, 30, 23].
the receptive fields of predictors should match the ob-
The attention mechanism in deep learning, which mim-
ject scale range [52, 33, 34, 22]. Therefore, merely ag-
ics the human visual attention mechanism [5, 8], is origi-
gregating contextual information on a global scale is
nally developed on a global scale. For example, the matrix
too biased and weakens the features of small objects.
multiplication in self-attention draws global dependencies
This gives rise to the question if a network can dy-
of each word in a sentence [41] or each pixel in an image
namically and adaptively fuse the received features in
[7, 44, 1]. The Squeeze-and-Excitation Networks (SENet)
a contextual scale-aware way.
squeeze global spatial information into a channel descrip-
Motivated by the above observations, we present the at- tor to capture channel-wise dependencies [16]. Recently,
tentional feature fusion (AFF) module, trying to answer the researchers start to take into account the scale issue of at-
question of how a unified approach for all kinds of feature tention mechanisms. Similar to the above-mentioned ap-
fusion scenarios should be and address the problems of con- proaches handling scale variation in CNNs, multi-scale at-
textual aggregation and initial integration. The AFF frame- tention mechanisms are achieved by either feeding multi-
work generalizes the attention-based feature fusion from the scale features into an attention module or combining fea-
same-layer scenario to cross-layer scenarios including short ture contexts of multiple scales inside an attention module.
and long skip connections, and even the initial integration In the first type, the features at multiple scales or their con-
inside AFF itself. It provides a universal and consistent way catenated result are fed into the attention module to generate
to improve the performance of various networks, e.g., In- multi-scale attention maps, while the scale of feature con-
ceptionNet, ResNet, ResNeXt [47], and FPN, by simply re- text aggregation inside the attention module remains single
placing existing feature fusion operators with the proposed [2, 4, 45, 6, 35, 40]. The second type, which is also re-
AFF module. Moreover, the AFF framework supports to ferred to as multi-scale spatial attention, aggregates feature
gradually refine the initial integration, namely the input of contexts by convolutional kernels of different sizes [20] or
the fusion weight generator, by iteratively integrating the from a pyramid [20, 43] inside the attention module .
received features with another AFF module, which we refer The proposed MS-CAM follows the idea of ParseNet
to as iterative attentional feature fusion (iAFF). [25] with combining local and global features in CNNs and
To alleviate the problems arising from scale variation and the idea of spatial attention with aggregating multi-scale
small objects, we advocate the idea that attention modules feature contexts inside the attention module, but differ in at
should also aggregate contextual information from different least two important aspects: 1) MS-CAM puts forward the
receptive fields for objects of different scales. More specif- scale issue in channel attention and is achieved by point-
ically, we propose the Multi-Scale Channel Attention Mod- wise convolution rather than kernels of different sizes. 2)
ule (MS-CAM), a simple yet effective scheme to remedy the instead of in the backbone network, MS-CAM aggregates
feature inconsistency across different scales for attentional local and global feature contexts inside the channel atten-
2
tion module. To the best of our knowledge, the multi-scale ture map of size H × W into a scalar. This extreme coarse
channel attention has never been discussed before. descriptor prefers to emphasize large objects that distribute
globally and can potentially wipe out most of the image
2.2. Skip Connections in Deep Learning signal present in a small object. However, detecting very
Skip connection has been an essential component in small objects stands out as the key performance bottleneck
modern convolutional networks. Short skip connections, of state-of-the-art networks [34]. For example, the diffi-
namely the identity mapping shortcuts added inside Resid- culty of COCO is largely due to the fact that most object
ual blocks, provide an alternative path for the gradient to instances are smaller than 1% of the image area [24, 33].
flow without interruption during backpropagation [12, 47, Therefore, global channel attention might not be the best
49]. Long skip connections help the network to obtain se- choice. Multi-scale feature contexts should be aggregated
mantic features with high resolutions by bridging features inside the attention module to alleviate the problems arising
of finer details from lower layers and high-level semantic from scale variation and small object instances.
features of coarse resolutions [17, 23, 30, 26]. Despite be- 3.2. Aggregating Local and Global Contexts
ing used to combine features in various pathways [9], the fu-
sion of connected features is usually implemented via addi- In this part, we depict the proposed multi-scale chan-
tion or concatenation, which allocate the features with fixed nel attention module (MS-CAM) in detail. The key idea
weights regardless of the variance of contents. Recently, is that the channel attention can be implemented in multiple
a few attention-based methods, e.g., Global Attention Up- scales by varying the spatial pooling size. To maintain it as
sample (GAU) [20] and Skip Attention (SA) [48], have been lightweight as possible, we merely add the local context to
proposed to use high-level features as guidance to modulate the global context inside the attention module. We choose
the low-level features in long skip connections. However, the point-wise convolution (PWConv) as the local channel
the fusion weights for the modulated features are still fixed. context aggregator, which only exploits point-wise channel
To the best of our knowledge, it is the Highway Net- interactions for each spatial position. To save parameters,
works that first introduced a selection mechanism in short the local channel context L(X) ∈ RC×H×W is computed
skip connections [36]. To some extent, the attentional skip via a bottleneck structure as follows:
connections proposed in this paper can be viewed as its L(X) = B (PWConv2 (δ (B (PWConv1 (X))))) (2)
follow-up, but differs in the three points: 1) Highway Net-
C
works employ a simple fully connected layer that can only The kernel sizes of PWConv1 and PWConv2 are × C × r
generate a scalar fusion weight, while our proposed MS- 1 × 1 and PWConv2 is C × Cr × 1 × 1, respectively. It
CAM generates fusion weights as the same size of feature is noteworthy that L(X) has the same shape as the input
maps, enabling dynamic soft selections in an element-wise feature, which can preserve and highlight the subtle details
way. 2) Highway Networks only use one input feature to in the low-level features. Given the global channel context
generate weight, while our AFF module is aware of both g(X) and local channel context L(X), the refined feature
features. 3) We point out the importance of initial feature X0 ∈ RC×H×W by MS-CAM can be obtained as follows:
integration and the iAFF module is proposed as a solution.
X0 = X ⊗ M(X) = X ⊗ σ (L(X) ⊕ g(X)) , (3)
C×H×W
3. Multi-scale Channel Attention where M(X) ∈ R denotes the attentional weights
generated by MS-CAM. ⊕ denotes the broadcasting addi-
3.1. Revisiting Channel Attention in SENet tion and ⊗ denotes the element-wise multiplication.
Given an intermediate feature X ∈ RC×H×W with C X
C×H×W
channels and feature maps of size H × W , the channel at-
tention weights w ∈ RC in SENet can be computed as GlobalAvgPooling
C×1×1
w = σ (g(X)) = σ (B (W2 δ (B (W1 (g(X)))))) , (1)
Point-wise Conv Point-wise Conv
where g(X) ∈ RC denotes the global feature context and
C C
BN r ×1×1 r ×H×W BN
1
PH PW ReLU ReLU
g(X) = H×W i=1 j=1 X[:,i,j] is the global average
pooling (GAP). δ denotes the Rectified Linear Unit (ReLU) Point-wise Conv Point-wise Conv
L
[27], and B denotes the Batch Normalization (BN) [18]. σ C×1×1
BN BN
C×H×W
3
4. Attentional Feature Fusion X X
Residual
4.1. Unification of Feature Fusion Scenarios Conv 3 × 3 Conv 5 × 5
C×H×W
Given two feature maps X, Y ∈ R , by default, AFF AFF
we assume Y is the feature map with a larger receptive field. Z Z
More specifically, (a) AFF-Inception module (b) AFF-ResBlock
1. same-layer scenario: X is the output of a 3 × 3 kernel Input Stage-1 Stage-2 Stage-3
Stem
and Y is the output of a 5 × 5 kernel in InceptionNet;
2. short skip connection scenario: X is the identity map-
ping, and Y is the learned residual in a ResNet block; Output Softmax AFF AFF
3. long skip connection scenario: X is the low-level fea-
ture map, and Y is the high-level semantic feature map (c) AFF-FPN
in a feature pyramid. Figure 3: The schema of the proposed AFF-Inception mod-
Based on the multi-scale channel attention module M, At- ule, AFF-ResBlock, and AFF-FPN. The blue and red lines
tentional Feature Fusion (AFF) can be expressed as denote channel expansion and upsampling, respectively.
Z = M(X ] Y) ⊗ X + (1 − M(X ] Y)) ⊗ Y, (4) but only partially aware of the input feature maps. In most
where Z ∈ R C×H×W
is the fused feature, and ] denotes cases, they only exploit the high-level feature map. Fully
the initial feature integration. In this subsection, for the context-aware approaches utilize both input feature maps
sake of simplicity, we choose the element-wise summation for guidance at the cost of raising the initial integration is-
as initial integration. The AFF is illustrated in Fig. 2(a), sue. (b) Refinement vs modulation vs selection. The sum
where the dashed line denotes 1 − M(X ] Y). It should of weights applied to two feature maps in soft selection ap-
be noted that the fusion weights M(X ] Y) consists of real proaches are bound to 1, while this is not the case for re-
numbers between 0 and 1, so are the 1 − M(X ] Y), which finement and modulation.
enable the network to conduct a soft selection or weighted
averaging between X and Y. 4.2. Iterative Attentional Feature Fusion
X Y
L Unlike partially context-aware approaches [20], fully
X Y
C×H×W C×H×W context-aware methods have an inevitable issue, namely
L N N how to initially integrate input features. As the input of
MS-CAM
C×H×W C×H×W L the attention module, the initial integration quality may pro-
N N foundly affect final fusion weights. Since it is still a feature
MS-CAM N C×H×W C×H×W N
MS-CAM fusion problem, an intuitive way is to have another attention
L L module to fuse input features. We call this two-stage ap-
AFF iAFF proach iterative Attentional Feature Fusion (iAFF), which
Z Z is illustrated in Fig. 2(b). Then, the initial integration X]Y
(a) AFF (b) iAFF
in Eq. (4) can be reformulated as
Figure 2: Illustration of the proposed AFF and iAFF
X ] Y = M(X + Y) ⊗ X + (1 − M(X + Y)) ⊗ Y (5)
We summarize different formulations of feature fusion
in deep networks in Table 1. G denotes the global atten- 4.3. Examples: InceptionNet, ResNet, and FPN
tion mechanism. Although there are many implementation
differences among multiple approaches for various feature To validate the proposed AFF/iAFF as a uniform and
fusion scenarios, once being abstracted into mathematical general scheme, we choose ResNet, FPN, and Inception-
forms, these differences in details disappear. Therefore, it Net as examples for the most common scenarios: short and
is possible to unify these feature fusion scenarios with a long skip connections as well as the same layer fusion. It
carefully designed approach, thereby improving the perfor- is straightforward to apply AFF/iAFF to existing networks
mance of all networks by replacing original fusion opera- by replacing the original addition or concatenation. Specif-
tions with this unified approach. ically, we replace the concatenation in the InceptionNet
From Table 1, it can be further seen that apart from the module as well as the addition in ResNet block (ResBlock)
implementation of the weight generation module G, the and FPN to obtain the attentional networks, which we call
state-of-the-art fusion schemes mainly differ in two crucial AFF-Inception module, AFF-ResBlock, and AFF-FPN, re-
points: (a) the context-awareness level. Linear approaches spectively. This replacement and the schemes of our pro-
like addition and concatenation are entirely contextual un- posed architectures are shown in Fig. 3. The iAFF is a par-
aware. Feature refinement and modulation are non-linear, ticular case of AFF, so it does not need another illustration.
4
Table 1: A brief overview of different feature fusion strategies in deep networks.
Context-aware Type Formulation Scenario & Reference Example
Addition X+Y Short Skip [12, 13], Long Skip [26, 23] ResNet, FPN
None
Concatenation WA X:,i,j + WB Y:,i,j Same Layer [38], Long Skip [30, 17] InceptionNet, U-Net
Refinement X + G(Y) ⊗ Y Short Skip [16, 15, 46, 28] SENet
Partially Modulation G(Y) ⊗ X + Y Long Skip [20] GAU
Soft Selection G(X) ⊗ X + (1 − G(X)) ⊗ Y Short Skip [36] Highway Networks
Modulation G(X, Y) ⊗ X + Y Long Skip [48] SA
Fully G(X + Y) ⊗ X + (1 − G(X + Y)) ⊗ Y Same Layer [21, 51] SKNet
Soft Selection
M(X ] Y) ⊗ X + (1 − M(X ] Y)) ⊗ Y Same Layer, Short Skip, Long Skip ours
5. Experiments X
L
Y X
L
Y
5
Table 2: Experimental settings for the networks integrated with the proposed AFF/iAFF.
Fusing Batch Learning Learning
Task Dataset Host Network r Epochs Optimizer Initialization
Scenario Size Rate Rate Mode
Inception-ResNet-20-b Same Layer 4 400 128 Nesterov 0.2 Step, γ = 0.1 Kaiming
Image CIFAR-100 ResNet-20-b Short Skip 4 400 128 Nesterov 0.2 Step, γ = 0.1 Kaiming
Classification ResNeXt-38-32x4d Short Skip 16 400 128 Nesterov 0.2 Step, γ = 0.1 Xavier
ImageNet ResNet-50 Short Skip 16 160 128 Nesterov 0.075 Cosine Kaiming
Semantic
StopSign ResNet-20-b + FPN Long Skip 4 300 32 AdaGrad 0.01 Poly Kaiming
Segmentation
Table 3: Comparison of contextual aggregation scales in attentional feature fusion given the same parameter budget. The
results suggest that a mix of scales should always be preferred inside the channel attention module.
InceptionNet on CIFAR-100 ResNet on CIFAR-100 ResNet + FPN on StopSign ResNet on
Aggregation Scale
ImageNet
b=1 b=2 b=3 b=4 b=1 b=2 b=3 b=4 b=1 b=2 b=3 b=4
Global + Global 0.735 0.766 0.775 0.789 0.754 0.796 0.811 0.821 0.911 0.923 0.936 0.939 0.777
Local + Local 0.746 0.771 0.785 0.787 0.754 0.794 0.808 0.814 0.895 0.919 0.921 0.924 0.780
Global + Local 0.756 0.784 0.794 0.801 0.763 0.804 0.816 0.826 0.924 0.935 0.939 0.944 0.784
Table 4: Comparison of context-aware level and feature integration strategy in feature fusion given the same parameter
budget. The results suggest that a fully context-aware and selective strategy should always be preferred for feature fusion. If
no problem in optimization, we should adopt the iterative attentional feature fusion without hesitation for better performance.
InceptionNet (Same Layer) ResNet (Short Skip) ResNet + FPN (Long Skip)
Fusion Type Context Strategy
b=1 b=2 b=3 b=4 b=1 b=2 b=3 b=4 b=1 b=2 b=3 b=4
Add None \ 0.720 0.753 0.771 0.782 0.740 0.786 0.797 0.808 0.895 0.920 0.925 0.928
Concatenation None \ 0.725 0.749 0.772 0.779 0.742 0.782 0.793 0.798 0.897 0.909 0.925 0.939
MS-GAU Partially Modulation 0.751 0.774 0.788 0.795 0.766 0.803 0.815 0.819 0.917 0.926 0.937 0.941
MS-SENet Partially Refinement 0.752 0.780 0.790 0.798 0.765 0.799 0.814 0.820 0.915 0.929 0.940 0.940
MS-SA Fully Modulation 0.756 0.779 0.790 0.798 0.761 0.801 0.814 0.822 0.920 0.932 0.938 0.941
AFF (ours) Fully Selection 0.756 0.784 0.794 0.801 0.763 0.804 0.816 0.826 0.924 0.935 0.939 0.944
iAFF (ours) Fully Selection 0.774 0.801 0.808 0.814 0.772 0.807 0.822 / 0.927 0.938 0.945 0.953
sion, and another level of attentional feature fusion can fur- the attended regions of the AFF-ResNet-50 highly overlap
ther improve the performance. However, this improvement with the labeled objects, which shows that it learns well to
may be obtained at the cost of increasing the difficulty in localize objects and exploit the features in object regions.
optimization. We notice that when the network depth in- On the contrary, the localization capacity of the baseline
creases as b changes from 3 to 4, the performance of iAFF- ResNet-50 is relatively poor, misplacing the center of at-
ResNet did not improve but degraded. tended regions in many cases. Although SENet-50 are able
to locate the true objects, the attended regions are over-
large including many background components. It is because
5.1.3 Impact on Localization and Small Objects
SENet-50 only utilizes the global channel attention, which
To study the impact of the proposed MS-CAM on object is biased to the context of a global scale, whereas the pro-
localization and small object recognition, we apply Grad- posed MS-CAM also aggregates the local channel context,
CAM [32] to ResNet-50, SENet-50, and AFF-ResNet-50 which helps the network to attend the objects with fewer
for the visualization results of images from the ImageNet background clutters and is also beneficial to the small ob-
dataset, which are illustrated in Fig. 6. Given a specific ject recognition. In the bottom half of Fig. 6, we can clearly
class, Grad-CAM results show the network’s attended re- see that AFF-ResNet-50 can predict correctly on the small-
gions clearly. Here, we show the heatmaps of the predicted scale objects, while ResNet-50 fails in most cases.
class, and the wrongly predicted image is denoted with the
symbol 6. The predicted class names and their softmax
5.2. Comparison with State-of-the-Art Networks
scores are also shown at the bottom of heatmaps. To show that the network performance can be improved
From the upper part of Fig. 6, it can be seen clearly that by replacing original fusion operations with the proposed
6
Backpack Basketball Bathing Cap Bee Goldfish Koala Screwdriver Volleyball
Input
image
ResNet
Backpack P=0.55 Basketball P=0.91 Bathing Cap P=0.82 Bee P=0.77 Goldfish P=0.84 Koala P=0.80 Screwdriver P=0.58 Volleyball P=0.97
SENet
Backpack P=0.81 Basketball P=0.91 Bathing Cap P=0.93 Bee P=0.97 Goldfish P=0.77 Koala P=0.88 Screwdriver P=0.84 Volleyball P=0.87
AFF +
ResNet
Backpack P=0.87 Basketball P=0.95 Bathing Cap P=0.87 Bee P=0.87 Goldfish P=0.85 Koala P=0.93 Screwdriver P=0.82 Volleyball P=0.92
Ant Chain Saw Hamster iPod Lipstick Plastic Bag Scorpion Tick
Input
image
ResNet
6 Ladybug 6 Chain Saw P=0.58 6 Rabbit 6 iPod P=0.69 Lipstick P=0.54 6 Rabbit 6 6 Tick 6 Tick P=0.72
SENet
6 Ladybug 6 Chain Saw P=0.95 Hamster P=0.51 iPod P=0.91 Lipstick P=0.92 Plastic Bag P=0.67 Scorpion P=0.81 Tick P=0.76
AFF +
ResNet
Ant P=0.35 Chain Saw P=0.87 Hamster P=0.55 iPod P=0.93 Lipstick P=0.76 Plastic Bag P=0.38 Scorpion P=0.83 Tick P=0.88
Figure 6: Network visualization with Grad-CAM. The comparison results suggest that the proposed MS-CAM is beneficial
to the object localization and small object recognition.
attentional feature fusion, we compare the AFF and iAFF paring SKNet / SENet / GAU-FPN with AFF-InceptionNet
modules with other attention modules based on the same / AFF-ResNet / AFF-FPN, we can see that our AFF or iAFF
host networks in different feature fusion scenarios. Fig. 7 integrated networks are better in all scenarios, which shows
illustrates the comparison results with a gradual increase in that our (iterative) attentional feature fusion approach not
network depth for all networks. It can be seen that: 1) Com- only has superior performance, but a good generality. We
7
CIFAR-100 CIFAR-100 StopSign
0.82 0.95
0.80
0.94
0.80
0.78 0.93
Accuracy
Accuracy
Accuracy
0.78 0.92
0.76 InceptionNet ResNet FPN
SKNet 0.76 SENet 0.91 GAU-FPN
0.74 AFF-InceptionNet AFF-ResNet AFF-FPN
iAFF-InceptionNet iAFF-ResNet 0.90 iAFF-FPN
0.74
105 2×10 3×10 4×105 5×105
5 5 105 2×105 3×105 4×105 105 2×105 3×105 4×105
Network Parameters Network Parameters Network Parameters
(a) InceptionNet (Same layer) (b) ResNet (Short skip connection) (c) FPN (Long skip connection)
Figure 7: Compassion with baseline and other state-of-the-art networks with a gradual increase of network depth.
Acknowledgement
The authors would like to thank the editor and anony-
Table 5: Comparison on ImageNet
mous reviewers for their helpful comments and suggestions,
Architecture top-1 err. Params and also thank @takedarts on Github for pointing out the
bug in our CIFAR-100 code. This work was supported in
ResNet-101 [12] 23.2 42.5 M
Efficient-Channel-Attention-Net-101 [42] 21.4 42.5 M
part by the National Natural Science Foundation of China
Attention-Augmented-ResNet-101 [1] 21.3 45.4 M under Grant No. 61573183, the Open Project Program of
SENet-101 [16] 20.9 49.4 M the National Laboratory of Pattern Recognition (NLPR) un-
Gather-Excite-θ+ -ResNet-101 [15] 20.7 58.4 M der Grant No. 201900029, the Nanjing University of Aero-
Local-Importance-Pooling-ResNet-101 [10] 20.7 42.9 M nautics and Astronautics PhD short-term visiting scholar
AFF-ResNet-50 (ours) 20.9 30.3 M project under Grant No. 180104DF03, the Excellent Chi-
AFF-ResNeXt-50-32x4d (ours) 20.8 29.9 M nese and Foreign Youth Exchange Program, China Associ-
iAFF-ResNet-50 (ours) 20.4 35.1 M ation for Science and Technology, China Scholarship Coun-
iAFF-ResNeXt-50-32x4d (ours) 20.2 34.7 M cil under Grant No. 201806830039.
8
References [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In 2016 IEEE
[1] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Conference on Computer Vision and Pattern Recognition
and Quoc V. Le. Attention augmented convolutional net- (CVPR), Las Vegas, NV, USA, pages 770–778, 2016.
works. In 2019 IEEE International Conference on Computer [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Vision (ICCV), Seoul, Korea (South), pages 3286–3295, Oc- Identity mappings in deep residual networks. In 14th Euro-
tober 2019. pean Conference on Computer Vision (ECCV), Amsterdam,
[2] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and The Netherlands, pages 630–645, 2016.
Alan L. Yuille. Attention to scale: Scale-aware semantic [14] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Jun-
image segmentation. In 2016 IEEE Conference on Com- yuan Xie, and Mu Li. Bag of tricks for image classification
puter Vision and Pattern Recognition (CVPR), Las Vegas, with convolutional neural networks. In 2019 IEEE Confer-
NV, USA, pages 3640–3649, 2016. ence on Computer Vision and Pattern Recognition (CVPR),
[3] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Long Beach, CA, USA, pages 558–567, 2019.
Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and [15] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea
Zheng Zhang. MXNet: A flexible and efficient machine Vedaldi. Gather-excite: Exploiting feature context in convo-
learning library for heterogeneous distributed systems. In In lutional neural networks. In Annual Conference on Neural
Neural Information Processing Systems, Workshop on Ma- Information Processing Systems (NeurIPS) 2018, Montréal,
chine Learning Systems, volume abs/1512.01274, 2015. Canada, pages 9423–9433, 2018.
[4] Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. [16] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
Yuille, and Xiaogang Wang. Multi-context attention for hu- works. In 2018 IEEE Conference on Computer Vision and
man pose estimation. In 2017 IEEE Conference on Computer Pattern Recognition (CVPR), Salt Lake City, UT, USA, pages
Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 7132–7141, 2018.
pages 5669–5678. IEEE Computer Society, 2017. [17] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil-
[5] Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and ian Q. Weinberger. Densely connected convolutional net-
Jianbing Shen. Shifting more attention to video salient ob- works. In 2017 IEEE Conference on Computer Vision
ject detection. In 2019 IEEE/CVF Conference on Com- and Pattern Recognition (CVPR), Honolulu, HI, USA, pages
puter Vision and Pattern Recognition (CVPR), Long Beach, 2261–2269, 2017.
CA, USA, pages 8554–8564. Computer Vision Foundation / [18] Sergey Ioffe and Christian Szegedy. Batch normalization:
IEEE, 2019. Accelerating deep network training by reducing internal co-
[6] Yang Feng, Deqian Kong, Ping Wei, Hongbin Sun, and Nan- variate shift. In the 32nd International Conference on Ma-
ning Zheng. A benchmark dataset and multi-scale attention chine Learning (ICML), Lille, France, pages 448–456, 2015.
network for semantic traffic light detection. In 2019 IEEE [19] Alex Krizhevsky. Learning multiple layers of features from
Intelligent Transportation Systems Conference (ITSC), Auck- tiny images. Technical report, University of Toronto, 2009.
land, New Zealand, pages 1–8. IEEE, 2019. [20] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang.
Pyramid attention network for semantic segmentation. In
[7] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi-
British Machine Vision Conference (BMVC) 2018, Newcas-
wei Fang, and Hanqing Lu. Dual attention network for scene
tle, UK, pages 1–13, 2018.
segmentation. In 2019 IEEE Conference on Computer Vision
[21] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Se-
and Pattern Recognition (CVPR), Long Beach, CA, USA,
lective kernel networks. In 2019 IEEE Conference on Com-
pages 3146–3154, 2019.
puter Vision and Pattern Recognition (CVPR), Long Beach,
[8] Keren Fu, Deng-Ping Fan, Ge-Peng Ji, and Qijun Zhao. JL- CA, USA, pages 510–519, 2019.
DCF: joint learning and densely-cooperative fusion frame-
[22] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhao-Xiang
work for RGB-D salient object detection. In 2020 IEEE/CVF
Zhang. Scale-aware trident networks for object detection.
Conference on Computer Vision and Pattern Recognition
In 2019 IEEE International Conference on Computer Vision
(CVPR), Seattle, WA, USA, pages 3049–3059. IEEE, 2020.
(ICCV), Seoul, Korea (South), pages 6053–6062, 2019.
[9] Keren Fu, Qijun Zhao, Irene Yu-Hua Gu, and Jie Yang. [23] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He,
Deepside: A general deep framework for salient object de- Bharath Hariharan, and Serge J. Belongie. Feature pyramid
tection. Neurocomputing, 356:69–82, Sep 2019. networks for object detection. In 2017 IEEE Conference
[10] Ziteng Gao, Limin Wang, and Gangshan Wu. LIP: local on Computer Vision and Pattern Recognition (CVPR), Hon-
importance-based pooling. In 2019 IEEE International Con- olulu, HI, USA, pages 936–944, 2017.
ference on Computer Vision (ICCV), Seoul, Korea (South), [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
pages 3354–3363. IEEE, 2019. Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence
[11] Bharath Hariharan, Pablo Andrés Arbeláez, Ross B. Gir- Zitnick. Microsoft coco: Common objects in context. In 13th
shick, and Jitendra Malik. Hypercolumns for object segmen- European Conference on Computer Vision (ECCV), Zurich,
tation and fine-grained localization. In 2015 IEEE Confer- Switzerland, pages 740–755, Cham, 2014.
ence on Computer Vision and Pattern Recognition (CVPR), [25] Wei Liu, Andrew Rabinovich, and Alexander C. Berg.
Boston, MA, USA, pages 447–456. IEEE Computer Society, Parsenet: Looking wider to see better. CoRR,
2015. abs/1506.04579, 2015.
9
[26] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully [39] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
convolutional networks for semantic segmentation. In 2015 Jonathon Shlens, and Zbigniew Wojna. Rethinking the in-
IEEE Conference on Computer Vision and Pattern Recogni- ception architecture for computer vision. In 2016 IEEE
tion (CVPR), Boston, MA, USA, pages 3431–3440, 2015. Conference on Computer Vision and Pattern Recognition
[27] Vinod Nair and Geoffrey E. Hinton. Rectified linear units (CVPR), Las Vegas, NV, USA, pages 2818–2826, 2016.
improve restricted boltzmann machines. In the 27th Inter- [40] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchi-
national Conference on Machine Learning (ICML), Haifa, cal multi-scale attention for semantic segmentation. arXiv
Israel, ICML’10, pages 807–814, USA, 2010. preprint arXiv:2005.10821, 2020.
[28] Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In So [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Kweon. BAM: bottleneck attention module. In British reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Machine Vision Conference (BMVC) 2018, Newcastle, UK, Polosukhin. Attention is all you need. In Annual Conference
pages 1–14, 2018. on Neural Information Processing Systems (NeurIPS) 2017,
[29] Shaoqing Ren, Kaiming He, Ross B. Girshick, Xiangyu Long Beach, CA, USA, pages 5998–6008, 2017.
Zhang, and Jian Sun. Object detection networks on convolu- [42] Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wang-
tional feature maps. IEEE Trans. Pattern Anal. Mach. Intell., meng Zuo, and Qinghua Hu. Eca-net: Efficient channel
39(7):1476–1481, 2017. attention for deep convolutional neural networks. In IEEE
[30] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Conference on Computer Vision and Pattern Recognition
Convolutional networks for biomedical image segmentation. (CVPR), Seattle, WA, USA, pages 11534–11542, 2020.
In 18th International Conference on Medical Image Comput- [43] Wenguan Wang, Shuyang Zhao, Jianbing Shen, Steven C. H.
ing and Computer-Assisted Intervention (MICCAI), Munich, Hoi, and Ali Borji. Salient object detection with pyramid at-
Germany, pages 234–241, 2015. tention and salient edges. In 2019 IEEE Conference on Com-
[31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- puter Vision and Pattern Recognition (CVPR), Long Beach,
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, CA, USA, pages 1448–1457. Computer Vision Foundation /
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and IEEE, 2019.
Li Fei-Fei. Imagenet large scale visual recognition challenge.
[44] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and
International Journal of Computer Vision, 115(3):211–252,
Kaiming He. Non-local neural networks. In 2018 IEEE
2015.
Conference on Computer Vision and Pattern Recognition
[32] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek (CVPR), Salt Lake City, UT, USA, pages 7794–7803, 2018.
Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba-
[45] Yi Wang, Haoran Dou, Xiaowei Hu, Lei Zhu, Xin Yang,
tra. Grad-cam: Visual explanations from deep networks via
Ming Xu, Jing Qin, Pheng-Ann Heng, Tianfu Wang, and
gradient-based localization. International Journal of Com-
Dong Ni. Deep Attentive Features for Prostate Segmenta-
puter Vision, 128(2):336–359, 2020.
tion in 3D Transrectal Ultrasound. IEEE Transactions on
[33] Bharat Singh and Larry S. Davis. An analysis of scale invari-
Medical Imaging, 38(12):2768–2778, Apr 2019.
ance in object detection - SNIP. In 2018 IEEE Conference
[46] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So
on Computer Vision and Pattern Recognition (CVPR), Salt
Kweon. CBAM: convolutional block attention module. In
Lake City, UT, USA, pages 3578–3587, June 2018.
15th European Conference on Computer Vision (ECCV),
[34] Bharat Singh, Mahyar Najibi, and Larry S. Davis. SNIPER:
Munich, Germany, pages 3–19, 2018.
efficient multi-scale training. In Annual Conference on
Neural Information Processing Systems (NeurIPS) 2018, [47] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu,
Montréal, Canada, pages 9333–9343, 2018. and Kaiming He. Aggregated residual transformations for
[35] Ashish Sinha and Jose Dolz. Multi-scale self-guided at- deep neural networks. In 2017 IEEE Conference on Com-
tention for medical image segmentation. IEEE Journal of puter Vision and Pattern Recognition, Honolulu, HI, USA,
Biomedical and Health Informatics, pages 1–14, Apr 2020. pages 5987–5995, 2017.
[36] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmid- [48] Weitao Yuan, Shengbei Wang, Xiangrui Li, Masashi Unoki,
huber. Training very deep networks. In Annual Conference and Wenwu Wang. A skip attention mechanism for monaural
on Neural Information Processing Systems (NeurIPS) 2015, singing voice separation. IEEE Signal Processing Letters,
Montreal, Quebec, Canada, pages 2377–2385, 2015. 26(10):1481–1485, 2019.
[37] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and [49] Sergey Zagoruyko and Nikos Komodakis. Wide residual
Alexander A. Alemi. Inception-v4, inception-resnet and the networks. In Richard C. Wilson, Edwin R. Hancock, and
impact of residual connections on learning. In the Thirty- William A. P. Smith, editors, British Machine Vision Confer-
First AAAI Conference on Artificial Intelligence, San Fran- ence (BMVC) 2016, York, UK. BMVA Press, 2016.
cisco, California, USA, pages 4278–4284, 2017. [50] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and
[38] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, David Lopez-Paz. mixup: Beyond empirical risk minimiza-
Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent tion. In 6th International Conference on Learning Represen-
Vanhoucke, and Andrew Rabinovich. Going deeper with tations (ICLR), Vancouver, BC, Canada. OpenReview.net,
convolutions. In 2015 IEEE Conference on Computer Vision 2018.
and Pattern Recognition (CVPR), Boston, MA, USA, pages [51] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi
1–9, 2015. Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R.
10
Manmatha, Mu Li, and Alexander Smola. ResNeSt: Split- Table 6 provides the experimental results of these mod-
Attention Networks. arXiv e-prints, page arXiv:2004.08955, ules on CIFAR-100, from which it can be seen that the
Apr. 2020. iterative AFF (iAFF) module presented in the manuscript
[52] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo achieves the best performance. On the contrary, the Recur-
Wang, and Stan Z. Li. S3FD: Single shot scale-invariant face sive AFF which can dynamically allocate fusion weights for
detector. In 2017 IEEE International Conference on Com- local and global contexts are almost the worst among these
puter Vision (ICCV), Venice, Italy, Oct 2017.
modules. We believe the reason is that Recursive AFF has
two successive nested Sigmoid functions (see Fig. 9(d)),
Appendix which increases the difficulty in optimization due to Sig-
Implementation Details moid’s saturation function form, whereas the iterative AFF
presented in the manuscript does not suffer from this prob-
All network architectures in this work are implemented lem.
based on MXNet [3] and GluonCV [14]. Since most of AFF and Concat-AFF have a very similar performance.
the experimental architectures cannot take advantage of Therefore, for simplicity, we choose the squeeze-and-
pre-trained weights, each implementation is trained from excitation form (current MS-CAM module) instead of the
scratch for fairness. We have introduced most of the ex- Inception-style form (Concat-AFF) for the proposed atten-
perimental settings in Table 2 of the manuscript. Here, in tional feature fusion. In future work, we will investigate
the supplemental document, we introduce the left settings their performance difference on larger datasets like Ima-
that not mentioned before. geNet. However, this point is not the main issue that we
For the experiments on the CIFAR-100 dataset, the would like to discuss in the manuscript, so we didn’t in-
weight decay is 1e-4, and we decay the learning rate by a clude this part in the manuscript.
factor of 0.1 at epoch 300 and 350.
For the experiments on the ImageNet, we use the label Analysis of FLOPs
smoothing trick and a cosine annealing schedule for the
The point-wise convolution inside our multi-scale chan-
learning rate without weight decay.
nel attention module can bring additional FLOPs, but at a
For the semantic segmentation experiment, the StopSign
marginal level, not a significant magnitude. The FLOPs
dataset is a subset of the COCO dataset [24], which has
of our AAF-ResNet-50 is 4.3 GFlops, and the Flops of
a large scale variation issue, as shown in Fig. 8. We use
ResNet-50 in our implementation is 4.1 GFlops. Actually,
the cross entropy as loss function and the mean intersection
depending on how many tricks are used in ResNet, the Flops
over union (mIoU) as evaluation metric.
of ResNet-50 can vary from 3.9 GFlops to 4.3 GFlops [14].
It should be noted that the proposed networks in Table 5
Therefore, taking ResNet-50 vs our AFF-ResNet-50 for ex-
and Table 6 are trained with mixup [50]. The rest experi-
ample, integrating the AFF module only brings additional
ments, including all the ablation studies and the experimen-
4.88% Flops from 4.1 GFlops to 4.3 GFlops. Considering
tal results in Figure 7 (in the manuscript) are trained without
the performance boost by the AFF module, we think addi-
mixup.
tional 4.88% Flops is a good trade-off.
Local and Global Fusion Strategies Given an output channel number C and the size H × W
of a output feature map, if the input channel number and
We also investigate the fusion strategy for the local and output channel number are the same, the Flops of a 3 × 3
global contexts inside the attention module. We explored convolution layer is 18C 2 HW (multiplication and addi-
four strategies as shown in Fig. 9, in which: tion), and a ResBlock consists of two or three convolution
layers. Meanwhile, the Flops of two point-wise convolu-
1. Half-AFF, AFF, and Iterative AFF apply addition to
tions of a bottleneck structure is 2r C 2 HW , where r = 4 or
fuse the local and global contexts, which allocate the
r = 16 depending on the dataset and network. Therefore,
same weights (a constant 0.5) for local and global con-
comparing the Flops of convolutions in the host network,
texts.
the Flops brought by the AFF module is marginal.
2. Concat-AFF concatenates the local and global contexts In Table 7, we list the Flops of convolutions in BasicRes-
followed by a point-wise convolution, in which the fus- Block / BottleneckResBlock, Flops of point-wise convolu-
ing weights are learned during training and fixed after tion in our AFF module, and the relative increasing per-
training. centage. It can be seen that the maximum additional flops
brought by the AFF module in percentage is around 7.7%
3. Recursive AFF allocates dynamic fusion weights for if we use AFF module in each ResBlock from beginning to
the local and global contexts during inference based end. However, it is not necessary to replace every ResBlock
on the proposed MS-CAM. with AFF-ResBlock. In our AFF-ResNet, we do this re-
11
Figure 8: Illustration for the StopSign dataset
Table 6: Results for the ablation study on the fusion manner of the local and global channel contexts on CIFAR-100
12
X Y X Y
L L
C×H×W C×H×W C×H×W C×H×W
GlobalAvgPooling GlobalAvgPooling
C×1×1 C×1×1
Point-wise Conv Point-wise Conv Point-wise Conv Point-wise Conv
C C
r ×1×1 r ×1×1
BN C BN C
r ×H×W BN r ×H×W BN
ReLU ReLU ReLU ReLU
L
Concat
2C
r ×H×W
GlobalAvgPooling
C×1×1
Point-wise Conv Point-wise Conv
C
r ×1×1
BN C
r ×H×W BN
ReLU ReLU
13
Figure 9: Architectures for the ablation study on the fusion manner of the local and global channel contexts.
Table 7: Additional Flops brought by the proposed AFF module in an AFF-ResBlock
14