0% found this document useful (0 votes)

75 views14 pages

Attentional Feature Fusion

This document proposes a new method called attentional feature fusion as a general approach for combining features from different layers or branches in neural networks. It summarizes existing feature fusion methods then identifies shortcomings such as limited scenarios, unsophisticated initial integration of features, and biased context aggregation scales in current attention-based approaches. The proposed attentional feature fusion introduces a multi-scale channel attention module to better fuse inconsistent features at different scales. It also adds another level of attention to alleviate bottlenecks from initial feature integration.

Uploaded by

Suesarn Wilainuch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views14 pages

Attentional Feature Fusion

Uploaded by

Suesarn Wilainuch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Attentional Feature Fusion

Yimian Dai1 Fabian Gieseke2,3 Stefan Oehmcke3 Yiquan Wu1 Kobus Barnard4
1
College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China
2
Department of Information Systems, University of Münster, Münster, Germany
3
Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
4
Department of Computer Science, University of Arizona, Tucson, AZ, USA
arXiv:2009.14082v2 [cs.CV] 9 Nov 2020

Abstract tures and has been studied extensively in the previous lit-
erature [38, 36, 12, 30, 23]. For instance, in the Inception-
Feature fusion, the combination of features from differ- Net family [38, 39, 37], the outputs of filters with multiple
ent layers or branches, is an omnipresent part of modern sizes on the same level are fused to handle the large varia-
network architectures. It is often implemented via simple tion of object size. In Residual Networks (ResNet) [12, 13]
operations, such as summation or concatenation, but this and its follow-ups [49, 47], the identity mapping features
might not be the best choice. In this work, we propose a uni- and residual learning features are fused as the output via
form and general scheme, namely attentional feature fusion, short skip connections, enabling the training of very deep
which is applicable for most common scenarios, including networks. In Feature Pyramid Networks (FPN) [23] and
feature fusion induced by short and long skip connections U-Net [30], low-level features and high-level features are
as well as within Inception layers. To better fuse features fused via long skip connections to obtain high-resolution
of inconsistent semantics and scales, we propose a multi- and semantically strong features, which are vital for seman-
scale channel attention module, which addresses issues that tic segmentation and object detection. However, despite its
arise when fusing features given at different scales. We also prevalence in modern networks, most works on feature fu-
demonstrate that the initial integration of feature maps can sion focus on constructing sophisticated pathways to com-
become a bottleneck and that this issue can be alleviated bine features in different kernels, groups, or layers. The
by adding another level of attention, which we refer to as feature fusion method has rarely been addressed and is usu-
iterative attentional feature fusion. With fewer layers or pa- ally implemented via simple operations such as addition or
rameters, our models outperform state-of-the-art networks concatenation, which merely offer a fixed linear aggrega-
on both CIFAR-100 and ImageNet datasets, which suggests tion of feature maps and are entirely unaware of whether
that more sophisticated attention mechanisms for feature this combination is suitable for specific objects.
fusion hold great potential to consistently yield better re- Recently, Selective Kernel Networks (SKNet) [21] and
sults compared to their direct counterparts. Our codes and ResNeSt [51] have been proposed to render dynamic
trained models are available online1 . weighted averaging of features from multiple kernels or
groups in the same layer based on the global channel atten-
1. Introduction tion mechanism [16]. Although such attention-based meth-
ods present nonlinear approaches for feature fusion, they
Convolutional neural networks (CNNs) have seen a sig- still suffer from the following shortcomings:
nificant improvement of the representation power by going
deeper [12], going wider [38, 49], increasing cardinality 1. Limited scenarios: SKNet and ResNeSt only focus on
[47], and refining features dynamically [16], corresponding the soft feature selection in the same layer, whereas the
to advances in many computer vision tasks. cross-layer fusion in skip connections has not been ad-
Apart from these strategies, in this paper, we inves- dressed, leaving their schemes quite heuristic. Despite
tigate a different component of the network, feature fu- having different scenarios, all kinds of feature fusion
sion, to further boost the representation power of CNNs. implementations face the same challenge, in essence,
Whether explicit or implicit, intentional or unintentional, that is, how to integrate features of different scales for
feature fusion is omnipresent for modern network architec- better performance. A module that can overcome the
semantic inconsistency and effectively integrate fea-
1 https://github.com/YimianDai/open-aff tures of different scales should be able to consistently

1
improve the quality of fused features in various net- feature fusion. Our key observation is that scale is not an is-
work scenarios. However, so far, there is still a lack of sue exclusive to the spatial attention, and the channel atten-
a generalized approach that can unify different feature tion can also have scales other than the global by varying the
fusion scenarios in a consistent manner. spatial pooling size. By aggregating the multi-scale context
2. Unsophisticated initial integration: To feed the re- information along the channel dimension, MS-CAM can si-
ceived features into the attention module, SKNet intro- multaneously emphasize large objects that distribute more
duces another phase of feature fusion in an involuntary globally and highlight small objects that distribute more lo-
but inevitable way, which we call initial integration cally, facilitating the network to recognize and detect ob-
and is implemented by addition. Therefore, besides jects under extreme scale variation.
the design of the attention module, as its input, the
initial integration approach also has a large impact on 2. Related Work
the quality of fusion weights. Considering the features
may have a large inconsistency on the scale and seman- 2.1. Multi-scale Attention Mechanism
tic level, an unsophisticated initial integration strategy
The scale variation of objects is one of the key challenges
ignoring this issue can be a bottleneck.
in computer vision. To remedy this issue, an intuitive way
3. Biased context aggregation scale: The fusion weights
is to leverage multi-scale image pyramids [29, 2], in which
in SKNet and ResNeSt are generated via the global
objects are recognized at multiple scales and the predictions
channel attention mechanism [16], which is preferred
are combined using non-maximum suppression. The other
for information that distributes more globally. How-
line of effort aims to exploit the inherent multi-scale, hierar-
ever, objects in the image can have an extremely large
chical feature pyramid of CNNs to approximate image pyra-
variation in size. Numerous studies have emphasized
mids, in which features from multiple layers are fused to
this issue that arises when designing CNNs, i.e., that
obtain semantic features with high resolutions [11, 30, 23].
the receptive fields of predictors should match the ob-
The attention mechanism in deep learning, which mim-
ject scale range [52, 33, 34, 22]. Therefore, merely ag-
ics the human visual attention mechanism [5, 8], is origi-
gregating contextual information on a global scale is
nally developed on a global scale. For example, the matrix
too biased and weakens the features of small objects.
multiplication in self-attention draws global dependencies
This gives rise to the question if a network can dy-
of each word in a sentence [41] or each pixel in an image
namically and adaptively fuse the received features in
[7, 44, 1]. The Squeeze-and-Excitation Networks (SENet)
a contextual scale-aware way.
squeeze global spatial information into a channel descrip-
Motivated by the above observations, we present the at- tor to capture channel-wise dependencies [16]. Recently,
tentional feature fusion (AFF) module, trying to answer the researchers start to take into account the scale issue of at-
question of how a unified approach for all kinds of feature tention mechanisms. Similar to the above-mentioned ap-
fusion scenarios should be and address the problems of con- proaches handling scale variation in CNNs, multi-scale at-
textual aggregation and initial integration. The AFF frame- tention mechanisms are achieved by either feeding multi-
work generalizes the attention-based feature fusion from the scale features into an attention module or combining fea-
same-layer scenario to cross-layer scenarios including short ture contexts of multiple scales inside an attention module.
and long skip connections, and even the initial integration In the first type, the features at multiple scales or their con-
inside AFF itself. It provides a universal and consistent way catenated result are fed into the attention module to generate
to improve the performance of various networks, e.g., In- multi-scale attention maps, while the scale of feature con-
ceptionNet, ResNet, ResNeXt [47], and FPN, by simply re- text aggregation inside the attention module remains single
placing existing feature fusion operators with the proposed [2, 4, 45, 6, 35, 40]. The second type, which is also re-
AFF module. Moreover, the AFF framework supports to ferred to as multi-scale spatial attention, aggregates feature
gradually refine the initial integration, namely the input of contexts by convolutional kernels of different sizes [20] or
the fusion weight generator, by iteratively integrating the from a pyramid [20, 43] inside the attention module .
received features with another AFF module, which we refer The proposed MS-CAM follows the idea of ParseNet
to as iterative attentional feature fusion (iAFF). [25] with combining local and global features in CNNs and
To alleviate the problems arising from scale variation and the idea of spatial attention with aggregating multi-scale
small objects, we advocate the idea that attention modules feature contexts inside the attention module, but differ in at
should also aggregate contextual information from different least two important aspects: 1) MS-CAM puts forward the
receptive fields for objects of different scales. More specif- scale issue in channel attention and is achieved by point-
ically, we propose the Multi-Scale Channel Attention Mod- wise convolution rather than kernels of different sizes. 2)
ule (MS-CAM), a simple yet effective scheme to remedy the instead of in the backbone network, MS-CAM aggregates
feature inconsistency across different scales for attentional local and global feature contexts inside the channel atten-

2
tion module. To the best of our knowledge, the multi-scale ture map of size H × W into a scalar. This extreme coarse
channel attention has never been discussed before. descriptor prefers to emphasize large objects that distribute
globally and can potentially wipe out most of the image
2.2. Skip Connections in Deep Learning signal present in a small object. However, detecting very
Skip connection has been an essential component in small objects stands out as the key performance bottleneck
modern convolutional networks. Short skip connections, of state-of-the-art networks [34]. For example, the diffi-
namely the identity mapping shortcuts added inside Resid- culty of COCO is largely due to the fact that most object
ual blocks, provide an alternative path for the gradient to instances are smaller than 1% of the image area [24, 33].
flow without interruption during backpropagation [12, 47, Therefore, global channel attention might not be the best
49]. Long skip connections help the network to obtain se- choice. Multi-scale feature contexts should be aggregated
mantic features with high resolutions by bridging features inside the attention module to alleviate the problems arising
of finer details from lower layers and high-level semantic from scale variation and small object instances.
features of coarse resolutions [17, 23, 30, 26]. Despite be- 3.2. Aggregating Local and Global Contexts
ing used to combine features in various pathways [9], the fu-
sion of connected features is usually implemented via addi- In this part, we depict the proposed multi-scale chan-
tion or concatenation, which allocate the features with fixed nel attention module (MS-CAM) in detail. The key idea
weights regardless of the variance of contents. Recently, is that the channel attention can be implemented in multiple
a few attention-based methods, e.g., Global Attention Up- scales by varying the spatial pooling size. To maintain it as
sample (GAU) [20] and Skip Attention (SA) [48], have been lightweight as possible, we merely add the local context to
proposed to use high-level features as guidance to modulate the global context inside the attention module. We choose
the low-level features in long skip connections. However, the point-wise convolution (PWConv) as the local channel
the fusion weights for the modulated features are still fixed. context aggregator, which only exploits point-wise channel
To the best of our knowledge, it is the Highway Net- interactions for each spatial position. To save parameters,
works that first introduced a selection mechanism in short the local channel context L(X) ∈ RC×H×W is computed
skip connections [36]. To some extent, the attentional skip via a bottleneck structure as follows:
connections proposed in this paper can be viewed as its L(X) = B (PWConv2 (δ (B (PWConv1 (X))))) (2)
follow-up, but differs in the three points: 1) Highway Net-
C
works employ a simple fully connected layer that can only The kernel sizes of PWConv1 and PWConv2 are × C × r
generate a scalar fusion weight, while our proposed MS- 1 × 1 and PWConv2 is C × Cr × 1 × 1, respectively. It
CAM generates fusion weights as the same size of feature is noteworthy that L(X) has the same shape as the input
maps, enabling dynamic soft selections in an element-wise feature, which can preserve and highlight the subtle details
way. 2) Highway Networks only use one input feature to in the low-level features. Given the global channel context
generate weight, while our AFF module is aware of both g(X) and local channel context L(X), the refined feature
features. 3) We point out the importance of initial feature X0 ∈ RC×H×W by MS-CAM can be obtained as follows:
integration and the iAFF module is proposed as a solution.
X0 = X ⊗ M(X) = X ⊗ σ (L(X) ⊕ g(X)) , (3)
C×H×W
3. Multi-scale Channel Attention where M(X) ∈ R denotes the attentional weights
generated by MS-CAM. ⊕ denotes the broadcasting addi-
3.1. Revisiting Channel Attention in SENet tion and ⊗ denotes the element-wise multiplication.
Given an intermediate feature X ∈ RC×H×W with C X
C×H×W
channels and feature maps of size H × W , the channel at-
tention weights w ∈ RC in SENet can be computed as GlobalAvgPooling
C×1×1
w = σ (g(X)) = σ (B (W2 δ (B (W1 (g(X)))))) , (1)
Point-wise Conv Point-wise Conv
where g(X) ∈ RC denotes the global feature context and
C C
BN r ×1×1 r ×H×W BN

1
PH PW ReLU ReLU
g(X) = H×W i=1 j=1 X[:,i,j] is the global average
pooling (GAP). δ denotes the Rectified Linear Unit (ReLU) Point-wise Conv Point-wise Conv
L
[27], and B denotes the Batch Normalization (BN) [18]. σ C×1×1
BN BN
C×H×W

is the Sigmoid function. This is achieved by a bottleneck N

C×H×W
C Sigmoid
with two fully connected (FC) layers, where W1 ∈ R r ×C
C× C
is a dimension reduction layer, and W2 ∈ R r is a di- MS-CAM
X0
mension increasing layer. r is the channel reduction ratio.
We can see that the channel attention squeezes each fea- Figure 1: Illustration of the proposed MS-CAM

3
4. Attentional Feature Fusion X X
Residual
4.1. Unification of Feature Fusion Scenarios Conv 3 × 3 Conv 5 × 5
C×H×W
Given two feature maps X, Y ∈ R , by default, AFF AFF
we assume Y is the feature map with a larger receptive field. Z Z
More specifically, (a) AFF-Inception module (b) AFF-ResBlock
1. same-layer scenario: X is the output of a 3 × 3 kernel Input Stage-1 Stage-2 Stage-3
Stem
and Y is the output of a 5 × 5 kernel in InceptionNet;
2. short skip connection scenario: X is the identity map-
ping, and Y is the learned residual in a ResNet block; Output Softmax AFF AFF
3. long skip connection scenario: X is the low-level fea-
ture map, and Y is the high-level semantic feature map (c) AFF-FPN
in a feature pyramid. Figure 3: The schema of the proposed AFF-Inception mod-
Based on the multi-scale channel attention module M, At- ule, AFF-ResBlock, and AFF-FPN. The blue and red lines
tentional Feature Fusion (AFF) can be expressed as denote channel expansion and upsampling, respectively.
Z = M(X ] Y) ⊗ X + (1 − M(X ] Y)) ⊗ Y, (4) but only partially aware of the input feature maps. In most
where Z ∈ R C×H×W
is the fused feature, and ] denotes cases, they only exploit the high-level feature map. Fully
the initial feature integration. In this subsection, for the context-aware approaches utilize both input feature maps
sake of simplicity, we choose the element-wise summation for guidance at the cost of raising the initial integration is-
as initial integration. The AFF is illustrated in Fig. 2(a), sue. (b) Refinement vs modulation vs selection. The sum
where the dashed line denotes 1 − M(X ] Y). It should of weights applied to two feature maps in soft selection ap-
be noted that the fusion weights M(X ] Y) consists of real proaches are bound to 1, while this is not the case for re-
numbers between 0 and 1, so are the 1 − M(X ] Y), which finement and modulation.
enable the network to conduct a soft selection or weighted
averaging between X and Y. 4.2. Iterative Attentional Feature Fusion
X Y
L Unlike partially context-aware approaches [20], fully
X Y
C×H×W C×H×W context-aware methods have an inevitable issue, namely
L N N how to initially integrate input features. As the input of
MS-CAM
C×H×W C×H×W L the attention module, the initial integration quality may pro-
N N foundly affect final fusion weights. Since it is still a feature
MS-CAM N C×H×W C×H×W N
MS-CAM fusion problem, an intuitive way is to have another attention
L L module to fuse input features. We call this two-stage ap-
AFF iAFF proach iterative Attentional Feature Fusion (iAFF), which
Z Z is illustrated in Fig. 2(b). Then, the initial integration X]Y
(a) AFF (b) iAFF
in Eq. (4) can be reformulated as
Figure 2: Illustration of the proposed AFF and iAFF
X ] Y = M(X + Y) ⊗ X + (1 − M(X + Y)) ⊗ Y (5)
We summarize different formulations of feature fusion
in deep networks in Table 1. G denotes the global atten- 4.3. Examples: InceptionNet, ResNet, and FPN
tion mechanism. Although there are many implementation
differences among multiple approaches for various feature To validate the proposed AFF/iAFF as a uniform and
fusion scenarios, once being abstracted into mathematical general scheme, we choose ResNet, FPN, and Inception-
forms, these differences in details disappear. Therefore, it Net as examples for the most common scenarios: short and
is possible to unify these feature fusion scenarios with a long skip connections as well as the same layer fusion. It
carefully designed approach, thereby improving the perfor- is straightforward to apply AFF/iAFF to existing networks
mance of all networks by replacing original fusion opera- by replacing the original addition or concatenation. Specif-
tions with this unified approach. ically, we replace the concatenation in the InceptionNet
From Table 1, it can be further seen that apart from the module as well as the addition in ResNet block (ResBlock)
implementation of the weight generation module G, the and FPN to obtain the attentional networks, which we call
state-of-the-art fusion schemes mainly differ in two crucial AFF-Inception module, AFF-ResBlock, and AFF-FPN, re-
points: (a) the context-awareness level. Linear approaches spectively. This replacement and the schemes of our pro-
like addition and concatenation are entirely contextual un- posed architectures are shown in Fig. 3. The iAFF is a par-
aware. Feature refinement and modulation are non-linear, ticular case of AFF, so it does not need another illustration.

4
Table 1: A brief overview of different feature fusion strategies in deep networks.
Context-aware Type Formulation Scenario & Reference Example
Addition X+Y Short Skip [12, 13], Long Skip [26, 23] ResNet, FPN
None
Concatenation WA X:,i,j + WB Y:,i,j Same Layer [38], Long Skip [30, 17] InceptionNet, U-Net
Refinement X + G(Y) ⊗ Y Short Skip [16, 15, 46, 28] SENet
Partially Modulation G(Y) ⊗ X + Y Long Skip [20] GAU
Soft Selection G(X) ⊗ X + (1 − G(X)) ⊗ Y Short Skip [36] Highway Networks
Modulation G(X, Y) ⊗ X + Y Long Skip [48] SA
Fully G(X + Y) ⊗ X + (1 − G(X + Y)) ⊗ Y Same Layer [21, 51] SKNet
Soft Selection
M(X ] Y) ⊗ X + (1 − M(X ] Y)) ⊗ Y Same Layer, Short Skip, Long Skip ours

5. Experiments X
L
Y X
L
Y

C×H×W C×H×W C×H×W C×H×W

For experimental evaluation, we resort to the following
benchmark datasets: CIFAR-100 [19] and ImageNet [31] GlobalAvgPooling GlobalAvgPooling
C×1×1 C×1×1
for image classification in the same-layer InceptionNet and Point-wise Conv Point-wise Conv Point-wise Conv Point-wise Conv
C C
short-skip connection ResNet scenarios as well as StopSign BN r ×1×1
C
r ×1×1 BN BN r ×H×W
C
r ×H×W BN
ReLU ReLU ReLU ReLU
(a subset of COCO dataset [24]) for semantic segmentation
in the long-skip connection FPN scenario. The detailed set- Point-wise Conv
L
Point-wise Conv Point-wise Conv
L
Point-wise Conv
C×1×1 C×1×1 C×H×W C×H×W
tings are listed in Table 2. b is the ResBlock number in BN BN BN BN
C×1×1 C×H×W
each stage used to scale the network by depth. Note that N
Sigmoid
N N
Sigmoid
N
our CIFAR-100 experiments classify images into 20 super- L L
classes, not 100 classes. It is a default setting of the CI- Global + Global Local + Local
FAR100 class in MXNet/Gluon. We didn’t notice it until a Z Z
bug issue in our github repo at the camera ready day. How- Figure 4: Architectures for the ablation study on the impact
ever, since all the CIFAR-100 experiments are conducted of contextual aggregation scale.
on the same class number, our conclusion drawn from the
experiment results still hold. For more implementation de- implemented schemes. To keep the parameter budget the
tails, please see the supplementary material and our code. same, here the channel reduction ratio r in MS-GAU, MS-
SE, MS-SA, and AFF is 2, while r in iAFF is 4.
5.1. Ablation Study
X YX Y X Y
5.1.1 Impact of Multi-Scale Context Aggregation L

To study the impact of multi-scale context aggregation,

N N N
in Fig. 4, we construct two ablation modules “Global + MS-CAM MS-CAM MS-CAM
Global” and “Local + Local”, in which the scales of the two L L L
contextual aggregation branches are set as the same, either
MS-GAU MS-SE MS-SA
global or local. The proposed AFF is dubbed as “Global +
Z Z Z
Local” here. All of them have the same parameter number. (a) MS-GAU (b) MS-SE (c) MS-SA
The only difference is their context aggregation scale.
Table 3 presents their comparison on CIFAR-100, Ima- Figure 5: Architectures for ablation study on the impact of
geNet, and StopSign on various host networks. It can be feature integration strategies
seen that the multi-scale contextual aggregation (Global + Table 4 provides the comparison results in three scenar-
Local) outperforms single-scale ones in all settings. The re- ios, from which it can be seen that: 1) compared to the lin-
sults suggest that the multi-scale feature context is vital for ear approach, namely addition and concatenation, the non-
the attentional feature fusion. linear fusion strategy with attention mechanism always of-
5.1.2 Impact of Feature Integration Type fers better performance; 2) our fully context-aware and se-
lective strategy is slightly but consistently better than the
Further, we investigate which feature fusion strategy is the others, suggesting that it should be preferred for multiple
best in Table 1. For fairness, we re-implement these ap- feature integration; 3) the proposed iAFF approach is sig-
proaches based on the proposed MS-CAM for attention nificantly better than the rest in most cases. The results
weights. Since MS-CAM are different from their original strongly demonstrate our hypothesis that the early integra-
attention modules, we add a prefix of ”MS-” to these newly tion quality has a large impact on the attentional feature fu-

5
Table 2: Experimental settings for the networks integrated with the proposed AFF/iAFF.
Fusing Batch Learning Learning
Task Dataset Host Network r Epochs Optimizer Initialization
Scenario Size Rate Rate Mode
Inception-ResNet-20-b Same Layer 4 400 128 Nesterov 0.2 Step, γ = 0.1 Kaiming
Image CIFAR-100 ResNet-20-b Short Skip 4 400 128 Nesterov 0.2 Step, γ = 0.1 Kaiming
Classification ResNeXt-38-32x4d Short Skip 16 400 128 Nesterov 0.2 Step, γ = 0.1 Xavier
ImageNet ResNet-50 Short Skip 16 160 128 Nesterov 0.075 Cosine Kaiming
Semantic
StopSign ResNet-20-b + FPN Long Skip 4 300 32 AdaGrad 0.01 Poly Kaiming
Segmentation

Table 3: Comparison of contextual aggregation scales in attentional feature fusion given the same parameter budget. The
results suggest that a mix of scales should always be preferred inside the channel attention module.
InceptionNet on CIFAR-100 ResNet on CIFAR-100 ResNet + FPN on StopSign ResNet on
Aggregation Scale
ImageNet
b=1 b=2 b=3 b=4 b=1 b=2 b=3 b=4 b=1 b=2 b=3 b=4
Global + Global 0.735 0.766 0.775 0.789 0.754 0.796 0.811 0.821 0.911 0.923 0.936 0.939 0.777
Local + Local 0.746 0.771 0.785 0.787 0.754 0.794 0.808 0.814 0.895 0.919 0.921 0.924 0.780
Global + Local 0.756 0.784 0.794 0.801 0.763 0.804 0.816 0.826 0.924 0.935 0.939 0.944 0.784

Table 4: Comparison of context-aware level and feature integration strategy in feature fusion given the same parameter
budget. The results suggest that a fully context-aware and selective strategy should always be preferred for feature fusion. If
no problem in optimization, we should adopt the iterative attentional feature fusion without hesitation for better performance.
InceptionNet (Same Layer) ResNet (Short Skip) ResNet + FPN (Long Skip)
Fusion Type Context Strategy
b=1 b=2 b=3 b=4 b=1 b=2 b=3 b=4 b=1 b=2 b=3 b=4
Add None \ 0.720 0.753 0.771 0.782 0.740 0.786 0.797 0.808 0.895 0.920 0.925 0.928
Concatenation None \ 0.725 0.749 0.772 0.779 0.742 0.782 0.793 0.798 0.897 0.909 0.925 0.939
MS-GAU Partially Modulation 0.751 0.774 0.788 0.795 0.766 0.803 0.815 0.819 0.917 0.926 0.937 0.941
MS-SENet Partially Refinement 0.752 0.780 0.790 0.798 0.765 0.799 0.814 0.820 0.915 0.929 0.940 0.940
MS-SA Fully Modulation 0.756 0.779 0.790 0.798 0.761 0.801 0.814 0.822 0.920 0.932 0.938 0.941
AFF (ours) Fully Selection 0.756 0.784 0.794 0.801 0.763 0.804 0.816 0.826 0.924 0.935 0.939 0.944
iAFF (ours) Fully Selection 0.774 0.801 0.808 0.814 0.772 0.807 0.822 / 0.927 0.938 0.945 0.953

sion, and another level of attentional feature fusion can fur- the attended regions of the AFF-ResNet-50 highly overlap
ther improve the performance. However, this improvement with the labeled objects, which shows that it learns well to
may be obtained at the cost of increasing the difficulty in localize objects and exploit the features in object regions.
optimization. We notice that when the network depth in- On the contrary, the localization capacity of the baseline
creases as b changes from 3 to 4, the performance of iAFF- ResNet-50 is relatively poor, misplacing the center of at-
ResNet did not improve but degraded. tended regions in many cases. Although SENet-50 are able
to locate the true objects, the attended regions are over-
large including many background components. It is because
5.1.3 Impact on Localization and Small Objects
SENet-50 only utilizes the global channel attention, which
To study the impact of the proposed MS-CAM on object is biased to the context of a global scale, whereas the pro-
localization and small object recognition, we apply Grad- posed MS-CAM also aggregates the local channel context,
CAM [32] to ResNet-50, SENet-50, and AFF-ResNet-50 which helps the network to attend the objects with fewer
for the visualization results of images from the ImageNet background clutters and is also beneficial to the small ob-
dataset, which are illustrated in Fig. 6. Given a specific ject recognition. In the bottom half of Fig. 6, we can clearly
class, Grad-CAM results show the network’s attended re- see that AFF-ResNet-50 can predict correctly on the small-
gions clearly. Here, we show the heatmaps of the predicted scale objects, while ResNet-50 fails in most cases.
class, and the wrongly predicted image is denoted with the
symbol 6. The predicted class names and their softmax
5.2. Comparison with State-of-the-Art Networks
scores are also shown at the bottom of heatmaps. To show that the network performance can be improved
From the upper part of Fig. 6, it can be seen clearly that by replacing original fusion operations with the proposed

6
Backpack Basketball Bathing Cap Bee Goldfish Koala Screwdriver Volleyball

Input
image

ResNet

Backpack P=0.55 Basketball P=0.91 Bathing Cap P=0.82 Bee P=0.77 Goldfish P=0.84 Koala P=0.80 Screwdriver P=0.58 Volleyball P=0.97

SENet

Backpack P=0.81 Basketball P=0.91 Bathing Cap P=0.93 Bee P=0.97 Goldfish P=0.77 Koala P=0.88 Screwdriver P=0.84 Volleyball P=0.87

AFF +
ResNet

Backpack P=0.87 Basketball P=0.95 Bathing Cap P=0.87 Bee P=0.87 Goldfish P=0.85 Koala P=0.93 Screwdriver P=0.82 Volleyball P=0.92
Ant Chain Saw Hamster iPod Lipstick Plastic Bag Scorpion Tick

Input
image

ResNet

6 Ladybug 6 Chain Saw P=0.58 6 Rabbit 6 iPod P=0.69 Lipstick P=0.54 6 Rabbit 6 6 Tick 6 Tick P=0.72

SENet

6 Ladybug 6 Chain Saw P=0.95 Hamster P=0.51 iPod P=0.91 Lipstick P=0.92 Plastic Bag P=0.67 Scorpion P=0.81 Tick P=0.76

AFF +
ResNet

Ant P=0.35 Chain Saw P=0.87 Hamster P=0.55 iPod P=0.93 Lipstick P=0.76 Plastic Bag P=0.38 Scorpion P=0.83 Tick P=0.88

Figure 6: Network visualization with Grad-CAM. The comparison results suggest that the proposed MS-CAM is beneficial
to the object localization and small object recognition.

attentional feature fusion, we compare the AFF and iAFF paring SKNet / SENet / GAU-FPN with AFF-InceptionNet
modules with other attention modules based on the same / AFF-ResNet / AFF-FPN, we can see that our AFF or iAFF
host networks in different feature fusion scenarios. Fig. 7 integrated networks are better in all scenarios, which shows
illustrates the comparison results with a gradual increase in that our (iterative) attentional feature fusion approach not
network depth for all networks. It can be seen that: 1) Com- only has superior performance, but a good generality. We

7
CIFAR-100 CIFAR-100 StopSign
0.82 0.95
0.80
0.94
0.80
0.78 0.93
Accuracy

Accuracy

Accuracy
0.78 0.92
0.76 InceptionNet ResNet FPN
SKNet 0.76 SENet 0.91 GAU-FPN
0.74 AFF-InceptionNet AFF-ResNet AFF-FPN
iAFF-InceptionNet iAFF-ResNet 0.90 iAFF-FPN
0.74
105 2×10 3×10 4×105 5×105
5 5 105 2×105 3×105 4×105 105 2×105 3×105 4×105
Network Parameters Network Parameters Network Parameters
(a) InceptionNet (Same layer) (b) ResNet (Short skip connection) (c) FPN (Long skip connection)
Figure 7: Compassion with baseline and other state-of-the-art networks with a gradual increase of network depth.

believe the improved performance comes from the proposed 6. Conclusion

multi-scale channel contextual aggregation inside the atten-
We generalize the concept of attention mechanisms as a
tion module. 2) Comparing the performance of iAFF-based
networks with AFF-based networks, it should be noted that selective and dynamic type of feature fusion to most scenar-
the proposed iterative attentional feature fusion scheme can ios, namely the same layer, short skip, and long skip con-
further improve the performance. 3) By replacing the sim- nections as well as information integration inside the atten-
ple addition or concatenation with the proposed AFF or tion mechanism. To overcome the semantic and scale incon-
sistency issue among input features, we propose the multi-
iAFF module, we can get a more efficient network. For ex-
ample, in Fig. 7(b), iAFF-ResNet (b = 2) achieves similar scale channel attention module, which adds local channel
performance with the baseline ResNet (b = 4), while only contexts to the global channel-wise statistics. Further, we
54% of the parameters were required. point out that the initial integration of received features is
a bottleneck in attention-based feature fusion, and it can be
Last, we validate the performance of AFF/iAFF based alleviated by adding another level of attention that we call
networks with state-of-the-art networks on ImageNet. The iterative attentional feature fusion. We conducted detailed
results are listed in Table 5. The results show that the ablation studies to empirically verify the individual impact
proposed AFF/iAFF based networks can improve per- of the context-aware level, the feature integration type, and
formance over the state-of-the-art networks under much the contextual aggregation scales of our proposed attention
smaller parameter budgets. Remarkably, on ImageNet, the mechanism. Experimental results on both the CIFAR-100
proposed iAFF-ResNet-50 outperforms Gather-Excite-θ+ - and the ImageNet dataset show that our models outperform
ResNet-101 [15] by 0.3% with only 60% parameters. These state-of-the-art networks with fewer layers or parameters
results indicate that the feature fusion in short skip con- per network, which suggests that one should pay attention
nections matters a lot for ResNet and ResNeXt. Instead of to the feature fusion in deep neural networks and that more
blindly increasing the depth of the network, we should pay sophisticated attention mechanisms for feature fusion hold
more attention to the quality of feature fusion. the potential to consistently yield better results.

Acknowledgement
The authors would like to thank the editor and anony-
Table 5: Comparison on ImageNet
mous reviewers for their helpful comments and suggestions,
Architecture top-1 err. Params and also thank @takedarts on Github for pointing out the
bug in our CIFAR-100 code. This work was supported in
ResNet-101 [12] 23.2 42.5 M
Efficient-Channel-Attention-Net-101 [42] 21.4 42.5 M
part by the National Natural Science Foundation of China
Attention-Augmented-ResNet-101 [1] 21.3 45.4 M under Grant No. 61573183, the Open Project Program of
SENet-101 [16] 20.9 49.4 M the National Laboratory of Pattern Recognition (NLPR) un-
Gather-Excite-θ+ -ResNet-101 [15] 20.7 58.4 M der Grant No. 201900029, the Nanjing University of Aero-
Local-Importance-Pooling-ResNet-101 [10] 20.7 42.9 M nautics and Astronautics PhD short-term visiting scholar
AFF-ResNet-50 (ours) 20.9 30.3 M project under Grant No. 180104DF03, the Excellent Chi-
AFF-ResNeXt-50-32x4d (ours) 20.8 29.9 M nese and Foreign Youth Exchange Program, China Associ-
iAFF-ResNet-50 (ours) 20.4 35.1 M ation for Science and Technology, China Scholarship Coun-
iAFF-ResNeXt-50-32x4d (ours) 20.2 34.7 M cil under Grant No. 201806830039.

8
References [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In 2016 IEEE
[1] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Conference on Computer Vision and Pattern Recognition
and Quoc V. Le. Attention augmented convolutional net- (CVPR), Las Vegas, NV, USA, pages 770–778, 2016.
works. In 2019 IEEE International Conference on Computer [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Vision (ICCV), Seoul, Korea (South), pages 3286–3295, Oc- Identity mappings in deep residual networks. In 14th Euro-
tober 2019. pean Conference on Computer Vision (ECCV), Amsterdam,
[2] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and The Netherlands, pages 630–645, 2016.
Alan L. Yuille. Attention to scale: Scale-aware semantic [14] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Jun-
image segmentation. In 2016 IEEE Conference on Com- yuan Xie, and Mu Li. Bag of tricks for image classification
puter Vision and Pattern Recognition (CVPR), Las Vegas, with convolutional neural networks. In 2019 IEEE Confer-
NV, USA, pages 3640–3649, 2016. ence on Computer Vision and Pattern Recognition (CVPR),
[3] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Long Beach, CA, USA, pages 558–567, 2019.
Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and [15] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea
Zheng Zhang. MXNet: A flexible and efficient machine Vedaldi. Gather-excite: Exploiting feature context in convo-
learning library for heterogeneous distributed systems. In In lutional neural networks. In Annual Conference on Neural
Neural Information Processing Systems, Workshop on Ma- Information Processing Systems (NeurIPS) 2018, Montréal,
chine Learning Systems, volume abs/1512.01274, 2015. Canada, pages 9423–9433, 2018.
[4] Xiao Chu, Wei Yang, Wanli Ouyang, Cheng Ma, Alan L. [16] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
Yuille, and Xiaogang Wang. Multi-context attention for hu- works. In 2018 IEEE Conference on Computer Vision and
man pose estimation. In 2017 IEEE Conference on Computer Pattern Recognition (CVPR), Salt Lake City, UT, USA, pages
Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 7132–7141, 2018.
pages 5669–5678. IEEE Computer Society, 2017. [17] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil-
[5] Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and ian Q. Weinberger. Densely connected convolutional net-
Jianbing Shen. Shifting more attention to video salient ob- works. In 2017 IEEE Conference on Computer Vision
ject detection. In 2019 IEEE/CVF Conference on Com- and Pattern Recognition (CVPR), Honolulu, HI, USA, pages
puter Vision and Pattern Recognition (CVPR), Long Beach, 2261–2269, 2017.
CA, USA, pages 8554–8564. Computer Vision Foundation / [18] Sergey Ioffe and Christian Szegedy. Batch normalization:
IEEE, 2019. Accelerating deep network training by reducing internal co-
[6] Yang Feng, Deqian Kong, Ping Wei, Hongbin Sun, and Nan- variate shift. In the 32nd International Conference on Ma-
ning Zheng. A benchmark dataset and multi-scale attention chine Learning (ICML), Lille, France, pages 448–456, 2015.
network for semantic traffic light detection. In 2019 IEEE [19] Alex Krizhevsky. Learning multiple layers of features from
Intelligent Transportation Systems Conference (ITSC), Auck- tiny images. Technical report, University of Toronto, 2009.
land, New Zealand, pages 1–8. IEEE, 2019. [20] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang.
Pyramid attention network for semantic segmentation. In
[7] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi-
British Machine Vision Conference (BMVC) 2018, Newcas-
wei Fang, and Hanqing Lu. Dual attention network for scene
tle, UK, pages 1–13, 2018.
segmentation. In 2019 IEEE Conference on Computer Vision
[21] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Se-
and Pattern Recognition (CVPR), Long Beach, CA, USA,
lective kernel networks. In 2019 IEEE Conference on Com-
pages 3146–3154, 2019.
puter Vision and Pattern Recognition (CVPR), Long Beach,
[8] Keren Fu, Deng-Ping Fan, Ge-Peng Ji, and Qijun Zhao. JL- CA, USA, pages 510–519, 2019.
DCF: joint learning and densely-cooperative fusion frame-
[22] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhao-Xiang
work for RGB-D salient object detection. In 2020 IEEE/CVF
Zhang. Scale-aware trident networks for object detection.
Conference on Computer Vision and Pattern Recognition
In 2019 IEEE International Conference on Computer Vision
(CVPR), Seattle, WA, USA, pages 3049–3059. IEEE, 2020.
(ICCV), Seoul, Korea (South), pages 6053–6062, 2019.
[9] Keren Fu, Qijun Zhao, Irene Yu-Hua Gu, and Jie Yang. [23] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He,
Deepside: A general deep framework for salient object de- Bharath Hariharan, and Serge J. Belongie. Feature pyramid
tection. Neurocomputing, 356:69–82, Sep 2019. networks for object detection. In 2017 IEEE Conference
[10] Ziteng Gao, Limin Wang, and Gangshan Wu. LIP: local on Computer Vision and Pattern Recognition (CVPR), Hon-
importance-based pooling. In 2019 IEEE International Con- olulu, HI, USA, pages 936–944, 2017.
ference on Computer Vision (ICCV), Seoul, Korea (South), [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
pages 3354–3363. IEEE, 2019. Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence
[11] Bharath Hariharan, Pablo Andrés Arbeláez, Ross B. Gir- Zitnick. Microsoft coco: Common objects in context. In 13th
shick, and Jitendra Malik. Hypercolumns for object segmen- European Conference on Computer Vision (ECCV), Zurich,
tation and fine-grained localization. In 2015 IEEE Confer- Switzerland, pages 740–755, Cham, 2014.
ence on Computer Vision and Pattern Recognition (CVPR), [25] Wei Liu, Andrew Rabinovich, and Alexander C. Berg.
Boston, MA, USA, pages 447–456. IEEE Computer Society, Parsenet: Looking wider to see better. CoRR,
2015. abs/1506.04579, 2015.

9
[26] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully [39] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
convolutional networks for semantic segmentation. In 2015 Jonathon Shlens, and Zbigniew Wojna. Rethinking the in-
IEEE Conference on Computer Vision and Pattern Recogni- ception architecture for computer vision. In 2016 IEEE
tion (CVPR), Boston, MA, USA, pages 3431–3440, 2015. Conference on Computer Vision and Pattern Recognition
[27] Vinod Nair and Geoffrey E. Hinton. Rectified linear units (CVPR), Las Vegas, NV, USA, pages 2818–2826, 2016.
improve restricted boltzmann machines. In the 27th Inter- [40] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchi-
national Conference on Machine Learning (ICML), Haifa, cal multi-scale attention for semantic segmentation. arXiv
Israel, ICML’10, pages 807–814, USA, 2010. preprint arXiv:2005.10821, 2020.
[28] Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In So [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Kweon. BAM: bottleneck attention module. In British reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Machine Vision Conference (BMVC) 2018, Newcastle, UK, Polosukhin. Attention is all you need. In Annual Conference
pages 1–14, 2018. on Neural Information Processing Systems (NeurIPS) 2017,
[29] Shaoqing Ren, Kaiming He, Ross B. Girshick, Xiangyu Long Beach, CA, USA, pages 5998–6008, 2017.
Zhang, and Jian Sun. Object detection networks on convolu- [42] Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wang-
tional feature maps. IEEE Trans. Pattern Anal. Mach. Intell., meng Zuo, and Qinghua Hu. Eca-net: Efficient channel
39(7):1476–1481, 2017. attention for deep convolutional neural networks. In IEEE
[30] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Conference on Computer Vision and Pattern Recognition
Convolutional networks for biomedical image segmentation. (CVPR), Seattle, WA, USA, pages 11534–11542, 2020.
In 18th International Conference on Medical Image Comput- [43] Wenguan Wang, Shuyang Zhao, Jianbing Shen, Steven C. H.
ing and Computer-Assisted Intervention (MICCAI), Munich, Hoi, and Ali Borji. Salient object detection with pyramid at-
Germany, pages 234–241, 2015. tention and salient edges. In 2019 IEEE Conference on Com-
[31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- puter Vision and Pattern Recognition (CVPR), Long Beach,
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, CA, USA, pages 1448–1457. Computer Vision Foundation /
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and IEEE, 2019.
Li Fei-Fei. Imagenet large scale visual recognition challenge.
[44] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and
International Journal of Computer Vision, 115(3):211–252,
Kaiming He. Non-local neural networks. In 2018 IEEE
2015.
Conference on Computer Vision and Pattern Recognition
[32] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek (CVPR), Salt Lake City, UT, USA, pages 7794–7803, 2018.
Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba-
[45] Yi Wang, Haoran Dou, Xiaowei Hu, Lei Zhu, Xin Yang,
tra. Grad-cam: Visual explanations from deep networks via
Ming Xu, Jing Qin, Pheng-Ann Heng, Tianfu Wang, and
gradient-based localization. International Journal of Com-
Dong Ni. Deep Attentive Features for Prostate Segmenta-
puter Vision, 128(2):336–359, 2020.
tion in 3D Transrectal Ultrasound. IEEE Transactions on
[33] Bharat Singh and Larry S. Davis. An analysis of scale invari-
Medical Imaging, 38(12):2768–2778, Apr 2019.
ance in object detection - SNIP. In 2018 IEEE Conference
[46] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So
on Computer Vision and Pattern Recognition (CVPR), Salt
Kweon. CBAM: convolutional block attention module. In
Lake City, UT, USA, pages 3578–3587, June 2018.
15th European Conference on Computer Vision (ECCV),
[34] Bharat Singh, Mahyar Najibi, and Larry S. Davis. SNIPER:
Munich, Germany, pages 3–19, 2018.
efficient multi-scale training. In Annual Conference on
Neural Information Processing Systems (NeurIPS) 2018, [47] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu,
Montréal, Canada, pages 9333–9343, 2018. and Kaiming He. Aggregated residual transformations for
[35] Ashish Sinha and Jose Dolz. Multi-scale self-guided at- deep neural networks. In 2017 IEEE Conference on Com-
tention for medical image segmentation. IEEE Journal of puter Vision and Pattern Recognition, Honolulu, HI, USA,
Biomedical and Health Informatics, pages 1–14, Apr 2020. pages 5987–5995, 2017.
[36] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmid- [48] Weitao Yuan, Shengbei Wang, Xiangrui Li, Masashi Unoki,
huber. Training very deep networks. In Annual Conference and Wenwu Wang. A skip attention mechanism for monaural
on Neural Information Processing Systems (NeurIPS) 2015, singing voice separation. IEEE Signal Processing Letters,
Montreal, Quebec, Canada, pages 2377–2385, 2015. 26(10):1481–1485, 2019.
[37] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and [49] Sergey Zagoruyko and Nikos Komodakis. Wide residual
Alexander A. Alemi. Inception-v4, inception-resnet and the networks. In Richard C. Wilson, Edwin R. Hancock, and
impact of residual connections on learning. In the Thirty- William A. P. Smith, editors, British Machine Vision Confer-
First AAAI Conference on Artificial Intelligence, San Fran- ence (BMVC) 2016, York, UK. BMVA Press, 2016.
cisco, California, USA, pages 4278–4284, 2017. [50] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and
[38] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, David Lopez-Paz. mixup: Beyond empirical risk minimiza-
Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent tion. In 6th International Conference on Learning Represen-
Vanhoucke, and Andrew Rabinovich. Going deeper with tations (ICLR), Vancouver, BC, Canada. OpenReview.net,
convolutions. In 2015 IEEE Conference on Computer Vision 2018.
and Pattern Recognition (CVPR), Boston, MA, USA, pages [51] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi
1–9, 2015. Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R.

10
Manmatha, Mu Li, and Alexander Smola. ResNeSt: Split- Table 6 provides the experimental results of these mod-
Attention Networks. arXiv e-prints, page arXiv:2004.08955, ules on CIFAR-100, from which it can be seen that the
Apr. 2020. iterative AFF (iAFF) module presented in the manuscript
[52] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo achieves the best performance. On the contrary, the Recur-
Wang, and Stan Z. Li. S3FD: Single shot scale-invariant face sive AFF which can dynamically allocate fusion weights for
detector. In 2017 IEEE International Conference on Com- local and global contexts are almost the worst among these
puter Vision (ICCV), Venice, Italy, Oct 2017.
modules. We believe the reason is that Recursive AFF has
two successive nested Sigmoid functions (see Fig. 9(d)),
Appendix which increases the difficulty in optimization due to Sig-
Implementation Details moid’s saturation function form, whereas the iterative AFF
presented in the manuscript does not suffer from this prob-
All network architectures in this work are implemented lem.
based on MXNet [3] and GluonCV [14]. Since most of AFF and Concat-AFF have a very similar performance.
the experimental architectures cannot take advantage of Therefore, for simplicity, we choose the squeeze-and-
pre-trained weights, each implementation is trained from excitation form (current MS-CAM module) instead of the
scratch for fairness. We have introduced most of the ex- Inception-style form (Concat-AFF) for the proposed atten-
perimental settings in Table 2 of the manuscript. Here, in tional feature fusion. In future work, we will investigate
the supplemental document, we introduce the left settings their performance difference on larger datasets like Ima-
that not mentioned before. geNet. However, this point is not the main issue that we
For the experiments on the CIFAR-100 dataset, the would like to discuss in the manuscript, so we didn’t in-
weight decay is 1e-4, and we decay the learning rate by a clude this part in the manuscript.
factor of 0.1 at epoch 300 and 350.
For the experiments on the ImageNet, we use the label Analysis of FLOPs
smoothing trick and a cosine annealing schedule for the
The point-wise convolution inside our multi-scale chan-
learning rate without weight decay.
nel attention module can bring additional FLOPs, but at a
For the semantic segmentation experiment, the StopSign
marginal level, not a significant magnitude. The FLOPs
dataset is a subset of the COCO dataset [24], which has
of our AAF-ResNet-50 is 4.3 GFlops, and the Flops of
a large scale variation issue, as shown in Fig. 8. We use
ResNet-50 in our implementation is 4.1 GFlops. Actually,
the cross entropy as loss function and the mean intersection
depending on how many tricks are used in ResNet, the Flops
over union (mIoU) as evaluation metric.
of ResNet-50 can vary from 3.9 GFlops to 4.3 GFlops [14].
It should be noted that the proposed networks in Table 5
Therefore, taking ResNet-50 vs our AFF-ResNet-50 for ex-
and Table 6 are trained with mixup [50]. The rest experi-
ample, integrating the AFF module only brings additional
ments, including all the ablation studies and the experimen-
4.88% Flops from 4.1 GFlops to 4.3 GFlops. Considering
tal results in Figure 7 (in the manuscript) are trained without
the performance boost by the AFF module, we think addi-
mixup.
tional 4.88% Flops is a good trade-off.
Local and Global Fusion Strategies Given an output channel number C and the size H × W
of a output feature map, if the input channel number and
We also investigate the fusion strategy for the local and output channel number are the same, the Flops of a 3 × 3
global contexts inside the attention module. We explored convolution layer is 18C 2 HW (multiplication and addi-
four strategies as shown in Fig. 9, in which: tion), and a ResBlock consists of two or three convolution
layers. Meanwhile, the Flops of two point-wise convolu-
1. Half-AFF, AFF, and Iterative AFF apply addition to
tions of a bottleneck structure is 2r C 2 HW , where r = 4 or
fuse the local and global contexts, which allocate the
r = 16 depending on the dataset and network. Therefore,
same weights (a constant 0.5) for local and global con-
comparing the Flops of convolutions in the host network,
texts.
the Flops brought by the AFF module is marginal.
2. Concat-AFF concatenates the local and global contexts In Table 7, we list the Flops of convolutions in BasicRes-
followed by a point-wise convolution, in which the fus- Block / BottleneckResBlock, Flops of point-wise convolu-
ing weights are learned during training and fixed after tion in our AFF module, and the relative increasing per-
training. centage. It can be seen that the maximum additional flops
brought by the AFF module in percentage is around 7.7%
3. Recursive AFF allocates dynamic fusion weights for if we use AFF module in each ResBlock from beginning to
the local and global contexts during inference based end. However, it is not necessary to replace every ResBlock
on the proposed MS-CAM. with AFF-ResBlock. In our AFF-ResNet, we do this re-

11
Figure 8: Illustration for the StopSign dataset

Table 6: Results for the ablation study on the fusion manner of the local and global channel contexts on CIFAR-100

Fusion weights of local and

Module b=1 b=2 b=3
global channel contexts
Half-AFF Constant, 0.5 for each 0.759 0.798 0.813
Concat-AFF Learned, fixed after training 0.765 0.792 0.817
AFF Constant, 0.5 for each 0.764 0.799 0.816
Dynamic, depending on the local
Recursive AFF 0.764 0.797 0.812
and global channel contexts
Iterative AFF Constant, 0.5 for each 0.772 0.807 0.822

placement from the middle of the network (last two stages),

while leaving the first two stages of the original Bottleneck-
ResBlock. It further reduces the Flops of AFF-ResNet-50.
To conclude, the AFF module will bring additional Flops
but at a marginal level, around 3% to 5%. We think it is a
good trade-off since the AFF module boosts the representa-
tion power of the convolution networks.

12
X Y X Y
L L
C×H×W C×H×W C×H×W C×H×W

GlobalAvgPooling GlobalAvgPooling
C×1×1 C×1×1
Point-wise Conv Point-wise Conv Point-wise Conv Point-wise Conv
C C
r ×1×1 r ×1×1
BN C BN C
r ×H×W BN r ×H×W BN
ReLU ReLU ReLU ReLU
L
Concat
2C
r ×H×W

Point-wise Conv Point-wise Conv

C×H×W C×H×W
N N N N
Sigmoid Sigmoid
L L
Half-AFF Concat-AFF
Z Z
(a) Half-AFF (b) Concat-AFF
X Y
L
C×H×W C×H×W

GlobalAvgPooling
C×1×1
Point-wise Conv Point-wise Conv
C
r ×1×1
BN C
r ×H×W BN
ReLU ReLU

Point-wise Conv Point-wise Conv

C×1×1 L C×H×W
X Y BN BN
L
C×H×W C×H×W GlobalAvgPooling
C×1×1

GlobalAvgPooling Point-wise Conv Point-wise Conv

C
r ×1×1
BN C
C×1×1 r ×H×W BN
ReLU ReLU
Point-wise Conv Point-wise Conv
C
r ×1×1
BN C
r ×H×W BN Point-wise Conv Point-wise Conv
ReLU ReLU C×1×1 L C×H×W
BN BN
C×H×W
Point-wise Conv Point-wise Conv N N
C×1×1 L C×H×W
Sigmoid
BN BN L
C×H×W
N N
Sigmoid N N
Sigmoid
L L
AFF Recursive AFF
Z Z
(c) AFF (d) Recursive AFF

13
Figure 9: Architectures for the ablation study on the fusion manner of the local and global channel contexts.
Table 7: Additional Flops brought by the proposed AFF module in an AFF-ResBlock

Layer doubling Flops of Conv Flops of Point-wise

ResBlock Type Percentage
channel number ? in ResBlock Convin AFF module
BasicResBlock Yes 27C 2 HW C 2 HW 3.70%
(CIFAR, r = 4) No 36C 2 HW C 2 HW 2.78%
BottleneckResBlock Yes 51C 2 HW 4C 2 HW 7.84%
(ImageNet, r = 16) No 52C 2 HW 4C 2 HW 7.69%

Grade 1 Mathematics Lesson Plan
100% (33)
Grade 1 Mathematics Lesson Plan
4 pages
Collision Avoidance at Sea - Practice and Problems
No ratings yet
Collision Avoidance at Sea - Practice and Problems
10 pages
الاحتراق النفسي
0% (1)
الاحتراق النفسي
23 pages
30XW Catalogue
No ratings yet
30XW Catalogue
4 pages
CFL CPM Guidelines
No ratings yet
CFL CPM Guidelines
30 pages
Vlsi Testing
No ratings yet
Vlsi Testing
79 pages
Economic Theory by ShumPeter
No ratings yet
Economic Theory by ShumPeter
16 pages
CH 7 Mse
No ratings yet
CH 7 Mse
20 pages
Balance and Movement
100% (2)
Balance and Movement
16 pages
Justifying The Use of Language Assessments: Linking Interpretations With Consequences Lyle F. Bachman
No ratings yet
Justifying The Use of Language Assessments: Linking Interpretations With Consequences Lyle F. Bachman
9 pages
Maulana Azad Education Foundation: Invites Applications To Fill The Vacant Posts
No ratings yet
Maulana Azad Education Foundation: Invites Applications To Fill The Vacant Posts
9 pages
RESERS
100% (1)
RESERS
4 pages
Communication Converter BFC Series
No ratings yet
Communication Converter BFC Series
6 pages
Questions Collection of Embedded System
No ratings yet
Questions Collection of Embedded System
2 pages
Lesson 15 - Information Theory
No ratings yet
Lesson 15 - Information Theory
18 pages
Journal of Constructional Steel Research: Meng Wang, Yongjiu Shi, Yuanqing Wang, Gang Shi
No ratings yet
Journal of Constructional Steel Research: Meng Wang, Yongjiu Shi, Yuanqing Wang, Gang Shi
13 pages
BPBLA-135 Assignment July 2023 & Jan 2024
No ratings yet
BPBLA-135 Assignment July 2023 & Jan 2024
4 pages
Case Study: Amazon AWS: CSE 40822 - Cloud Computing Prof. Douglas Thain University of Notre Dame
No ratings yet
Case Study: Amazon AWS: CSE 40822 - Cloud Computing Prof. Douglas Thain University of Notre Dame
33 pages
Fully Convolutional Networks For Semantic Segmentation
No ratings yet
Fully Convolutional Networks For Semantic Segmentation
12 pages
ACNet - Strengthening The Kernel Skeletons For Powerful CNN Via Asymmetric Convolution Blocks
No ratings yet
ACNet - Strengthening The Kernel Skeletons For Powerful CNN Via Asymmetric Convolution Blocks
10 pages
Physics 1 Lab Report Experiment Churi 1
No ratings yet
Physics 1 Lab Report Experiment Churi 1
11 pages
21st Century - Writing A Photographic Essay
No ratings yet
21st Century - Writing A Photographic Essay
5 pages
Deep Layer Aggregation
No ratings yet
Deep Layer Aggregation
10 pages
MIE1622H - Assignment 3
No ratings yet
MIE1622H - Assignment 3
5 pages
Wang Et Al. - 2017 - A Fast Feature Fusion Algorithm in Image Classific
No ratings yet
Wang Et Al. - 2017 - A Fast Feature Fusion Algorithm in Image Classific
10 pages
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
No ratings yet
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
10 pages
Informatica ETL Developme NT
No ratings yet
Informatica ETL Developme NT
5 pages
CNN Features Off-The-shelf: An Astounding Baseline For Recognition
No ratings yet
CNN Features Off-The-shelf: An Astounding Baseline For Recognition
8 pages
ParseNet: Looking Wider To See Better (2015)
No ratings yet
ParseNet: Looking Wider To See Better (2015)
11 pages
STS PPT G1
No ratings yet
STS PPT G1
39 pages
Rachid Benmokhtar and Benoit Huet Sid-Ahmed Berrani: Low-Level Feature Fusion Models For Soccer Scene Classification
No ratings yet
Rachid Benmokhtar and Benoit Huet Sid-Ahmed Berrani: Low-Level Feature Fusion Models For Soccer Scene Classification
4 pages
Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
No ratings yet
Deep Convolutional Neural Networks For Image Classification: Many Slides From Rob Fergus (NYU and Facebook)
55 pages
Attention Transfer
No ratings yet
Attention Transfer
13 pages
3098 15835 1 PB 2011 PDF
No ratings yet
3098 15835 1 PB 2011 PDF
6 pages
CA Net
No ratings yet
CA Net
12 pages
DFANet Deep Feature Aggregation For Real-Time Semantic Segmentation
No ratings yet
DFANet Deep Feature Aggregation For Real-Time Semantic Segmentation
10 pages
A Generative Adversarial Network With Adaptive Con
No ratings yet
A Generative Adversarial Network With Adaptive Con
12 pages
PHD Thesis Structure and Content
No ratings yet
PHD Thesis Structure and Content
2 pages
7E'S SCI 8 DLL August 5, 2019
No ratings yet
7E'S SCI 8 DLL August 5, 2019
2 pages
A Review On Multiscale-Deep-Learning Applications
No ratings yet
A Review On Multiscale-Deep-Learning Applications
28 pages
Infrared and Visible Image Fusion With Resnet and Zero-Phase Component Analysis
No ratings yet
Infrared and Visible Image Fusion With Resnet and Zero-Phase Component Analysis
22 pages
STEERER Resolving Scale Variations For Counting and Localization
No ratings yet
STEERER Resolving Scale Variations For Counting and Localization
12 pages
Entropy 22 00851
No ratings yet
Entropy 22 00851
23 pages
1 s2.0 S1566253518305505 Main
No ratings yet
1 s2.0 S1566253518305505 Main
20 pages
CNN2
No ratings yet
CNN2
70 pages
PSYC327 Lab Report Introduction Renee Chen Sze Ling 7024666
No ratings yet
PSYC327 Lab Report Introduction Renee Chen Sze Ling 7024666
6 pages
Lesson Plan Form: Ashland University
No ratings yet
Lesson Plan Form: Ashland University
2 pages
Installation Guide For Ibm'S Db2 Database Server Software
No ratings yet
Installation Guide For Ibm'S Db2 Database Server Software
10 pages
Pointwise Convolutional Neural Networks
No ratings yet
Pointwise Convolutional Neural Networks
10 pages
Optik: Hafiz Tayyab Mustafa, Jie Yang, Hamza Mustafa, Masoumeh Zareapoor
No ratings yet
Optik: Hafiz Tayyab Mustafa, Jie Yang, Hamza Mustafa, Masoumeh Zareapoor
13 pages
Feature Fusion Based On Convolutional Neural Netwo PDF
No ratings yet
Feature Fusion Based On Convolutional Neural Netwo PDF
8 pages
Unsupervised Densely Attention Network For Infrared and Visible Image Fusion
No ratings yet
Unsupervised Densely Attention Network For Infrared and Visible Image Fusion
12 pages
Guidelines For Draft Data Set
No ratings yet
Guidelines For Draft Data Set
1 page
CNN Features Off-the-Shelf - An Astounding Baseline For Recognition
No ratings yet
CNN Features Off-the-Shelf - An Astounding Baseline For Recognition
8 pages
Infrared and Visible Image Fusion Using A Deep Learning Framework
No ratings yet
Infrared and Visible Image Fusion Using A Deep Learning Framework
6 pages
Infrared and Visible Image Fusion Using A Deep Learning Framework
No ratings yet
Infrared and Visible Image Fusion Using A Deep Learning Framework
6 pages
CAFNET: Cross-Attention Fusion Network For Infrared and Low Illumination Visible-Light Image
No ratings yet
CAFNET: Cross-Attention Fusion Network For Infrared and Low Illumination Visible-Light Image
15 pages
Dint A 00062
No ratings yet
Dint A 00062
16 pages
Lee 2016
No ratings yet
Lee 2016
4 pages
Combining Multiple Sources of Knowledge in Deep Cnns For Action Recognition
No ratings yet
Combining Multiple Sources of Knowledge in Deep Cnns For Action Recognition
8 pages
Multi-Scale Convolution Aggregation and Stochastic Feature Reuse For Densenets
No ratings yet
Multi-Scale Convolution Aggregation and Stochastic Feature Reuse For Densenets
10 pages
Image Fusion - 01
No ratings yet
Image Fusion - 01
16 pages
Channel2Dtransformer: A Multi-Level Features Self-Attention Fusion Module For Semantic Segmentation
No ratings yet
Channel2Dtransformer: A Multi-Level Features Self-Attention Fusion Module For Semantic Segmentation
11 pages
An Efficient Network Model For Visible and Infrared Image Fusion
No ratings yet
An Efficient Network Model For Visible and Infrared Image Fusion
18 pages
Addad Multi-Exit Resource-Efficient Neural Architecture For Image Classification With Optimized Fusion
No ratings yet
Addad Multi-Exit Resource-Efficient Neural Architecture For Image Classification With Optimized Fusion
6 pages
Gao2022 Article ATotalVariationGlobalOptimizat
No ratings yet
Gao2022 Article ATotalVariationGlobalOptimizat
9 pages
Military AI-Week 05-AI in Computer Vision
No ratings yet
Military AI-Week 05-AI in Computer Vision
65 pages
CNN Mse
No ratings yet
CNN Mse
7 pages
Zhenli Zhang ExFuse Enhancing Feature ECCV 2018 Paper
No ratings yet
Zhenli Zhang ExFuse Enhancing Feature ECCV 2018 Paper
16 pages
Fully Convolutional Networks For Semantic Segmentation
No ratings yet
Fully Convolutional Networks For Semantic Segmentation
12 pages
1 s2.0 S0045790622002701 Main
No ratings yet
1 s2.0 S0045790622002701 Main
12 pages
2020 (Code) RXDNFuse - A Aggregated Residual Dense Network For Infrared and Visible Image Fusion
No ratings yet
2020 (Code) RXDNFuse - A Aggregated Residual Dense Network For Infrared and Visible Image Fusion
42 pages
Multi-View Ensemble Federated Learning For Efficient Prediction of Consumer Electronics Applications in Fog Networks
No ratings yet
Multi-View Ensemble Federated Learning For Efficient Prediction of Consumer Electronics Applications in Fog Networks
8 pages
Convolution Neural Networks U2
No ratings yet
Convolution Neural Networks U2
24 pages
Background-Aware Cross-Attention Multiscale Fusion
No ratings yet
Background-Aware Cross-Attention Multiscale Fusion
20 pages
2021 (Code) RFN-Nest - An End-To-End Residual Fusion Network For Infrared and Visible Images
No ratings yet
2021 (Code) RFN-Nest - An End-To-End Residual Fusion Network For Infrared and Visible Images
17 pages
1 CASENet: Deep Category-Aware Semantic Edge Detection
No ratings yet
1 CASENet: Deep Category-Aware Semantic Edge Detection
16 pages
Scale-Space Theory, F-Transform Kernels and CNN Realization: Vojtech - Molek, Irina - Perfilieva @osu - CZ
No ratings yet
Scale-Space Theory, F-Transform Kernels and CNN Realization: Vojtech - Molek, Irina - Perfilieva @osu - CZ
11 pages
Carafe
No ratings yet
Carafe
15 pages
MPFNet Multiscale Prediction Network With Cross Fu
No ratings yet
MPFNet Multiscale Prediction Network With Cross Fu
12 pages
MFINet - Multi-Scale Feature Interaction Network For Change Detection of High-Resolution Remote Sensing Images
No ratings yet
MFINet - Multi-Scale Feature Interaction Network For Change Detection of High-Resolution Remote Sensing Images
19 pages
22BEECE12
No ratings yet
22BEECE12
5 pages
Continuous Technical Skill Growth
No ratings yet
Continuous Technical Skill Growth
2 pages
DAP Project Group6
No ratings yet
DAP Project Group6
8 pages
F4 Notes Waves and Optics
No ratings yet
F4 Notes Waves and Optics
31 pages
SGCNet
No ratings yet
SGCNet
16 pages
Image Fusion Transformerhjguyguyftufrdtr
No ratings yet
Image Fusion Transformerhjguyguyftufrdtr
6 pages
STDFusionNet An Infrared and Visible Image Fusion Network Based On Salient Target Detection
No ratings yet
STDFusionNet An Infrared and Visible Image Fusion Network Based On Salient Target Detection
13 pages
SAFuseNet Integration of Fusion and Detection For Infrared and Visible
No ratings yet
SAFuseNet Integration of Fusion and Detection For Infrared and Visible
7 pages
2022 - Enhanced Feature Fusion and Multiple Receptive Fields Object Detection
No ratings yet
2022 - Enhanced Feature Fusion and Multiple Receptive Fields Object Detection
12 pages
Feature Fusion Method For Edge Detection of Color Images
No ratings yet
Feature Fusion Method For Edge Detection of Color Images
6 pages
ICAFusion
No ratings yet
ICAFusion
35 pages
03 - CNN
No ratings yet
03 - CNN
10 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Attentional Feature Fusion

Uploaded by

Attentional Feature Fusion

Uploaded by

Attentional Feature Fusion

is the Sigmoid function. This is achieved by a bottleneck N

C×H×W C×H×W C×H×W C×H×W

To study the impact of multi-scale context aggregation,

believe the improved performance comes from the proposed 6. Conclusion

Fusion weights of local and

placement from the middle of the network (last two stages),

Point-wise Conv Point-wise Conv

Point-wise Conv Point-wise Conv

GlobalAvgPooling Point-wise Conv Point-wise Conv

Layer doubling Flops of Conv Flops of Point-wise

You might also like