3.1. Overall Pipeline
Figure 1 illustrates the innovative architecture of our proposed system tailored for the amalgamation of multi-scale features and the augmentation of decision-making capabilities. This framework builds upon the baseline model Oriented R-CNN, integrating our innovative contributions: the consolidated multi-scale feature enhancement module (CMFEM), the efficient atrous channel-wise attention (EACA), and the sparsely gated mixture of heterogeneous experts head (MOHEH) module. The CMFEM is paired with the Pyramid Attention Feature Pyramid Network (PAFPN), and it is composed of multi-scale feature integration, inter-stage channel-wise attention, and a residual aggregation with the original features. This module plays an essential role in the multi-scale refinement and fusion of hierarchical feature representations. Meanwhile, the MOHEH module advances the decision aggregation process by deploying diverse expert head structures for nuanced class-specific and regression predictions, embodying the essence of a sparsely gated mixture of experts approaches.
Given an input remote sensing image, we first process it through the ResNet50 backbone, and we utilize C2, C3, C4, C5 as the multi-layer features extracted from the backbone. These hierarchical features are then passed to the PAFPN for multi-scale fusion, enhancing the feature maps for subsequent refinement.
Next, the output feature maps from PAFPN serve as input for the CMFEM. Since feature fusion requires feature maps to be on the same scale and channel dimension, CMFEM first adjusts the output feature maps of PAFPN at each level scale to an intermediate scale using either downsampling or upsampling. Additionally, the feature maps are standardized to a uniform channel dimension (denoted as C, as visualized in CMFEM in
Figure 1), ensuring consistency and compatibility for the fusion process. These harmonized feature maps are then aggregated within CMFEM and passed to the Efficient Atrous Channel-Wise Attention (EACA) module, which refines the features by adaptively emphasizing crucial channels and enhancing their representation.
The refined feature maps generated by CMFEM and EACA are forwarded to the Oriented Regional Proposal Network (ORPN), which generates oriented proposals tailored for objects with arbitrary rotations. Following this, the proposals will go through a rotated ROI align operation to align and prepare them for decision-making.
Finally, the aligned proposals and feature maps are processed by the MOHEH. The MOHEH module uses a top-k gating mechanism to dynamically aggregate prediction results from different detection head architectures. This innovative and intuitive design integrates class and regression predictions through a mixture of experts’ approaches, markedly boosting the model’s prediction accuracy.
3.2. Consolidated Multi-Scale Feature Enhancement Module
The FPN was adopted in the baseline model to enhance and fuse the multi-scale feature representation. Because remote sensing images contain objects of various scales, we conjecture that the single top-down pathway will not adequately capture such data’s diverse and complex spatial hierarchies. As FPN, PAFPN also has a similar top-down pathway to aggregate feature representation learned by the CNN backbone, and it includes another bottom-up feature aggregation pathway in addition to the top-down fusion pathway to augment the feature representation further. Therefore, we replace the original FPN structure with PAFPN in the baseline model and observe an mAP gain immediately after the replacement.
Although the replacement of PAFPN with the FPN enhances the detection performance in Oriented RCNN detector, it still has the following limitations and can be further improved:
(1) The fusion strategy of PAFPN is not that customized for remote sensing images. There are lots of objects with various scales and shape in remote-sensing images, so it is necessary to further augment the fusion process of each feature level to capture finer details and semantic details.
(2) The fusion process does not fully exploit each scale, which hinders the model’s ability to improve the multi-scale information intrinsic to remote-sensing images.
(3) Inadequate exploration of feature correlations across multiple scales.
Therefore, we propose a novel CMFEM to aggregate and augment the multi-scale features from the PAFPN’s output, and
Figure 2 presents the comprehensive architecture of CMFEM, which has the following three major improvements:
(1): Following top-down and bottom-up feature aggregation, we incorporate a feature fusion step to fuse feature maps from all scales.
(2): We introduce an efficient and computational-friendly channel attention module, adept at extracting inter-scale correlations and effectively harnessing multi-scale contextual information.
(3): To preserve the original feature representation acquired by PAFPN, we integrate a residual pathway, maintaining the valuable feature representation learned throughout the network.
For feature maps
(for
), we define resizing operations
that use nearest neighbor interpolation (NNI) or adaptive max pooling (AMP) to match the scale of
and then convolution to modify channel dimensions. Specifically,
Here, NNI denotes nearest neighbor interpolation and AMP denotes adaptive max pooling. The factor
represents the scaling parameter necessary to match the spatial dimensions of
. Each transformation is designed to align the feature map sizes with
while maintaining crucial spatial and feature details. The transformed maps
are then aggregated to form
which undergoes further processing in the EACA module to enhance essential channels. The output
Q from the EACA is adjusted for spatial dimensions to match
through upscaling or downscaling as
and the final feature maps are computed as
ensuring coherent feature integration and enhancement across scales.
Table 1 presents thorough parameters and settings for the CMFEM modules. To enhance multi-scale feature representation, all output feature maps from the PAFPN are resized to an intermediate channel dimension of
. This dimension corresponds to the channel size of
, the middle feature map among the five feature maps (
to
). This configuration ensures seamless feature fusion across scales by aligning all feature maps to the size and channel dimension of
. Nearest Neighbor Interpolation (NNI) is utilized for upscaling, and Adaptive Max Pooling (AMP) is used for downscaling to reduce the loss of feature information during resizing. The residual pathway combines the refined outputs with the original feature maps via element-wise addition to preserve the integrity of the learned features. Furthermore, the output channel dimensions of
to
are standardized to 256 by using 1x1 convolution, following the baseline design and ensuring compatibility with the Oriented RPN, which requires uniform dimensions for effective proposal generation.
3.3. Efficient Atrous Channel-Wise Attention
Our proposed approach leverages a novel channel-wise attention module inspired by the Efficient Channel Attention Network (ECANet) to streamline multi-scale feature fusion and reduce redundancy, improving our detection framework’s efficacy. The efficient atrous channel-wise attention (EACA) module selects features by generating the channel-wise feature descriptors and focusing on essential channels, as shown in
Figure 3.
Suppose X is the input feature map, which is denoted as
. The feature descriptors
p and
q are obtained using global average and max pooling operations. They can be expressed mathematically as follows:
The inter-enhancement module is show in
Figure 4. Each descriptor is then processed through a set of 1D atrous convolutions, each specified by a unique atrous rate:
where
D is the total number of atrous rates used and
is the specific atrous rate for the
k-th convolution.
The outputs from these convolutions,
and
, are then summed to form the enhanced feature sets
and
:
The combined features are processed through a sigmoid activation function to compute the attention weights
:
Finally, these attention weights modulate the input feature map
X to produce the enhanced output feature map
:
In the schematic of the Inter-Enhancement Module, the average-pooled and max-pooled feature vectors
p and
q derived from the input feature map will be fed into a specialized sequence of atrous convolutions module to generate the sets
and
.
Figure 4 presents a visualization of the inter-enhancement module, utilizing atours convolution with atrous rate 1 and 2. To create the feature vectors
and
, respectively,
p is passed through two one-dimensional convolutional layers, each has a kernel size of 3. The atrous rate of one layer is 1, while the atrous rate of the other layer is 2. A comparable pair of atrous convolutions are applied to
q to provide
and
. When the atrous rate equals to 1, it is equivalent to an ordinary convolution, and this is for maintaining the original fine-grained feature representation. The atrous convolution layers allow the module to capture information at multiple scales, and they will effectively expand the receptive field without introducing extra computational costs.
Figure 5 presents a visualization of the one-dimensional atrous convolution. One-dimensional atrous convolution is an adaptation of the conventional convolutional operation that aims to increase the receptive field in sequential data without introducing computational overheads. These convolutions effectively capture long-range relationships while preserving the length of the sequence by adding gaps between the standard convolutions. Such capabilities make atrous convolutions essential for improving feature extraction in sequence modeling tasks, enabling more effective learning from data with complex dynamics. The input to the inter-enhancement module will be one-dimensional average-pooled and max-pooled features, which can be considered as a form of sequential feature representation. Therefore, we can utilize the one-dimensional atrous convolution to expand our model’s capability to integrate contextual information across various scales. Because the average-pooled and max-pooled features are from fused feature maps, we can efficiently leverage the one-dimensional convolution to differentiate the salient and non-salient signals across these fused feature descriptors. The spaced kernels of atrous convolution provide a broader view of the input features, and this is crucial for processing fused feature maps, as it helps to preserve and highlight pivotal features that might be diluted during the feature fusion process. Consequently, one-dimensional atrous convolution can not only maintain the integrity of the significant features but also augment the model’s capability to interpret sophisticated patterns and anomalies within the data, and that is why we choose to use the atrous convolution in the Inter-enhancement module.
Table 2 includes the parameters and settings of the Efficient Atrous Channel-wise Attention (EACA) module. Firstly, input feature maps (
) are pooled into feature descriptors (
) using GAP and GMP. These descriptors are then processed by three 1D atrous convolutions with rates (
) while keeping the dimension constant. After that, the outputs from these two 1D atrous convolution blocks are summed, activated using sigmoid to generate attention weights (
), and applied to refine the input feature map. Lastly, the residual connections will combine the refined outputs with the original input, and the dimension of the final output feature maps of the EACA module will be the same as the input.