FAMHE-Net: Multi-Scale Feature Augmentation and Mixture of Heterogeneous Experts for Oriented Object Detection

Chen, Yixin; Jiang, Weilai; Wang, Yaonan

doi:10.3390/rs17020205

Open AccessArticle

FAMHE-Net: Multi-Scale Feature Augmentation and Mixture of Heterogeneous Experts for Oriented Object Detection

by

Yixin Chen

^1,2,

Weilai Jiang

^1,2,* and

Yaonan Wang

^1,2

¹

The College of Electrical and Information Engineering, Hunan University, Changsha 410082, China

²

Greater Bay Area Institute for Innovation, Hunan University, Guangzhou 511300, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(2), 205; https://doi.org/10.3390/rs17020205

Submission received: 21 November 2024 / Revised: 27 December 2024 / Accepted: 2 January 2025 / Published: 8 January 2025

(This article belongs to the Special Issue Image Fusion and Object Detection Using Multi-Modal Remote Sensing Data)

Download

Browse Figures

Versions Notes

Abstract

:

Object detection in remote sensing images is essential for applications like unmanned aerial vehicle (UAV)-assisted agricultural surveys and aerial traffic analysis, facing unique challenges such as low resolution, complex backgrounds, and the variability of object scales. Current detectors struggle with integrating spatial and semantic information effectively across scales and often omit necessary refinement modules to focus on salient features. Furthermore, a detector head that lacks a meticulous design may face limitations in fully understanding and accurately predicting based on the enriched feature representations. These deficiencies can lead to insufficient feature representation and reduced detection accuracy. To address these challenges, this paper introduces a novel deep-learning framework, FAMHE-Net, for enhancing object detection in remote sensing images. Our framework features a consolidated multi-scale feature enhancement module (CMFEM) with integrated Path Aggregation Feature Pyramid Network (PAFPN), utilizing our efficient atrous channel attention (EACA) within CMFEM for enhanced contextual and semantic information refinement. Additionally, we introduce a sparsely gated mixture of heterogeneous expert heads (MOHEH) to adaptively aggregate detector head outputs. Compared to the baseline model, FAMEH-Net demonstrates significant improvements, achieving a 0.90% increase in mean Average Precision (mAP) of the DOTA dataset and a 1.30% increase in mAP12 of HRSC2016 datasets. These results highlight the effectiveness of FAMEH-Net in object detection within complex remote sensing images.

Keywords:

deep learning; remote sensing; object detection; multi-scale feature enhancement

1. Introduction

In recent years, unmanned aerial vehicle (UAV) technology has developed rapidly and extensively in the civilian and research sectors [1]. Object detection in aerial images, especially in remote sensing images (RSI) collected by UAVs, poses significant obstacles for the visual sensor systems of the UAVs. Unlike natural images, aerial images often feature objects embedded within complex environments characterized by substantial scale variations, differing resolutions, and objects with various orientations [2]. Many excellent detectors employ the horizontal bounding box (HBB) for target localization, including YOLO series (You Only Look Once) [3,4,5,6,7], and RCNN-series (Region-based Convolutional Neural Network) [8,9]. Nonetheless, because of the dense object distribution in RSI, HBB introduces more noise and covers unnecessary regions around the target, leading to worse localization performance. To solve these problems, the oriented bounding box (OBB) adopts the direction for regression in addition to the HBB, resulting in a more accurate capture of the object’s shape and better localization. Therefore, oriented object detection of remote sensing images from UAV aims to enhance detection accuracy and efficiency by leveraging advanced convolutional neural networks that can precisely localize and classify objects at various orientations and scales.

Multi-scale feature extraction and fusion are crucial for enhancing the performance of object detectors. The Feature Pyramid Network (FPN) [10] enhances feature representation by up-sampling semantically strong high-level features and merging them with original CNN outputs through lateral connections. For example, most existing oriented object detection networks utilize the FPN or FPN-alike structure [5,9,11,12]. Although these methods significantly advance multi-scale feature fusion, they still face limitations in addressing the challenges posed by the complex features in remote sensing datasets, where considerable scale disparities among objects in remote sensing images reflect the varied sizes across categories, making it difficult for single-scale methods to capture all pertinent details.

In addition to multi-scale feature fusion, refining fused feature maps is essential for enhancing oriented object detection in remote sensing images. Trending attention-based methods like [13,14] emphasize important channels or spatial regions, but they are not tailored for processing fused feature maps, as they were primarily designed for the feature extraction stage rather than refinement. Therefore, these methods frequently do not use the contextual information and hierarchical relationships established during the feature fusion stage. This shortcoming becomes evident when handling the complex characteristics of remote sensing datasets, such as varying scales and dense object distributions, as it also limits adaptability to dynamic features, eventually lowering detection accuracy. Hence, advanced refinement modules designed to process fused feature maps and maintain hierarchical relationships are crucial to handling the challenges of dense object distributions and varying scales.

The detector head for oriented object detection in remote sensing images plays a significant role in making decisions based on all previously learned feature representations. An oriented object detector’s head structure has two purposes: classification and localization. Detection heads are typically classified into convolutional heads, composed of convolution layers, and fully connected heads, composed of fully connected layers, with each further categorized as single-branch or multi-branch based on their structure. Double-branch head [15] was proposed to demonstrate that fully connected layers are more suitable for object detection, and the convolutional layers have more capability for bounding box detection. However, detection heads that rely on a single architecture, whether single-branch or double-branch, are deterministic and are not generalizable to the diverse and complex features in remote sensing datasets. This limitation hinders their adaptability to remote sensing datasets’ diverse and complicated features, mainly when dealing with varying scales, small objects, and arbitrary orientations. To address this challenge, adaptive ensemble techniques that dynamically integrate the expertise from different head structures could significantly enhance the detector’s accuracy and robustness.

The aforementioned techniques are essential for oriented object detection in remote-sensing images. However, the following issues still remain:

While FPN [10] and similar methods improve multi-scale feature fusion, they struggle to address considerable scale disparities among objects and fail to effectively capture details across varied sizes.
Existing attention mechanisms typically concentrate on feature extraction but often ignore the contextual information and hierarchical relationships formed during feature fusion, diminishing their effectiveness for varying scales and dense object distributions.
Existing detection head structures lack adaptability and generalizability, making handling diverse features, varying scales, small objects, and arbitrary orientations in remote sensing datasets challenging.

In this research, we proposed a novel framework for multi-scale feature aggregation and decision aggregation in remote-sensing images. Our framework mainly comprises a multi-scale feature enhancement module, an efficient atrous channel-wise attention mechanism, and a sparsely gated mixture of heterogeneous detection heads module. Together, these modules are designed to overcome the identified research problems, significantly boosting detection performance. Our framework is evaluated on two popular public datasets and illustrates that it achieves state-of-the-art performance on both datasets.

The main contributions of our work can be summarized as follows:

To effectively aggregate features across varying scales and dense object distributions, we introduce a novel consolidated multi-scale feature enhancement (CMFEM) module, paired with PAFPN, for advanced feature fusion, refinement, and enrichment. This module directly addresses the challenge of multi-scale feature fusion by improving scale invariance and reducing erroneous detections, particularly for objects with considerable scale disparities.
To refine fused feature maps and fully utilize hierarchical relationships, we propose a novel inter-scale efficient atrous channel attention (EACA) mechanism, which selectively enhances crucial channels of the fused features from our proposed CMFEM. This mechanism ensures enhanced adaptability to dense object distributions and enriches feature representation by resolving the inefficiencies in fused feature refinement.
To adapt detection heads for varying scales, small objects, and arbitrary orientations, we introduce an innovative approach featuring a sparsely gated mixture of heterogeneous expert head (MOHEH) modules designed to dynamically aggregate predictions from various detection head architectures. This module enhances detection accuracy by dynamically integrating multiple specialized head architectures, improving generalization and adaptability to diverse remote sensing datasets.

2. Related Work

2.1. Oriented Object Detection of Remote Sensing Images

Recent research in object detection across varying orientations provides critical insights for object detection in remote sensing imagery. Techniques like the RRPN [16] generate rotated proposals by deploying angled anchors, enhancing detection accuracy but adding computational complexity. To mitigate this, the ROI Transformer [17] transforms horizontal ROIs to rotated ones, which improves detection accuracy but at a cost to speed. Innovations such as the Gliding Vertex [18] refine detection by adjusting vertex positions, improving the fit around object orientations. Additionally, the Circular Smooth Label [19] converts the bounding box regression tasks into a classification framework, effectively handling the periodicity of angles. Techniques like Oriented Reppoints replace traditional bounding boxes and anchors with adaptive points that better capture spatial details crucial for precise detection. Despite these advancements, the high variability of object sizes and orientations in remote sensing still poses significant challenges, highlighting the need for more sophisticated spatial reasoning and robust, scale-invariant features to boost detection performance.

2.2. Multi-Scale Feature Fusion

The Multi-scale Feature Fusion scheme typically comprises inputs from the CNN backbone and a pyramid architecture for feature fusion. The deeper level of a neural network contains more semantic information than the shallow layers, and the shallow level contains mainly shape features. This scheme typically integrates the high-level semantic information into the low-level shallow features. FPN [10] utilizes the top-down feature pyramid structure to combine high-level semantic information with the low-level primitive features. Beyond the top-down path of FPN, PANet [12] uses an additional bottom-up path to a further stage of feature fusion. NAS-FPN [20] utilizes neural architecture search to configure the topology of the feature network autonomously. Bi-FPN [21] is a novel architecture used within the EfficientDet. It employs a weighted bi-directional FPN architecture for effective and quick fusion of multi-scale features. By adding additional feedback links from the feature pyramid to the bottom-up layers of the backbone, Recursive-FPN [22] facilitates more thorough feature integration and enhancement. However, while these multi-scale fusion techniques are widely applicable, they struggle to address considerable scale disparities and significant variations in object sizes.

Therefore, our approach leverages a consolidated multi-scale feature enhancement technique to address these challenges, combining advanced semantic enrichment techniques with a feature pyramid architecture to improve feature fusion, refinement, and representation for improved detection performance.

2.3. Attention Mechanism

The attention mechanism in deep learning introduces a form of selectivity that can substantially increase model performance. The Squeeze-and-Excitation Network (SENet) [13] employs a channel-wise attention mechanism, which weights each channel in the input feature map to emphasize the significant channels while minimizing less important ones, thereby more efficiently grasping the contextual attributes. Later, the convolutional block attention module (CBAM) [14] sequentially organizes the channel attention and spatial attention, and it introduces the global pooling to capture the global spatial information. Using 1D convolution for efficient, local cross-channel interactions without dimensionality reduction, ECANet [23] improves convolutional networks and maximizes performance by concentrating on informative features. Attention mechanisms like SENet, CBAM, and ECANet have greatly improved feature extraction methods, allowing for more accurate and contextually aware object detection. However, these methods are not specifically designed for processing fused feature maps because they concentrate more on the feature extraction phase than refining. Consequently, they often fail to effectively utilize the contextual information and hierarchical relationships established during the feature fusion process.

Thus, to refine fused feature maps and utilize hierarchical relationships, our work proposes an attention mechanism that enhances key channels and enriches feature representation. By leveraging atrous convolution, our proposed attention mechanism expands the receptive field for efficient multi-scale refinement, addressing the limitations of existing methods.

2.4. Mixture of Experts

The Mixture of Experts (MoE) methodology significantly advances ensemble machine learning, particularly impacting large-scale language models through its principle of conditional computation. This approach divides the network into specialized “experts”, each handling specific input types, activating only relevant ones based on the input’s characteristics to enhance computational efficiency. Shazier et al. [24] demonstrated a sparsely gated network that handles extensive weights with minimal computation during testing. MoE has been successfully integrated into machine translation, managing over 600 billion parameters effectively [24], and employed in frameworks like Glam [25]. By integrating MoE into its architecture, the Switch Transformer [26] reduces training costs and simplifies architecture while accelerating pre-training and boosting performance across NLP tasks. The “divide and conquer” strategy of MoE enhances model efficiency and effectiveness by focusing expertise where most needed, optimizing resources, and improving accuracy across diverse applications. Despite its widespread use, MoE typically consists of experts with the same underlying structure but different hyperparameters, such as fully connected layers with varying dropout rates, which may not be efficient for handling the diverse scenarios and complexities in many datasets.

Therefore, our work addresses this limitation by introducing a design with heterogeneous expert structures, enabling the model to adapt more effectively to diverse features, varying scales, small objects, and arbitrary orientations present in remote sensing datasets.

3. Methods

3.1. Overall Pipeline

Figure 1 illustrates the innovative architecture of our proposed system tailored for the amalgamation of multi-scale features and the augmentation of decision-making capabilities. This framework builds upon the baseline model Oriented R-CNN, integrating our innovative contributions: the consolidated multi-scale feature enhancement module (CMFEM), the efficient atrous channel-wise attention (EACA), and the sparsely gated mixture of heterogeneous experts head (MOHEH) module. The CMFEM is paired with the Pyramid Attention Feature Pyramid Network (PAFPN), and it is composed of multi-scale feature integration, inter-stage channel-wise attention, and a residual aggregation with the original features. This module plays an essential role in the multi-scale refinement and fusion of hierarchical feature representations. Meanwhile, the MOHEH module advances the decision aggregation process by deploying diverse expert head structures for nuanced class-specific and regression predictions, embodying the essence of a sparsely gated mixture of experts approaches.

Given an input remote sensing image, we first process it through the ResNet50 backbone, and we utilize C2, C3, C4, C5 as the multi-layer features extracted from the backbone. These hierarchical features are then passed to the PAFPN for multi-scale fusion, enhancing the feature maps for subsequent refinement.

Next, the output feature maps from PAFPN serve as input for the CMFEM. Since feature fusion requires feature maps to be on the same scale and channel dimension, CMFEM first adjusts the output feature maps of PAFPN at each level scale to an intermediate scale using either downsampling or upsampling. Additionally, the feature maps are standardized to a uniform channel dimension (denoted as C, as visualized in CMFEM in Figure 1), ensuring consistency and compatibility for the fusion process. These harmonized feature maps are then aggregated within CMFEM and passed to the Efficient Atrous Channel-Wise Attention (EACA) module, which refines the features by adaptively emphasizing crucial channels and enhancing their representation.

The refined feature maps generated by CMFEM and EACA are forwarded to the Oriented Regional Proposal Network (ORPN), which generates oriented proposals tailored for objects with arbitrary rotations. Following this, the proposals will go through a rotated ROI align operation to align and prepare them for decision-making.

Finally, the aligned proposals and feature maps are processed by the MOHEH. The MOHEH module uses a top-k gating mechanism to dynamically aggregate prediction results from different detection head architectures. This innovative and intuitive design integrates class and regression predictions through a mixture of experts’ approaches, markedly boosting the model’s prediction accuracy.

3.2. Consolidated Multi-Scale Feature Enhancement Module

The FPN was adopted in the baseline model to enhance and fuse the multi-scale feature representation. Because remote sensing images contain objects of various scales, we conjecture that the single top-down pathway will not adequately capture such data’s diverse and complex spatial hierarchies. As FPN, PAFPN also has a similar top-down pathway to aggregate feature representation learned by the CNN backbone, and it includes another bottom-up feature aggregation pathway in addition to the top-down fusion pathway to augment the feature representation further. Therefore, we replace the original FPN structure with PAFPN in the baseline model and observe an mAP gain immediately after the replacement.

Although the replacement of PAFPN with the FPN enhances the detection performance in Oriented RCNN detector, it still has the following limitations and can be further improved:

(1) The fusion strategy of PAFPN is not that customized for remote sensing images. There are lots of objects with various scales and shape in remote-sensing images, so it is necessary to further augment the fusion process of each feature level to capture finer details and semantic details.

(2) The fusion process does not fully exploit each scale, which hinders the model’s ability to improve the multi-scale information intrinsic to remote-sensing images.

(3) Inadequate exploration of feature correlations across multiple scales.

Therefore, we propose a novel CMFEM to aggregate and augment the multi-scale features from the PAFPN’s output, and Figure 2 presents the comprehensive architecture of CMFEM, which has the following three major improvements:

(1): Following top-down and bottom-up feature aggregation, we incorporate a feature fusion step to fuse feature maps from all scales.

(2): We introduce an efficient and computational-friendly channel attention module, adept at extracting inter-scale correlations and effectively harnessing multi-scale contextual information.

(3): To preserve the original feature representation acquired by PAFPN, we integrate a residual pathway, maintaining the valuable feature representation learned throughout the network.

For feature maps

K_{i}

(for

i = 2, 3, 4, 5, 6

), we define resizing operations

R

that use nearest neighbor interpolation (NNI) or adaptive max pooling (AMP) to match the scale of

K_{4}

and then convolution to modify channel dimensions. Specifically,

K_{2}^{'} = R_{2 \to 4} (K_{2}) = AMP (K_{2}, δ_{2, 4}),

(1)

K_{3}^{'} = R_{3 \to 4} (K_{3}) = AMP (K_{3}, δ_{3, 4}),

(2)

K_{4}^{'} = K_{4},

(3)

K_{5}^{'} = R_{5 \to 4} (K_{5}) = NNI (K_{5}, δ_{5, 4}),

(4)

K_{6}^{'} = R_{6 \to 4} (K_{6}) = NNI (K_{6}, δ_{6, 4}) .

(5)

Here, NNI denotes nearest neighbor interpolation and AMP denotes adaptive max pooling. The factor

δ_{i, 4}

represents the scaling parameter necessary to match the spatial dimensions of

K_{4}

. Each transformation is designed to align the feature map sizes with

K_{4}

while maintaining crucial spatial and feature details. The transformed maps

K_{i}^{'}

are then aggregated to form

N = \frac{1}{5} \sum_{i = 2}^{6} K_{i}^{'},

(6)

which undergoes further processing in the EACA module to enhance essential channels. The output Q from the EACA is adjusted for spatial dimensions to match

K_{i}

through upscaling or downscaling as

O_{i} (Q) = \{\begin{matrix} Upscale (Q) & if i < 4, \\ Downsample (Q) & if i > 4, \end{matrix}

(7)

and the final feature maps are computed as

P_{i} = O_{i} (Q) + K_{i},

(8)

ensuring coherent feature integration and enhancement across scales.

Table 1 presents thorough parameters and settings for the CMFEM modules. To enhance multi-scale feature representation, all output feature maps from the PAFPN are resized to an intermediate channel dimension of

C = 1024

. This dimension corresponds to the channel size of

K_{4}

, the middle feature map among the five feature maps (

K_{2}

to

K_{6}

). This configuration ensures seamless feature fusion across scales by aligning all feature maps to the size and channel dimension of

K_{4}

. Nearest Neighbor Interpolation (NNI) is utilized for upscaling, and Adaptive Max Pooling (AMP) is used for downscaling to reduce the loss of feature information during resizing. The residual pathway combines the refined outputs with the original feature maps via element-wise addition to preserve the integrity of the learned features. Furthermore, the output channel dimensions of

P_{2}

to

P_{6}

are standardized to 256 by using 1x1 convolution, following the baseline design and ensuring compatibility with the Oriented RPN, which requires uniform dimensions for effective proposal generation.

3.3. Efficient Atrous Channel-Wise Attention

Our proposed approach leverages a novel channel-wise attention module inspired by the Efficient Channel Attention Network (ECANet) to streamline multi-scale feature fusion and reduce redundancy, improving our detection framework’s efficacy. The efficient atrous channel-wise attention (EACA) module selects features by generating the channel-wise feature descriptors and focusing on essential channels, as shown in Figure 3.

Suppose X is the input feature map, which is denoted as

X \in R^{H \times W \times C}

. The feature descriptors p and q are obtained using global average and max pooling operations. They can be expressed mathematically as follows:

\begin{matrix} p & = GAP (X) = \frac{1}{H \cdot W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{i, j}, \end{matrix}

(9)

\begin{matrix} q & = GMP (X) = max_{1 \leq i \leq H, 1 \leq j \leq W} X_{i, j} . \end{matrix}

(10)

The inter-enhancement module is show in Figure 4. Each descriptor is then processed through a set of 1D atrous convolutions, each specified by a unique atrous rate:

\begin{matrix} A_{k} & = {Conv 1 D}_{atrous = d_{k}} (p), for k = 1, \dots, D, \end{matrix}

(11)

\begin{matrix} B_{k} & = {Conv 1 D}_{atrous = d_{k}} (q), for k = 1, \dots, D, \end{matrix}

(12)

where D is the total number of atrous rates used and

d_{k}

is the specific atrous rate for the k-th convolution.

The outputs from these convolutions,

A_{k}

and

B_{k}

, are then summed to form the enhanced feature sets

A^{'}

and

B^{'}

:

\begin{matrix} A^{'} & = \sum_{k = 1}^{D} A_{k}, \end{matrix}

(13)

\begin{matrix} B^{'} & = \sum_{k = 1}^{D} B_{k} . \end{matrix}

(14)

The combined features are processed through a sigmoid activation function to compute the attention weights

α

:

α = σ (A^{'} + B^{'}) .

(15)

Finally, these attention weights modulate the input feature map X to produce the enhanced output feature map

X_{out}

:

X_{out} = α ⊙ X .

(16)

In the schematic of the Inter-Enhancement Module, the average-pooled and max-pooled feature vectors p and q derived from the input feature map will be fed into a specialized sequence of atrous convolutions module to generate the sets

A^{'}

and

B^{'}

. Figure 4 presents a visualization of the inter-enhancement module, utilizing atours convolution with atrous rate 1 and 2. To create the feature vectors

A_{1}

and

A_{2}

, respectively, p is passed through two one-dimensional convolutional layers, each has a kernel size of 3. The atrous rate of one layer is 1, while the atrous rate of the other layer is 2. A comparable pair of atrous convolutions are applied to q to provide

B_{1}

and

B_{2}

. When the atrous rate equals to 1, it is equivalent to an ordinary convolution, and this is for maintaining the original fine-grained feature representation. The atrous convolution layers allow the module to capture information at multiple scales, and they will effectively expand the receptive field without introducing extra computational costs.

Figure 5 presents a visualization of the one-dimensional atrous convolution. One-dimensional atrous convolution is an adaptation of the conventional convolutional operation that aims to increase the receptive field in sequential data without introducing computational overheads. These convolutions effectively capture long-range relationships while preserving the length of the sequence by adding gaps between the standard convolutions. Such capabilities make atrous convolutions essential for improving feature extraction in sequence modeling tasks, enabling more effective learning from data with complex dynamics. The input to the inter-enhancement module will be one-dimensional average-pooled and max-pooled features, which can be considered as a form of sequential feature representation. Therefore, we can utilize the one-dimensional atrous convolution to expand our model’s capability to integrate contextual information across various scales. Because the average-pooled and max-pooled features are from fused feature maps, we can efficiently leverage the one-dimensional convolution to differentiate the salient and non-salient signals across these fused feature descriptors. The spaced kernels of atrous convolution provide a broader view of the input features, and this is crucial for processing fused feature maps, as it helps to preserve and highlight pivotal features that might be diluted during the feature fusion process. Consequently, one-dimensional atrous convolution can not only maintain the integrity of the significant features but also augment the model’s capability to interpret sophisticated patterns and anomalies within the data, and that is why we choose to use the atrous convolution in the Inter-enhancement module.

Table 2 includes the parameters and settings of the Efficient Atrous Channel-wise Attention (EACA) module. Firstly, input feature maps (

H = 64, W = 64, C = 1024

) are pooled into feature descriptors (

1 \times 1 \times 1024

) using GAP and GMP. These descriptors are then processed by three 1D atrous convolutions with rates (

d_{k} = 1, 2, 3

) while keeping the dimension constant. After that, the outputs from these two 1D atrous convolution blocks are summed, activated using sigmoid to generate attention weights (

α

), and applied to refine the input feature map. Lastly, the residual connections will combine the refined outputs with the original input, and the dimension of the final output feature maps of the EACA module will be the same as the input.

3.4. Sparsely Gated Mixture of Heterogeneous Head

In object detection of the remote sensing images, the detector head is one of the most pivotal modules of a model, because it directly conducts evaluations on the learned feature representation from the previous block of the network. This assessment is of great importance as it determines the overall accuracy and efficiency of detecting objects across various scales in remote sensing images. Taking account of the diverse scale and orientation of objects in remote sensing images, the detector head should be highly flexible and accurate in classification and localization, and this will guarantee that the learned spatial feature representation and semantic information are efficiently employed. In addition to that, the intricacy of the remote sensing images, which includes changes in lightning or background with severe noises, requires the detector head to be adaptive to such variabilities. To overcome such variabilities, we take advantage of not only the regular single-branch head structures but also customized double-branch structures, while using the Mixture of Experts (MoE) methodology to adaptively and selectively choose the best-performing head structures. The incorporation of the MoE selections mechanism significantly enhances the robustness and performance of our proposed detector.

3.4.1. Residual Double Head Structure

Double-Head R-CNN [15] has emerged as a trending detector head module that has achieved a notable improvement in object detection tasks, gaining +3.5% and +2.8% AP on MS COCO dataset [27] from Feature Pyramid Network (FPN) baselines. RCNN-based detectors commonly employ two types of head structures for classification and localization tasks: head with mainly fully connected layers and head with mainly convolutional layers. In double-Head R-CNN architectures, a detector head that comprises two branches, where one branch mostly includes fully connected layers and the other branch primarily includes convolutional layers, demonstrates better performance than the single-branch head structure. This improved architecture demonstrates that the convolutional branch is more appropriate for the localization requirement, and the fully connected branch is more capable of classification tasks. Labeled as “Expert 1” in Figure 6, the original detector head in the Oriented RCNN consists of a single branch with two fully connected layers; this arrangement does not consider the classification and the localization advantage. We propose a mixed convolution and linear layer structure for the single-branch enhancement, labeled “Expert 2” in Figure 6, consisting of two 3 × 3 convolutions and one fully connected layer on a single branch. Additionally, our customized Double-head RCNN structure is depicted in Figure 6 as “Expert 3” and “Expert 4”. It comprises a branch of FC layers for classification and a branch with convolutional blocks for bounding box regression. The double head’s convolutional module comprises a basic residual block, and a residual bottleneck in Figure 7 or a inverted residual bottleneck in Figure 8. A basic residual block comprises a 3 × 3 conv layer and a 1 × 1 conv layer. The detection performance demonstrates that the FC branch is more adept at classification due to its higher spatial sensitivity, which is crucial for distinguishing between whole and partial objects in remote sensing images. On the other hand, the conv branch adds a stack of residual blocks to improve feature learning capabilities significantly without incurring too much computational overhead.

3.4.2. Sparsely Gated Mixture of Expert Heads

In our study, we introduce a novel architecture called the Sparsely gated Mixture of Detector Heads. As depicted in Figure 6, the workflow begins with an ROI-aligned Input Feature Map with 256 channels and a spatial dimension of 7×7. This input is first flattened, resulting in a vector of 12,544 dimensions, which is then processed through a gating network. The gate’s role is to dynamically select and activate the relevant detector heads based on the input, optimizing both the processing efficiency and the task-specific performance of the model. This strategy embodies the MoE’s fundamental objective of enhancing model flexibility and computational efficiency through targeted activation of neural network segments.

Our proposed Mixture of Experts(MoE) layer is composed of n “expert networks”, denoted

E_{1}

,...,

E_{n}

, and a “Gating network” g that outputs an n-dimensional vector. As illustrated in Figure 6, the experts are of different structures mentioned in the previous section, and each of them outputs the same: the class prediction and the regression prediction. For a given input x, let

E_{i}

(x) and G(x) represents the output of the i-th expert and the output of the gating network. Initially, the gating network computes a set of scores that determine the relevance of each expert for a given input. The scores for each expert i are computed as

{scores}_{i} = w_{i}^{⊤} x + b_{i} for each expert i \in {1, 2, 3, \dots, N}

(17)

where

w_{i}

represents the weight vector,

b_{i}

the bias, and x the input feature vector. Based on these scores, we use a top-k selection mechanism to choose the experts. The top-k gating mechanism activates only the top k experts based on their scores. The indices of the top-k experts are chosen, and the gating outputs for these experts are normalized by SoftMax, as show in Equation (18) below:

G {(x)}_{i} = \{\begin{matrix} \frac{exp ({scores}_{i})}{\sum_{j \in S_{k} (x)} exp ({scores}_{j})} & if i \in S_{k} (x) \\ 0 & otherwise \end{matrix}

(18)

where

S_{k} (x)

denotes the indices of the top-k selected experts.

Finally, each expert network

E_{i}

provides two outputs: a classification (

{cls}_{i}

) and a regression (

{reg}_{i}

) prediction. The final aggregated outputs for the classification and regression are computed as weighted sums of the respective outputs from the selected experts, using the gating outputs as weights:

\begin{matrix} cls & = \sum_{i \in S_{k} (x)} G {(x)}_{i} \cdot {cls}_{i} (x), \end{matrix}

(19)

\begin{matrix} reg & = \sum_{i \in S_{k} (x)} G {(x)}_{i} \cdot {reg}_{i} (x) . \end{matrix}

(20)

Table 3 provides a detailed summary of the parameters and settings used in the MOHEH module. The input feature map dimensions (

H = 7, W = 7, C = 256

) originate from the ROI Align process, which extracts fixed-size feature maps from regions of interest generated by the Oriented RPN. These feature maps are flattened into a 1D vector of size

12, 544

and passed through a gating network of dimensions

12, 544 \times 4

, responsible for dynamically selecting experts based on input relevance. The module comprises four experts (

n = 4

), a balance between diversity and computational efficiency, with a top-k gating mechanism activating two (

k = 2

) experts per input. The selected experts employ diverse architectures optimized for classification and regression tasks. For expert design, expert 1 uses two fully connected (FC) layers to balance simplicity and feature extraction. Expert 2 employs two

3 \times 3

convolutional layers and one FC layer to effectively capture local and global features. Experts 3 and 4 utilize only one regular residual block and one regular bottleneck block or inverted residual block to learn more features while maintaining computational efficiency. The parameters within the residual blocks follow the design of regular residual blocks.

4. Results

4.1. Datasets

DOTA-v1.0 [28] dataset: The DOTA is a large dataset for oriented object detection, comprised of 2806 aerial images. In total, 1411 images are employed for training, 937 images for validation, and the remaining 458 images at the time within testing. The resolution value of the DOTA dataset ranges from 800 × 800 to 4000 × 4000 pixels, with an average size of 2000 × 2000 pixels. For training, we utilize both the training set and the validation set, and we use the remaining images for a testing set. Image annotations are provided in the XML and text format. The following are the categories and their respective abbreviations: Ground track field (GTF), Soccer-ball field (SBF), Plane (PL), Tennis court (TC), Harbor (HA), Small vehicle (SV), Baseball diamond (BD), Swimming pool (SP), Helicopter (HC), Ship (SH), Basketball court (BC), Roundabout (RA), Large vehicle (LV), Bridge (BR), and Storage tank (ST).

HRSC2016 [29] is a dataset designed for ship detection tasks, featuring high-definition imagery sourced from six harbors around the globe. It encompasses a total of 1061 images, whose dimensions vary from 300 by 300 to 1500 by 900 pixels. The dataset is divided into 3 sets: 436 images for training purposes, 181 for validation, and 444 for testing. Each ship object within the dataset is annotated using Oriented Bounding Boxes (OBB), and the considerable diversity in the sizes of the ships presents a notable difficulty for precise object detection. The settings for these two datasets follow those of the baseline to ensure a fair comparison.

4.2. Evaluation Metrics

In this study, we employ two principal metrics to assess the model’s performance: mean Average Precision (mAP) and Frames Per Second (FPS). FPS measures the detection speed of the model, indicating the number of images processed per second. A greater FPS value demonstrates better detection capabilities. mAP is used to gauge the accuracy of object detection models across various object classes and is defined as the mean of Average Precision (AP) scores across all categories:

m A P = \frac{1}{C} \sum_{c = 1}^{C} A P_{c}

where C represents the total number of object categories, and

A P_{c}

is the average precision for the cth category. AP summarizes the precision—recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight:

A P = \sum_{n} (R_{n} - R_{n - 1}) P_{n}

Precision (P) and Recall (R) are key components in evaluating detection tasks:

\begin{matrix} Precision (P) & = \frac{T P}{T P + F P} \\ Recall (R) & = \frac{T P}{T P + F N} \end{matrix}

Here,

T P

,

F P

, and

F N

stand for true positives, false positives, and false negatives, respectively. These metrics effectively measure the model’s accuracy and ability to detect objects accurately across different scenarios.

4.3. Parameter Settings

To ensure a fair and consistent comparison with our experiments’ baseline model ORCNN [30], we closely follow its parameter settings. Specifically, as shown in Table 4, we conduct our experiments on a server equipped with a single NVIDIA RTX 4070Ti GPU, using a batch size of 2 for training. Our experimental results are produced on the MMDetection platform, with ResNet50 as the backbone, pre-trained on ImageNet. We use the SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001, ensuring consistency with the baseline configuration. Similarly, we merge the training and validation sets following the baseline configuration.

On the DOTA dataset, as the baseline method, we also crop the original images into

1024 \times 1024

patches with a stride of 824. We ensured that the training was conducted for 12 epochs with an initial learning rate of 0.005 divided by 10 at the 8th and 11th epochs.

For the HRSC2016 dataset, we maintained the aspect ratios of images. We ensured training was performed for 36 epochs with an initial learning rate of 0.005, divided by 10 at the 24th and 33rd epochs. We ensure the comparative experiment results are reliable by adhering to the baseline model’s parameter settings. Table 5 summarizes the complete list of parameters and settings employed in our experiments.

4.4. Comparison with State-of-the-Art Detectors

In this section, we conduct a series of experiments on DOTA and HRSC2016 to demonstrate the effectiveness of our detector. We adopt ResNet50 as the backbone to ensure a fair comparison with the baseline model. In addition to comparative methods using ResNet50, we also include methods with ResNet101 as the backbone, demonstrating that our model achieves more optimal performance with shallower backbones. The complete comparison results for DOTA are shown in Table 6. We first demonstrate the overall performance of the FAMHE-Net and showcase the enhancement of our proposed framework. We achieved the best and second-best results for most categories compared to other detectors, and our detector improved mAP on DOTA by 0.9%. Then, in the next few sections, we append each proposed module to our baseline model to perform the ablation study and demonstrate that each of our proposed modules is crucial for augmenting the overall detection performance, thereby illustrating the advantage of our FAMHE-Net.

Table 7 shows the results of different state-of-the-art object detection methods on the HRSC2016 dataset, with the mAP07 and mAP12 as the evaluation metrics. According to the mAP07 and mAP12 results, we demonstrate that our proposed framework achieves higher detection accuracy. The “backbone” column indicates the name of the feature extraction network and its corresponding number of layers. Results from Table 7 illustrate that our proposed framework with only an R-50 backbone network achieves a mAP07 of 90.70 and mAP12 of 97.70, which are +0.34% and +1.30% higher than the baseline model.

We also included parameters and FLOPS as comparison metrics to provide a balanced perspective on the trade-offs between computational cost and detection accuracy to ensure a comprehensive evaluation. As shown in Table 7, we observed that FAMHE-Net introduces increases in parameters and FLOPS due to the use of the mixture of experts (MoE) mechanism. By utilizing MoE, our model needs to process multiple experts in parallel, which requires additional computational resources, thereby increasing both parameters and FLOPS. However, this trade-off is justified by the significant improvement in detection accuracy, particularly in challenging remote sensing scenarios involving dense object distributions, objects with arbitrary scales and orientations, and small objects. Therefore, we can see that the designs of these specialized experts enable FAMHE-Net to capture the complicated relationship, enhancing detection accuracy and robustness in remote sensing detection tasks.

Figure 9 and Figure 10 present the FAMHE-Net detection results. We specifically select images containing objects with dense distributions, varying scales, and small sizes to evaluate the model’s performance under challenging scenarios. Figure 11 and Figure 12 compare the baseline model’s and FAMHE-NET’s detection results. The top row displays the baseline detection results, while the bottom row presents the results from FAMHE-NET. Red circles indicate newly detected objects by FAMHE-NET, while blue circles highlight false positives in the baseline detection. In line with our specified research problem, we focus on challenging scenarios with small items, densely packed objects, and shifting object scales to validate the efficacy of FAMHE-NET. From the comparison images in the DOTA and HRSC2016 datasets, our enhanced detector demonstrates enhanced precision by significantly reducing false positives and accurately detecting previously missed objects, tiny objects, objects with arbitrary orientations and scales, and those within densely distributed scenes.

4.5. Ablation Study

4.5.1. Effectiveness of the CMFEM

Our baseline detection framework utilizes the Feature Pyramid Network (FPN) as the neck architecture for multi-scale feature fusion. Experimental results are comprehensively summarized in Table 8. For the HRSC2016 dataset, the evaluation metrics employed are mAP07 and mAP12, while for the DOTA dataset, mAP is used. As demonstrated in Table 8, we initially substitute FPN with PAFPN, which results in improvements on both datasets. Subsequently, we integrated our proposed CMFEM atop PAFPN to assess its efficacy. The CMFEM encompasses several components, including multi-scale feature fusion, a residual pathway, and an efficient atrous channel-wise attention (EACA) module.

In Table 8, we conduct an ablation study examining various combinations of neck structures integrated into the overall detector architecture. Specifically, employing only the feature fusion component of the CMFEM—merely fusing the multi-scale outputs from PAFPN and then scaling back the fused feature to the detector head—yielded a modest increase in detection accuracy across both datasets. Including the residual pathway further augmented detection performance, underscoring the significance of retaining original feature representations for enhanced accuracy. The addition of the EACA, our proposed channel-wise attention module, led to significant performance gains. The replacement of FPN with PAFPN and addition of CMFEM achieves an overall increase of 0.22% in mAP07, an increase of 0.8% in mAP12 for the HRSC2016 dataset, and an increase of 0.58% in mAP for the DOTA dataset. The performance improvement for the channel-wise refinement demonstrates that focusing on selective feature enhancement through channel-wise attention mechanisms can substantially amplify the detection model’s ability to discriminate and recognize the crucial context of the fused features. Figure 13 visualizes the CMFEM module without the addition of the EACA module, with the left columns showing the original images, the middle columns displaying the baseline visualization using the ResNet backbone with FPN, and the right columns presenting the visualization of the CMFEM module without EACA, but with the ResNet backbone, PAFPN, Fusion Only, and Residual Pathway. Our approach enhanced multi-scale fusion to address the research problem of effectively aggregating features across varying scales and dense object distributions, and this visualization demonstrates our approach’s effectiveness. As shown in Figure 13, this is evident from the feature maps’ heatmaps, where regions with greater emphasis on objects are highlighted in red. Especially for complicated and densely packed objects, this enhanced focus enables the detector to achieve better feature representations, which improves localization and classification accuracy.

4.5.2. Effectiveness of the EACA

We need to conduct an ablation study to analyze the effect of different configurations of atrous rates on performance when adding EACA to the CMFEM module. To determine the optimal atrous rate for the inter-enhancement block within the EACA module, we conduct experiments to assess the impact of different numbers of atrous convolutions on feature representation across various receptive fields. We used an atrous rate of one as our baseline for comparison, where the convolution operation mimics a standard convolution. Specifically, for the DOTA dataset, incorporating the EACA module with multi-scale feature fusion and the residual pathway led to a significant enhancement in mAP, detailed further in Table 9. Our experimental findings indicate that an atrous rate configuration of (1, 2) improves the mAP significantly, while a configuration of (1, 2, 3) yields an even greater increase in mAP. Similarly, for the HRSC2016 dataset, employing different sets of atrous rates improved the baseline performance. The configuration of (1, 2) led to mAP improvement, and extending this to (1, 2, 3) gains a greater amount than the (1, 2) setup. For (1, 2, 3, 4) setup, the performance decreases, and we conjecture that this is caused by a light extent of over-fitting. Therefore, we choose the setup (1, 2, 3) after considering the performance on two datasets.

As demonstrated by our experiments, the optimal configuration of atrous rates for the inter-enhancement block was (1, 2, 3). This configuration involves average-pooled and max-pooled features being processed through three distinct convolutions at atrous rates of 1, 2, and 3, respectively. This approach significantly improves the extraction and integration of contextual information, optimizing the overall feature enrichment facilitated by the CMFEM. Figure 14 visualizes the CMFEM module by adding the EACA module under the optimal atrous rate configuration and with the ResNet50 backbone. The left column shows the original images, the middle column displays the baseline with ResNet50 backbone with FPN, and the right column presents FAMHE-Net with ResNet50 backbone with CMFEM and EACA attention. From this visualization, we can see that with EACA included to refine features, our network effectively enhances feature representation, leading to more precise and robust output. These findings demonstrate that refinement of fused and enhanced multi-scale features is essential, ensuring that our model achieves robust and high-performing detection capabilities across diverse remote sensing imaging scenarios.

4.5.3. Effectiveness of Mixture of Heterogeneous Head

In Table 10, we initially evaluated and compared various head structures, focusing on their detection accuracies. The results indicate that the double-head structure outperforms the single-head setup. This finding supports our hypothesis that convolution layers are more adept at regression tasks, while linear layers excel in classification tasks.

Further detailed in Table 10, we observe a notable improvement in mAP with the implementation of advanced head structures: the double-branch residual head structure shows an +0.27% improvement in mAP, and the double-branch inverted-residual head structure exhibits an 0.22% increase in mAP. To balance performance enhancements with computational efficiency, both the residual and the inverted residual head structures incorporate one basic residual block and one bottleneck block in the convolutional branch, slightly increasing the computational cost but considerably boosting the detection performance.

Our novel MOHEH module introduces two critical hyper-parameters: the number of experts, k, and the number of selected experts, n. After evaluating each head structure individually, we implement gating for each structure with n = 4, k = 1 to k = 4. Table 11 presents the experimental result of MOHEH modules with different hyper-parameters. After applying the MoE, we can see that the parameters are doubled compared to the baseline models, significantly enlarging the model size and enhancing the capacity for specialized learning and adaptability across classification and regression tasks. We can see that the Parameters are the optimal performance is obtained when n = 4 and k = 2, which enhanced the mAP by 0.79%. For k = 3 and k = 4, we can see that although the model performance remains better than the baseline, there is a noticeable decrease in FPS due to the increased number of experts required for each input. Specifically, when k = 4, the input feature will pass through all four experts, which means the results of all experts will be combined. Due to the inclusion of residual convolutional blocks in the double-branch experts, activating all four experts in each iteration introduces additional computational cost, which in turn lowers the FPS. Additionally, considering the complex input feature representation and model simplicity, we employ a fully connected layer designed to adaptively select experts based on the top-k selection algorithm for the gating mechanism, thereby enhancing the module’s effectiveness in various detection scenarios.

5. Discussion

In oriented object detection for remote sensing images, our proposed model successfully and effectively addresses issues such as arbitrary oriented objects, significant scale variations, and dense object distributions.

5.1. Analysis of Results

This paper compares FAMHE-Net’s performance with state-of-the-art detectors and baseline models and conducts an ablation study for each proposed module. The enhanced detector achieves competitive results on each dataset. Compared to the baseline model’s performance on the downstream task of oriented object detection, our proposed FAMHE-Net improved performance by 0.90% in mAP on DOTA and 1.30% in mAP on HRSC2016, especially performing well in categories that include a significant number of small objects and objects with varying scales and orientations, such as the categories “Ship” and “Plane” in the DOTA dataset.

For the comparison of detection results in Figure 11 and Figure 12, our model demonstrates its ability to significantly reduce error detections (false positives) and improve accurate detection of small objects, objects with varying scales, and previously missed detections from the baseline. From the visualization of feature maps’ heatmaps in Figure 13 and Figure 14, we observe that the multi-scale feature fusion and enhancement module CMFEM, along with our refinement module EACA, enables the detector’s feature maps to be more focused on objects of interest, particularly in complex scenes. As we can see from the heat maps, regions of interest are highlighted more distinctly, resulting in enhanced feature representation and improved localization and classification accuracy. Overall, these results highlight the robustness and effectiveness of FAMHE-Net in addressing challenges related to object scale variations, arbitrary orientation, small objects, and complex object distributions.

5.2. Future Directions: Multi-Modality Integration and Expanding Application Scenarios

This paper focused on enhancing oriented object detection in remote sensing images through advanced multi-scale feature aggregation, efficient refinement mechanisms, and dynamic decision aggregation modules. Future research could integrate the model with multi-modality data, such as thermal and hyperspectral imagery, to enhance its robustness and accuracy under diverse environmental conditions. Additionally, labeled data is hard to acquire for dynamic or rare events in remote sensing. So, few-shot or zero-shot techniques would also be worth exploring to enable the model to generalize effectively with minimal labeled data. The model can be applied to various real-life application scenarios in remote sensing object detection, such as urban planning for detecting infrastructure features, UAV-based disaster management for identifying damaged structures and monitoring agricultural activities.

5.3. Limitations

Like most research, while the FAMHE-Net framework demonstrates significant improvements in oriented object detection for remote sensing images, there are still certain limitations and scenarios where its performance could be improved. FAMHE-Net may not achieve optimal detection results across all object types in scenarios with extremely dense object distributions or highly imbalanced scales. Additionally, the computational complexity of the framework remains a problem if deploying to edge devices with low computing powers. Our future goal is to optimize the architecture of FAMHE-Net to enhance its computational efficiency and detection accuracy and expand its applicability to diverse real-world detection scenarios.

6. Conclusions

In this study, we introduced FAMHE-Net, an innovative framework specifically designed to address the critical challenges in detecting oriented objects in aerial images, particularly within the context of remote sensing. Our research tackles three key problems in this domain: (1) the significant scale disparities among objects in remote sensing images, which make it difficult for single-scale methods to capture all relevant details; (2) the need for advanced refinement modules that can process fused feature maps while maintaining hierarchical relationships to handle dense object distributions and varying scales; and (3) the limitations of detection heads with deterministic architectures, which lack generalizability to the diverse and complex features in remote sensing datasets.

FAMHE-Net addresses these challenges through its novel architecture, integrating a multi-scale feature augmentation module (CMFEM), an Efficient Atrous Channel-wise Attention (EACA) module, and a sparsely-gated mixture of heterogeneous experts detection head (MOHEH). The CMFEM effectively addresses scale disparities by improving feature representation across different scales and hierarchical levels. The EACA module refines fused feature maps by highlighting critical regions and ensures robust performance in dense object distributions. The MOHEH detection head provides adaptability and generalizability by leveraging diverse expert structures, making it particularly effective for handling the complexities of remote sensing datasets.

Through comprehensive evaluations of multiple remote sensing datasets, FAMHE-Net demonstrated its capability to significantly enhance the accuracy of oriented object detection while introducing minimal computational cost. By leveraging multi-scale augmentation, advanced refinement mechanisms, and adaptive decision aggregation, FAMHE-Net establishes itself as a robust and versatile framework for aerial object detection.

Beyond its immediate contributions, FAMHE-Net has promising applications in disaster management, environmental monitoring, and precision agriculture, where scale disparities, dense object distributions, and complex object characteristics are frequent challenges. Additionally, this work sets a foundation for future research, such as integrating multimodal data (e.g., LiDAR and SAR) and adapting the framework to other vision-focused downstream tasks requiring multi-level feature processing. FAMHE-Net advances the state of the art in remote sensing object detection by tackling these specified research problems and provides a practical and extensible solution for real-world applications.

Author Contributions

Conceptualization, Y.C.; Formal analysis, Y.C.; Funding acquisition, W.J.; Investigation, Y.C.; Methodology, Y.C.; Project administration, Y.W.; Resources, Y.W.; Software, Y.C.; Supervision, W.J.; Validation, Y.C.; Visualization, Y.C.; Writing—original draft, Y.C.; Writing—review and editing, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62473138, the Project of Natural Science Foundation Youth Enhancement Program of Guangdong Province under Grant 2024A1515030184, the Project of Guangzhou City Zengcheng District Key Research and Development under Grant 2024ZCKJ01, and the General Project of Natural Science Foundation of Hunan Province under Grant 2022JJ30162.

Data Availability Statement

For all source data and code, please contact us: [email protected].

Acknowledgments

We sincerely appreciate the constructive comments and suggestions of the anonymous reviewers, which have greatly helped to improve this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yao, H.; Qin, R.; Chen, X. Unmanned aerial vehicle for remote sensing applications—A review. Remote Sens. 2019, 11, 1443. [Google Scholar] [CrossRef]
Gao, T.; Niu, Q.; Zhang, J.; Chen, T.; Mei, S.; Jubair, A. Global to local: A scale-aware network for remote sensing object detection. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 5615614. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wu, Y.; Chen, Y.; Yuan, L.; Liu, Z.; Wang, L.; Li, H.; Fu, Y. Rethinking classification and localization for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10186–10195. [Google Scholar]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 677–694. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Qiao, S.; Chen, L.C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10213–10224. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
Du, N.; Huang, Y.; Dai, A.M.; Tong, S.; Lepikhin, D.; Xu, Y.; Krikun, M.; Zhou, Y.; Yu, A.W.; Firat, O.; et al. Glam: Efficient scaling of language models with mixture-of-experts. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 5547–5569. [Google Scholar]
Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017; SciTePress: Setúbal, Portugal, 2017; Volume 2, pp. 324–331. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 3520–3529. [Google Scholar]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational region CNN for orientation robust scene text detection. arXiv 2017, arXiv:1706.09579. [Google Scholar]
Guo, Z.; Zhang, X.; Liu, C.; Ji, X.; Jiao, J.; Ye, Q. Convex-hull feature adaptation for oriented and densely packed object detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5252–5265. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3163–3171. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-adaptive selection and measurement for oriented object detection. Proc. AAAI Conf. Artif. Intell. 2022, 36, 923–932. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 11830–11841. [Google Scholar]

Figure 1. The overall architecture of FAMHE-Net. FAMHE-Net is composed of a backbone, a fine-grained feature pyramid network, a feature fusion and enhancement module, a novel channel-wise attention mechanism, and a sparsely-gated mixture of heterogeneous detection head module. H and W denotes height and width of feature map. C denotes channel number in CMFEM.

Figure 2. Structure of CMFEM. Conv2D 1 × 1, s1 refers to 1 × 1 convolution with stride set to 1, and s2 refers to stride set to 2. N denotes input image size. N/4 denotes the feature map resolution. UpSample denotes up-scaling the feature map with a factor of 2. EACA refers to efficient atrous channel-wise attention module.

Figure 3. The overall structure of our proposed Efficient Atrous Channel-wise Attention module. The Inter-Enhancement block is composed of atrous convolution with different atrous rates.

Figure 4. Inter-enhancement Block.

Figure 5. The structure of Atrous Convolution.

Figure 6. Our Proposed Mixture of Heterogeneous Experts Head (MOHEH) Module. This graph visualizes the structure of each expert. This example shows the number of experts = 4 and the number of selected experts = 2. The activated experts are highlighted in red in this figure.

Figure 7. The Residual Bottleneck.

Figure 8. The Inverted Residual Bottleneck.

Figure 9. Detection Result of FAMHE-Net on DOTA.

Figure 10. Detection Result of FAMHE-Net on HRSC2016.

Figure 11. Comparison of the baseline and the improved detector on DOTA. The top row includes the baseline detection results, and the bottom row includes our proposed FAMHE-NET detection results. All newly detected areas have been circled in red. All false positives from baseline detector have been circled in blue.

Figure 12. Comparison of the baseline and the improved detector on HRSC2016. The top row includes the baseline detection result, and the bottom row includes our proposed FAMHE-NET detection result. All newly detected areas have been circled in red. All false positives from baseline detector have been circled in blue.

Figure 13. Visualization of the CMFEM on the DOTA dataset. The columns from left to right are the original images, the feature map before CMFEM and the feature map after CMFEM, respectively.

Figure 14. Visualization of the EACA on the DOTA dataset. The columns from left to right are the original images, the feature map before applying EACA and the feature map after CMFEM, respectively.

Table 1. Parameters and Settings for CMFEM.

Hyperparameter	Settings
Channel Dimension from C2 to C5	256, 512, 1024, 2048
Channel Dimension from F2 to F5	256, 512, 1024, 2048
Size from C2 to C5	256, 128, 64, 32
Size from F2 to F5	256, 128, 64, 32
Size from K2 to K6	256, 128, 64, 32, 16
Intermediate Channel Dimension (C) ¹	1024
Resizing Operation for Downscaling	Adaptive Max Pooling (AMP)
Resizing Operation for Upscaling	Nearest Neighbor Interpolation (NNI)
Number of Feature Maps Aggregated ²	5
Normalization of Fused Feature Map	Averaging ( $N = \frac{1}{5} \sum K_{i}^{'}$ )
Upscaling/Downscaling for P2 to P6	Bilinear/Nearest Neighbor
Residual Pathway Integration	Element-wise addition
Output Channel Dimension of P2 to P6 ³	256

¹ Channel number of K4; ² The five aggregated feature maps are K2, K3, K4, K5, K6; ³ Follow the design from the baseline method.

Table 2. Parameters and Settings for EACA Module.

Parameter	Settings
Input Feature Map Dimensions (X)	$H = 64, W = 64, C = 1024$
Feature Descriptor Dimensions ( $p, q$ ) ¹	$1 \times 1 \times 1024$
Number of Atrous Rates (D)	3
Atrous Rates ( $d_{k}$ ) ²	1, 2, 3
1D Convolution Kernel Size	3
Dimensions After ( $A_{k}, B_{k}$ )	$1 \times 1 \times 1024$
Summed Feature Set Dimensions ( $A^{'}, B^{'}$ )	$1 \times 1 \times 1024$
Pooling Methods	GAP, GMP
Sigmoid Activation	Normalizes attention weights ( $α$ )
Attention Weights Dimensions ( $α$ )	$1 \times 1 \times 1024$
Output Feature Map Dimensions ( $X_{out}$ )	$H = 64, W = 64, C = 1024$
Residual Connection Integration	$X_{enhanced} = X_{out} + X$
( $X_{enhanced}$ ) Dimensions	$H = 64, W = 64, C = 1024$

¹ Feature descriptors p and q are derived using GAP and GMP, respectively; ² Our optimal atrous rate configuration is (1, 2, 3).

Table 3. Parameters and Settings for MOHEH Module.

Parameter	Settings
Input Feature Map Dimensions	$H = 7, W = 7, C = 256$
Flattened Input Dimension	$12, 544$
Gating Network Dimension	$12, 544 \times 4$
Number of Experts (n)	4
Number of Selected Experts (k)	2
Expert 1 Structure	2 Fully Connected (FC) layers
Expert 2 Structure	2 Convolutional layers (3 × 3) + 1 FC layer
Expert 3 Structure	Double branch: 2 FC for classification + 1 basic residual block + 1 residual bottleneck for regression
Expert 4 Structure	Double branch: 2 FC for classification + 1 basic residual block + 1 inverted residual bottleneck for regression
Gating Network Output Dimension	$n = 4$
Top-k Selection	$k = 2$
Normalization Function	Softmax over selected experts
Aggregation Method	Weighted sum of top-k experts
Convolution Kernel Size (Experts)	$3 \times 3$
Residual Block Components	$3 \times 3$ Conv, $1 \times 1$ Conv
Activation Function	ReLU

Table 4. Detector Operating Environment.

Configuration Items	Parameters
CPU	13th Gen Intel(R) Core(TM) i7-13790F, Intel Corporation, Chengdu, China
GPU	NVIDIA GeForce RTX 4070Ti × 1, Santa Clara, CA, USA
RAM	16.0 GB
Operating System	Ubuntu 22.04
Deep Learning Framework	MMDetection (Based on PyTorch 1.7.0)

Table 5. Parameters and Settings.

Hyperparameter	Settings
Backbone	ResNet-50 (pre-trained on ImageNet)
Dataset	DOTA and HRSC2016
Data Augmentation	Horizontal and Vertical Flipping
Training/Validation Split	Merged
Batch Size	2
Optimizer	SGD
Momentum	0.9
Weight Decay	$0.0001$
Initial Learning Rate	$0.005$
Learning Rate Schedule	Linearly decreased after 7th epoch
Total Epochs (DOTA)	12
Total Epochs (HRSC2016)	36
Input image size	1024 × 1024

Table 6. Comparison of our proposed method with other state-of-the-art techniques on the DOTA dataset.

Method	Backbone	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
FR-O [28]	R-50	79.42	77.13	17.17	64.05	35.30	38.02	37.16	89.41	69.64	59.28	50.30	52.91	47.89	47.4	46.30	54.13
R²CNN [31]	R-101	80.94	65.67	35.34	67.44	59.92	50.91	55.81	90.67	66.92	72.39	55.06	52.23	55.14	53.35	48.22	60.67
RoI Trans [17]	R-101	88.64	78.52	43.44	75.92	68.81	73.68	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67	69.56
CFA [32]	R-50	88.04	82.14	53.90	73.69	79.94	78.87	87.16	90.87	81.90	85.63	56.14	64.40	70.31	70.63	38.05	73.45
R³Det [33]	R-152	89.49	81.17	50.53	66.10	70.92	78.66	78.21	90.81	85.26	84.23	61.81	63.77	68.16	69.83	67.17	73.74
S²A-Net [34]	R-50	89.11	82.84	48.37	71.11	78.11	78.39	87.25	90.83	84.90	85.64	60.36	62.60	65.26	69.13	57.94	74.12
ORepPoints [35]	R-50	87.02	83.17	54.13	71.16	80.18	78.40	87.28	90.90	85.97	86.25	59.90	70.49	73.53	72.27	58.97	75.97
SASM [36]	R-50	86.42	78.97	52.47	69.84	77.30	75.99	86.72	90.89	82.63	85.66	60.13	68.25	73.98	72.22	62.37	74.92
O-RCNN [30]	R-50	89.54	82.35	54.95	70.27	78.96	82.81	88.15	90.90	86.53	84.88	62.24	65.95	75.16	69.25	55.38	75.82
Ours	R-50	89.62	83.74	53.84	73.47	78.85	83.42	88.19	90.91	87.55	85.75	63.59	67.23	75.65	70.48	58.59	76.72

Best results for each category are in red. Second-best results achieved by our detector are labeled in blue.

Table 7. Comparison between state-of-the-art detectors using the HRSC2016 dataset.

Method	Backbone	mAP07	mAP12	Params (M)	FLOPS
R²CNN [31]	R-101	73.07	-	-	-
RoI Trans [17]	R-101	86.20	-	55.1	200 G
CFA [32]	R-50	87.10	91.60	-	-
S²A-Net [34]	R-101	90.17	95.01	38.6	198 G
SASM [36]	R-50	87.90	91.80	-	-
R³Det [33]	R-101	89.26	96.01	41.9	336 G
GWD [37]	R-101	89.85	97.37	47.4	456 G
O-RCNN [30]	R-50	90.36	96.40	41.1	199 G
Ours	R-50	90.70	97.70	63.8	518 G

R-101 denotes 101 layers used in the backbone model. R-50 denotes 50 layers used in the backbone model; Best results are in red. Second-best results achieved by our detector are labeled in blue.

Table 8. Ablation Study of Each Proposed Module.

Neck					HRSC2016		DOTA
FPN	PAFPN	Fusion Only	Residual Pathway	EACA	mAP07	mAP12	mAP
✓					90.36	96.40	75.82
	✓				90.39	96.52	75.90
	✓	✓			90.42	96.63	75.97
	✓	✓	✓		90.44	96.70	76.07
	✓	✓	✓	✓	90.58	97.20	76.40

“Fusion Only” means multi-scale fusion and resize back to the original scale. Best result is in red.

Table 9. Ablation study results on atrous convolutions of HRSC2016 and DOTA dataset.

Method	Atrous Rate	Backbone	HRSC2016		DOTA
			mAP07	mAP12	mAP
Oriented-RCNN [30]	-	R-50	90.36	96.40	75.82
Ours	(1)	R-50	90.46	96.75	76.10
Ours	(1, 2)	R-50	90.50	96.90	76.22
Ours	(1, 2, 3)	R-50	90.58	97.20	76.40
Ours	(1, 2, 3, 4)	R-50	90.38	96.45	75.80

Best results are in red.

Table 10. Ablation Study of Non-MOE heads and MOE heads on DOTA. Mixed head means a single-branch detection head that includes both convolutional layers and FCs layers.

Head Structure	Params	FPS	mAP
Single-branch FCs-only head	41.13 M	20.6	75.82
Single-branch mixed head (Proposed)	41.27 M	20.2	75.85
Double-branch residual head	43.52 M	17.1	76.09
Double-branch inverted-residual head (Proposed)	42.76 M	17.9	76.04

Table 11. Ablation Study on different top-k and number of experts n. The best mAP result is colored in red.

n	k	Params (M)	FPS	mAP
4	1	89.62	19.8	76.18
4	2	89.62	18.4	76.61
4	3	89.62	15.7	76.50
4	4	89.62	12.3	76.22

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Jiang, W.; Wang, Y. FAMHE-Net: Multi-Scale Feature Augmentation and Mixture of Heterogeneous Experts for Oriented Object Detection. Remote Sens. 2025, 17, 205. https://doi.org/10.3390/rs17020205

AMA Style

Chen Y, Jiang W, Wang Y. FAMHE-Net: Multi-Scale Feature Augmentation and Mixture of Heterogeneous Experts for Oriented Object Detection. Remote Sensing. 2025; 17(2):205. https://doi.org/10.3390/rs17020205

Chicago/Turabian Style

Chen, Yixin, Weilai Jiang, and Yaonan Wang. 2025. "FAMHE-Net: Multi-Scale Feature Augmentation and Mixture of Heterogeneous Experts for Oriented Object Detection" Remote Sensing 17, no. 2: 205. https://doi.org/10.3390/rs17020205

APA Style

Chen, Y., Jiang, W., & Wang, Y. (2025). FAMHE-Net: Multi-Scale Feature Augmentation and Mixture of Heterogeneous Experts for Oriented Object Detection. Remote Sensing, 17(2), 205. https://doi.org/10.3390/rs17020205

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FAMHE-Net: Multi-Scale Feature Augmentation and Mixture of Heterogeneous Experts for Oriented Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Oriented Object Detection of Remote Sensing Images

2.2. Multi-Scale Feature Fusion

2.3. Attention Mechanism

2.4. Mixture of Experts

3. Methods

3.1. Overall Pipeline

3.2. Consolidated Multi-Scale Feature Enhancement Module

3.3. Efficient Atrous Channel-Wise Attention

3.4. Sparsely Gated Mixture of Heterogeneous Head

3.4.1. Residual Double Head Structure

3.4.2. Sparsely Gated Mixture of Expert Heads

4. Results

4.1. Datasets

4.2. Evaluation Metrics

4.3. Parameter Settings

4.4. Comparison with State-of-the-Art Detectors

4.5. Ablation Study

4.5.1. Effectiveness of the CMFEM

4.5.2. Effectiveness of the EACA

4.5.3. Effectiveness of Mixture of Heterogeneous Head

5. Discussion

5.1. Analysis of Results

5.2. Future Directions: Multi-Modality Integration and Expanding Application Scenarios

5.3. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI