3.1. LMNet
A hybrid model, LMNet, based on YOLOv5 and an RT3DsAM, is proposed in this study. It can extract features from small aerial targets using high-resolution representation. The residual transformer 3D-spatial attention can establish a global long-distance dependence and take the representation of context information into account to improve the accuracy of small target recognition. The overall structure of LMNet is made up of the ladder backbone network, the path aggregation network (PAN) [
26] neck, and the detecting head. To obtain multi-scale and high-resolution fusion features, we first employ the ladder-type backbone network as the YOLOv5 backbone. Then, we design a ladder-type backbone network based on the high-resolution encoder of the high-resolution network (HRNet) [
27], connect an RT3DsAM to the fourth stage’s 1/8, 1/16, and 1/32 resolution feature map, and feed them into the PAN neck. Finally, the soft non-maximum suppression (Soft-NMS) [
28] algorithm is utilized to process the detection results in order to lower the missing detection rate of numerous tiny targets overlapping samples. The LMNet structure is shown in
Figure 2. Each black rectangle represents a bottleneck, each blue rectangle represents a basic block, and each yellow rectangle represents a residential transformer 3D-spatial attention module.
YOLO [
29,
30,
31,
32] is a classical single-stage target detection algorithm. The YOLOv5 algorithm is developed based on YOLOv4 [
32] and YOLOv3 [
31] and turns the detection problem into a regression problem. Unlike the two-stage detection network, it does not extract the region of interest, but directly generates the bounding box coordinates and probability of each class through regression, which is faster than Faster RCNN. YOLOv5 has four versions: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. YOLOv5s is the lightest version, with the least amount of parameters and the fastest detection speed. The network structure is shown in
Figure 3. The CBS block contains convolution layers, batch normalization, and SiLU functions. The CSP1_x block contains the CBS block and x residual connection units. The CSP2_x block contains x CBS blocks. And the SPPF mainly includes three MaxPool layers.
The YOLOv5 model is composed of four parts: input, backbone, neck, and head. First, YOLOv5’s input adopts mosaic data enhancement, which enriches the data set via random scaling, random clipping, and random layout. Random scaling, in particular, adds many small targets, making the network more robust. Second, as can be observed in
Figure 3, the backbone network has a relatively simple structure. In terms of feature information extraction and fusion interaction, it is not as well integrated as HRNet’s backbone network and lacks the fusion between high and low-resolution features. Third, inspired by the path aggregation network (PANet) [
26], the neck structure of the feature pyramid network (FPN) + PAN is designed. Finally, three prediction branches are designed. The prediction information includes the target coordinate, category, and confidence. The post-processing method of detecting the target object adopts weighted non-maximum suppression.
U-Net [
33], SegNet [
34], DeconvNet [
35], Hourglass [
36], and other mainstream backbone networks typically use the method of reducing resolution first and then increasing resolution. They typically encode the image as a low-resolution representation, convolutionally connect the high and low-resolution representations, and then restore the high-resolution representation from the low-resolution representation. The HRNet, on the other hand, connects high and low-resolution subnetworks in parallel and performs iterative multi-scale fusion to obtain spatially accurate results. The proposed model in this paper makes use of the aforementioned high-resolution network. Some researchers in the field of small target detection have combined HRNet with small target detection and achieved good results. Wang et al. [
37] presented a small object detection method for remote sensing images based on candidate region feature alignment. To some extent, the problem of small targets in UAV optical remote sensing images had been addressed. When detecting traffic signs in bad weather, Zhou et al. [
38] proposed a parallel fusion attention network in conjunction with HRNet. More information could be obtained to improve accuracy through repeated multi-scale fusion of high and low-resolution representations.
In order to solve the problem of small-driver helmet detection in aerial photography, we need to extract more information from the limited resolution. As a necessary consequence, we design an LMNet: during the feature extraction stage, a ladder-type backbone network is designed, which can better retain high-resolution image information, make high and low resolution information fully interactive fusion, and is more sensitive to object location information.
Figure 4 depicts a ladder-type backbone network. To reduce redundant parameters, we only use a bottleneck for the first stage of the network and a basic block for the remaining stages, which significantly reduces the number of parameters and speeds up model reasoning. In ladder-type backbone networks, multi-scale fusion is dependent on step size convolution, up sampling, down sampling, and summation operations. We use the resolution fusion in Stage3 to demonstrate the network’s feature fusion. Since Stage2 outputs three different resolution representations of
, while Stage3 outputs three corresponding resolution representations of
; the formula is as follows (1).
where
r represents the indicator of resolution and
represents transformation function.
When the cross-phase fusion is performed, such as from Stage3 to Stage4, another calculated output is shown in the following Formula (2).
Regarding Formulas (1) and (2), x represents the input resolution size in , and i is the output resolution size. If x = r, . If x > r, upsamples the input H and adjusts the number of channels by a convolution of 1 × 1. If x < r, executes (r-s) step convolution of input H to subsample it. By performing feature fusion between different resolution branches, we finally include the 1/8, 1/16, and 1/32 resolution feature maps as the output of a ladder-type backbone network, and introduce the RT3DsAM in each branch, effectively improving the detection accuracy of small targets in aerial photography.
In the motorcycle driver helmet detection stage, we use the motorcycle and driver as a whole target to overcome the problem of pedestrian misunderstanding. The motorcycle is classified as an electric motorcycle, a fuel motorcycle, and an electric bicycle; their appearances, sizes, and postures differ greatly between cars; additionally, during the cycling state, the appearances and postures of the tricycle, bicycle, and motorcycle are similar, resulting in a small difference between classes. As a result, large intra-class gaps and small inter-class gaps can lead to a large number of false positives and low detection accuracy. To address these sample flaws, we build an RT3DsAM that uses channel global self-attention to capture long-distance dependencies.
In general, the attention mechanism is divided into two types: responsive attention and soft attention. The hard attention mechanism’s goal is to select the most useful part of the input features, whereas the soft attention mechanism learns a weighting vector to weight all of these features. Soft attention is commonly used in image classification and object detection. For example, Wang et al. [
39] used scale attention to weight the output of convolutions with different filter sizes. A lightweight channel attention was proposed by squeezing and stimulating channel features. Furthermore, Woo et al. [
22] created the CBAM, which could serially generate attention feature maps in two dimensions of channel and space, and then multiplied two feature maps to produce the final feature map, which improved object detection and image classification performance. In this paper, we design an RT3DsAM, taking into account not only the adaptive recalibration of the input feature maps, but also the missing correlations between the deep abstract positional pixel information, and focusing on the adaptive selection of high-level semantic information and the refinement of learned small target features.
Given the input feature
, we generate a one-dimensional channel residual transformer attention
and a three-dimensional spatial attention
. As shown in
Figure 5, where
H,
W, and
C indicate height, width, and channel, respectively. The above two attention modules are used for global long-distance self-attention modeling and self-selection of spatial information in each layer, respectively. The calculation process of overall attention can be summarized as Formulas (3)–(5).
where
represents element multiplication,
Cat(*) is the concatenate on the channel dimension, and
Conv is a 2D convolution. Details of the attention calculation are given in the next two paragraphs.
The RCTAM aims to emphasize the significance of extracting information from the global image that is useful for feature representation and final classification detection, as well as establishing self-attention between them. To accomplish the aforementioned goal, we must create a channel global information driver function to map the input features to the target weight vector, which means that the target will consider not only the helmet but also the riders and the riding type of vehicle. This can reduce the intra-class gap while increasing the inter-class gap. The RCTAM is shown in
Figure 6. To generate the summary statistics for the target-wide
, the adaptive global average pooling operates on each feature of the spatial dimension
H ×
W. The feature matrix with the
nth dimensional output height and width (
H ×
W) of
is calculated as shown in Formula (6).
where the
is the input feature matrix of the
nth channel and
denotes an adaptive global average pooling operation.
According to reference [
22], which was studied in CBAM, global max pooling played an important role as supplementary global information for global average pooling. It is also used in this paper. The feature matrices of
with the
nth dimension output height and width are calculated as shown in Formula (7).
where the
denotes an adaptive global max pooling operation.
To fully capture the interaction between global high-dimensional information and establish a correlation between cross-channel position pixels and cross-object position awareness, respectively, a residual transformer block (RTB) with two convolution layers around the nonlinearity and multi-head self-attention is operated on
x. The RTB is shown in
Figure 7 and the multi-head self-attention (MHSA) is shown in
Figure 8. The first convolution layer is a dimensionality-reduction layer parameterized by
with a reduction ratio
r and a rectified linear unit (ReLU). The second is a multi-head self-attention layer parameterized by
Ṫ. The third is a multi-head dimensionality-increasing layer parameterized by
.
where
refers to the ReLU [
40] function,
,
and
, reduction ratio
r set to 1 in our experiment.
To build the RTB-based channel global self-attention and reduce the loss of information, we share the parameters {
,
Ṫ,
} of RTB for the output of both global adaptive average pooling and global adaptive max pooling. Then, the outputs of RTB
and
are concatenated together on the channel dimension calculated by a 2D convolution:
where
refers to the sigmoid function and
denotes a concatenation operator.
is a manipulation with a convolution of step 1, padding 0, and kernel 4 × 4.
The final output of RCTAM is obtained by rescaling the input features with the output activation
:
where
and
refer to pixel-wise multiplication between the feature map
and the scalar
.
In
Figure 8, ⊕ and ⊗ represent the element-wise sum and matrix multiplication, respectively; while 1 × 1 is a pointwise convolution. According to the reference [
41], we use four heads in multi-head self-attention in this paper. The highlighted blue boxes represent position encodings and the value projection, in addition to the use of multiple heads. Furthermore, the feature map after global adaptive pooling is sent to all2all attention for execution, and
and
are encoded for the height and width on the input feature map, respectively. Attention logits is the
, where
q,
k, and
r represent the query, key, value, and position encoding, respectively. Similarly, here we use the relative distance position for encoding [
42,
43,
44]. According to the studies [
43,
44,
45], relative-distance-aware position encodings are better suited for vision tasks. This is due to attention not only taking into account the content information but also relative distances between features at different locations, thereby being able to effectively associate information across objects with positional awareness [
41].
3DsAM aims to further mine the spatial correlation between max and average pooling information, enhance the spatial information of pixels with labels of the same category in the neighborhood, and suppress pixels with different classes of labels. Therefore, the ideal output of 3DsAM should be a feature matrix with the same height and width as the input feature through 3D spatial attention, and which carries the information of adaptive selection. It first obtains detailed spatial information about the intra and inter-class objects from two channels, then establishes a spatial attention map for each input channel, and finally forms a 3D spatial attention that adaptively adjusts the weights layer-by-layer.
Figure 9 shows the 3DsAM. As in CBAM [
22], we apply the global max pooling and global average pooling on the input across channels.
where
is the value at position
of the
cth channel. Then, two outputs are concatenated horizontally as the input of a new convolutional layer followed by a sigmoid activation function:
where
is a convolution with step 1, padding 0, and stride 1 × 1;
is 3D attention map by activating features through
operation with the sigmoid function.
The final output of 3DsAM is obtained by rescaling the input features
with the output activation
:
where
is the spatial-wise of each channel multiplication operation between the feature map
and 3D spatial attention map
.
3.2. ESRGAN
The ESRGAN [
25] model is improved based on the image super-resolution generative adversarial network (SRGAN) [
46]. Based on SRGAN, the generator neural network of SRGAN, the discriminator identification object, and the loss function are adjusted and optimized, respectively. Thus, the SRGAN algorithm’s performance is significantly improved. The generator neural network structure of ESRGAN is shown in
Figure 10.
Unlike SRGAN, ESRGAN uses dense connection blocks to replace residual modules. Each dense connection block is an improved residual module. It is distinguished by the use of a multi-layer residual (residual-in-residual) structure, which employs a deeper convolutional neural network to improve the depth learning algorithm’s performance. To address the increased computational amount in this process, the ESRGAN algorithm employs a similar strategy to that of the face super-resolution generative adversarial network (FSRCNN) [
47]. Simultaneously, ESRGAN points out that BN operation is easy to generate artifacts for deep layer GAN network training, so the backbone network abandons the use of BN operation.
ESRGAN also improves the discriminator’s mode. Although the reconstructed image is judged to be false at times in the original SRGAN algorithm, this does not imply that the reconstructed image is not realistic enough, but the discriminator may not be able to correctly identify the image of the content. On the other hand, if the output value of a reconstructed image is high but the corresponding value of the original image is higher, it indicates that the reconstruction result needs to be improved. Thus, in the SRGAN algorithm, the method of passing the reconstructed image and original image through the discriminator and subtracting the output is used to replace the method of identifying and evaluating only the reconstructed image and original image. As a result, when the output value of an image in the generator is very low but the output value of its corresponding original image is lower, the resistance loss value is greatly reduced when compared to SRGAN, which has little effect on the generator. On the other hand, if the output value of the reconstructed image in the discriminator is high but the output value of the original image is higher, the loss will still increase so that the generator can generate a more realistic reconstructed image. These steps can be represented as follows:
where the reconstructed image
and the original image
are input into the discriminator network in a random order,
and
are the probabilities (the number between 0 and 1 is output) for the discriminator to judge them as true. Function
(⦁) is a simple impulse function that maps the output to a region of 0 to 1, and
E[⦁] is the process of finding expectations. In this process, the discriminator does not know the input order of the image, and only calculates the probability that two images are true, and then subtracts them. Function
represents the pair of the input image, the reconstructed image is in the front, the real image is in the back, and the output of this value is close to 1 under the ideal state (the reconstructed image of generator makes the discriminator unable to distinguish between true and false images).
represents that the real image is in the front and the reconstructed image is in the back. By subtracting two probabilities, in the ideal state, the value is close to 0 through a simple impulse function. The
of Formula (18) is the mathematical expression of the ESRGAN generator’s resistance loss function. The total loss of the ESRGAN discriminator can be expressed as:
ESRGAN also includes the L-1 loss function to further improve the reconstruction accuracy. To reconstruct more accurate underlying details, the total loss function of the ESRGAN is defined as follows.
The
is the structural loss of ESRGAN, which is roughly the same as that of the SRGAN. The difference is that the loss function in the visual geometry group network [
48] feature extractor removes the output activation function. In this paper, we use the weights generated by various 2K and flick 2k (DF2K) and outdoorscene (OST) training sets to reconstruct the original image twice as well.
Figure 11 shows that the rider and helmet have more realistic textures and sharp edges. It can add more features for future network learning.