J12 TGRS2023 SuperYOLO
J12 TGRS2023 SuperYOLO
Abstract— Accurately and timely detecting multiscale small neural network (DNN)-based object detection frameworks [1],
objects that contain tens of pixels from remote sensing images [2], [3], [4], [5] have been proposed, updated, and optimized
(RSI) remains challenging. Most of the existing solutions pri- in computer vision. The remarkable accuracy enhancement of
marily design complex deep neural networks to learn strong
feature representations for objects separated from the back- DNN-based object detection frameworks owes to the appli-
ground, which often results in a heavy computation burden. cation of large-scale natural datasets with accurate annota-
In this article, we propose an accurate yet fast object detection tions [6], [7], [8].
method for RSI, named SuperYOLO, which fuses multimodal Compared with natural scenarios, there are several vital
data and performs high-resolution (HR) object detection on challenges for accurate object detection in remote sensing
multiscale objects by utilizing the assisted super resolution
(SR) learning and considering both the detection accuracy images (RSIs). First, the number of labeled samples is rela-
and computation cost. First, we utilize a symmetric compact tively small, which limits the training of DNNs to achieve high
multimodal fusion (MF) to extract supplementary information detection accuracy. Second, the size of objects in RSI is much
from various data for improving small object detection in RSI. smaller, accounting for merely tens of pixels in relation to the
Furthermore, we design a simple and flexible SR branch to complicated and broad backgrounds [9], [10]. Moreover, the
learn HR feature representations that can discriminate small
objects from vast backgrounds with low-resolution (LR) input, scale of those objects is diverse with multiple categories [11].
thus further improving the detection accuracy. Moreover, to avoid As shown in Fig. 1(a), the object car is considerably small
introducing additional computation, the SR branch is discarded within a vast area. As shown in Fig. 1(b), the objects have
in the inference stage, and the computation of the network large-scale variations, to which the scale of a car is smaller
model is reduced due to the LR input. Experimental results
than that of a camping vehicle.
show that, on the widely used VEDAI RS dataset, SuperYOLO
achieves an accuracy of 75.09% (in terms of mAP50 ), which is Currently, most object detection techniques are solely
more than 10% higher than the SOTA large models, such as designed and applied for a single modality, such as red-green-
YOLOv5l, YOLOv5x, and RS designed YOLOrs. Meanwhile, blue (RGB) and infrared (IR) [12], [13]. Consequently, with
the parameter size and GFLOPs of SuperYOLO are about 18× respect to object detection, its capability to recognize objects
and 3.8× less than YOLOv5x. Our proposed model shows a
on the Earth’s surface remains insufficient due to the deficiency
favorable accuracy–speed tradeoff compared to the state-of-the-
art models. The code will be open-sourced at https://github.com/ of complementary information between different modali-
icey-zhang/SuperYOLO. ties [14]. As imaging technology flourishes, RSIs collected
Index Terms— Feature fusion, multimodal remote sensing
from multimodality become available and provide an oppor-
image, object detection, super resolution (SR). tunity to improve detection accuracy. For example, as shown
in Fig. 1, the fusion of two different multimodalities (RGB
I. I NTRODUCTION and IR) can effectively enhance the detection accuracy in
Fig. 1. Visual comparison of RGB image, IR image, and ground truth (GT). The IR image provides vital complementary information for resolving the
challenges in RGB detection. The object car in (a) is considerably small within a vast area. In (b), the objects have large-scale variation, to which the scale
of a car is smaller than that of a camping vehicle. The fusion of RGB and IR modalities effectively enhances detection performance.
models, we choose small-size YOLOv5s [19] structure as of discriminating small objects from vast backgrounds
our detection baseline. It can reduce deployment costs and with low-resolution (LR) input.
facilitate rapid deployment of the model. Considering the 3) Considering the demand for high-quality results and
high-resolution (HR) retention requirements for small objects, low-computation cost, the SR module functioning as
we remove the Focus module in the baseline YOLOv5s model, an auxiliary task is removed during the inference stage
which not only benefits defining the location of small dense without introducing additional computation. The SR
objects but also enhances the detection performance. Consider- branch is general and extensible and can be inserted
ing the complementary characteristics in different modalities, in the existing fully convolutional network (FCN)
we propose a multimodal fusion (MF) scheme to improve the framework.
detection performance for RSI. We evaluate different fusion 4) The proposed SuperYOLO markedly improves the per-
alternatives (pixel-level or feature-level) and choose pixel-level formance of object detection, outperforming SOTA
fusion for low computation cost. detectors in real-time multimodal object detection. Our
Lastly and most importantly, we develop an SR assur- proposed model shows a favorable accuracy–speed
ance module to guide the network to generate HR fea- tradeoff compared to the state-of-the-art models.
tures that are capable of identifying small objects in vast
backgrounds, thereby reducing false alarms induced by II. R ELATED W ORK
background-contaminated objects in RSI. Nevertheless, a naive
SR solution can significantly increase the computation cost. A. Object Detection With Multimodal Data
Therefore, we set the auxiliary SR branch engaged in the Recently, multimodal data have been widely leveraged in
training process and remove it in the inference stage, facili- numerous practical application scenarios, including visual
tating spatial information extraction in HR without increasing question answering [20], auto-pilot vehicles [21], saliency
computation cost. detection [22], and remote sensing classification [23]. It is
In summary, this article makes the following contributions. found that combining the internal information of multimodal
1) We propose a computation-friendly pixel-level fusion data can efficiently transfer complementary features to avoid
method to combine inner information bidirectionally in a certain information of a single modality from being omitted.
symmetric and compact manner. It efficiently decreases In the field of RSI processing, there exist various modalities
the computation cost without sacrificing accuracy com- (e.g., RGB, synthetic aperture radar (SAR), Light Detec-
pared with feature-level fusion. tion and Ranging (LiDAR), IR, panchromatic (PAN), and
2) We introduce an assisted SR branch into multimodal multispectral (MS) images) from diverse sensors, which can
object detection for the first time. Our approach not be fused with complementary characteristics to enhance the
only makes a breakthrough in limited detection per- performance of various tasks [24], [25], [26]. For exam-
formance but also paves a more flexible way to study ple, the additional IR modality [27] captures longer thermal
outstanding HR feature representations that are capable wavelengths to improve the detection under difficult weather
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415
Fig. 2. Overview of the proposed SuperYOLO framework. Our new contributions include: 1) removal of the Focus module to reserve HR; 2) MF; and
3) assisted SR branch. The architecture is optimized in terms of mean square error (mse) loss for the SR branch and task-specific loss for object detection.
During the training stage, the SR branch guides the related learning of the spatial dimension to enhance the HR information preservation for the backbone.
During the test stage, the SR branch is removed to accelerate the inference speed equal to the baseline.
conditions. Manish et al. [27] proposed a real-time framework B. Super Resolution in Object Detection
for object detection in multimodal remote sensing imaging, In the recent literature, the performance of small object
in which the extended version conducted mid-level fusion and detection can be improved by multiscale feature learning
merged data from multiple modalities. Despite that multisensor [29], [30] and context-based detection [31]. These methods
fusion can enhance the detection performance, as shown in always enhance the information representation ability of the
Fig. 1, hardly can its low-accuracy detection performance and network in different scales but ignore the HR contextual infor-
to-be-improved computing speed meet the requirements of mation reservation. Conducted in a preprocessing step, SR has
real-time detection tasks. proven to be effective and efficient in various object detection
The fusion methods are primarily grouped into three strate- tasks [32], [33]. Shermeyer and Van Etten [34] quantified its
gies, i.e., pixel-level fusion, feature-level fusion, and decision- effect on the detection performance of satellite imaging by
level fusion methods [28]. The decision-level fusion methods multiple resolutions of RSI. Based on generative adversarial
fuse the detection results during the last stage, which may networks (GANs), Courtrai et al. [35] utilized SR to generate
consume enormous computation resources due to repeated HR images that were fed into the detector to improve its
calculations for different multimodal branches. In the field detection performance. Rabbi et al. [36] leveraged a Laplacian
of remote sensing, feature-level fusion methods are mainly operator to extract edges from the input image to enhance
adopted with multi branches. The multimodal images will be the capability of reconstructing HR images, thus improv-
input into the parallel branches to extract respective indepen- ing its performance in object localization and classification.
dent features of different modalities, and then, these features Ji et al. [37] introduced a cycle-consistent GAN structure as an
will be combined by some operations, such as attention SR network and modified faster R-CNN architecture to detect
module or simple concatenation. The parallel branches bring vehicles from enhanced images that are produced by the SR
repeated computation as the modalities increase, which is not network. In these works, the adoption of the SR structure has
friendly in the real-time tasks in remote sensing. effectively addressed the challenges regarding small objects.
In contrast, the adoption of pixel-level fusion methods can However, compared with single detection models, additional
reduce unnecessary computation. In this article, our proposed computation is introduced, which attributes to the enlarged
SuperYOLO fuses the modalities at the pixel level to signif- scale of the input image by HR design.
icantly reduce the computation cost and design operations in Recently, Wang et al. [38] proposed an SR module that can
spatial and channel domains to extract inner information in maintain HR representations with LR input while reducing
the different modalities, which can help enhance detection the model computation in segmentation tasks. Inspired by
accuracy. Wang et al. [38], we design an SR assisted branch. In contrast
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
Fig. 3. Backbone structure of YOLOv5s. The low-level texture and high-level semantic features are extracted by stacked CSP, CBS, and SPP structures.
to the aforementioned work in which the SR is realized in Limitation 1: It is worth mentioning that the Focus mod-
the start stage, the assisted SR module guides the learning of ule is introduced to decrease the number of computations.
high-quality HR representations for the detector, which not As shown in Fig. 2 (bottom left), inputs are partitioned
only strengthens the response of small dense objects but also into individual pixels, reconstructed at intervals, and, finally,
improves the performance of the object detection in spatial concatenated in the channel dimension. The inputs are resized
space. Moreover, the SR module is removed in the inference to a smaller scale to reduce the computation cost and accel-
stage to avoid extra computation. erate the network training and inference speed. However, this
may sacrifice object detection accuracy to a certain extent,
especially for small objects vulnerable to resolution.
III. BASELINE A RCHITECTURE Limitation 2: It is known that the backbone of YOLO
As shown in Fig. 2, the baseline YOLOv5 network consists employs deep convolutional neural networks to extract hier-
of two main components: the backbone and head (includ- archical features with a stride step of 2, through which the
ing the neck). The backbone is designed to extract low- size of the extracted features is halved. Hence, the feature size
level texture and high-level semantic features. Next, these retained for multiscale detection is far smaller than that of the
hint features are fed to the head to construct the enhanced original input image. For example, when the input image size
feature pyramid network from top to bottom to transfer robust is 608, the sizes of output features for the last detection layer
semantic features and from bottom to top to propagate a strong are 76, 38, and 19, respectively. LR features may result in the
response of local texture and pattern features. This resolves the missing of some small objects.
various scale issue of the objects by yielding an enhancement
of detection with diverse scales. IV. S UPERYOLO A RCHITECTURE
In Fig. 3, CSPNet [39] is utilized as the backbone to As summarized in Fig. 2, we introduce three new con-
extract the feature information, consisting of numerous sam- tributions to our SuperYOLO network architecture. First,
ple Convolution-Batch-normalization-SiLu (CBS) components we remove the Focus module in the backbone and replace
and cross stage partial (CSP) modules. The CBS is composed it with an MF module to avoid resolution degradation and,
of operations of convolution, batch normalization, and activa- thus, accuracy degradation. Second, we explore different
tion function SiLu [40]. The CSP duplicates the feature map fusion methods and choose the computation-efficient pixel-
of the previous layer into two branches and then halves the level fusion to fuse RGB and IR modalities to refine dissimilar
channel numbers through 1 × 1 convolution, by which the and complementary information. Finally, we add an assisted
computation is, therefore, reduced. With respect to the two SR module in the training stage, which reconstructs the HR
copies of the feature map, one is connected to the end of the images to guide the related backbone learning in spatial
stage, and the other is sent into ResNet blocks or CBS blocks dimension and, thus, maintain HR information. In the infer-
as the input. Finally, the two copies of the feature map are ence stage, the SR branch is discarded to avoid introducing
concatenated to combine the features, which is followed by a additional computation overhead.
CBS block. The spatial pyramid pooling (SPP) module [41]
is composed of parallel maxpool layers with different kernel
sizes and is utilized to extract multiscale deep features. The A. Focus Removal
low-level texture and high-level semantic features are extracted As presented in Section III and Fig. 2 (bottom left), the
by stacked CSP, CBS, and SPP structures. Focus module in the YOLOv5 backbone partitions images at
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415
intervals on the spatial domain and then reorganize the new an input IR image into two intervals of [0, 1]. The input modal-
image to resize the input images. Specifically, this operation ities X RGB , X IR ∈ RC×H ×W are subsampled to IRGB , IIR ∈
is to collect a value for every group of pixels in an image and RC×(H/n)×(W/n) , which are fed to SE blocks extracting inner
then reconstruct it to obtain smaller complementary images. information in channel domain [42] to generate FRGB , FIR
The size of the rebuilt images decreases with the increase
in the number of channels. As a result, it causes resolution FRGB = S E(IRGB ), FIR = S E(IIR ). (1)
degradation and spatial information loss for small targets. Then, the attention map that reveals the inner relationship
Considering that the detection of small targets depends more of the different modalities in the spatial domain is defined as
heavily on higher resolution, the Focus module is abandoned
and replaced by an MF module (as shown in Fig. 4) to prevent m IR = f 1 (FIR ), m RGB = f 2 (FRGB ) (2)
the resolution from being degraded.
where f 1 and f 2 represent 1 × 1 convolutions for the RGB
and IR modalities, respectively. Here, ⊗ denotes the element-
B. Multimodal Fusion wise matrix multiplication. Inner spatial information between
The more the information is utilized to distinguish objects, the different modalities is produced by
the better the performance can be achieved in object detection.
Fin1 = m RGB ⊗ FRGB , Fin2 = m IR ⊗ FIR . (3)
MF is an effective path for merging different information
from various sensors. The decision-, feature-, and pixel-level To incorporate internal inner view information and spatial
fusions are the three mainstream fusion methods that can be texture information, the features are added by the original
deployed at different depths of the network. Since decision- input modalities and then fed into 1 × 1 convolutions. The
level fusion requires enormous computation, it is not consid- full features are
ered in SuperYOLO.
We propose a pixel-level MF to extract the shared and Fful1 = f 3 (Fin1 + IRGB ), Fful2 = f 4 (Fin2 + IIR ) (4)
special information from the different modalities. The MF where f 3 and f 4 represent 1 × 1 convolutions. Finally, the
can combine multimodal inner information bidirectionally in features are fused by
a symmetric and compact manner. As shown in Fig. 4, for the
pixel-level fusion, we first normalize an input RGB image and Fo = SE(Concat(Fful1 , Fful2 )) (5)
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
Fig. 5. SR structure of SuperYOLO. The SR structure can be regarded as a simple encode–decoder model. The low- and high-level features of the backbone
are selected to fuse local textures patterns and semantic information, respectively.
where Concat(·) denotes the concatenation operation along the high-level feature, we use an upsampling operation to match
channel axis. The result is then fed to the backbone to produce the spatial size of the low-level feature, and then, we use
multilevel features. Note that X is subsampled to 1/n size of a concatenation operation and two CR modules to merge
the original image to accomplish the SR module discussed in the low- and high-level features. The CR module includes
Section IV-C and to accelerate the training process. The X a convolution and ReLU. For the decoder, the LR feature is
represents the RGB or IR modality, and the sampled image is upscaled to the HR space in which the SR module’s output size
denoted as I ∈ RC×(H/n)×(W/n) and generated by is twice larger than that of the input image. As illustrated in
Fig. 5, the decoder is implemented using three deconvolutional
I = D(X ) (6)
layers. The SR guides the related learning of spatial dimension
where D(·) represents the n times downsampling operation and transfers it to the main branch, thereby improving the
using bilinear interpolation. performance of object detection. In addition, we introduce
EDSR [43] as our encoder structure to explore the SR per-
formance and its influence on detection performance.
C. Super Resolution
To present a more visually interpretable description,
As mentioned in Section III, the feature size retained for we visualize the features of backbones for YOLOv5s,
multiscale detection in the backbone is far smaller than that YOLOv5x, and SuperYOLO in Fig. 6. The features are upsam-
of the original input image. Most of the existing methods pled to the same scale as the input image for comparison.
conduct upsampling operations to recover the feature size. By comparing the pairwise images of (c), (f), and (i); (d),
Unfortunately, this approach has produced limited success due (g), and (j); and (e), (h), and (k) in Fig. 6, it can be observed
to the information loss in texture and pattern, which explains that SuperYOLO contains clearer object structures with higher
that it is inappropriate to employ this operation to detect small resolution with the assistance of the SR. Eventually, we obtain
targets that require HR preservation in RSI. a bumper harvest in high-quality HR representation with the
To address this issue, as shown in Fig. 2, we introduce SR branch and utilize the head of YOLOv5 to detect small
an auxiliary SR branch. First, the introduced branch shall objects.
facilitate the extraction of HR information in the backbone and
achieve satisfactory performance. Second, the branch should D. Loss Function
not add more computation to reduce the inference speed. The overall loss of our network consists of two components:
It shall realize a tradeoff between accuracy and computation detection loss L o and SR construction loss L s , which can be
time during the inference stage. Inspired by the study of expressed as
Wang et al. [38] where the proposed SR succeeded in facil-
itating segmentation tasks without additional requirements, L total = c1 L o + c2 L s (7)
we introduce a simple and effective branch named SR into where c1 and c2 are the coefficients for a balance of the two
the framework. Our proposal can improve detection accuracy training tasks. The L1 loss (rather than L2 loss) [44] is used to
without computation and memory overload, especially under calculate the SR construction loss L s between the input image
circumstances of LR input. X and SR result S, to which the expression is written as
Specifically, the SR structure can be regarded as a simple
L s = S − X 1 . (8)
encode–decoder model. We select the backbone’s low- and
high-level features to fuse local textures and patterns, and The detection loss involves three components [19]: loss of
semantic information, respectively. As depicted in Fig. 4, judging whether there is an object L obj , loss of object location
we select the result of the fourth and ninth modules as the low- L loc , and loss of object classification L cls , which are used to
and high-level features, respectively. The encoder integrates evaluate the loss of the prediction as
the low-level feature and high-level feature generated in the
2
2
2
backbone. As illustrated in Fig. 5, in the encoder, the first L o = λloc al L loc + λobj bl L obj + λcls cl L cls (9)
CR module is conducted on the low-level feature. For the l=0 l=0 l=0
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415
Fig. 6. Feature-level visualization of backbone for YOLOv5s, YOLOv5x, and SuperYOLO with the same input: (a) RGB input, (b) IR input,
(c)–(e) features of YOLOv5s, (f)–(h) features of YOLOv5x, and (i)–(k) features of SuperYOLO. The features are upsampled to the same scale as the
input image for comparison. (c), (f), and (i) Features in the first layer. (d), (g), and (j) Low-level features. (e), (h), and (k) High-level features in layers at the
same depth.
where l represents the layer of the output in the head; al , bl , another binary flag identify whether an object is cropped.
and cl are the weights of different layers for the three loss We do not consider classes with fewer than 50 instances in
functions; and the weights λloc , λobj , and λcls regulate error the dataset, such as plane, motorcycle, and bus. Thus, the
emphasis among box coordinates, box dimensions, objectness, annotations of the VEDAI dataset are converted to YOLOv5
no-objectness, and classification. format, and we transfer the ID of the interested class to
0, 1, . . . , 7, i.e., N = 8. Then, the center coordinates of the
V. E XPERIMENTAL R ESULTS bounding box are normalized, and the absolute coordinate is
transformed into a relative coordinate. Similarly, the length and
A. Dataset
width of the bounding box are normalized to [0, 1]. To realize
The popular Vehicle Detection in Aerial Imagery (VEDAI) the SR assisted branch, the input images of the network are
dataset [45] is used in the experiments, which contains downsampled from 1024 × 1024 size to 512 × 512 during
cropped images obtained from the much larger Utah Auto- the training process. In the test process, the image size
mated Geographic Reference Center (AGRC) dataset. Each is 512 × 512, which is consistent with the input of other
image collected from the same altitude in AGRC has approx- algorithms compared. In addition, data are augmented with hue
imately 16 000 × 16 000 pixels, with a resolution of about saturation value (HSV), multiscale, translation, left-right flip,
12.5 cm × 12.5 cm per pixel. RGB and IR are the two modal- and mosaic. The augmentation strategy is canceled in the test
ities for each image in the same scenes. The VEDAI dataset stage. The standard stochastic gradient descent (SGD) [46] is
consists of 1246 smaller images that focus on diverse back- used to train the network with a momentum of 0.937, a weight
grounds involving grass, highway, mountains, and urban areas. decay of 0.0005 for the Nesterov accelerated gradients utilized,
All images are in the size of 1024 × 1024 or 512 × 512 . The and a batch size of 2. The learning rate is set to 0.01 initially.
task is to detect 11 classes of different vehicles, such as car, The entire training process involves 300 epochs.
pickup, camping, and truck.
C. Accuracy Metrics
B. Implementation Details
The accuracy assessment measures the agreements and dif-
Our proposed framework is implemented in PyTorch and ferences between the detection result and the reference mask.
runs on a workstation with an NVIDIA 3090 GPU. The The recall, precision, and mean Average Precision (mAP)
VEDAI dataset is used to train our SuperYOLO. Follow- are used as accuracy metrics to evaluate the performance of
ing [27], the VEDAI dataset is devised for tenfold cross- the methods to be compared with. The calculations of the
validation. In each split, 1089 images are used for training, precision and recall metrics are defined as
and another 121 images are used for testing. The ablation
TP
experiments are conducted on the first fold of data, while Precision = (10)
the comparisons with previous methods are performed on the TP + FP
TP
ten folds by averaging their results. The annotations for each Recall = (11)
object in the image contain the coordinates of the bounding TP + FN
box center, the orientation of the object concerning the positive where the true positive (TP) and true negative (TN) denote
x-axis, the four corners of the bounding box, the class ID, correct prediction, and the false positive (FP) and false
a binary flag identifying whether an object is occluded, and negative (FN) denote incorrect outcome. The precision and
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
TABLE I TABLE II
C OMPARISON R ESULTS OF M ODEL S IZE AND I NFERENCE A BILITY IN I NFLUENCE OF R EMOVING THE F OCUS M ODULE IN THE N ETWORK ON
D IFFERENT BASELINE YOLO F RAMEWORKS ON THE F IRST THE F IRST F OLD OF THE VEDAI VALIDATION S ET
F OLD OF THE VEDAI VALIDATION S ET
recall are correlated with the commission and omission errors, TABLE III
respectively. The mAP is a comprehensive indicator obtained C OMPARISON R ESULT OF P IXEL - AND F EATURE -L EVEL F USIONS IN
by averaging AP values, which uses an integral method to YOLOV 5 S ( NO F OCUS ) FOR M ULTIMODAL DATASET ON THE
calculate the area enclosed by the precision–recall curve and F IRST F OLD OF THE VEDAI VALIDATION S ET
coordinate axis of all categories. Hence, the mAP can be
calculated by
1
AP p(r )dr
mAP = = 0 (12)
N N
where p denotes the precision, r denotes the recall, and N is
the number of categories.
The giga floating-point operations per second (GFOLPs)
and parameter size are used to measure the model complexity
and computation cost. In addition, PSNR and SSIM are used
for image quality evaluation of the SR branch. Generally,
higher PSNR values and SSIM values represent the better
four YOLOv5 network frameworks: YOLOv5s, YOLOV5m,
quality of the generated image.
YOLOv5l, and YOLOv5x. Note that the results here are
collected after the concatenation pixel-level fusion of RGB and
D. Ablation Study IR modalities. As listed in Table II, after removing the Focus
First, we verify the effectiveness of our proposed method by module, we observe a noticeable improvement in the detec-
designing a series of ablation experiments that are conducted tion performance of YOLOv5s (62.2%→69.5% in mAP50 ),
on the first fold of the validation set. YOLOv5m (64.5%→72.2%), YOLOV5l (63.7%→72.5%),
1) Validation of the Baseline Framework: In Table I, the and YOLOv5x (64.0%→69.2%). This is because by removing
model size and inference ability of different base frameworks the Focus module, not only can the resolution degradation be
are evaluated in terms of the number of layers, parameter size, avoided, but also the spatial interval information be retained
and GFLOPs. The detection performances of those models for small objects in RSI, thereby reducing the missing errors of
are measured by mAP50 , i.e., the detection metric of mAP at object detection. Generally, removing the Focus module brings
the intersection over union (IOU) = 0.5. Although YOLOv4 more than 5% improvement in the detection performance
achieves the best detection performance, it has 169 more layers (mAP50 ) of the whole frameworks.
than YOLOv5s (393 versus 224), its parameter size (params) is Meanwhile, we notice that the above removal
7.4 times larger than that of YOLOv5s (52.5M versus 7.1M), increases the inference computation cost (GFLOPs) in
and its GFLOPs is 7.2 times higher than that of YOLOv5s YOLOv5s (5.3→20.4), YOLOv5m (16.1→63.6), YOLOV5l
(38.2 versus 5.3). With respect to YOLOv5s, although its (36.7→145), and YOLOv5x (69.7→276.6). However, the
mAP is slightly lower than those of YOLOv4 and YOLOv5m, GFLOPs of YOLOv5s-noFocus (20.4) are smaller than
its number of layers, parameter size, and GFLOPs are much those of YOLOv3 (52.8), YOLOv4 (38.2), and YOLOrs
smaller than those of other models. Therefore, it is easier to (46.4), as shown in Table I. The parameters of these models
deploy YOLOv5s on board to achieve real-time performance are slightly reduced after removing the Focus module.
in practical applications. The above fact verifies the rationality In summary, in order to retain the resolution to better detect
of YOLOv5s as the baseline detection framework. smaller objects, priority shall be given to the detection
2) Impact of Removing Focus Module: As presented in accuracy, for which the convolution operation is adopted to
Section IV-A, the Focus module reduces the resolution of replace the Focus module.
input images, which imposes encumbrance on the detection 3) Comparison of Different Fusion Methods: To evaluate
performance of small objects in RSI. To investigate the influ- the influence of the devised fusion methods, we compare
ence of the Focus module, we conduct experiments on the five fusion results on YOLOv5-noFocus, as presented in
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415
TABLE IV
I NFLUENCE OF D IFFERENT R ESOLUTIONS FOR I NPUT I MAGE ON
N ETWORK P ERFORMANCE ON THE F IRST F OLD OF THE
VEDAI VALIDATION S ET
TABLE V
A BLATION E XPERIMENT R ESULTS A BOUT THE I NFLUENCE OF SR B RANCH ON D ETECTION
P ERFORMANCE ON THE F IRST F OLD OF THE VEDAI VALIDATION S ET
TABLE VI does not require a lot of manpower to refine the design of the
E FFECTIVE VALIDATION OF THE SR B RANCH FOR THE D IFFERENT detection network. The SR branch is general and extensible,
BASELINE ON THE F IRST F OLD OF THE VEDAI VALIDATION S ET
and can be utilized in the existing FCN framework.
Fig. 8. Visual results of object detection using different methods involving YOLOv4, YOLOv5s, YOLOv5m, and the proposed SuperYOLO. The red cycles
represent the false alarms, the yellow ones denote the FP detection results, and the blue ones are FN detection results. (a)–(e) Different images in the VEDAI
dataset.
module to slim the input image, but results in lousy detection including a large-scale Dataset for Object Detection in Aerial
performance, especially for small objects. The SuperYOLO images (DOTA), object DetectIon in Optical Remote sensing
performs 18.30% mAP50 better than YOLOv5s. Our proposed images (DIOR), and Northwestern Polytechnical University
SuperYOLO shows a favorable speed–accuracy tradeoff com- Very-High-Resolution 10-class (NWPU VHR-10) datasets.
pared to the state-of-the-art models. 1) DOTA: The DOTA dataset was proposed in 2018 for
object detection of remote sensing. It contains 2806 large
F. Generalization to Single Modal Remote Sensing Images images and 188 282 instances, which are divided into 15 cat-
At present, although there are massive multimodal images egories. The size of each original image is 4000 × 4000,
in remote sensing, the labeled dataset in object detection and the images are cropped into 1024 × 1024 pixels with an
tasks is lacking due to the expensive cost of manually overlap of 200 pixels in the experiment. We select half of the
annotating. To validate the generalization of our proposed original images as the training set, 1/6 as the validation set,
network, we compare the SuperYOLO with different one- or and 1/3 as the testing set. The size of the image is fixed to
two-stage methods using data from the single modality, 512 × 512.
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
TABLE VII
C LASSWISE AVERAGE P RECISION AP, M EAN AVERAGE P RECISION mAP 50 , PARAMETERS , AND GFLP S FOR THE P ROPOSED S UPERYOLO, YOLOV 3,
YOLOV 4, YOLOV 5 S - X , YOLO RS , YOLO-F INE , AND YOLOF USION , I NCLUDING U NIMODAL AND M ULTIMODAL C ONFIGURATIONS
ON VEDAI DATASET. * R EPRESENTS U SING P RETRAINED W EIGHT
2) NWPU VHR-10: The dataset of NWPU VHR-10 was RetainNet [51], and GFL [52]); two-stage method (Faster
proposed in 2016. It contains 800 images, of which 650 pic- R-CNN [5]); lightweight models (MobileNetV2 [55] and
tures contain objects, so we use 520 images as the training ShuffleNet [56]); distillation-based methods (ARSD [59]);
set and 130 images as the testing set. The dataset contains ten and remote sensing designed approaches (FMSSD [58] and
categories, and the size of the image is fixed to 512 × 512. O2DNet [57]).
3) DIOR: The DIOR dataset was proposed in 2020 for As presented in Table VIII, our SuperYOLO achieves
the task of object detection, which involves 23 463 images the optimal detection result (69.99%, 93.30%, and 71.82%
and 192 472 instances. The size of each image is 800 × 800. mAP50 ), and the model parameters (7.70M, 7.68M, and
We choose 11 725 images as the training set and 11 738 images 7.70M) and GFLOPs (20.89, 20.86, and 20.93) are much
as the testing set. The size of the image is fixed to 512 × 512. smaller than other SOTA detectors regardless of the two-
The training strategy is modified to accommodate the new stage, one-stage, lightweight, or distillation-based method.
dataset. The entire training process involves 150 epochs for The PANet structure and three detectors are responsible for
NWPU and DIOR datasets, and 100 epochs for DOTA. The enhancing small-, middle-, and large-scale target detections in
batch size of the DOTA and DIOR is 16 and of NWPU is 8. consideration of the big objects, such as playgrounds in these
To verify the superiority of the SuperYOLO proposed in this three datasets. Hence, the model parameters of SuperYOLO
article, we selected 11 generic methods for comparison: one- are more than those in Table VII. We also compare two detec-
stage algorithms (YOLOv3 [47], FCOS [53], ATSS [54], tors designed for RSI, such as FMSSD [58] and O2DNet [57].
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415
TABLE VIII
P ERFORMANCE OF D IFFERENT A LGORITHMS ON DOTA, NWPU, AND DOTA T ESTING S ETS
Although these models have a close performance with our [2] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.
lightweight model, the huger parameters and GFLOPs seem to (ICCV), Dec. 2015, pp. 1440–1448.
[3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
be a massive cost in computation resources. Hence, our model once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
has a better balance in consideration of detection efficiency Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
and efficacy. [4] P. Tang, X. Wang, X. Bai, and W. Liu, “Multiple instance detection
network with online instance classifier refinement,” in Proc. IEEE Conf.
VI. C ONCLUSION AND F UTURE W ORK Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 3059–3067.
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
In this article, we have presented SuperYOLO, a real- real-time object detection with region proposal networks,” IEEE Trans.
time lightweight network that is built on top of the widely Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
used YOLOv5s to improve the detection performance of [6] D. Jia, D. Wei, R. Socher, J. Lili, K. Li, and F. Li, “ImageNet: A large-
scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis.
small objects in RSI. First, we have modified the baseline Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255.
network by removing the Focus module to avoid resolu- [7] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
tion degradation, through which the baseline is significantly Proc. Eur. Conf. Comput. Vis., Sep. 2014, pp. 740–755.
[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
improved and overcomes the missing error of small objects. A. Zisserman, “The PASCAL visual object classes (VOC) challenge,”
Second, we have conducted research fusion of multimodal- Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
ity to improve the detection performance based on mutual [9] Z. Zheng, Y. Zhong, J. Wang, and A. Ma, “Foreground-aware relation
information. Lastly and most importantly, we have introduced network for geospatial object segmentation in high spatial resolution
remote sensing imagery,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
a simple and flexible SR branch facilitating the backbone Recognit. (CVPR), Jun. 2020, pp. 4096–4105.
to construct an HR representation feature, by which small [10] J. Pang, C. Li, J. Shi, Z. Xu, and H. Feng, “R 2 -CNN: Fast tiny object
objects can be easily recognized from vast backgrounds with detection in large-scale remote sensing images,” IEEE Trans. Geosci.
Remote Sens., vol. 57, no. 8, pp. 5512–5524, Aug. 2019.
merely LR input required. We remove the SR branch in the
[11] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou, “Multi-scale
inference stage, accomplishing the detection without changing object detection in remote sensing imagery with convolutional neural
the original structure of the network to achieve the same networks,” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 3–22,
GFOLPs. With joint contributions of these ideas, the proposed Apr. 2018.
[12] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning RoI
SuperYOLO achieves 75.09% mAP50 with lower computation transformer for oriented object detection in aerial images,” in Proc.
cost on the VEDAI dataset, which is 18.30% higher than IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
that of YOLOv5s, and more than 12.44% higher than that pp. 2844–2853.
[13] Z. Liu, H. Wang, H. Weng, and L. Yang, “Ship rotated bounding box
of YOLOv5x. space for ship extraction from high-resolution optical satellite images
The performance and inference ability of our proposal with complex backgrounds,” IEEE Geosci. Remote Sens. Lett., vol. 13,
highlight the value of SR in remote sensing tasks, paving way no. 8, pp. 1074–1078, Aug. 2016.
for the future study of multimodal object detection. Our future [14] D. Hong et al., “More diverse means better: Multimodal deep learn-
ing meets remote-sensing imagery classification,” IEEE Trans. Geosci.
interests will be focusing on the design of a low-parameter Remote Sens., vol. 59, no. 5, pp. 4340–4354, May 2021.
mode to extract HR features, thereby further satisfying real- [15] Z. Wang, K. Jiang, P. Yi, Z. Han, and Z. He, “Ultra-dense GAN for satel-
time and high-accuracy motivations. lite imagery super-resolution,” Neurocomputing, vol. 398, pp. 328–337,
Jul. 2020.
R EFERENCES [16] M. T. Razzak, G. Mateo-García, G. Lecuyer, L. Gómez-Chova,
Y. Gal, and F. Kalaitzis, “Multi-spectral multi-image super-resolution of
[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar- Sentinel-2 with radiometric consistency losses and its effect on building
chies for accurate object detection and semantic segmentation,” in Proc. delineation,” ISPRS J. Photogramm. Remote Sens., vol. 195, pp. 1–13,
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587. Jan. 2023.
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023
[17] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang, “Edge-enhanced [40] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units
GAN for remote sensing image superresolution,” IEEE Trans. Geosci. for neural network function approximation in reinforcement learning,”
Remote Sens., vol. 57, no. 8, pp. 5799–5812, Jun. 2019. Neural Netw., vol. 107, pp. 3–11, Nov. 2018.
[18] Y. Xiao, X. Su, Q. Yuan, D. Liu, H. Shen, and L. Zhang, “Satellite [41] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
video super-resolution via multiscale deformable convolution alignment deep convolutional networks for visual recognition,” IEEE Trans. Pattern
and temporal grouping projection,” IEEE Trans. Geosci. Remote Sens., Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2014.
vol. 60, 2022, Art. no. 5610819. [42] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
[19] (2021). Ultralytics/Yolov5:v5.0. [Online]. Available: https://github. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
com/ultralytics/yolov5 pp. 7132–7141.
[20] S. Zhang, M. Chen, J. Chen, F. Zou, Y.-F. Li, and P. Lu, “Multimodal [43] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep
feature-wise co-attention method for visual question answering,” Inf. residual networks for single image super-resolution,” in Proc. IEEE
Fusion, vol. 73, pp. 1–10, Sep. 2021. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017,
[21] Y.-T. Chen, J. Shi, Z. Ye, C. Mertz, D. Ramanan, and S. Kong, pp. 136–144.
“Multimodal object detection via probabilistic ensembling,” 2021, [44] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image
arXiv:2104.02904. restoration with neural networks,” IEEE Trans. Comput. Imag., vol. 3,
[22] Q. Chen et al., “EF-Net: A novel enhancement and fusion network no. 1, pp. 47–57, Mar. 2017.
for RGB-D saliency detection,” Pattern Recognit., vol. 112, Apr. 2021, [45] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery:
Art. no. 107740. A small target detection benchmark,” J. Vis. Commun. Image Represent.,
[23] H. Zhu, M. Ma, W. Ma, L. Jiao, and B. Hou, “A spatial-channel vol. 34, pp. 187–203, Jan. 2016.
progressive fusion ResNet for remote sensing classification,” Inf. Fusion, [46] L. Bottou, “Large-scale machine learning with stochastic gradi-
vol. 70, no. 1, pp. 72–87, 2020. ent descent,” in Proc. 19th Int. Conf. Comput. Statist., vol. 2010,
[24] Y. Sun, Z. Fu, C. Sun, Y. Hu, and S. Zhang, “Deep multimodal pp. 177–186.
fusion network for semantic segmentation using remote sensing image [47] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
and LiDAR data,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, 2018, arXiv:1804.02767.
Art. no. 5404418. [48] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal
[25] W. Li, Y. Gao, M. Zhang, R. Tao, and Q. Du, “Asymmetric feature speed and accuracy of object detection,” 2020, arXiv:2004.10934.
fusion network for hyperspectral and SAR image classification,” IEEE [49] M.-T. Pham, L. Courtrai, C. Friguet, S. Lefèvre, and A. Baussard,
Trans. Neural Netw. Learn. Syst., early access, Feb. 18, 2022, doi: “YOLO-fine: One-stage detector of small objects under various back-
10.1109/TNNLS.2022.3149394. grounds in remote sensing images,” Remote Sens., vol. 12, no. 15,
[26] Y. Gao et al., “Hyperspectral and multispectral classification for coastal p. 2501, Aug. 2020.
wetland using depthwise feature interaction network,” IEEE Trans. [50] F. Qingyun and W. Zhaokui, “Cross-modality attentive feature fusion
Geosci. Remote Sens., vol. 60, 2022, Art. no. 5512615. for object detection in multispectral remote sensing imagery,” Pattern
[27] M. Sharma et al., “YOLOrs: Object detection in multimodal remote Recognit., vol. 130, Oct. 2022, Art. no. 108786.
sensing imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., [51] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for
vol. 14, pp. 1497–1508, 2021. dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
[28] L. Gómez-Chova, D. Tuia, G. Moser, and G. Camps-Valls, “Multimodal Oct. 2017, pp. 2980–2988.
classification of remote sensing images: A review and future directions,” [52] X. Li et al., “Generalized focal loss: Learning qualified and distributed
Proc. IEEE, vol. 103, no. 9, pp. 1560–1584, Sep. 2015. bounding boxes for dense object detection,” in Proc. Adv. Neural Inf.
[29] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, Process. Syst., vol. 33, 2020, pp. 21002–21012.
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. [53] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. one-stage object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
[30] C. Li, T. Yang, S. Zhu, C. Chen, and S. Guan, “Density map (ICCV), Oct. 2019, pp. 9627–9636.
guided object detection in aerial images,” in Proc. IEEE/CVF Conf. [54] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between
Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2020, anchor-based and anchor-free detection via adaptive training sample
pp. 190–191. selection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[31] C. Chen, M.-Y. Liu, O. Tuzel, and J. Xiao, “R-CNN for small object (CVPR), Jun. 2020, pp. 9759–9768.
detection,” in Proc. Asian Conf. Comput. Vis. Cham, Switzerland: [55] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
Springer, 2017, pp. 214–230. “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc.
[32] J. Noh, W. Bae, W. Lee, J. Seo, and G. Kim, “Better to follow, follow IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
to be better: Towards precise supervision of feature super-resolution pp. 4510–4520.
for small object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. [56] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An
(ICCV), Oct. 2019, pp. 9725–9734. extremely efficient convolutional neural network for mobile devices,”
[33] M. Haris, G. Shakhnarovich, and N. Ukita, “Task-driven super in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
resolution: Object detection in low-resolution images,” 2018, pp. 6848–6856.
arXiv:1803.11316. [57] H. Wei, Y. Zhang, Z. Chang, H. Li, H. Wang, and X. Sun, “Oriented
[34] J. Shermeyer and A. Van Etten, “The effects of super-resolution on
objects as pairs of middle lines,” ISPRS J. Photogramm. Remote Sens.,
object detection performance in satellite imagery,” in Proc. IEEE/CVF
vol. 169, pp. 268–279, Nov. 2020.
Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2019, [58] P. Wang, X. Sun, W. Diao, and K. Fu, “FMSSD: Feature-merged
pp. 1432–1441. single-shot detection for multiscale objects in large-scale remote sens-
[35] L. Courtrai, M.-T. Pham, and S. Lefèvre, “Small object detection
ing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 5,
in remote sensing images based on super-resolution with auxiliary
pp. 3377–3390, Dec. 2020.
generative adversarial networks,” Remote Sens., vol. 12, no. 19, p. 3152, [59] Y. Yang et al., “Adaptive knowledge distillation for lightweight remote
Sep. 2020. sensing object detectors optimizing,” IEEE Trans. Geosci. Remote Sens.,
[36] J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, “Small-
vol. 60, 2022, Art. no. 5623715.
object detection in remote sensing images with end-to-end edge-
enhanced GAN and object detector network,” Remote Sens., vol. 12,
no. 9, p. 1432, May 2020.
[37] H. Ji, Z. Gao, T. Mei, and B. Ramesh, “Vehicle detection in
remote sensing images leveraging on simultaneous super-resolution,” Jiaqing Zhang received the B.E. degree in telecom-
IEEE Geosci. Remote Sens. Lett., vol. 17, no. 4, pp. 676–680, munications engineering from Ningbo University,
Apr. 2020. Ningbo, Zhejiang, China, in 2019. She is currently
[38] L. Wang, D. Li, Y. Zhu, L. Tian, and Y. Shan, “Dual super-resolution pursuing the Ph.D. degree with the Image Coding
learning for semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. and Processing Center, State Key Laboratory of Inte-
Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 3773–3782. grated Services Networks, Xidian University, Xi’an,
[39] C.-Y. Wang, H.-Y. Mark Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, China.
and I.-H. Yeh, “CSPNet: A new backbone that can enhance learning Her research interests include multimodal image
capability of CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern processing, remote sensing object detection, and
Recognit. Workshops (CVPRW), Jun. 2020, pp. 1571–1580. network compression.
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415
Jie Lei (Member, IEEE) received the M.S. degree in Yunsong Li (Member, IEEE) received the M.S.
telecommunication and information systems and the degree in telecommunication and information sys-
Ph.D. degree in signal and information processing tems and the Ph.D. degree in signal and information
from Xidian University, Xi’an, China, in 2006 and processing from Xidian University, Xi’an, China, in
2010, respectively. 1999 and 2002, respectively.
He was a Visiting Scholar with the Depart- He joined the School of Telecommunications Engi-
ment of Computer Science, University of Cali- neering, Xidian University, in 1999, where he is
fornia at Los Angeles, Los Angeles, CA, USA, currently a Professor. He is currently the Director of
from 2014 to 2015. He is currently a Professor the Image Coding and Processing Center, State Key
with the School of Telecommunications Engineer- Laboratory of Integrated Services Networks, Xidian
ing, Xidian University, where he is also a member of University. His research interests focus on image and
the Image Coding and Processing Center, State Key Laboratory of Integrated video processing and high-performance computing.
Services Networks. He is also with the Science and Technology on Electro-
Optic Control Laboratory, Luoyang, China. His research interests focus on
image and video processing, computer vision, and customized computing for
big-data applications.