Single-Stream CNN With Learnable Architecture For Multisource Remote Sensing Data
Single-Stream CNN With Learnable Architecture For Multisource Remote Sensing Data
Abstract— In this article, we propose an efficient and gen- on. One fundamental yet challenging task in RS is land-
eralizable framework based on a deep convolutional neural use/land-cover (LULC) classification, which aims to assign
network (CNN) for multisource remote sensing (RS) data joint one semantic category to each pixel in an RS image acquired
classification. While recent methods are mostly based on multi-
stream architectures, we use group convolution (GConv) to con- over some region of interest.
struct equivalent network architectures efficiently within a single- Nowadays, diverse sensor technologies allow to measure
stream network. Based on a recent technique called dynamic different aspects of scenes and objects from the air, including
grouping convolution (DGConv), we further propose a network sensors for multispectral (MS) optical imaging, hyperspec-
module named separable DGConv (SepDGConv), to make GConv tral (HS) imaging, synthetic aperture radar (SAR), and light
hyperparameters, and, thus, the overall network architecture,
learnable during network training. In the experiments, the pro- detection and ranging (LiDAR). Different sensors bring diverse
posed method is applied to residual network (ResNet) and UNet, and complementary information [3]. For example, the MS
and the adjusted networks are verified on three very diverse optical imagery contains spatial information, such as object
benchmark datasets (i.e., Houston2018 data, Berlin data, and shape and spatial relationship. The HS data provide detailed
MUUFL Gulfport Hyperspectral and LiDAR Airborne Data Set spectral information of LULC and ground objects. While HS
(MUUFL) data). Experimental results demonstrate the effective-
ness of the proposed single-stream CNNs, and in particular, imagery cannot be used to differentiate objects composed of
SepG-ResNet18 improves the state-of-the-art classification overall the same material, such as roofs and roads both made of
accuracy (OA) on hyperspectral–synthetic aperture radar (HS– concrete, LiDAR data can capture elevation distribution and,
SAR) Berlin dataset from 62.23% to 68.21%. In the experiments thus, can be used to distinguish roofs from roads. Also, the
we have two interesting findings. First, using DGConv generally SAR data can provide additional structure information about
reduces test OA variance. Second, multistream is harmful to
model performance if imposed to the first few layers, but becomes Earth’s surface. Availability of multisource, multimodal RS
beneficial if applied to deeper layers. Altogether, the findings data makes it possible to integrate rich information to improve
imply that the multistream architecture, instead of being a strictly LULC classification performance. Also, considerable efforts
necessary component in deep learning models for multisource RS have been invested into the research of multisource RS data
data, essentially plays the role of model regularizer. Our code joint analysis for LULC since recent years.
is publicly available at https://github.com/yyyyangyi/CNNs-for-
Multi-Source-Remote-Sensing-Data-Fusion. We hope our work
can inspire novel research in the future.
A. Related Work
Index Terms— Classification, convolutional neural networks
(CNNs), dynamic grouping convolution (DGConv), multisource Conventionally, a multisource RS data analysis workflow
remote sensing (RS) data, network architecture, segmentation. contains two phases: one feature extraction phase and one
feature fusion phase. In the feature extraction phase, different
I. I NTRODUCTION feature extractors are applied to different data modalities,
while in the feature fusion phase, high-level features obtained
R EMOTE sensing (RS) plays an important role in Earth
observation and supports applications, such as envi-
ronmental monitoring [1], precision agriculture [2], and so
from the previous phase are fused by certain algorithms and
fed to LULC classifiers. For example, both [4] and [5] extract
morphological attribute profiles from HS and LiDAR data and
Manuscript received August 31, 2021; revised January 3, 2022 and use feature stacking as a fusion technique. In [6], morpho-
March 10, 2022; accepted April 6, 2022. Date of publication April 21,
2022; date of current version May 4, 2022. This work was supported by logical extinction profiles are extracted separately from HS
the National Key Research and Development Program of China under Grant and LiDAR data, and the features are further fused using
2018YFB0505300. (Corresponding author: Yi Yang.) an orthogonal total variation component analysis (OTVCA).
Yi Yang and Fuhu Ren are with the Center for Data Science,
Peking University, Beijing 100871, China (e-mail: [email protected]; Gu et al. [7] extracted manually engineered features from MS
[email protected]). and LiDAR data and used multiple kernel learning as a fusion
Daoye Zhu is with the Center for Data Science, Peking University, Beijing strategy to train a support vector machine classifier. In [8],
100871, China, and also with the Laboratory of Interdisciplinary Spatial
Analysis, University of Cambridge, Cambridge CB3 9EP, U.K. (e-mail: MAPPER [9] is used as a feature extractor for MS optical
[email protected]). and SAR data, and the features are further fused with manifold
Tengteng Qu, Qiangyu Wang, and Chengqi Cheng are with the College alignment. Note that, in the literature, data fusion also refers
of Engineering, Peking University, Beijing 100871, China (e-mail:
[email protected]; [email protected]; [email protected]). to a data processing technique that integrates multisource data
Digital Object Identifier 10.1109/TGRS.2022.3169163 into one data modality, while, in our paper, we use this term
1558-0644 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
5409218 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
to express the same meaning as “joint classification/analysis” branch late fusion is made, and Audebert et al. [31] found out
of multisource RS data. that one strategy does not consistently outperform the other on
Meanwhile, deep learning (DL) [10], as one of the most different datasets. In [32], MS optical data and a digital surface
notable advances in computer vision (CV) recently, has model (DSM) band are jointly classified within a single-stream
attracted attention from both CV and RS communities. CNN with depth-wise convolution.
In particular, a convolutional neural network (CNN) [11]
is a DL-based model that significantly outperforms tradi-
tional methods in image classification and segmentation. B. Challenges
Most of the data in RS are also presented in the form of It can be summarized from the abovementioned literature
images, and CNN has shown remarkable success in ana- that when designing a data fusion CNN, the following two
lyzing MS [12]–[14], HS [15]–[17], LiDAR [18], [19], and principles are usually followed: 1) different branches are
SAR data [20], [21]. strictly separated; thus, low level features are sensor-specific,
Besides being applied to individual data sources, CNNs are and 2) the number of branches is set equal or proportional to
adopted as backbone models for multisource RS data classifi- the number of data sources.
cation in many recent works. As an early attempt, [22] designs Despite achieved success, it is far from fully understood
a two-branch CNN for joint analysis of HS-LiDAR data, and why these empirical principles work. In particular, it may be
one branch for each sensor, achieving promising classification helpful to further improve data fusion models, if we can gain
accuracy. Xu et al. [23] proposed another two-branch CNN some insight into the following two problems.
for HS-LiDAR data, with a different design of HS feature 1) How to Find Optimal Number of Branches?: Model
extraction branch. In [24], a three-branch CNN is proposed performance and efficiency are both closely related to the
to fuse MS, HS, and LiDAR data. Hong et al. [25] further number of branches, which is often treated as hyperparameters
extended the scope of deep multibranch networks by allowing and defined by human experts. This can very likely lead the
either CNN or fully connected neural networks be one feature network to learn a suboptimal solution, because experts cannot
extraction branch. See Section II-A for a formal definition of confidently know, and in fact, there has been no agreement on
network branch. an optimal set of these hyperparameters. On the one hand,
Based on the multibranch architecture, some latest papers while it is possible to find an optimal number of branches
devote to further improve the model performance by introduc- by trial for small models, for typical modern CNNs, which
ing various novel modules to the network. Hong et al. [26] are very large (∼100 layers [33]), manual tuning is no longer
used self-adversarial modules, interactive learning modules, feasible. On the other hand, in a CNN, convolution layers in
and label propagation modules to build a deep CNN for semi- different depths learn features of different semantic meanings,
supervised multimodal learning. In [27], Gram matrices are and it can be very difficult to find an optimal network depth
utilized to improve multisource complementary information at which sensor-specific features are fused. It is, therefore,
preservation in a two-branch CNN for HS and LiDAR data desirable that hyperparameters, such as number of branches
fusion. In [28], a two-branch CNN for joint classification of and branch depth, can be found automatically.
HS and LiDAR is proposed, where in the feature extraction 2) Which Works—Specificity or Regularization?: A multi-
branches, Octave convolutional layers are used to reduce fea- stream CNN has fewer parameters than its dense counterpart;
ture redundancy from low-frequency data components, and in for the latter, there are more parameters to connect different
the fusion subnetwork, fractional Gabor convolution is utilized branches. In the DL community, reducing model parame-
to obtain multiscale and multidirectional spatial features. ters is known as an effective regularization technique, which
Using multistream architectures as mentioned earlier, this improves model’s test performance [34]. For multisource data
data fusion strategy is also known as the “late-fusion” scheme, fusion, while it is generally assumed that sensor-specific fea-
because the features are kept sensor-specific until the last few tures are beneficial, the effects of regularization have not been
layers, i.e., the fusion subnetwork. Another strategy in contrast studied in isolation.
to late fusion is “early fusion.” As the name suggests, early
fusion means either the feature extraction branches are very
shallow, or there are certain information exchange between C. Method Overview
different branches, making the features no longer absolutely To address the aforementioned challenges, we aim, in this
sensor-specific. In [29], a coupled CNN is proposed to fuse article, to develop a framework that allows CNN architectures
HS and LiDAR data, where a weight sharing technique is to be learned from data within a single-stream network, for
utilized in intermediate layers of the proposed two-branch multisource RS data fusion.
CNN, making each feature extraction branch only contain very We notice that any multistream architecture can be equiv-
few separate layers. Hazirbas et al. [30] proposed a fully CNN alently expressed by group convolution (GConv) within a
named FuseNet with two branches for semantic labeling of single-stream architecture (see Section II-A). The following
indoor scenes on red-green-blue-depth (RGB-D) data. To fuse two parameters further control GConv and need to be specified
features from RGB branch and depth branch, the authors for each layer: number of total groups and number of feature
propose a fusion block, which adds the output of each block maps in each group. Recently proposed dynamic grouping
from depth branch to RGB branch. In [31], a more detailed convolution (DGConv) [35] enables these two parameters to
comparison between FuseNet-based early fusion and multi- be learned in an end-to-end manner via network training.
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SINGLE-STREAM CNN WITH LEARNABLE ARCHITECTURE FOR MULTISOURCE RS DATA 5409218
Fig. 1. Illustration on the difference among (a) multistream architecture, (b) GConv architecture, and (c) proposed DGConv architecture.
Originally proposed for efficient architecture designing, the 4) To the best of our knowledge, this is the first time that
DGConv itself does not ensure sensor-specific features and, the single-stream CNNs for multisource RS data fusion
thus, cannot be directly used to approximate and study multi- are systematically studied and compared with the state-
stream models for multisource RS data fusion models. In our of-the-art (SOTA) multibranch models.
paper, we propose necessary modifications to DGConv, based The remainder of this article is organized as follows. The
on which we further design CNN blocks and single-stream proposed SepDGConv layer and CNN architectures are intro-
architectures with simultaneous feature extraction fusion for duced in Section II. The experimental results and analysis are
joint classification of multisource RS data. Fig. 1(c) illustrates presented in Section III. Finally, Section IV makes the sum-
such a CNN model. More specifically, the contributions of this mary with some important conclusions and hints at potential
paper can be highlighted as follows. future research trends.
1) A modified DGConv module, which we name separable
II. P RELIMINARIES
DGConv (SepDGConv), is proposed to automatically
learn a GConv structure within single-stream neural In this section, first, we show how we can use GConv
networks. SepDGConv is theoretically compatible with to construct a single-stream CNN that is equivalent to a
any CNN architecture. multistream one. Second, we briefly introduce DGConv as
2) Based on the proposed SepDGConv module, the deep well as groupable networks (G-Nets), a family of architectures
single-stream CNN models are proposed with reference using DGConv.
to typical architectures in the CV area. The proposed
CNNs show promising classification performance on A. GConv as Multistream Conv
various benchmark multisource RS datasets. A CNN branch/stream consists of a sequence of convo-
3) Experimental results suggest that using densely con- lution/normalization/activation/pooling layers, and in a mul-
nected network to jointly extract features from multiple tistream architecture, the network branches usually play the
data modalities actually improves the final classification role of feature extractor. Fig. 1(a) shows a typical two-branch
performance, and using SepDGConv in deeper layers, CNN, with HS data fed to one branch (blue) and LiDAR
which contain more parameters, also helps improve to the other (yellow). The output features of multistream
classification accuracy. This finding is very interest- feature extractors are sensor-specific, because the LiDAR data
ing, because it suggests that regularization contributes never come into the HS branch, and vice versa. These fea-
more to model performance improvement than sensor tures are further fed into a fusion subnetwork, which usu-
specificity. ally consists of one single branch. The fusion subnetwork
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
5409218 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
k−1
k−1
O(i, j ) = ω(m, n) F(i + m, j + n) (1)
m=0 n=0
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SINGLE-STREAM CNN WITH LEARNABLE ARCHITECTURE FOR MULTISOURCE RS DATA 5409218
1) Definition: Formally, DGConv is defined as Backpropagation through the non-differentiable sign(·) can
be done by the straight-through estimator proposed for
k−1
k−1
O(i, j ) = (U ω(m, n)) F(i + m, j + n) (3) quantized neural networks [38], and then, automatic gradient
m=0 n=0
computation for the rest is supported by most modern DL
programming frameworks.
where U ∈ {0, 1}C ×C , and denotes element-wise product.
out in
As an example, a convolution layer with C in = C out = 8,
Ui , the i th row of U , is a binary vector that indicates which as shown in Fig. 2, has K = 3, and the relationship matrix U
input channels are involved in the computation of the i th out- is of shape 8 × 8. U for regular convolution in Fig. 2(a) can
put channel. The definition is reasonable, as many convolution be expressed as U = 1 ⊗ 1 ⊗ 1, with g = (1, 1, 1). Similarly,
operations can be regarded as special cases of DGConv. For for depth-wise convolution in Fig. 2(b), g = (0, 0, 0) and
instance, DGConv becomes regular convolution (1) if we let U = I ⊗ I ⊗ I. For GConv illustrated as in Fig. 2(c),
U be a matrix of ones, as illustrated in Fig. 2(a). DGConv g = (1, 1, 0) and U = 1 ⊗ 1 ⊗ I. For DGConv shown in
becomes depth-wise convolution if we let U be an identity Fig. 2(d), g = (0, 0, 1) and U = I ⊗ I ⊗ 1.
matrix, as illustrated in Fig. 2(b). DGConv can also represent
GConv (2), if we take U to be a block-diagonal matrix of ones C. G-Nets
and zeros, as illustrated in Fig. 2(c). G-Nets [35] refer to architectures using DGConv. In partic-
2) Learning the Relationship Matrix U: While (3) is a ular, Zhang et al. [35] experimented with G-ResNet50, which
representative, such a definition results in the following two is based on ResNet50 [39].
difficulties in estimating U . First, the introduction of U adds ResNet50 uses Bottleneck as its building block. The Bot-
lots of additional parameters to the network, which makes tleneck block in order consists of one 1 × 1 convolution layer,
the learning process more difficult. Second, U takes binary one 3 × 3 convolution layer, and one more 1 × 1 convolution
values of 0 and 1, while it is widely known that optimization layer, as shown in Fig. 3(c). In its DGConv version, the
involving discrete values are generally very hard to solve. middle 3 × 3 convolution is replaced with 3 × 3 DGConv.
To address the first issue, U is decomposed into a set G-ResNet50 consists of four Bottleneck blocks; the number of
of small matrices, and learnable parameters are designed to output channels for each block being [256, 512, 1024, 2048],
generate this set of small matrices. Consider a simple yet respectively.
quite general case, where U is a square matrix with C in =
C out = 2 K , K being an integer. Then, a set of 2 × 2 matrices III. M ETHOD
U1 , . . . , Ui , . . . , U K can be defined, and U can be recon- While DGConv enables automatically learning of GConv
structed as hyperparameters, the learning outcome does not lead to a
network with sensor-specific branches, as we will see in the
U = U1 ⊗ · · · ⊗ Ui ⊗ · · · ⊗ U K (4)
following, and, thus, cannot be directly used to approximate
where ⊗ denotes Kronecker product, i ∈ {1, . . . , K }. Each multibranch CNNs for multisource RS data fusion. To address
small matrix Ui is further represented by a binary parameter this issue, based on DGConv, we propose SepDGConv and
gi ∈ {0, 1} separable G-Net (SepG-Net), which makes it possible that the
learned architecture contains sensor-specific branches.
Ui = gi 1 + (1 − gi )I (5)
A. Blocks With SepDGConv
where 1 denotes a 2×2 constant matrix of ones, and I denotes
a 2 × 2 constant identity matrix. Thus, each 2 K × 2 K relation- First, reconsider the Bottleneck block with DGConv. If sen-
ship matrix U can be constructed from a vector g ∈ R K . sor specificity is to be preserved, then it is necessary that
The number of parameters to be learned is thereby reduced every convolution layer in a block uses GConv, as in (2).
exponentially. In both regular and DGConv Bottleneck blocks, the first and
To address the second issue, a learnable gate vector g̃, taking last convolution layers use 1×1 regular convolution. Therefore,
continuous values, is introduced to generate the binary vector in our SepDGConv Bottleneck, we use depth-wise convolution
g, as follows: instead of regular convolution; i.e., we set the number of
groups G equal to the number of that layer’s feature maps.
g = sign(g̃) (6) Fig. 3(d) shows a SepDGConv Bottleneck block.
Second, consider the DoubleConv block, which is used
where sign(·) represents the sign function
as the building blocks by two very popular CNN models:
0, x < 0 ResNet18 and UNet [40]. A DoubleConv block consists of two
sign(x) = (7) consecutive 3 × 3 convolution layers, as shown in Fig. 3(a).
1, x ≥ 0.
For the residual network (ResNet) family, BasicBlock has
Altogether, combining (4)–(7), the 2 K × 2 K binary relation- a very similar structure to that of DoubleConv, except for
ship matrix U is constructed using a continuous vector g̃ of the additional residual connection of BasicBlock. As long as
length K there is no confusion, we also use DoubleConv to refer to
BasicBlock in ResNets. To impose sensor specificity, in our
g = sign(g̃) SepDGConv DoubleConv, we replace both convolution layers
U = g1 1 + (1 − g1 )I ⊗ · · · ⊗ g K 1 + (1 − g K )I. (8) with DGConv layers, as shown in Fig. 3(b).
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
5409218 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
has more feature maps than its previous layer. C out < C in is
and C in columns in our computation, ignoring the remaining
often used in upsampling layers of a segmentation model.
entries.
Formally, we define the following case expansion:
C out /C in = r , and the other case reduction: C in /C out = r ,
with r ≥ 2 being an integer. Our strategy is to expand or C. Regularization in SepDGConv
reduce the shape of the relationship matrix by doing matrix In SepDGConv, U is responsible for the regularization
multiplication and Kronecker product with identity matrices effect. The learned U is expected to divide the network into
and vectors of ones. multiple groups, and, thus, also to be sparse. Recall that U is
In the expansion case, we first construct a matrix multiplied with network parameters ω (3); hence, a sparse U
Ũ ∈ RC ×C as described in Section II-B. Recall that
in in
essentially wipes out a certain number of parameters, which
U (i, j ) = 1 if the j th input channel is involved in the regularizes the model by controlling model complexity. GConv
computation of the i th output channel; otherwise, U (i, j ) = 0. can, in the same way, regularize the model as well.
Hence, we duplicate each row of Ũ r times and stack them Consider again the example in Fig. 2. For the GConv illus-
together horizontally to get U trated in Fig. 2(c), U has half of its entries equal to zero, and
comparing with the regular convolution in Fig. 2(a), this wipes
U = (I ⊗ 1r )Ũ (9) out half of the layer’s parameters. U s learned in SepDGConv
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SINGLE-STREAM CNN WITH LEARNABLE ARCHITECTURE FOR MULTISOURCE RS DATA 5409218
Fig. 4. Neural network architectures using SepDGConv. (a) SepG-ResNet18. (b) SepG-ResNet50. (c) SepG-UNet. In (b), the ×x means that x consecutive
blocks are omitted as one in the illustration. In (c), gray line represents skip connections utilized in UNet, while denotes concatenation along the channel
axis.
tend to be even more sparse (see experimental results at we present our experimental results. Third, we analyze the role
Section IV-B), which, therefore, reduces the model complexity of SepDGConv in the entire model by an ablation analysis,
and regularizes the model usually more than GConv. and we report changes in classification performance. Fourth,
we compare different convolution strategies, in particular,
D. SepG-Nets GConv and SepDGConv, to isolate the effect of sensor speci-
Using SepDGConv DoubleConv, we can build SepG- ficity. Finally, we discuss the results and findings of our
ResNet18, as shown in Fig. 4(a), and separable group experiments.
uNet (SepG-UNet), as shown in Fig. 4(c). In SepG-
ResNet18, we follow the convention and build the net- A. Datasets
work with four layers, each having two DoubleConv blocks. 1) Houston2018 Dataset: Houston2018 is an HS-LiDAR-
The number of output channels for each layer being RGB dataset. Acuqired by the National Center for Air-
[64, 128, 256, 512], respectively. In SepG-UNet, we have borne Laser Mapping at the University of Houston, Houston,
eight DoubleConv blocks, with output channel numbers TX, USA, the Houston2018 dataset covers the University of
[128, 256, 512, 1024, 512, 256, 128, 64]. Houston campus and its surrounding urban areas. The dataset
SepG-ResNet50 consists of four layers, in order composed consists of MS-LiDAR, HS, and MS-optical RS data, each
of [2, 3, 5, 2] SepDGConv Bottleneck blocks. The number of containing 7, 48, and 3 channels, respectively. The HS data
output channels for each block is the same as G-ResNet50, cover a 380–1050-nm spectral range, while the laser wave-
being [256, 512, 1024, 2048], respectively. The architecture of lengths of three LiDAR sensors are 1550, 1064, and 532 nm.
SepG-ResNet50 is illustrated in Fig. 4(b). The MS-LiDAR data also contain digital elevation Model
(DEM) and DSM derived from point clouds. This dataset
IV. E XPERIMENTS AND D ISCUSSION was originally provided in 2018 GRSS Data Fusion Contest.
We experiment with SepG-ResNet18, SepG-ResNet50, and The paper [3] reports the outcome of the Contest and also
SepG-UNet on three diverse datasets: Houston2018 dataset, contains more detailed description of the Houston2018 data
Berlin dataset, and MUUFL Gulfport Hyperspectral and set. We resample the imagery at a 0.5-m GSD, so that the size
LiDAR Airborne Data Set (MUUFL) dataset, and compare of each image channel is 2404×8344 pixels. The ground truth
with baseline models, as well as SOTA models on the datasets. of dataset contains 20 classes. The number of samples in each
In this section, first, we describe the data sets. Second, class is shown in Table I.
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
5409218 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SINGLE-STREAM CNN WITH LEARNABLE ARCHITECTURE FOR MULTISOURCE RS DATA 5409218
Fig. 5. Results on Houston2018 dataset. (a) MS optical image. (b) Test ground truth labels. (c) Prediction map by UNet. (d) Prediction map by SepG-UNet.
Zoomed-in view of the image from (a), (c), and (d) are shown in the bottom row.
#total_samples represents the total number of samples in the the networks for 300 epochs. We use three GPUs in parallel
training set. to train the networks, with a batch size set to 12. To ensure
Metrics: To evaluate the classification results, for each reproducibility, we use transpose convolution in upsampling
classifier, we will report F1-score (F1) for each class, and modules in UNet and SepG-UNet. To the loss function, we add
three widely used criteria in the literature to evaluate overall a mask, so that pixels with label class “undefined” are not
performance, i.e., average accuracy (AA), overall accuracy accounted for training loss.
(OA), and Kappa coefficient (κ). b) Results: Quantitative classification results of SepG-
Baseline Models: All baseline models discussed in the UNet, UNet, Fusion-FCN, and DCNN are shown in Table IV,
following, except DCNN [44] on the Houston2018 dataset, while Fig. 5(c) and (d) shows the classification map of
are reproduced and reevaluated under our experimental envi- UNet and SepG-UNet, alongside with the ground truth labels.
ronment. We are not able to reproduce DCNN, because the As shown in Table IV, our experimental results show that
paper [44] does not provide enough details of the model, so we basic UNet, which does not has a multistream architecture,
directly cite the results reported in [44]. can yield an OA of 63.66%, largely outperforming 51.52%
1) Experiments on Houston2018 Dataset: We follow [24] obtained by Fusion-FCN, and 51.2% by DCNN. Also, with
and treat the LULC classification on the Houston2018 dataset the proposed SepDGConv, the performance of SepG-UNet is
as a semantic segmentation problem and experiment with further improved to 63.66%. From Fig. 5, it can be seen that
UNet, a very commonly studied semantic segmentation model. both UNet and SepG-UNet can output meaningful prediction
We run basic UNet on the dataset as a baseline and examine maps. According to Table IV, using SepDGConv reduces the
the performance of SepG-UNet, where SepDGConv layers test OA variance, which is reflected in Fig. 5(c) and (d) that
replace regular convolution layers in the baseline model. SepG-UNet gives a generally less noisy classification map than
We also compare with the first and second place methods basic UNet.
presented in 2018 Data Fusion Contest; both are multistream In Fig. 8(a), we plot the learned number of groups and
models: Fusion-fully convolutional network (FCN) [24] and sparsity of relationship matrix U for each SepDGConv layer
DCNN [44]. in SepG-UNet. Here, the sparsity of U is defined as the ratio
a) Implementation details: We use Image tiles of shape between number of 0’s and number of total entries in U . The
58×128×128. In the training phase, we use a spatial stride of Sparsity plot shows that the SepDGConv generates an archi-
64×64 pixels to extract training samples from the 58×1202× tecture with dense-sparse connection alternately appearing.
4768 data, while in the test phase, the stride is 128 × 128. While, in InConv block, the learned U s are mostly sparse, the
We use the Adam optimizer [45] to train both UNet and sparsities quickly drop to 0 in Down1 block in all five replicas,
SepG-UNet, with optimizer hyperparameters β1 and β2 set to which suggests that sensor-specific branches learned in SepG-
default values. We set the initial learning rate to 0.001 and train UNet are very shallow, and feature fusion probably begins
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
5409218 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
TABLE IV
M ODEL P ERFORMANCE ON H OUSTON 2018 D ATASET
Fig. 6. Results on Berlin dataset. (a) HS false color image. (b) Test ground truth labels. Prediction maps by (c) ResNet18, (d) SepG-ResNet18, (e) ResNet50,
and (f) SepG-ResNet50.
in a very early stage. In the #Groups plot, we can see that 2) Experiments on Berlin Dataset: In the Berlin dataset,
SepDGConv learns more groups in middle layers where there the ground truth labels are sparse, so we extract small image
are more feature maps and more parameters. This is consistent patches as training and testing samples, with the center of
with the behavior of DGConv found in the paper [35]. each image patch aligned with one label. Hence, LULC
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SINGLE-STREAM CNN WITH LEARNABLE ARCHITECTURE FOR MULTISOURCE RS DATA 5409218
TABLE V
M ODEL P ERFORMANCE ON B ERLIN D ATASET
classification on the Berlin dataset becomes a image classi- number of groups should be the same as the number of sensors.
fication task. We select ResNet18 and ResNet50 as baseline Besides, there are generally more groups in the last two blocks
and experiment with SepG-ResNet18 and SepG-ResNet50, than in the previous blocks for both models. As, in ResNets,
where regular convolution layers in the baseline models are there are more feature maps and, thus, more parameters
replaced with SepDGConv layers. In addition, we compare in Layer3 and Layer4, this result is also consistent with
with shared and specific feature learning model (S2FL) [42], SepG-UNet.
which achieves SOTA performance on this dataset. 3) Experiments on MUUFL Dataset: For the MUUFL
a) Implementation details: For SepG-ResNet18, dataset, we follow most studies on it and use classification
ResNet18, SepG-ResNet50, and ResNet50, we use an image models. We experiment with ResNet18, ResNet50, and their
patch of 17 × 17 as training and test samples. All models are Sep-DGConv derivatives. We also compare our results with
trained on a single GPU using a batch size of 64. those of the following methods: OTVCA [6] and two-branch
To train SepG-ResNet18 and ResNet18, we use stochastic CNN (TB-CNN) [23]. Both OTVCA and TB-CNN are multi-
gradient descent (SGD) with momentum as our optimizer, stream models.
with momentum parameter set to 0.9. We train the networks a) Implementation details: For SepG-ResNet18 and
for 300 epochs. An initial learning rate is set to 0.001. For ResNet18, we use image patch of 11 × 11 as training and test
SepG-ResNet50 and ResNet50, we use Adam as an optimizer, samples, while for SepG-ResNet50 and ResNet50, we use an
with default algorithm parameters. We train both networks for image patch of size 17×17. As the previous studies do not use
400 epochs. An initial learning rate is set to 0.001, which is a fixed training set, we make random train-test split in each
further decayed to 0.0001 at the 300th epoch. replica under different random seeds. We follow [28] and set
b) Results: The quantitative classification results of training set size fixed to 100 and the rest used as the test set.
SepG-ResNet18, ResNet18, SepG-ResNet50, ResNet50, and All the models are trained on a single GPU.
S2FL are shown in Table V, while Fig. 6 shows the classi- For SepG-ResNet18 and ResNet18, we use He et al.’s [46]
fication map of our models, alongside with the ground truth initialization to initialize convolution filters and zero initial-
labels. While S2FL is not DL-based, it is essentially a multi- ization for the last batch normalization layer in each residual
stream model. Our experimental results show that the ResNet- branch [47]. We use SGD as an optimizer, with momentum
based methods generally outperform S2FL. SepG-ResNet18 parameter set to 0.9. We train the networks for 300 epochs.
surpasses its baseline model and improves SOTA OA on the Initial learning rate is set to 0.02, with a learning rate
Berlin dataset to 68.21%, while SepG-ResNet50 obtains a schedule that decreases the learning rate to 0.002 at 200th
marginally lower classification accuracy than basic ResNet50. epoch, and further decreases the learning rate to 0.0002 at
We will see in Section IV-C that the best performance is the 240th epoch. For both models, batch size is set to 48.
achieved with ResNet50 when some but not all convolution In SepG-ResNet50 and ResNet50, He’s initialization and zero
layers are replaced with SepDGConv. Test variance reduction batch norm initialization are also used. We use Adam as an
is also observed on SepDGConv models. Fig. 6 shares a similar optimizer, with default algorithm parameters. We train both
visual comparison with quantitative results. networks for 400 epochs. Initial learning rate is set to 0.01,
The SepDGConv group structure plots for SepG-ResNet18 with a learning rate schedule at epochs [300, 350], decreasing
and SepG-ResNet50 are shown in Fig. 8(b) and (c). The Spar- the learning rate to [0.001, 0.0001]. Both models are trained
sity plot shows that both learned architectures are generally using a batch size of 64.
sparse; however, sparsity drop in shallow layers, which sug- b) Results: The quantitative classification results of
gests early fusion, is also found present here. The #Groups plot SepG-ResNet18, SepG-ResNet50, ResNet18, ResNet50, and
shows that, in SepG-ResNet18, InConv learns ∼ 10 groups, methods to compare with are shown in Table VI. Fig. 7 shows
while in SepG-ResNet50, InConv learns two to four groups, the classification map of these models alongside with the
which is on the contrary to the empirical principle that the ground truth labels. According to Table VI, the ResNets
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
5409218 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
Fig. 7. Results on MUUFL dataset. (a) HS false color image. (b) Test ground truth labels. Prediction maps by (c) ResNet18, (d) SepG-ResNet18,
(e) ResNet50, and (f) SepG-ResNet50.
TABLE VI
M ODEL P ERFORMANCE ON MUUFL D ATASET
generally achieve better performance than OTVCA and TB- We design the ablation experiments based on the optimal
CNN. On the MUUFL dataset, however, both SepG-ResNet18 brain damage (OBD) theory in DL [48], according to which if
and SepG-ResNet50 do not surpass their corresponding base- an important module in a deep neural network is removed, then
line model. Again, we will see in Section IV-C that using significant performance drop should be observed. Concretely,
SepDGConv in some but not all convolution layers, both for each SepDGConv model, we run one forward pass and
models can obtain better performance and outperform baseline one backward pass. In the forward pass, we in order change
models. In Fig. 7, the output classification maps of ResNet50s SepDGConv in the following blocks back to regular convo-
are generally more noisy than those of ResNet18s, which sug- lution: [InConv, Layer1, Layer2, Layer3, Layer4] for SepG-
gest that there is probably overfitting in ResNet50s, because ResNet, and [InConv, Down1, Down2, Down3, Down4, Up1,
the dataset is relatively small. Up2, Up3, Up4] for SepG-UNet. In the backward pass, the
SepDGConv group structures learned on MUUFL dataset order is reversed. The obtained models are retrained. For each
are shown in Fig. 8(d) and (e). Both early fusion and more model, the baseline, SepDGConv derivative, forward Pass, and
groups in deeper layers are consistently observed. backward pass are trained under one same configuration of
hyperparameters.
The results of ablation experiments are shown in Fig. 9.
C. Ablation Analysis
We run each model five times with random seeds 42–46, the
To investigate the performance improvement of SepDG- same as our main experiment mentioned earlier. The value of
Conv, we remove SepDGConv block-by-block from above- each data point in Fig. 9 is taken as average OA of the five
mentioned models and analyze the influence of SepDGConv’s replicas, while each vertical bar represents the standard devia-
usage on the overall performance. In particular, as separate tion of five OA values. The red line represents OA changes in
convolution groups learned in a SepG-Net represent sensor- a forward pass, while the blue line represents OA changes in a
specific branches, we hope to shed light on whether such backward pass. A data point in a forward pass means that all
sensor-specificity is important for multisource RS data fusion. SepDGConvs in blocks previous to and including this point are
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SINGLE-STREAM CNN WITH LEARNABLE ARCHITECTURE FOR MULTISOURCE RS DATA 5409218
Fig. 8. Number of groups and sparsity of U in each layer of the following SepDGConv-based models on different datasets: (a) SepG-UNet trained on
Houston2018 dataset. (b) SepG-ResNet18 trained on Berlin dataset. (c) SepG-ResNet50 trained on Berlin dataset. (d) SepG-ResNet18 trained on MUUFL
dataset. (e) SepG-ResNet50 trained on MUUFL dataset.
replaced with regular convolution. For example, in Fig. 9(b), In forward passes, performance gain is consistently
a point at Layer1 represents results from ResNet18s that have observed in early stage of each pass. For example, in Fig. 9(b),
regular convolution in InConv and Layer1, and SepDGConv on Berlin dataset, SepG-ResNet18’s performance improves
in Layer2–4. Similarly, a data point in a backward pass means as SepDGConv is removed from InConv, and an OA of
that all SepDGConvs in blocks behind and including this point 69.76% is obtained, which surpasses both the original SepG-
are replaced with regular convolution, while blocks previous Net as well as the baseline. The same performance gain
to this point remain using SepDGConv. can be observed in Fig. 9(d) and (e), when SepDGConv
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
5409218 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
Fig. 9. Classification performance of various models obtained via ablation analysis. The arrows indicate the order of removal of SepDGConv modules,
while blue bars represent the number of parameters in the corresponding block. (a) Performance of SepG-UNets on Houston2018 dataset. (b) Performance
of SepG-ResNet18s on Berlin dataset. (c) Performance of SepG-ResNet50s on Berlin dataset. (d) Performance of SepG-ResNet18s on MUUFL dataset.
(e) Performance of SepG-ResNet50s on MUUFL dataset.
in InConv is replaced by regular convolution. According to both InConv and Layer1 are removed, which means that,
the OBD theory, this phenomenon implies that, rather than at this point, the shallow layers in the model are densely
being beneficial to model performance, imposing multistream connected rather than having a multistream, sensor-specific
architecture in shallow layers actually harms model perfor- architecture. A very similar loss-gain curve is observed on
mance. For SepG-ResNet50 on the Berlin dataset, as shown SepG-UNet, as shown in Fig. 9(a). Hence, the experimental
in Fig. 9(c), while there is an initial performance loss, the results of SepG-ResNet50 on Berlin dataset and SepG-UNet
OA goes up to 68.58% as soon as SepDGConv layers in on Houston2018 dataset both support our finding that, for the
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SINGLE-STREAM CNN WITH LEARNABLE ARCHITECTURE FOR MULTISOURCE RS DATA 5409218
first few blocks, dense convolution is better than multistream G 0 , a CNN with G 0 sensor-specific branches, each of depth l,
convolution. can be built. In our experiments in the following, we will
In backward passes, performance loss is observed as use this strategy to construct models that have strictly sensor-
SepDGConv in middle, and deep layers are removed. For specific branches, so we can better observe the effect of sensor
UNet, the performance loss occurs as the backward pass goes specificity. In particular, for ResNets, we will use GConv for
through middle blocks, from Up1 to Down4, as shown in [InConv, Layer1–4], and for UNets, we will use GConv for
Fig. 9(a). For ResNets, we observe OA drop when SepDG- [InConv, Down1–4, Up1–4].
Conv in the last two blocks, Layer4 and Layer3, is replaced, The role of FGConv is to further isolate sensor specificity
as shown in Fig. 9(b)–(d). from regularization. For sensor specificity, using SepDGConv
Furthermore, based on the distribution of number of para- does not guarantee this in deep layers, because the number
meters in the studied models, shown as the blue histograms in of groups it learns varies from layer to layer. For regulariza-
Fig. 9, we summarize the performance loss in middle and last tion, SepDGConv usually has larger regularization effects than
layers into a more general phenomenon, i.e., performance loss GConv, because SepDGConv reduces more model parameters
in wide layers. Wide layers refer to layers with more feature to 0. The idea of FGConv is to combine SepDGConv and
map channels and, thus, with more parameters. It can be seen GConv to make sure that any layer using FGConv has at
from Fig. 9 that for UNet, the wide layers are Down3–Up2, least G 0 groups, while maintaining the strong regularization
exactly where we observe performance loss, and for ResNets, effect of SepDGConv. In particular, GConv can be equiva-
the wide layers are Layer3–Layer4, where performance loss lently expressed by an relationship matrix U0 [Fig. 2(c)], and
in the backward pass is also observed. Such performance loss in FGConv, we construct a new relationship matrix U F for
implies that SepDGConv is beneficial to model performance a layer, using that layer’s learned SepDGConv relationship
if applied to wide layers. matrix U and a specified GConv matrix U0
Finally, we shall revisit the performance gain phenomenon
observed in the early stage of forward passes. This phenom- U F = U0 U (16)
enon occurs in the first few layers of studied models, where
there are fewer parameters, as shown in Fig. 9. These layers are where denotes element-wise product. U0 is constructed
also called narrow layers, in accordance with “wide layers.” using the pre-specified group number G 0 and is fixed, while U
Thus, we can say that there is performance gain if we replace is still learned from data. In the experiments, FGConv is used
SepDGConv with regular convolution in narrow layers. Such in [InConv, Layer1–4] for ResNets, and [InConv, Down1–4,
performance gain implies that the SepDGConv is harmful Up1–4] for UNets.
to model performance if applied to narrow, usually shallow For GConv, we make sure, at the input layer, that the group
layers. Besides, in Fig. 9(b)–(e), performance gain is observed division leads to sensor specificity, by manually designing
in backward passes going from Layer2 to InConv. This is the relationship matrix U0 . For the Houston2018 dataset,
also a clue that dense convolution in narrow layers is more U0 divides the 58 channel input data into four groups, mapping
favorable to model performance. [3 MS, 48 HS, 3 LiDAR, 4 DEM/DSM] to [16, 16, 16, 16]
To summarize, in the ablation analysis, we observe that there feature maps. For the Berlin dataset, U0 maps [244 HS, 4 SAR]
is model performance gain if we replace SepDGConv with to [32, 32] feature maps. For the MUUFL dataset, U0 maps
regular convolution in narrow, usually the first few layers in [64 HS, 2 LiDAR] to [32, 32] feature maps. The same U0 is
a model, and there is model performance loss if we replace applied to FGConv.
SepDGConv with regular convolution in wide, usually the The experimental results are shown in Tables VII and VIII.
last few layers in a model. These findings imply that the Performance of baseline models and SepG-models is cited
multistream architecture is harmful to model performance if from above, while the “Ablation” column refers to the model’s
used in narrow layers, but becomes beneficial if applied to best performance obtained in the ablation study. We report
wide layers. average and std of OA of five runs, using random seeds
42–46, the same as mentioned earlier.
D. Comparing Different Convolution Strategies First consider the ResNets; see Table VII. First, it is con-
The results of our ablation analysis suggest that it is better sistently observed that baseline models marginally outperform
to use regular convolution than SepDGConv in the first few the corresponding GConv models. This indicates that sensor-
layers of a CNN. This indicates that the sensor specificity specific multibranching does not necessarily help improve
may not be playing an important role in data fusion models, model performance. Second, in three out of total four exper-
because using densely connected convolution, the model no iments, i.e., except ResNet50 on Berlin dataset, FGConv
longer has sensor-specific features. However, SepDGConv also outperforms GConv. This supports our basic assumption that
regularizes the model, and the effect of sensor specificity is automatically learning GConv hyperparameters benefits model
not yet isolated from regularization. In this section, we further performance. Third, in all of four experiments, neither does
compare models using SepDGConv with another two convo- GConv nor does FGConv outperform the best architecture we
lution strategies that impose sensor-specific multibranch archi- previously find in the ablation study, where the models’ first
tectures: GConv and fixed groupable convolution (FGConv). few SepDGConv layers are replaced with regular convolution.
As mentioned in Section II-A, using GConv for the first l These models found in the ablation study do not have sensor-
layers and setting the number of groups to a fixed number specific branches, so this result is consistent with (1).
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
5409218 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
TABLE VII
D IFFERENT C ONVOLUTION S TRATEGIES ON ResNets
TABLE VIII
D IFFERENT C ONVOLUTION S TRATEGIES ON UNet, H OUSTON 2018 D ATASET
Then, consider the UNets on the Houston2018 data Set; UNet and last few layers in ResNets are much wider and
see Table VIII. While in consistent with (2) above that have much more parameters than the first few layers;
FGConv outperforms GConv, for UNets, we observe GConv hence, using SepDGConv on these wide layers very
outperforming baseline, SepDGConv, and Ablation. To find likely just achieves the expected regularization effect,
an explanation for this, we add a False-GConv experiment, improving the models’ performance.
where at the input layer, we use a group division of [14, 16, As we experiment with three different models on three very
14, 14] instead of [2, 48, 3, 4], so that the four branches are no diverse datasets, our two findings mentioned earlier are highly
longer sensor-specific. False-GConv outperforms GConv while generalizable, which provides strong clues that multistream
achieves slightly lower accuracy than FGConv. This implies architectures actually play the role of model regularizer.
that GConv and FGConv gain performance probably from the Yet, regularization itself is a complex technique, and its
setting G = 4, and again, this experiment supports our finding effect is always coupled with various aspects of model opti-
mentioned earlier that the sensor specificity is not necessarily mization and generalization; hence, there are probably many
helpful. other factors to explore that contribute to the phenomena we
have observed. For example, a very recent paper [50] find out
E. Discussion that one same model trained separately on different sources
of data acquired in the same area (RGB and SAR in their
1) Multistream as Regularization: Theoretically, both
case) can lead to very similar model parameter distributions.
SepDGConv and human designed multistream deep neural net-
This finding suggests that there could be feature redundancy
works can be regarded as a regularization technique, because
in shallow layers of multistream architectures. Nevertheless,
it imposes certain constraints on the network architecture
we hope our work shed some light on the mechanisms behind
to reduce overfitting and to improve model performance
neural network architecture designing for multisource RS data,
[34], [49]. Based on our experiments and ablation analysis
and inspire novel research.
mentioned earlier, we attribute model performance in multi-
source RS data fusion using SepDGConv to regularization, for 2) Possible Improvements on GConv: Our results suggest
we have observed the following two very important features that models with regular convolution, such as ResNet18, can
of regularization. obtain classification results at least comparable with SOTA
1) Variance Reduction: It is known that the regularized methods, and that in shallow layers dense, regular convolution
models can generalize better, which means they should should be used, which together advocate single-stream deep
have lower test variance. In our experiments, as dis- CNN models for joint classification of multisource RS data.
cussed in Section IV-B, except ResNet18 on MUUFL To automatically learn grouped convolution in wide layers
dataset, all SepDGConv models have less test OA vari- to utilize the regularization effect, it is more desirable if
ance than their corresponding baseline models. SepDGConv can learn dense convolution for narrow layers,
2) Over-Regularization: If a simple model is regularized and thus, there is still room for improvement in SepDGConv.
too much, the model capacity can be reduced too much Besides, the restrictions SepDGConv puts on the relationship
to fit the data, and as a result, the overall model perfor- matrix U are strong, and in practice, we may need to construct
mance is harmed. In our ablation analysis, as shown in U s with more flexible structure. We hope novel research in
Fig. 9, we find out that imposing multistream SepDG- constructing and learning the relationship matrix U can lead
Conv on shallow layers leads to model performance loss. to better single-stream CNN architectures for multisource RS
The most possible reason is that, these shallow narrow data.
layer themselves do not have many parameters, and 3) Toward Better Performance: Our work makes it possible
using SepDGConv, they are over-regularized, leading to to build deep, single stream networks for multisource RS
underfitting. On the other hand, the middle layers in data. On the one hand, modern techniques that boost model
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SINGLE-STREAM CNN WITH LEARNABLE ARCHITECTURE FOR MULTISOURCE RS DATA 5409218
performance are more easily applied in a unified network. [7] Y. Gu, Q. Wang, X. Jia, and J. A. Benediktsson, “A novel MKL model
On the other hand, designing sufficiently large models is of integrating LiDAR data and MSI for urban area classification,” IEEE
Trans. Geosci. Remote Sens., vol. 53, no. 10, pp. 5312–5326, Oct. 2015.
beneficial to, and probably necessary for, solving large-scale, [8] J. Hu, D. Hong, and X. X. Zhu, “MIMA: MAPPER-induced manifold
real world RS problems. alignment for semi-supervised fusion of optical image and polarimet-
In our paper, we only experiment and compare with basic ric SAR data,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 11,
pp. 9025–9040, Nov. 2019.
models, leaving many techniques, such as ensembling, post- [9] G. Singh et al., “Topological methods for the analysis of high dimen-
processing, and so on, and more complex modules such as sional data sets and 3D object recognition,” PBG Eurographics, vol. 2,
attention mechanism to further improve model performance pp. 1–10, Sep. 2007.
[10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
unexplored. For example, [28] utilizes Octave convolution pp. 436–444, May 2015.
and fractional Gabor convolution to propose a network that [11] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks
achieves the SOTA OA of 89.90% on MUUFL dataset. and applications in vision,” in Proc. IEEE Int. Symp. Circuits Syst.,
May 2010, pp. 253–256.
We believe that such advanced modules, and more in the [12] D. Marmanis, M. Datcu, T. Esch, and U. Stilla, “Deep learning earth
future, can be more easily implemented in a single-branch observation classification using ImageNet pretrained networks,” IEEE
network. Geosci. Remote Sens. Lett., vol. 13, no. 1, pp. 105–109, Jan. 2016.
[13] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Convolu-
tional neural networks for large-scale remote-sensing image classifica-
V. C ONCLUSION tion,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 2, pp. 645–657,
In this article, we have investigated the potential of single- Feb. 2017.
[14] X. Yuan, J. Shi, and L. Gu, “A review of deep learning methods for
stream models in joint classification of multisource RS data. semantic segmentation of remote sensing imagery,” Expert Syst. Appl.,
To enable a multistream network structure to be automatically vol. 169, May 2021, Art. no. 114417.
learned within a single-stream architecture, we propose the [15] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional
neural networks for hyperspectral image classification,” J. Sensors,
SepDGConv module based on the GConv and DGConv tech- vol. 2015, pp. 1–12, Jan. 2015.
niques. With reference to modern deep CNN architectures, [16] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classification
we then propose several DL models with SepDGConv: SepG- using deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens.,
vol. 55, no. 2, pp. 844–853, Feb. 2017.
ResNet18, SepG-ResNet50, and SepG-UNet. The proposed
[17] H. Lee and H. Kwon, “Going deeper with contextual CNN for hyper-
models are verified on three benchmark datasets with diverse spectral image classification,” IEEE Trans. Image Process., vol. 26,
data modality, yielding promising classification results, which no. 10, pp. 4843–4855, Oct. 2017.
indicate the effectiveness and generalizability of the proposed [18] X. He, A. Wang, P. Ghamisi, G. Li, and Y. Chen, “LiDAR data classi-
fication using spatial transformation and CNN,” IEEE Geosci. Remote
single-stream networks for multisource RS data joint classifi- Sens. Lett., vol. 16, no. 1, pp. 125–129, Jan. 2018.
cation. Furthermore, we analyze the usage of SepDGConv in [19] S. Pan et al., “Land-cover classification of multispectral LiDAR data
different parts of the models and find out that: 1) using SepDG- using CNN with optimized hyper-parameters,” ISPRS J. Photogramm.
Remote Sens., vol. 166, pp. 241–254, Aug. 2020.
Conv generally reduces model variance; 2) using SepDGConv [20] J. Zhao, W. Guo, S. Cui, Z. Zhang, and W. Yu, “Convolutional neural
in narrow layers, usually the first few layers, harms model network for SAR image classification at patch level,” in Proc. IEEE Int.
performance; and 3) using SepDGConv in wide layers, usu- Geosci. Remote Sens. Symp. (IGARSS), 2016, pp. 945–948.
[21] M. Ma, J. Chen, W. Liu, and W. Yang, “Ship classification and detection
ally the last few layers, improves model performance. These based on CNN using GF-3 SAR images,” Remote Sens., vol. 10, no. 12,
findings imply that the sensor-specific multistream architecture p. 2043, Dec. 2018.
is essentially playing the role of model regularizer, and is not [22] Y. Chen, C. Li, P. Ghamisi, X. Jia, and Y. Gu, “Deep fusion of remote
sensing data for accurate classification,” IEEE Geosci. Remote Sens.
strictly necessary for multisource RS data fusion. We hope our Lett., vol. 14, no. 8, pp. 1253–1257, Aug. 2017.
work can inspire novel flexible and generalizable models for [23] X. Xu, W. Li, Q. Ran, Q. Du, L. Gao, and B. Zhang, “Multisource remote
multisource RS data analysis. sensing data classification based on convolutional neural network,” IEEE
Trans. Geosci. Remote Sens., vol. 56, no. 2, pp. 937–949, Feb. 2018.
[24] Y. Xu, B. Du, and L. Zhang, “Multi-source remote sensing data
R EFERENCES classification via fully convolutional networks and post-classification
[1] J. Li, Y. Pei, S. Zhao, R. Xiao, X. Sang, and C. Zhang, “A review of processing,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
remote sensing for environmental monitoring in China,” Remote Sens., Jul. 2018, pp. 3852–3855.
vol. 12, no. 7, p. 1130, Apr. 2020. [25] D. Hong et al., “More diverse means better: Multimodal deep learn-
[2] R. P. Sishodia, R. L. Ray, and S. K. Singh, “Applications of remote ing meets remote-sensing imagery classification,” IEEE Trans. Geosci.
sensing in precision agriculture: A review,” Remote Sens., vol. 12, no. 19, Remote Sens., vol. 59, no. 5, pp. 4340–4354, Apr. 2021.
p. 3136, Sep. 2020. [26] D. Hong, N. Yokoya, G.-S. Xia, J. Chanussot, and X. X. Zhu,
[3] Y. Xu et al., “Advanced multi-sensor optical remote sensing for urban “X-ModalNet: A semi-supervised deep cross-modal network for classifi-
land use and land cover classification: Outcome of the 2018 IEEE GRSS cation of remote sensing data,” J. Photogramm. Remote Sens., vol. 167,
data fusion contest,” IEEE J. Sel. Topics Appl. Earth Observ. Remote pp. 12–23, Sep. 2020.
Sens., vol. 12, no. 6, pp. 1709–1724, Jun. 2019. [27] M. Zhang, W. Li, R. Tao, H. Li, and Q. Du, “Information fusion for
[4] M. Pedergnana, P. R. Marpu, M. D. Mura, J. A. Benediktsson, and classification of hyperspectral and LiDAR data using IP-CNN,” IEEE
L. Bruzzone, “Classification of remote sensing optical and LiDAR data Trans. Geosci. Remote Sens., vol. 60, pp. 1–12, 2022.
using extended attribute profiles,” IEEE J. Sel. Topics Signal Process., [28] X. Zhao, R. Tao, W. Li, W. Philips, and W. Liao, “Fractional Gabor con-
vol. 6, no. 7, pp. 856–865, Nov. 2012. volutional network for multisource remote sensing data classification,”
[5] M. Khodadadzadeh, J. Li, S. Prasad, and A. Plaza, “Fusion of hyperspec- IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–18, 2022.
tral and LiDAR remote sensing data using multiple feature learning,” [29] R. Hang, Z. Li, P. Ghamisi, D. Hong, G. Xia, and Q. Liu, “Classification
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6, of hyperspectral and LiDAR data using coupled CNNs,” IEEE Trans.
pp. 2971–2983, Jun. 2015. Geosci. Remote Sens., vol. 58, no. 7, pp. 4939–4950, Jul. 2020.
[6] B. Rasti, P. Ghamisi, and R. Gloaguen, “Hyperspectral and LiDAR [30] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “FuseNet: Incorpo-
fusion using extinction profiles and total variation component analysis,” rating depth into semantic segmentation via fusion-based CNN architec-
IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7, pp. 3997–4007, ture,” in Proc. Asian Conf. Comput. Vis. Cham, Switzerland: Springer,
Jul. 2017. 2016, pp. 213–228.
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
5409218 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
[31] N. Audebert, B. Le Saux, and S. Lefèvre, “Beyond RGB: Very high Daoye Zhu received the M.S. degree in cartogra-
resolution urban remote sensing with multimodal deep networks,” phy and geographical information system from the
J. Photogramm. Remote Sens., vol. 140, pp. 20–32, Jun. 2018. State Key Laboratory of Information Engineering
[32] K. Chen et al., “Effective fusion of multi-modal data with group con- in Surveying, Mapping and Remote Sensing (LIES-
volutions for semantic segmentation of aerial imagery,” in Proc. IEEE MARS), Wuhan University, Wuhan, China, in 2018.
Int. Geosci. Remote Sens. Symp. (IGARSS), Jul. 2019, pp. 3911–3914. He is currently pursuing the Ph.D. degree in data sci-
[33] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the ence (computer science and technology) with Peking
recent architectures of deep convolutional neural networks,” Artif. Intell. University, Beijing, China.
Rev., vol. 53, no. 8, pp. 5455–5516, 2020, doi: 10.1007/s10462-020- In 2020, he joined the University of Cambridge,
09825-6. Cambridge, U.K., as a visiting scholar. His research
[34] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, interests include spatial analysis, remote sensing
MA, USA: MIT, 2016. intelligent interpretation, spatial data fusion, and geographic information
[35] Z. Zhang et al., “Differentiable learning-to-group channels via groupable engineering.
convolutional neural networks,” in Proc. IEEE/CVF Int. Conf. Comput.
Vis., Oct. 2019, pp. 3542–3551.
[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-
tion with deep convolutional neural networks,” in Proc. Adv. Neural Tengteng Qu received the B.E. degree in surveying
Inf. Process. Syst. (NIPS), Stateline, NV, USA, vol. 25, Dec. 2012, engineering and the Ph.D. degree in photogrammetry
pp. 1097–1105. engineering and remote sensing from Tongji Uni-
[37] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual versity, Shanghai, China, in 2011 and 2017, respec-
transformations for deep neural networks,” in Proc. IEEE Conf. Comput. tively.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1492–1500. From 2018 to 2021, she was a Post-Doctoral
[38] P. Yin, J. Lyu, S. Zhang, S. Osher, Y. Qi, and J. Xin, “Understanding Researcher with Peking University, Beijing, China,
straight-through estimator in training activation quantized neural nets,” where she is currently a Research Assistant Profes-
in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–30. sor with the College of Engineering. Since 2019, she
[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image has been serving as a Standard Expert of geographic
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, information in the IEEE Standards Association, the
pp. 770–778. Open Geospatial Consortium (OGC), and the International Organization for
[40] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- Standardization (ISO). Her research interests include geospatial big data,
works for biomedical image segmentation,” in Proc. Int. Conf. Med. global subdivision grids, and synthetic aperture radar remote sensing.
Image Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer,
2015, pp. 234–241.
[41] A. Okujeni, S. van der Linden, and P. Hostert, “Berlin-urban-gradient Qiangyu Wang received the B.S. degree in com-
dataset 2009—An enmap preparatory flight campaign,” GFZ (German puter science and technology from the Shandong
Research Centre for Geosciences) Data Services, Potsdam, Germany, University of Science and Technology, Qingdao,
Tech. Rep., 2016, doi: 10.2312/enmap.2016.002. China, in 2012, the M.Sc. degree in advanced
[42] D. Hong, J. Hu, J. Yao, J. Chanussot, and X. X. Zhu, “Multimodal computer science from Newcastle University, Tyne,
remote sensing benchmark datasets for land cover classification with U.K., in 2014, and the Ph.D. degree in computer
a shared and specific feature learning model,” J. Photogramm. Remote architecture from the China University of Mining
Sens., vol. 178, pp. 68–80, Aug. 2021. and Technology (Beijing), Beijing, China, in 2019.
[43] X. Du and A. Zare, “Technical report: Scene label ground truth He is currently a Post-Doctoral Researcher with
map for MUUFL Gulfport data set,” Univ. Florida, Gainesville, the School of Engineering, Peking University, Bei-
FL, USA, Tech. Rep. 20170417, Apr. 2017. [Online]. Available: jing. His research interests include spatio-temporal
http://ufdc.ufl.edu/IR00009711/00001 grids for big data, deep learning, and computer vision.
[44] D. Cerra et al., “Combining deep and shallow neural networks with
ad hoc detectors for the classification of complex multi-modal urban
scenes,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
Jul. 2018, pp. 3856–3859. Fuhu Ren received the B.S. degree in geology, the
[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” M.S. degree in remote sensing and geographic infor-
2014, arXiv:1412.6980. mation system, and the Ph.D. degree in geography
[46] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: from Peking University, Beijing, China, in 1984,
Surpassing human-level performance on ImageNet classification,” in 1988, and 1991, respectively.
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034. He was a Postal-Doctoral Researcher with the Uni-
[47] P. Goyal et al., “Accurate, large minibatch SGD: Training ImageNet in versity of Tokyo, Tokyo, Japan, for two years, and a
1 hour,” 2017, arXiv:1706.02677. UN Researcher with the United Nations Centre for
[48] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Regional Development (UNCRD), Nagoya, Japan,
Proc. Adv. Neural Inf. Process. Syst., 1990, pp. 598–605. for three years. He is currently a Professor and the
[49] J. Kukačka, V. Golkov, and D. Cremers, “Regularization for deep Executive Director of the Collaborative Innovation
learning: A taxonomy,” 2017, arXiv:1710.10686. Center for Geospatial Big Data, Peking University. He is an Expert rep-
[50] Z. Zheng, A. Ma, L. Zhang, and Y. Zhong, “Deep multisensor learning resenting the Standardization Administration of the People’s Republic Of
for missing-modality all-weather mapping,” J. Photogramm. Remote China (SAC) in the Work Group 9 of ISO TC211 for the development of
Sens., vol. 174, pp. 254–264, Apr. 2021. relevant standards on DGGS. His research interests include discrete global
grid systems (DGGS), spatial-temporal analysis, and remote sensing cloud
computing.
Yi Yang received the B.E. degree in surveying and Chengqi Cheng received the Ph.D. degree from
mapping engineering from Tongji university, Shang- Peking University, Beijing, China, in 1989.
hai, China, in 2018, and the M.S. degree in data He is currently a Professor with the College of
science (computer science and technology) from Engineering, Peking University. He established the
Peking University, Beijing, China, in 2021. He is Collaborative Innovation Center for Geospatial Data,
expected to do a second master’s degree with the Peking University, which has been involved in the
Imperial College London, London, U.K. IEEE Standards Association Corporate Program. His
His research interests include geospatial big research interests include global subdivision model
data, remote sensing image processing, and pattern and geographic information system applications.
recognition.
Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.