0% found this document useful (0 votes)

26 views18 pages

Single-Stream CNN With Learnable Architecture For Multisource Remote Sensing Data

Uploaded by

Jayesh Fasate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views18 pages

Single-Stream CNN With Learnable Architecture For Multisource Remote Sensing Data

Uploaded by

Jayesh Fasate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

60, 2022 5409218

Single-Stream CNN With Learnable Architecture

for Multisource Remote Sensing Data
Yi Yang , Daoye Zhu, Tengteng Qu , Qiangyu Wang, Fuhu Ren, and Chengqi Cheng

Abstract— In this article, we propose an efficient and gen- on. One fundamental yet challenging task in RS is land-
eralizable framework based on a deep convolutional neural use/land-cover (LULC) classification, which aims to assign
network (CNN) for multisource remote sensing (RS) data joint one semantic category to each pixel in an RS image acquired
classification. While recent methods are mostly based on multi-
stream architectures, we use group convolution (GConv) to con- over some region of interest.
struct equivalent network architectures efficiently within a single- Nowadays, diverse sensor technologies allow to measure
stream network. Based on a recent technique called dynamic different aspects of scenes and objects from the air, including
grouping convolution (DGConv), we further propose a network sensors for multispectral (MS) optical imaging, hyperspec-
module named separable DGConv (SepDGConv), to make GConv tral (HS) imaging, synthetic aperture radar (SAR), and light
hyperparameters, and, thus, the overall network architecture,
learnable during network training. In the experiments, the pro- detection and ranging (LiDAR). Different sensors bring diverse
posed method is applied to residual network (ResNet) and UNet, and complementary information [3]. For example, the MS
and the adjusted networks are verified on three very diverse optical imagery contains spatial information, such as object
benchmark datasets (i.e., Houston2018 data, Berlin data, and shape and spatial relationship. The HS data provide detailed
MUUFL Gulfport Hyperspectral and LiDAR Airborne Data Set spectral information of LULC and ground objects. While HS
(MUUFL) data). Experimental results demonstrate the effective-
ness of the proposed single-stream CNNs, and in particular, imagery cannot be used to differentiate objects composed of
SepG-ResNet18 improves the state-of-the-art classification overall the same material, such as roofs and roads both made of
accuracy (OA) on hyperspectral–synthetic aperture radar (HS– concrete, LiDAR data can capture elevation distribution and,
SAR) Berlin dataset from 62.23% to 68.21%. In the experiments thus, can be used to distinguish roofs from roads. Also, the
we have two interesting findings. First, using DGConv generally SAR data can provide additional structure information about
reduces test OA variance. Second, multistream is harmful to
model performance if imposed to the first few layers, but becomes Earth’s surface. Availability of multisource, multimodal RS
beneficial if applied to deeper layers. Altogether, the findings data makes it possible to integrate rich information to improve
imply that the multistream architecture, instead of being a strictly LULC classification performance. Also, considerable efforts
necessary component in deep learning models for multisource RS have been invested into the research of multisource RS data
data, essentially plays the role of model regularizer. Our code joint analysis for LULC since recent years.
is publicly available at https://github.com/yyyyangyi/CNNs-for-
Multi-Source-Remote-Sensing-Data-Fusion. We hope our work
can inspire novel research in the future.
A. Related Work
Index Terms— Classification, convolutional neural networks
(CNNs), dynamic grouping convolution (DGConv), multisource Conventionally, a multisource RS data analysis workflow
remote sensing (RS) data, network architecture, segmentation. contains two phases: one feature extraction phase and one
feature fusion phase. In the feature extraction phase, different
I. I NTRODUCTION feature extractors are applied to different data modalities,
while in the feature fusion phase, high-level features obtained
R EMOTE sensing (RS) plays an important role in Earth
observation and supports applications, such as envi-
ronmental monitoring [1], precision agriculture [2], and so
from the previous phase are fused by certain algorithms and
fed to LULC classifiers. For example, both [4] and [5] extract
morphological attribute profiles from HS and LiDAR data and
Manuscript received August 31, 2021; revised January 3, 2022 and use feature stacking as a fusion technique. In [6], morpho-
March 10, 2022; accepted April 6, 2022. Date of publication April 21,
2022; date of current version May 4, 2022. This work was supported by logical extinction profiles are extracted separately from HS
the National Key Research and Development Program of China under Grant and LiDAR data, and the features are further fused using
2018YFB0505300. (Corresponding author: Yi Yang.) an orthogonal total variation component analysis (OTVCA).
Yi Yang and Fuhu Ren are with the Center for Data Science,
Peking University, Beijing 100871, China (e-mail: [email protected]; Gu et al. [7] extracted manually engineered features from MS
[email protected]). and LiDAR data and used multiple kernel learning as a fusion
Daoye Zhu is with the Center for Data Science, Peking University, Beijing strategy to train a support vector machine classifier. In [8],
100871, China, and also with the Laboratory of Interdisciplinary Spatial
Analysis, University of Cambridge, Cambridge CB3 9EP, U.K. (e-mail: MAPPER [9] is used as a feature extractor for MS optical
[email protected]). and SAR data, and the features are further fused with manifold
Tengteng Qu, Qiangyu Wang, and Chengqi Cheng are with the College alignment. Note that, in the literature, data fusion also refers
of Engineering, Peking University, Beijing 100871, China (e-mail:
[email protected]; [email protected]; [email protected]). to a data processing technique that integrates multisource data
Digital Object Identifier 10.1109/TGRS.2022.3169163 into one data modality, while, in our paper, we use this term
1558-0644 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
5409218 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

to express the same meaning as “joint classification/analysis” branch late fusion is made, and Audebert et al. [31] found out
of multisource RS data. that one strategy does not consistently outperform the other on
Meanwhile, deep learning (DL) [10], as one of the most different datasets. In [32], MS optical data and a digital surface
notable advances in computer vision (CV) recently, has model (DSM) band are jointly classified within a single-stream
attracted attention from both CV and RS communities. CNN with depth-wise convolution.
In particular, a convolutional neural network (CNN) [11]
is a DL-based model that significantly outperforms tradi-
tional methods in image classification and segmentation. B. Challenges
Most of the data in RS are also presented in the form of It can be summarized from the abovementioned literature
images, and CNN has shown remarkable success in ana- that when designing a data fusion CNN, the following two
lyzing MS [12]–[14], HS [15]–[17], LiDAR [18], [19], and principles are usually followed: 1) different branches are
SAR data [20], [21]. strictly separated; thus, low level features are sensor-specific,
Besides being applied to individual data sources, CNNs are and 2) the number of branches is set equal or proportional to
adopted as backbone models for multisource RS data classifi- the number of data sources.
cation in many recent works. As an early attempt, [22] designs Despite achieved success, it is far from fully understood
a two-branch CNN for joint analysis of HS-LiDAR data, and why these empirical principles work. In particular, it may be
one branch for each sensor, achieving promising classification helpful to further improve data fusion models, if we can gain
accuracy. Xu et al. [23] proposed another two-branch CNN some insight into the following two problems.
for HS-LiDAR data, with a different design of HS feature 1) How to Find Optimal Number of Branches?: Model
extraction branch. In [24], a three-branch CNN is proposed performance and efficiency are both closely related to the
to fuse MS, HS, and LiDAR data. Hong et al. [25] further number of branches, which is often treated as hyperparameters
extended the scope of deep multibranch networks by allowing and defined by human experts. This can very likely lead the
either CNN or fully connected neural networks be one feature network to learn a suboptimal solution, because experts cannot
extraction branch. See Section II-A for a formal definition of confidently know, and in fact, there has been no agreement on
network branch. an optimal set of these hyperparameters. On the one hand,
Based on the multibranch architecture, some latest papers while it is possible to find an optimal number of branches
devote to further improve the model performance by introduc- by trial for small models, for typical modern CNNs, which
ing various novel modules to the network. Hong et al. [26] are very large (∼100 layers [33]), manual tuning is no longer
used self-adversarial modules, interactive learning modules, feasible. On the other hand, in a CNN, convolution layers in
and label propagation modules to build a deep CNN for semi- different depths learn features of different semantic meanings,
supervised multimodal learning. In [27], Gram matrices are and it can be very difficult to find an optimal network depth
utilized to improve multisource complementary information at which sensor-specific features are fused. It is, therefore,
preservation in a two-branch CNN for HS and LiDAR data desirable that hyperparameters, such as number of branches
fusion. In [28], a two-branch CNN for joint classification of and branch depth, can be found automatically.
HS and LiDAR is proposed, where in the feature extraction 2) Which Works—Specificity or Regularization?: A multi-
branches, Octave convolutional layers are used to reduce fea- stream CNN has fewer parameters than its dense counterpart;
ture redundancy from low-frequency data components, and in for the latter, there are more parameters to connect different
the fusion subnetwork, fractional Gabor convolution is utilized branches. In the DL community, reducing model parame-
to obtain multiscale and multidirectional spatial features. ters is known as an effective regularization technique, which
Using multistream architectures as mentioned earlier, this improves model’s test performance [34]. For multisource data
data fusion strategy is also known as the “late-fusion” scheme, fusion, while it is generally assumed that sensor-specific fea-
because the features are kept sensor-specific until the last few tures are beneficial, the effects of regularization have not been
layers, i.e., the fusion subnetwork. Another strategy in contrast studied in isolation.
to late fusion is “early fusion.” As the name suggests, early
fusion means either the feature extraction branches are very
shallow, or there are certain information exchange between C. Method Overview
different branches, making the features no longer absolutely To address the aforementioned challenges, we aim, in this
sensor-specific. In [29], a coupled CNN is proposed to fuse article, to develop a framework that allows CNN architectures
HS and LiDAR data, where a weight sharing technique is to be learned from data within a single-stream network, for
utilized in intermediate layers of the proposed two-branch multisource RS data fusion.
CNN, making each feature extraction branch only contain very We notice that any multistream architecture can be equiv-
few separate layers. Hazirbas et al. [30] proposed a fully CNN alently expressed by group convolution (GConv) within a
named FuseNet with two branches for semantic labeling of single-stream architecture (see Section II-A). The following
indoor scenes on red-green-blue-depth (RGB-D) data. To fuse two parameters further control GConv and need to be specified
features from RGB branch and depth branch, the authors for each layer: number of total groups and number of feature
propose a fusion block, which adds the output of each block maps in each group. Recently proposed dynamic grouping
from depth branch to RGB branch. In [31], a more detailed convolution (DGConv) [35] enables these two parameters to
comparison between FuseNet-based early fusion and multi- be learned in an end-to-end manner via network training.

Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: SINGLE-STREAM CNN WITH LEARNABLE ARCHITECTURE FOR MULTISOURCE RS DATA 5409218

Fig. 1. Illustration on the difference among (a) multistream architecture, (b) GConv architecture, and (c) proposed DGConv architecture.

Originally proposed for efficient architecture designing, the 4) To the best of our knowledge, this is the first time that
DGConv itself does not ensure sensor-specific features and, the single-stream CNNs for multisource RS data fusion
thus, cannot be directly used to approximate and study multi- are systematically studied and compared with the state-
stream models for multisource RS data fusion models. In our of-the-art (SOTA) multibranch models.
paper, we propose necessary modifications to DGConv, based The remainder of this article is organized as follows. The
on which we further design CNN blocks and single-stream proposed SepDGConv layer and CNN architectures are intro-
architectures with simultaneous feature extraction fusion for duced in Section II. The experimental results and analysis are
joint classification of multisource RS data. Fig. 1(c) illustrates presented in Section III. Finally, Section IV makes the sum-
such a CNN model. More specifically, the contributions of this mary with some important conclusions and hints at potential
paper can be highlighted as follows. future research trends.
1) A modified DGConv module, which we name separable
II. P RELIMINARIES
DGConv (SepDGConv), is proposed to automatically
learn a GConv structure within single-stream neural In this section, first, we show how we can use GConv
networks. SepDGConv is theoretically compatible with to construct a single-stream CNN that is equivalent to a
any CNN architecture. multistream one. Second, we briefly introduce DGConv as
2) Based on the proposed SepDGConv module, the deep well as groupable networks (G-Nets), a family of architectures
single-stream CNN models are proposed with reference using DGConv.
to typical architectures in the CV area. The proposed
CNNs show promising classification performance on A. GConv as Multistream Conv
various benchmark multisource RS datasets. A CNN branch/stream consists of a sequence of convo-
3) Experimental results suggest that using densely con- lution/normalization/activation/pooling layers, and in a mul-
nected network to jointly extract features from multiple tistream architecture, the network branches usually play the
data modalities actually improves the final classification role of feature extractor. Fig. 1(a) shows a typical two-branch
performance, and using SepDGConv in deeper layers, CNN, with HS data fed to one branch (blue) and LiDAR
which contain more parameters, also helps improve to the other (yellow). The output features of multistream
classification accuracy. This finding is very interest- feature extractors are sensor-specific, because the LiDAR data
ing, because it suggests that regularization contributes never come into the HS branch, and vice versa. These fea-
more to model performance improvement than sensor tures are further fed into a fusion subnetwork, which usu-
specificity. ally consists of one single branch. The fusion subnetwork

makes predictions based on the extracted features and outputs

classification map.
GConv, on the other hand, is originally proposed by the
CV community for parallel computing [36]. This technique
enables convolution in a layer to be computed in parallel
groups. GConv is also studied in efficient network architecture
designing [37], because it reduces the number of parameters
in convolution layers. Fig. 1(b) illustrates a single-stream
CNN architecture using GConv. Input HS and LiDAR data
are stacked together. In the feature extraction subnetwork, the
feature maps are divided into two groups in parallel, blue and
yellow. The HS data go into the yellow convolution group,
and the LiDAR data go into the blue group. Compared with
Fig. 1(a), we can see that the two architectures are identi-
cal, suppose the architecture hyperparameters, i.e., number of
layers, are the same.
Formally, let F ∈ R N ×C ×H ×W denote a feature map of a
in

certain layer in a CNN, where N, C in , H , and W represent

minibatch size, number of input channels, and height and
width of the feature map, respectively. Let ω ∈ RC ×C ×k×k
out in

be the convolution kernel in the same layer, where C out is the

number of output channels, and k is the kernel size. Then,
in a convolution operation, F and ω are multiplied to give an
output feature map O ∈ R N ×C ×H ×W
out

k−1
k−1
O(i, j ) = ω(m, n) F(i + m, j + n) (1)
m=0 n=0

where i ∈ {1, . . . , H }, j ∈ {1, . . . , W }, O(i, j ) ∈ R N ×C ,

out

F(i + m, j + n) ∈ R N ×C , and ω(m, n) ∈ RC ×C .

in out in

In regular convolution, ω densely maps every input channel

to every output channel, as illustrated in Fig. 2(a). In GConv, Fig. 2. (Top) Illustration of different convolution strategies and (bottom)
such dense mapping is replaced by structured mapping, such their corresponding relationship matrix U . One rectangle represents one input
feature map channel, and one circle represents one output feature map channel.
that both feature maps and convolution kernels are divided (a) Regular convolution. (b) Depth-wise convolution. (c) GConv. (d) DGConv.
into several groups on the channel dimension, and each kernel
convolves only on feature maps in the same group as the
kernel. Concretely, suppose we divide the convolution into G is equivalent to a two-branch network with each branch having
groups, then GConv can be written as four channels.
Furthermore, a common way to build deep sensor-specific

k−1
k−1 branches is to set G to a fixed value for all layers. Then, in any
O γ (i, j ) = ωγ (m, n) F γ (i + m, j + n) (2) layer, for any two different groups γ1 and γ2 , O γ1 is computed
m=0 n=0 only using F γ1 , and F γ2 is never involved. This means that the
output features are of the same sensor specificity as the input,
where γ ∈ {1, . . . , G}
as the intuitive blue-to-blue, yellow-to-yellow illustration
∪ represents concatenation along the channel axis
in Fig. 2(c).
O(i, j ) = O 1 (i, j ) ∪ . . . ∪ O γ (i, j ) ∪ . . . ∪ O G (i, j )
The existing models using GConv may suffer from subopti-
ω(i, j ) = ω1 (i, j ) ∪ . . . ∪ ωγ (i, j ) ∪ . . . ∪ ω G (i, j )
mal performance due to manually specifying hyperparameter
F(i, j ) = F 1 (i, j ) ∪ . . . ∪ F γ (i, j ) ∪ . . . ∪ F G (i, j ).
G. To address this issue, DGConv allows both the total group
Fig. 2(c) illustrates a GConv layer with C in = C out = 8 and number and channel connections to be learned from data,
G = 2. alongside with other CNN parameters.
In the case of multisource remote sensing (RS) data, if we
assign one or multiple groups to each data modality, then we
are able to embed a multistream structure within a single- B. Dynamic Grouping Convolution
stream CNN. For example, suppose we have HS and LiDAR
data, then we can set G = 2, and assign HS data to group 1 and Here, we briefly introduce DGConv [35]. The key idea of
LiDAR data to group 2. Using the illustration in Fig. 2(c), let DGConv is to model the input–output channel mapping by
us say the HS data are on the blue colored group, while the introducing a binary relationship matrix U , and then make U
LiDAR data are on the yellow colored group. Obviously, this learnable as part of the CNN parameters.

1) Definition: Formally, DGConv is defined as Backpropagation through the non-differentiable sign(·) can
be done by the straight-through estimator proposed for

k−1
k−1
O(i, j ) = (U ω(m, n)) F(i + m, j + n) (3) quantized neural networks [38], and then, automatic gradient
m=0 n=0
computation for the rest is supported by most modern DL
programming frameworks.
where U ∈ {0, 1}C ×C , and denotes element-wise product.
out in
As an example, a convolution layer with C in = C out = 8,
Ui , the i th row of U , is a binary vector that indicates which as shown in Fig. 2, has K = 3, and the relationship matrix U
input channels are involved in the computation of the i th out- is of shape 8 × 8. U for regular convolution in Fig. 2(a) can
put channel. The definition is reasonable, as many convolution be expressed as U = 1 ⊗ 1 ⊗ 1, with g = (1, 1, 1). Similarly,
operations can be regarded as special cases of DGConv. For for depth-wise convolution in Fig. 2(b), g = (0, 0, 0) and
instance, DGConv becomes regular convolution (1) if we let U = I ⊗ I ⊗ I. For GConv illustrated as in Fig. 2(c),
U be a matrix of ones, as illustrated in Fig. 2(a). DGConv g = (1, 1, 0) and U = 1 ⊗ 1 ⊗ I. For DGConv shown in
becomes depth-wise convolution if we let U be an identity Fig. 2(d), g = (0, 0, 1) and U = I ⊗ I ⊗ 1.
matrix, as illustrated in Fig. 2(b). DGConv can also represent
GConv (2), if we take U to be a block-diagonal matrix of ones C. G-Nets
and zeros, as illustrated in Fig. 2(c). G-Nets [35] refer to architectures using DGConv. In partic-
2) Learning the Relationship Matrix U: While (3) is a ular, Zhang et al. [35] experimented with G-ResNet50, which
representative, such a definition results in the following two is based on ResNet50 [39].
difficulties in estimating U . First, the introduction of U adds ResNet50 uses Bottleneck as its building block. The Bot-
lots of additional parameters to the network, which makes tleneck block in order consists of one 1 × 1 convolution layer,
the learning process more difficult. Second, U takes binary one 3 × 3 convolution layer, and one more 1 × 1 convolution
values of 0 and 1, while it is widely known that optimization layer, as shown in Fig. 3(c). In its DGConv version, the
involving discrete values are generally very hard to solve. middle 3 × 3 convolution is replaced with 3 × 3 DGConv.
To address the first issue, U is decomposed into a set G-ResNet50 consists of four Bottleneck blocks; the number of
of small matrices, and learnable parameters are designed to output channels for each block being [256, 512, 1024, 2048],
generate this set of small matrices. Consider a simple yet respectively.
quite general case, where U is a square matrix with C in =
C out = 2 K , K being an integer. Then, a set of 2 × 2 matrices III. M ETHOD
U1 , . . . , Ui , . . . , U K can be defined, and U can be recon- While DGConv enables automatically learning of GConv
structed as hyperparameters, the learning outcome does not lead to a
network with sensor-specific branches, as we will see in the
U = U1 ⊗ · · · ⊗ Ui ⊗ · · · ⊗ U K (4)
following, and, thus, cannot be directly used to approximate
where ⊗ denotes Kronecker product, i ∈ {1, . . . , K }. Each multibranch CNNs for multisource RS data fusion. To address
small matrix Ui is further represented by a binary parameter this issue, based on DGConv, we propose SepDGConv and
gi ∈ {0, 1} separable G-Net (SepG-Net), which makes it possible that the
learned architecture contains sensor-specific branches.
Ui = gi 1 + (1 − gi )I (5)
A. Blocks With SepDGConv
where 1 denotes a 2×2 constant matrix of ones, and I denotes
a 2 × 2 constant identity matrix. Thus, each 2 K × 2 K relation- First, reconsider the Bottleneck block with DGConv. If sen-
ship matrix U can be constructed from a vector g ∈ R K . sor specificity is to be preserved, then it is necessary that
The number of parameters to be learned is thereby reduced every convolution layer in a block uses GConv, as in (2).
exponentially. In both regular and DGConv Bottleneck blocks, the first and
To address the second issue, a learnable gate vector g̃, taking last convolution layers use 1×1 regular convolution. Therefore,
continuous values, is introduced to generate the binary vector in our SepDGConv Bottleneck, we use depth-wise convolution
g, as follows: instead of regular convolution; i.e., we set the number of
groups G equal to the number of that layer’s feature maps.
g = sign(g̃) (6) Fig. 3(d) shows a SepDGConv Bottleneck block.
Second, consider the DoubleConv block, which is used
where sign(·) represents the sign function
as the building blocks by two very popular CNN models:
0, x < 0 ResNet18 and UNet [40]. A DoubleConv block consists of two
sign(x) = (7) consecutive 3 × 3 convolution layers, as shown in Fig. 3(a).
1, x ≥ 0.
For the residual network (ResNet) family, BasicBlock has
Altogether, combining (4)–(7), the 2 K × 2 K binary relation- a very similar structure to that of DoubleConv, except for
ship matrix U is constructed using a continuous vector g̃ of the additional residual connection of BasicBlock. As long as
length K there is no confusion, we also use DoubleConv to refer to
BasicBlock in ResNets. To impose sensor specificity, in our
g = sign(g̃) SepDGConv DoubleConv, we replace both convolution layers
U = g1 1 + (1 − g1 )I ⊗ · · · ⊗ g K 1 + (1 − g K )I. (8) with DGConv layers, as shown in Fig. 3(b).

where I is a C in × C in identity matrix, and 1r is a column

vector of ones, with length r .
Similarly, in the reduction case, we first construct
Ũ ∈ RC ×C . Then, we duplicate each column of Ũ r times
out out

and stack them together vertically

U = Ũ I ⊗ 1r (10)
where I is a C out × C out identity matrix, and 1r is a row
vector of ones, with length r .
With our proposed strategy, we encourage feature maps in
one input group to stay in the same output group. Also, note
that both (I ⊗ 1r ) and (I ⊗ 1r ) are fixed during network
initialization phase and need no training.
2) C in = 2 K : Finally, we present how we handle C in = 2 K ,
which is commonly seen in the input convolution (InConv)
block of a CNN that computes convolution on the input data.
For RS data, the number of input data channels can range
from ∼ 3 (RGB data) to ∼ 102 (HS data), while, very
commonly, the number of input channels for the first block
after InConv is set to 64, which means for InConv, C out = 64.
Hence, C in for InConv can fall within any of the follow-
ing three intervals: (0, (1/2)C out ), ((1/2)C out , 2C out ), and
(2C out , +∞).
If 0 < C in < (1/2)C out , then we hope to use (9) to
construct U . We make the following modification to make
Fig. 3. CNN blocks and their corresponding SepDGConv variant. (a) Reg-
ular DoubleConv. (b) SepDGConv DoubleConv. (c) Regular Bottleneck. sure both K and r are integers:
(d) SepDGConv Bottleneck. The curve with arrow stands for residual connec-
tion in ResNet blocks, and ⊕ denotes summation. In (a) and (b), the dashed K = log2 (C in )
curve means there is no residual connection when DoubleConv is used in r = C out /C in (11)
UNet.
where x denotes the round up function.
Similarly, we use (10) for 2C out < C in < +∞, with
B. Details of SepDGConv
K = log2 (C out )
Recall that in Section II-B, for DGConv, it is assumed on
the relationship matrix U ∈ RC ×C that C out = C in = 2 K .
out in
r = C in /C out . (12)
To apply DGConv to all layers to approximate deep sensor- For the last case, (1/2)C out < C in < 2C out , we determine
specific branches, in our SepDGConv, we handle the following K and round it up to an integer
two exceptions that violate the abovementioned assumption.
1) C out = C in : For architectures mentioned earlier, Bottle- K = log2 (max(C in , C out )) (13)
neck by our design satisfies this condition; however, Double-
where max(a, b) takes the maximum of a and b.
Conv usually has C out = C in . C out > C in is commonly seen in
Finally, the size of U constructed by our design is always
the feature extraction phase of a CNN, where the latter layer
larger than RC ×C . Hence, we only use the first C out rows
out in

has more feature maps than its previous layer. C out < C in is
and C in columns in our computation, ignoring the remaining
often used in upsampling layers of a segmentation model.
entries.
Formally, we define the following case expansion:
C out /C in = r , and the other case reduction: C in /C out = r ,
with r ≥ 2 being an integer. Our strategy is to expand or C. Regularization in SepDGConv
reduce the shape of the relationship matrix by doing matrix In SepDGConv, U is responsible for the regularization
multiplication and Kronecker product with identity matrices effect. The learned U is expected to divide the network into
and vectors of ones. multiple groups, and, thus, also to be sparse. Recall that U is
In the expansion case, we first construct a matrix multiplied with network parameters ω (3); hence, a sparse U
Ũ ∈ RC ×C as described in Section II-B. Recall that
in in
essentially wipes out a certain number of parameters, which
U (i, j ) = 1 if the j th input channel is involved in the regularizes the model by controlling model complexity. GConv
computation of the i th output channel; otherwise, U (i, j ) = 0. can, in the same way, regularize the model as well.
Hence, we duplicate each row of Ũ r times and stack them Consider again the example in Fig. 2. For the GConv illus-
together horizontally to get U trated in Fig. 2(c), U has half of its entries equal to zero, and
comparing with the regular convolution in Fig. 2(a), this wipes
U = (I ⊗ 1r )Ũ (9) out half of the layer’s parameters. U s learned in SepDGConv

(a) (b) (c)

Fig. 4. Neural network architectures using SepDGConv. (a) SepG-ResNet18. (b) SepG-ResNet50. (c) SepG-UNet. In (b), the ×x means that x consecutive
blocks are omitted as one in the illustration. In (c), gray line represents skip connections utilized in UNet, while denotes concatenation along the channel
axis.

tend to be even more sparse (see experimental results at we present our experimental results. Third, we analyze the role
Section IV-B), which, therefore, reduces the model complexity of SepDGConv in the entire model by an ablation analysis,
and regularizes the model usually more than GConv. and we report changes in classification performance. Fourth,
we compare different convolution strategies, in particular,
D. SepG-Nets GConv and SepDGConv, to isolate the effect of sensor speci-
Using SepDGConv DoubleConv, we can build SepG- ficity. Finally, we discuss the results and findings of our
ResNet18, as shown in Fig. 4(a), and separable group experiments.
uNet (SepG-UNet), as shown in Fig. 4(c). In SepG-
ResNet18, we follow the convention and build the net- A. Datasets
work with four layers, each having two DoubleConv blocks. 1) Houston2018 Dataset: Houston2018 is an HS-LiDAR-
The number of output channels for each layer being RGB dataset. Acuqired by the National Center for Air-
[64, 128, 256, 512], respectively. In SepG-UNet, we have borne Laser Mapping at the University of Houston, Houston,
eight DoubleConv blocks, with output channel numbers TX, USA, the Houston2018 dataset covers the University of
[128, 256, 512, 1024, 512, 256, 128, 64]. Houston campus and its surrounding urban areas. The dataset
SepG-ResNet50 consists of four layers, in order composed consists of MS-LiDAR, HS, and MS-optical RS data, each
of [2, 3, 5, 2] SepDGConv Bottleneck blocks. The number of containing 7, 48, and 3 channels, respectively. The HS data
output channels for each block is the same as G-ResNet50, cover a 380–1050-nm spectral range, while the laser wave-
being [256, 512, 1024, 2048], respectively. The architecture of lengths of three LiDAR sensors are 1550, 1064, and 532 nm.
SepG-ResNet50 is illustrated in Fig. 4(b). The MS-LiDAR data also contain digital elevation Model
(DEM) and DSM derived from point clouds. This dataset
IV. E XPERIMENTS AND D ISCUSSION was originally provided in 2018 GRSS Data Fusion Contest.
We experiment with SepG-ResNet18, SepG-ResNet50, and The paper [3] reports the outcome of the Contest and also
SepG-UNet on three diverse datasets: Houston2018 dataset, contains more detailed description of the Houston2018 data
Berlin dataset, and MUUFL Gulfport Hyperspectral and set. We resample the imagery at a 0.5-m GSD, so that the size
LiDAR Airborne Data Set (MUUFL) dataset, and compare of each image channel is 2404×8344 pixels. The ground truth
with baseline models, as well as SOTA models on the datasets. of dataset contains 20 classes. The number of samples in each
In this section, first, we describe the data sets. Second, class is shown in Table I.

TABLE I TABLE III

LULC C LASSES IN H OUSTON 2018 D ATASET LULC C LASSES IN MUUFL D ATASET

B. Classification and Results

As the data modality of the abovementioned three datasets is
quite different from each other, we use different classification
models and compare with different methods on each dataset.
We will present data preparation, experimental settings, and
TABLE II
classification results separately for each dataset. Here, we state
LULC C LASSES IN B ERLIN D ATASET
some global configurations in our experiments.
Environment and Reproducibility: We run our models on
NVIDIA GTX 1080 Ti GPUs throughout our experiments.
Software we use include: CUDA 10.2, cuDNN 7.6.5, python
3.8.2, pytorch 1.9.0, torchvision 0.10.0, scipy 1.6.2, and numpy
1.20.2. In all our experiments, we run five replicas for each
model using random seeds 42–46. The reproducibility of our
experimental results depends on random seeds, software, and
hardware. Note that using deterministic algorithms to a certain
extent reduces classification accuracy of our models.
Preprocessing: For all the data utilized in our experiments,
we use channel-wise normalize to rescale each channel into
the range [0, 1]
2) Berlin Dataset: This is an HS-SAR dataset. The Berlin channel[i, j ] − min(channel)
dataset covers the Berlin urban and its rural neighboring area. channel[i, j ] = (14)
max(channel) − min(channel)
The dataset consists of HS and MS-SAR imagery, containing
244 and 4 channels, respectively. The HS data were originally where channel is a 2-D array representing one image channel,
provided in [41], with the wavelength range of 400–2500 nm, and max(channel) and min(channel) take the maximum and
while more recently, Hong et al. [42] acquired the MS-SAR minimum value from channel, respectively.
data of the same area and processed both HS and MS-SAR For more details of data preparation, we follow the previous
data into an analysis- ready form, which is used in our exper- work unless otherwise specified. For the Houston2018 dataset,
iments. The processed data have a 13.89-m GSD and consists we follow [24]. For the Berlin dataset, we follow [42]. For the
of 1723 × 476 pixels. The Berlin dataset contains 15 classes MUUFL dataset, we follow [28].
of ground truth labels. The number of samples in each class Initialization: If not specified, we draw initial values from
is shown in Table II. the uniform distribution, which is also the default initialization
3) MUUFL Dataset: The MUUFL Gulfport dataset [43] is method in Pytorch.
an HS-LiDAR dataset. It was collected over The University of Optimization: As there are varying degrees of data imbal-
Southern Mississippi, Gulf Park Campus, Long Beach, MS, ance in all the three datasets, we use the weighted cross
USA. The dataset contains co-registered HS and MS-LiDAR entropy loss to train CNN models, with weight for each class
data, with 64 and 2 bands, respectively. The wavelength of HS set to
data spectral bands ranges from 375 to 1050 nm. The dataset #samplesc
contains 325 × 220 pixels, with a spatial resolution of 0.54 m wc = 1 − (15)
#total_samples
across track and 1.0 m along track. In the ground truth labels,
there are 11 classes. The number of samples in each class is where wc denotes class weight for class c, #samplesc
shown in Table III. represents number of training samples of class c, and

Fig. 5. Results on Houston2018 dataset. (a) MS optical image. (b) Test ground truth labels. (c) Prediction map by UNet. (d) Prediction map by SepG-UNet.
Zoomed-in view of the image from (a), (c), and (d) are shown in the bottom row.

#total_samples represents the total number of samples in the the networks for 300 epochs. We use three GPUs in parallel
training set. to train the networks, with a batch size set to 12. To ensure
Metrics: To evaluate the classification results, for each reproducibility, we use transpose convolution in upsampling
classifier, we will report F1-score (F1) for each class, and modules in UNet and SepG-UNet. To the loss function, we add
three widely used criteria in the literature to evaluate overall a mask, so that pixels with label class “undefined” are not
performance, i.e., average accuracy (AA), overall accuracy accounted for training loss.
(OA), and Kappa coefficient (κ). b) Results: Quantitative classification results of SepG-
Baseline Models: All baseline models discussed in the UNet, UNet, Fusion-FCN, and DCNN are shown in Table IV,
following, except DCNN [44] on the Houston2018 dataset, while Fig. 5(c) and (d) shows the classification map of
are reproduced and reevaluated under our experimental envi- UNet and SepG-UNet, alongside with the ground truth labels.
ronment. We are not able to reproduce DCNN, because the As shown in Table IV, our experimental results show that
paper [44] does not provide enough details of the model, so we basic UNet, which does not has a multistream architecture,
directly cite the results reported in [44]. can yield an OA of 63.66%, largely outperforming 51.52%
1) Experiments on Houston2018 Dataset: We follow [24] obtained by Fusion-FCN, and 51.2% by DCNN. Also, with
and treat the LULC classification on the Houston2018 dataset the proposed SepDGConv, the performance of SepG-UNet is
as a semantic segmentation problem and experiment with further improved to 63.66%. From Fig. 5, it can be seen that
UNet, a very commonly studied semantic segmentation model. both UNet and SepG-UNet can output meaningful prediction
We run basic UNet on the dataset as a baseline and examine maps. According to Table IV, using SepDGConv reduces the
the performance of SepG-UNet, where SepDGConv layers test OA variance, which is reflected in Fig. 5(c) and (d) that
replace regular convolution layers in the baseline model. SepG-UNet gives a generally less noisy classification map than
We also compare with the first and second place methods basic UNet.
presented in 2018 Data Fusion Contest; both are multistream In Fig. 8(a), we plot the learned number of groups and
models: Fusion-fully convolutional network (FCN) [24] and sparsity of relationship matrix U for each SepDGConv layer
DCNN [44]. in SepG-UNet. Here, the sparsity of U is defined as the ratio
a) Implementation details: We use Image tiles of shape between number of 0’s and number of total entries in U . The
58×128×128. In the training phase, we use a spatial stride of Sparsity plot shows that the SepDGConv generates an archi-
64×64 pixels to extract training samples from the 58×1202× tecture with dense-sparse connection alternately appearing.
4768 data, while in the test phase, the stride is 128 × 128. While, in InConv block, the learned U s are mostly sparse, the
We use the Adam optimizer [45] to train both UNet and sparsities quickly drop to 0 in Down1 block in all five replicas,
SepG-UNet, with optimizer hyperparameters β1 and β2 set to which suggests that sensor-specific branches learned in SepG-
default values. We set the initial learning rate to 0.001 and train UNet are very shallow, and feature fusion probably begins

TABLE IV
M ODEL P ERFORMANCE ON H OUSTON 2018 D ATASET

Fig. 6. Results on Berlin dataset. (a) HS false color image. (b) Test ground truth labels. Prediction maps by (c) ResNet18, (d) SepG-ResNet18, (e) ResNet50,
and (f) SepG-ResNet50.

in a very early stage. In the #Groups plot, we can see that 2) Experiments on Berlin Dataset: In the Berlin dataset,
SepDGConv learns more groups in middle layers where there the ground truth labels are sparse, so we extract small image
are more feature maps and more parameters. This is consistent patches as training and testing samples, with the center of
with the behavior of DGConv found in the paper [35]. each image patch aligned with one label. Hence, LULC

TABLE V
M ODEL P ERFORMANCE ON B ERLIN D ATASET

classification on the Berlin dataset becomes a image classi- number of groups should be the same as the number of sensors.
fication task. We select ResNet18 and ResNet50 as baseline Besides, there are generally more groups in the last two blocks
and experiment with SepG-ResNet18 and SepG-ResNet50, than in the previous blocks for both models. As, in ResNets,
where regular convolution layers in the baseline models are there are more feature maps and, thus, more parameters
replaced with SepDGConv layers. In addition, we compare in Layer3 and Layer4, this result is also consistent with
with shared and specific feature learning model (S2FL) [42], SepG-UNet.
which achieves SOTA performance on this dataset. 3) Experiments on MUUFL Dataset: For the MUUFL
a) Implementation details: For SepG-ResNet18, dataset, we follow most studies on it and use classification
ResNet18, SepG-ResNet50, and ResNet50, we use an image models. We experiment with ResNet18, ResNet50, and their
patch of 17 × 17 as training and test samples. All models are Sep-DGConv derivatives. We also compare our results with
trained on a single GPU using a batch size of 64. those of the following methods: OTVCA [6] and two-branch
To train SepG-ResNet18 and ResNet18, we use stochastic CNN (TB-CNN) [23]. Both OTVCA and TB-CNN are multi-
gradient descent (SGD) with momentum as our optimizer, stream models.
with momentum parameter set to 0.9. We train the networks a) Implementation details: For SepG-ResNet18 and
for 300 epochs. An initial learning rate is set to 0.001. For ResNet18, we use image patch of 11 × 11 as training and test
SepG-ResNet50 and ResNet50, we use Adam as an optimizer, samples, while for SepG-ResNet50 and ResNet50, we use an
with default algorithm parameters. We train both networks for image patch of size 17×17. As the previous studies do not use
400 epochs. An initial learning rate is set to 0.001, which is a fixed training set, we make random train-test split in each
further decayed to 0.0001 at the 300th epoch. replica under different random seeds. We follow [28] and set
b) Results: The quantitative classification results of training set size fixed to 100 and the rest used as the test set.
SepG-ResNet18, ResNet18, SepG-ResNet50, ResNet50, and All the models are trained on a single GPU.
S2FL are shown in Table V, while Fig. 6 shows the classi- For SepG-ResNet18 and ResNet18, we use He et al.’s [46]
fication map of our models, alongside with the ground truth initialization to initialize convolution filters and zero initial-
labels. While S2FL is not DL-based, it is essentially a multi- ization for the last batch normalization layer in each residual
stream model. Our experimental results show that the ResNet- branch [47]. We use SGD as an optimizer, with momentum
based methods generally outperform S2FL. SepG-ResNet18 parameter set to 0.9. We train the networks for 300 epochs.
surpasses its baseline model and improves SOTA OA on the Initial learning rate is set to 0.02, with a learning rate
Berlin dataset to 68.21%, while SepG-ResNet50 obtains a schedule that decreases the learning rate to 0.002 at 200th
marginally lower classification accuracy than basic ResNet50. epoch, and further decreases the learning rate to 0.0002 at
We will see in Section IV-C that the best performance is the 240th epoch. For both models, batch size is set to 48.
achieved with ResNet50 when some but not all convolution In SepG-ResNet50 and ResNet50, He’s initialization and zero
layers are replaced with SepDGConv. Test variance reduction batch norm initialization are also used. We use Adam as an
is also observed on SepDGConv models. Fig. 6 shares a similar optimizer, with default algorithm parameters. We train both
visual comparison with quantitative results. networks for 400 epochs. Initial learning rate is set to 0.01,
The SepDGConv group structure plots for SepG-ResNet18 with a learning rate schedule at epochs [300, 350], decreasing
and SepG-ResNet50 are shown in Fig. 8(b) and (c). The Spar- the learning rate to [0.001, 0.0001]. Both models are trained
sity plot shows that both learned architectures are generally using a batch size of 64.
sparse; however, sparsity drop in shallow layers, which sug- b) Results: The quantitative classification results of
gests early fusion, is also found present here. The #Groups plot SepG-ResNet18, SepG-ResNet50, ResNet18, ResNet50, and
shows that, in SepG-ResNet18, InConv learns ∼ 10 groups, methods to compare with are shown in Table VI. Fig. 7 shows
while in SepG-ResNet50, InConv learns two to four groups, the classification map of these models alongside with the
which is on the contrary to the empirical principle that the ground truth labels. According to Table VI, the ResNets

Fig. 7. Results on MUUFL dataset. (a) HS false color image. (b) Test ground truth labels. Prediction maps by (c) ResNet18, (d) SepG-ResNet18,
(e) ResNet50, and (f) SepG-ResNet50.
TABLE VI
M ODEL P ERFORMANCE ON MUUFL D ATASET

generally achieve better performance than OTVCA and TB- We design the ablation experiments based on the optimal
CNN. On the MUUFL dataset, however, both SepG-ResNet18 brain damage (OBD) theory in DL [48], according to which if
and SepG-ResNet50 do not surpass their corresponding base- an important module in a deep neural network is removed, then
line model. Again, we will see in Section IV-C that using significant performance drop should be observed. Concretely,
SepDGConv in some but not all convolution layers, both for each SepDGConv model, we run one forward pass and
models can obtain better performance and outperform baseline one backward pass. In the forward pass, we in order change
models. In Fig. 7, the output classification maps of ResNet50s SepDGConv in the following blocks back to regular convo-
are generally more noisy than those of ResNet18s, which sug- lution: [InConv, Layer1, Layer2, Layer3, Layer4] for SepG-
gest that there is probably overfitting in ResNet50s, because ResNet, and [InConv, Down1, Down2, Down3, Down4, Up1,
the dataset is relatively small. Up2, Up3, Up4] for SepG-UNet. In the backward pass, the
SepDGConv group structures learned on MUUFL dataset order is reversed. The obtained models are retrained. For each
are shown in Fig. 8(d) and (e). Both early fusion and more model, the baseline, SepDGConv derivative, forward Pass, and
groups in deeper layers are consistently observed. backward pass are trained under one same configuration of
hyperparameters.
The results of ablation experiments are shown in Fig. 9.
C. Ablation Analysis
We run each model five times with random seeds 42–46, the
To investigate the performance improvement of SepDG- same as our main experiment mentioned earlier. The value of
Conv, we remove SepDGConv block-by-block from above- each data point in Fig. 9 is taken as average OA of the five
mentioned models and analyze the influence of SepDGConv’s replicas, while each vertical bar represents the standard devia-
usage on the overall performance. In particular, as separate tion of five OA values. The red line represents OA changes in
convolution groups learned in a SepG-Net represent sensor- a forward pass, while the blue line represents OA changes in a
specific branches, we hope to shed light on whether such backward pass. A data point in a forward pass means that all
sensor-specificity is important for multisource RS data fusion. SepDGConvs in blocks previous to and including this point are

Fig. 8. Number of groups and sparsity of U in each layer of the following SepDGConv-based models on different datasets: (a) SepG-UNet trained on
Houston2018 dataset. (b) SepG-ResNet18 trained on Berlin dataset. (c) SepG-ResNet50 trained on Berlin dataset. (d) SepG-ResNet18 trained on MUUFL
dataset. (e) SepG-ResNet50 trained on MUUFL dataset.

replaced with regular convolution. For example, in Fig. 9(b), In forward passes, performance gain is consistently
a point at Layer1 represents results from ResNet18s that have observed in early stage of each pass. For example, in Fig. 9(b),
regular convolution in InConv and Layer1, and SepDGConv on Berlin dataset, SepG-ResNet18’s performance improves
in Layer2–4. Similarly, a data point in a backward pass means as SepDGConv is removed from InConv, and an OA of
that all SepDGConvs in blocks behind and including this point 69.76% is obtained, which surpasses both the original SepG-
are replaced with regular convolution, while blocks previous Net as well as the baseline. The same performance gain
to this point remain using SepDGConv. can be observed in Fig. 9(d) and (e), when SepDGConv

Fig. 9. Classification performance of various models obtained via ablation analysis. The arrows indicate the order of removal of SepDGConv modules,
while blue bars represent the number of parameters in the corresponding block. (a) Performance of SepG-UNets on Houston2018 dataset. (b) Performance
of SepG-ResNet18s on Berlin dataset. (c) Performance of SepG-ResNet50s on Berlin dataset. (d) Performance of SepG-ResNet18s on MUUFL dataset.
(e) Performance of SepG-ResNet50s on MUUFL dataset.

in InConv is replaced by regular convolution. According to both InConv and Layer1 are removed, which means that,
the OBD theory, this phenomenon implies that, rather than at this point, the shallow layers in the model are densely
being beneficial to model performance, imposing multistream connected rather than having a multistream, sensor-specific
architecture in shallow layers actually harms model perfor- architecture. A very similar loss-gain curve is observed on
mance. For SepG-ResNet50 on the Berlin dataset, as shown SepG-UNet, as shown in Fig. 9(a). Hence, the experimental
in Fig. 9(c), while there is an initial performance loss, the results of SepG-ResNet50 on Berlin dataset and SepG-UNet
OA goes up to 68.58% as soon as SepDGConv layers in on Houston2018 dataset both support our finding that, for the

first few blocks, dense convolution is better than multistream G 0 , a CNN with G 0 sensor-specific branches, each of depth l,
convolution. can be built. In our experiments in the following, we will
In backward passes, performance loss is observed as use this strategy to construct models that have strictly sensor-
SepDGConv in middle, and deep layers are removed. For specific branches, so we can better observe the effect of sensor
UNet, the performance loss occurs as the backward pass goes specificity. In particular, for ResNets, we will use GConv for
through middle blocks, from Up1 to Down4, as shown in [InConv, Layer1–4], and for UNets, we will use GConv for
Fig. 9(a). For ResNets, we observe OA drop when SepDG- [InConv, Down1–4, Up1–4].
Conv in the last two blocks, Layer4 and Layer3, is replaced, The role of FGConv is to further isolate sensor specificity
as shown in Fig. 9(b)–(d). from regularization. For sensor specificity, using SepDGConv
Furthermore, based on the distribution of number of para- does not guarantee this in deep layers, because the number
meters in the studied models, shown as the blue histograms in of groups it learns varies from layer to layer. For regulariza-
Fig. 9, we summarize the performance loss in middle and last tion, SepDGConv usually has larger regularization effects than
layers into a more general phenomenon, i.e., performance loss GConv, because SepDGConv reduces more model parameters
in wide layers. Wide layers refer to layers with more feature to 0. The idea of FGConv is to combine SepDGConv and
map channels and, thus, with more parameters. It can be seen GConv to make sure that any layer using FGConv has at
from Fig. 9 that for UNet, the wide layers are Down3–Up2, least G 0 groups, while maintaining the strong regularization
exactly where we observe performance loss, and for ResNets, effect of SepDGConv. In particular, GConv can be equiva-
the wide layers are Layer3–Layer4, where performance loss lently expressed by an relationship matrix U0 [Fig. 2(c)], and
in the backward pass is also observed. Such performance loss in FGConv, we construct a new relationship matrix U F for
implies that SepDGConv is beneficial to model performance a layer, using that layer’s learned SepDGConv relationship
if applied to wide layers. matrix U and a specified GConv matrix U0
Finally, we shall revisit the performance gain phenomenon
observed in the early stage of forward passes. This phenom- U F = U0 U (16)
enon occurs in the first few layers of studied models, where
there are fewer parameters, as shown in Fig. 9. These layers are where denotes element-wise product. U0 is constructed
also called narrow layers, in accordance with “wide layers.” using the pre-specified group number G 0 and is fixed, while U
Thus, we can say that there is performance gain if we replace is still learned from data. In the experiments, FGConv is used
SepDGConv with regular convolution in narrow layers. Such in [InConv, Layer1–4] for ResNets, and [InConv, Down1–4,
performance gain implies that the SepDGConv is harmful Up1–4] for UNets.
to model performance if applied to narrow, usually shallow For GConv, we make sure, at the input layer, that the group
layers. Besides, in Fig. 9(b)–(e), performance gain is observed division leads to sensor specificity, by manually designing
in backward passes going from Layer2 to InConv. This is the relationship matrix U0 . For the Houston2018 dataset,
also a clue that dense convolution in narrow layers is more U0 divides the 58 channel input data into four groups, mapping
favorable to model performance. [3 MS, 48 HS, 3 LiDAR, 4 DEM/DSM] to [16, 16, 16, 16]
To summarize, in the ablation analysis, we observe that there feature maps. For the Berlin dataset, U0 maps [244 HS, 4 SAR]
is model performance gain if we replace SepDGConv with to [32, 32] feature maps. For the MUUFL dataset, U0 maps
regular convolution in narrow, usually the first few layers in [64 HS, 2 LiDAR] to [32, 32] feature maps. The same U0 is
a model, and there is model performance loss if we replace applied to FGConv.
SepDGConv with regular convolution in wide, usually the The experimental results are shown in Tables VII and VIII.
last few layers in a model. These findings imply that the Performance of baseline models and SepG-models is cited
multistream architecture is harmful to model performance if from above, while the “Ablation” column refers to the model’s
used in narrow layers, but becomes beneficial if applied to best performance obtained in the ablation study. We report
wide layers. average and std of OA of five runs, using random seeds
42–46, the same as mentioned earlier.
D. Comparing Different Convolution Strategies First consider the ResNets; see Table VII. First, it is con-
The results of our ablation analysis suggest that it is better sistently observed that baseline models marginally outperform
to use regular convolution than SepDGConv in the first few the corresponding GConv models. This indicates that sensor-
layers of a CNN. This indicates that the sensor specificity specific multibranching does not necessarily help improve
may not be playing an important role in data fusion models, model performance. Second, in three out of total four exper-
because using densely connected convolution, the model no iments, i.e., except ResNet50 on Berlin dataset, FGConv
longer has sensor-specific features. However, SepDGConv also outperforms GConv. This supports our basic assumption that
regularizes the model, and the effect of sensor specificity is automatically learning GConv hyperparameters benefits model
not yet isolated from regularization. In this section, we further performance. Third, in all of four experiments, neither does
compare models using SepDGConv with another two convo- GConv nor does FGConv outperform the best architecture we
lution strategies that impose sensor-specific multibranch archi- previously find in the ablation study, where the models’ first
tectures: GConv and fixed groupable convolution (FGConv). few SepDGConv layers are replaced with regular convolution.
As mentioned in Section II-A, using GConv for the first l These models found in the ablation study do not have sensor-
layers and setting the number of groups to a fixed number specific branches, so this result is consistent with (1).

TABLE VII
D IFFERENT C ONVOLUTION S TRATEGIES ON ResNets

TABLE VIII
D IFFERENT C ONVOLUTION S TRATEGIES ON UNet, H OUSTON 2018 D ATASET

Then, consider the UNets on the Houston2018 data Set; UNet and last few layers in ResNets are much wider and
see Table VIII. While in consistent with (2) above that have much more parameters than the first few layers;
FGConv outperforms GConv, for UNets, we observe GConv hence, using SepDGConv on these wide layers very
outperforming baseline, SepDGConv, and Ablation. To find likely just achieves the expected regularization effect,
an explanation for this, we add a False-GConv experiment, improving the models’ performance.
where at the input layer, we use a group division of [14, 16, As we experiment with three different models on three very
14, 14] instead of [2, 48, 3, 4], so that the four branches are no diverse datasets, our two findings mentioned earlier are highly
longer sensor-specific. False-GConv outperforms GConv while generalizable, which provides strong clues that multistream
achieves slightly lower accuracy than FGConv. This implies architectures actually play the role of model regularizer.
that GConv and FGConv gain performance probably from the Yet, regularization itself is a complex technique, and its
setting G = 4, and again, this experiment supports our finding effect is always coupled with various aspects of model opti-
mentioned earlier that the sensor specificity is not necessarily mization and generalization; hence, there are probably many
helpful. other factors to explore that contribute to the phenomena we
have observed. For example, a very recent paper [50] find out
E. Discussion that one same model trained separately on different sources
of data acquired in the same area (RGB and SAR in their
1) Multistream as Regularization: Theoretically, both
case) can lead to very similar model parameter distributions.
SepDGConv and human designed multistream deep neural net-
This finding suggests that there could be feature redundancy
works can be regarded as a regularization technique, because
in shallow layers of multistream architectures. Nevertheless,
it imposes certain constraints on the network architecture
we hope our work shed some light on the mechanisms behind
to reduce overfitting and to improve model performance
neural network architecture designing for multisource RS data,
[34], [49]. Based on our experiments and ablation analysis
and inspire novel research.
mentioned earlier, we attribute model performance in multi-
source RS data fusion using SepDGConv to regularization, for 2) Possible Improvements on GConv: Our results suggest
we have observed the following two very important features that models with regular convolution, such as ResNet18, can
of regularization. obtain classification results at least comparable with SOTA
1) Variance Reduction: It is known that the regularized methods, and that in shallow layers dense, regular convolution
models can generalize better, which means they should should be used, which together advocate single-stream deep
have lower test variance. In our experiments, as dis- CNN models for joint classification of multisource RS data.
cussed in Section IV-B, except ResNet18 on MUUFL To automatically learn grouped convolution in wide layers
dataset, all SepDGConv models have less test OA vari- to utilize the regularization effect, it is more desirable if
ance than their corresponding baseline models. SepDGConv can learn dense convolution for narrow layers,
2) Over-Regularization: If a simple model is regularized and thus, there is still room for improvement in SepDGConv.
too much, the model capacity can be reduced too much Besides, the restrictions SepDGConv puts on the relationship
to fit the data, and as a result, the overall model perfor- matrix U are strong, and in practice, we may need to construct
mance is harmed. In our ablation analysis, as shown in U s with more flexible structure. We hope novel research in
Fig. 9, we find out that imposing multistream SepDG- constructing and learning the relationship matrix U can lead
Conv on shallow layers leads to model performance loss. to better single-stream CNN architectures for multisource RS
The most possible reason is that, these shallow narrow data.
layer themselves do not have many parameters, and 3) Toward Better Performance: Our work makes it possible
using SepDGConv, they are over-regularized, leading to to build deep, single stream networks for multisource RS
underfitting. On the other hand, the middle layers in data. On the one hand, modern techniques that boost model

performance are more easily applied in a unified network. [7] Y. Gu, Q. Wang, X. Jia, and J. A. Benediktsson, “A novel MKL model
On the other hand, designing sufficiently large models is of integrating LiDAR data and MSI for urban area classification,” IEEE
Trans. Geosci. Remote Sens., vol. 53, no. 10, pp. 5312–5326, Oct. 2015.
beneficial to, and probably necessary for, solving large-scale, [8] J. Hu, D. Hong, and X. X. Zhu, “MIMA: MAPPER-induced manifold
real world RS problems. alignment for semi-supervised fusion of optical image and polarimet-
In our paper, we only experiment and compare with basic ric SAR data,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 11,
pp. 9025–9040, Nov. 2019.
models, leaving many techniques, such as ensembling, post- [9] G. Singh et al., “Topological methods for the analysis of high dimen-
processing, and so on, and more complex modules such as sional data sets and 3D object recognition,” PBG Eurographics, vol. 2,
attention mechanism to further improve model performance pp. 1–10, Sep. 2007.
[10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
unexplored. For example, [28] utilizes Octave convolution pp. 436–444, May 2015.
and fractional Gabor convolution to propose a network that [11] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks
achieves the SOTA OA of 89.90% on MUUFL dataset. and applications in vision,” in Proc. IEEE Int. Symp. Circuits Syst.,
May 2010, pp. 253–256.
We believe that such advanced modules, and more in the [12] D. Marmanis, M. Datcu, T. Esch, and U. Stilla, “Deep learning earth
future, can be more easily implemented in a single-branch observation classification using ImageNet pretrained networks,” IEEE
network. Geosci. Remote Sens. Lett., vol. 13, no. 1, pp. 105–109, Jan. 2016.
[13] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Convolu-
tional neural networks for large-scale remote-sensing image classifica-
V. C ONCLUSION tion,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 2, pp. 645–657,
In this article, we have investigated the potential of single- Feb. 2017.
[14] X. Yuan, J. Shi, and L. Gu, “A review of deep learning methods for
stream models in joint classification of multisource RS data. semantic segmentation of remote sensing imagery,” Expert Syst. Appl.,
To enable a multistream network structure to be automatically vol. 169, May 2021, Art. no. 114417.
learned within a single-stream architecture, we propose the [15] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional
neural networks for hyperspectral image classification,” J. Sensors,
SepDGConv module based on the GConv and DGConv tech- vol. 2015, pp. 1–12, Jan. 2015.
niques. With reference to modern deep CNN architectures, [16] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classification
we then propose several DL models with SepDGConv: SepG- using deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens.,
vol. 55, no. 2, pp. 844–853, Feb. 2017.
ResNet18, SepG-ResNet50, and SepG-UNet. The proposed
[17] H. Lee and H. Kwon, “Going deeper with contextual CNN for hyper-
models are verified on three benchmark datasets with diverse spectral image classification,” IEEE Trans. Image Process., vol. 26,
data modality, yielding promising classification results, which no. 10, pp. 4843–4855, Oct. 2017.
indicate the effectiveness and generalizability of the proposed [18] X. He, A. Wang, P. Ghamisi, G. Li, and Y. Chen, “LiDAR data classi-
fication using spatial transformation and CNN,” IEEE Geosci. Remote
single-stream networks for multisource RS data joint classifi- Sens. Lett., vol. 16, no. 1, pp. 125–129, Jan. 2018.
cation. Furthermore, we analyze the usage of SepDGConv in [19] S. Pan et al., “Land-cover classification of multispectral LiDAR data
different parts of the models and find out that: 1) using SepDG- using CNN with optimized hyper-parameters,” ISPRS J. Photogramm.
Remote Sens., vol. 166, pp. 241–254, Aug. 2020.
Conv generally reduces model variance; 2) using SepDGConv [20] J. Zhao, W. Guo, S. Cui, Z. Zhang, and W. Yu, “Convolutional neural
in narrow layers, usually the first few layers, harms model network for SAR image classification at patch level,” in Proc. IEEE Int.
performance; and 3) using SepDGConv in wide layers, usu- Geosci. Remote Sens. Symp. (IGARSS), 2016, pp. 945–948.
[21] M. Ma, J. Chen, W. Liu, and W. Yang, “Ship classification and detection
ally the last few layers, improves model performance. These based on CNN using GF-3 SAR images,” Remote Sens., vol. 10, no. 12,
findings imply that the sensor-specific multistream architecture p. 2043, Dec. 2018.
is essentially playing the role of model regularizer, and is not [22] Y. Chen, C. Li, P. Ghamisi, X. Jia, and Y. Gu, “Deep fusion of remote
sensing data for accurate classification,” IEEE Geosci. Remote Sens.
strictly necessary for multisource RS data fusion. We hope our Lett., vol. 14, no. 8, pp. 1253–1257, Aug. 2017.
work can inspire novel flexible and generalizable models for [23] X. Xu, W. Li, Q. Ran, Q. Du, L. Gao, and B. Zhang, “Multisource remote
multisource RS data analysis. sensing data classification based on convolutional neural network,” IEEE
Trans. Geosci. Remote Sens., vol. 56, no. 2, pp. 937–949, Feb. 2018.
[24] Y. Xu, B. Du, and L. Zhang, “Multi-source remote sensing data
R EFERENCES classification via fully convolutional networks and post-classification
[1] J. Li, Y. Pei, S. Zhao, R. Xiao, X. Sang, and C. Zhang, “A review of processing,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
remote sensing for environmental monitoring in China,” Remote Sens., Jul. 2018, pp. 3852–3855.
vol. 12, no. 7, p. 1130, Apr. 2020. [25] D. Hong et al., “More diverse means better: Multimodal deep learn-
[2] R. P. Sishodia, R. L. Ray, and S. K. Singh, “Applications of remote ing meets remote-sensing imagery classification,” IEEE Trans. Geosci.
sensing in precision agriculture: A review,” Remote Sens., vol. 12, no. 19, Remote Sens., vol. 59, no. 5, pp. 4340–4354, Apr. 2021.
p. 3136, Sep. 2020. [26] D. Hong, N. Yokoya, G.-S. Xia, J. Chanussot, and X. X. Zhu,
[3] Y. Xu et al., “Advanced multi-sensor optical remote sensing for urban “X-ModalNet: A semi-supervised deep cross-modal network for classifi-
land use and land cover classification: Outcome of the 2018 IEEE GRSS cation of remote sensing data,” J. Photogramm. Remote Sens., vol. 167,
data fusion contest,” IEEE J. Sel. Topics Appl. Earth Observ. Remote pp. 12–23, Sep. 2020.
Sens., vol. 12, no. 6, pp. 1709–1724, Jun. 2019. [27] M. Zhang, W. Li, R. Tao, H. Li, and Q. Du, “Information fusion for
[4] M. Pedergnana, P. R. Marpu, M. D. Mura, J. A. Benediktsson, and classification of hyperspectral and LiDAR data using IP-CNN,” IEEE
L. Bruzzone, “Classification of remote sensing optical and LiDAR data Trans. Geosci. Remote Sens., vol. 60, pp. 1–12, 2022.
using extended attribute profiles,” IEEE J. Sel. Topics Signal Process., [28] X. Zhao, R. Tao, W. Li, W. Philips, and W. Liao, “Fractional Gabor con-
vol. 6, no. 7, pp. 856–865, Nov. 2012. volutional network for multisource remote sensing data classification,”
[5] M. Khodadadzadeh, J. Li, S. Prasad, and A. Plaza, “Fusion of hyperspec- IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–18, 2022.
tral and LiDAR remote sensing data using multiple feature learning,” [29] R. Hang, Z. Li, P. Ghamisi, D. Hong, G. Xia, and Q. Liu, “Classification
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6, of hyperspectral and LiDAR data using coupled CNNs,” IEEE Trans.
pp. 2971–2983, Jun. 2015. Geosci. Remote Sens., vol. 58, no. 7, pp. 4939–4950, Jul. 2020.
[6] B. Rasti, P. Ghamisi, and R. Gloaguen, “Hyperspectral and LiDAR [30] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “FuseNet: Incorpo-
fusion using extinction profiles and total variation component analysis,” rating depth into semantic segmentation via fusion-based CNN architec-
IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7, pp. 3997–4007, ture,” in Proc. Asian Conf. Comput. Vis. Cham, Switzerland: Springer,
Jul. 2017. 2016, pp. 213–228.

[31] N. Audebert, B. Le Saux, and S. Lefèvre, “Beyond RGB: Very high Daoye Zhu received the M.S. degree in cartogra-
resolution urban remote sensing with multimodal deep networks,” phy and geographical information system from the
J. Photogramm. Remote Sens., vol. 140, pp. 20–32, Jun. 2018. State Key Laboratory of Information Engineering
[32] K. Chen et al., “Effective fusion of multi-modal data with group con- in Surveying, Mapping and Remote Sensing (LIES-
volutions for semantic segmentation of aerial imagery,” in Proc. IEEE MARS), Wuhan University, Wuhan, China, in 2018.
Int. Geosci. Remote Sens. Symp. (IGARSS), Jul. 2019, pp. 3911–3914. He is currently pursuing the Ph.D. degree in data sci-
[33] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the ence (computer science and technology) with Peking
recent architectures of deep convolutional neural networks,” Artif. Intell. University, Beijing, China.
Rev., vol. 53, no. 8, pp. 5455–5516, 2020, doi: 10.1007/s10462-020- In 2020, he joined the University of Cambridge,
09825-6. Cambridge, U.K., as a visiting scholar. His research
[34] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, interests include spatial analysis, remote sensing
MA, USA: MIT, 2016. intelligent interpretation, spatial data fusion, and geographic information
[35] Z. Zhang et al., “Differentiable learning-to-group channels via groupable engineering.
convolutional neural networks,” in Proc. IEEE/CVF Int. Conf. Comput.
Vis., Oct. 2019, pp. 3542–3551.
[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-
tion with deep convolutional neural networks,” in Proc. Adv. Neural Tengteng Qu received the B.E. degree in surveying
Inf. Process. Syst. (NIPS), Stateline, NV, USA, vol. 25, Dec. 2012, engineering and the Ph.D. degree in photogrammetry
pp. 1097–1105. engineering and remote sensing from Tongji Uni-
[37] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual versity, Shanghai, China, in 2011 and 2017, respec-
transformations for deep neural networks,” in Proc. IEEE Conf. Comput. tively.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1492–1500. From 2018 to 2021, she was a Post-Doctoral
[38] P. Yin, J. Lyu, S. Zhang, S. Osher, Y. Qi, and J. Xin, “Understanding Researcher with Peking University, Beijing, China,
straight-through estimator in training activation quantized neural nets,” where she is currently a Research Assistant Profes-
in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–30. sor with the College of Engineering. Since 2019, she
[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image has been serving as a Standard Expert of geographic
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, information in the IEEE Standards Association, the
pp. 770–778. Open Geospatial Consortium (OGC), and the International Organization for
[40] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- Standardization (ISO). Her research interests include geospatial big data,
works for biomedical image segmentation,” in Proc. Int. Conf. Med. global subdivision grids, and synthetic aperture radar remote sensing.
Image Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer,
2015, pp. 234–241.
[41] A. Okujeni, S. van der Linden, and P. Hostert, “Berlin-urban-gradient Qiangyu Wang received the B.S. degree in com-
dataset 2009—An enmap preparatory flight campaign,” GFZ (German puter science and technology from the Shandong
Research Centre for Geosciences) Data Services, Potsdam, Germany, University of Science and Technology, Qingdao,
Tech. Rep., 2016, doi: 10.2312/enmap.2016.002. China, in 2012, the M.Sc. degree in advanced
[42] D. Hong, J. Hu, J. Yao, J. Chanussot, and X. X. Zhu, “Multimodal computer science from Newcastle University, Tyne,
remote sensing benchmark datasets for land cover classification with U.K., in 2014, and the Ph.D. degree in computer
a shared and specific feature learning model,” J. Photogramm. Remote architecture from the China University of Mining
Sens., vol. 178, pp. 68–80, Aug. 2021. and Technology (Beijing), Beijing, China, in 2019.
[43] X. Du and A. Zare, “Technical report: Scene label ground truth He is currently a Post-Doctoral Researcher with
map for MUUFL Gulfport data set,” Univ. Florida, Gainesville, the School of Engineering, Peking University, Bei-
FL, USA, Tech. Rep. 20170417, Apr. 2017. [Online]. Available: jing. His research interests include spatio-temporal
http://ufdc.ufl.edu/IR00009711/00001 grids for big data, deep learning, and computer vision.
[44] D. Cerra et al., “Combining deep and shallow neural networks with
ad hoc detectors for the classification of complex multi-modal urban
scenes,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
Jul. 2018, pp. 3856–3859. Fuhu Ren received the B.S. degree in geology, the
[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” M.S. degree in remote sensing and geographic infor-
2014, arXiv:1412.6980. mation system, and the Ph.D. degree in geography
[46] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: from Peking University, Beijing, China, in 1984,
Surpassing human-level performance on ImageNet classification,” in 1988, and 1991, respectively.
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034. He was a Postal-Doctoral Researcher with the Uni-
[47] P. Goyal et al., “Accurate, large minibatch SGD: Training ImageNet in versity of Tokyo, Tokyo, Japan, for two years, and a
1 hour,” 2017, arXiv:1706.02677. UN Researcher with the United Nations Centre for
[48] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Regional Development (UNCRD), Nagoya, Japan,
Proc. Adv. Neural Inf. Process. Syst., 1990, pp. 598–605. for three years. He is currently a Professor and the
[49] J. Kukačka, V. Golkov, and D. Cremers, “Regularization for deep Executive Director of the Collaborative Innovation
learning: A taxonomy,” 2017, arXiv:1710.10686. Center for Geospatial Big Data, Peking University. He is an Expert rep-
[50] Z. Zheng, A. Ma, L. Zhang, and Y. Zhong, “Deep multisensor learning resenting the Standardization Administration of the People’s Republic Of
for missing-modality all-weather mapping,” J. Photogramm. Remote China (SAC) in the Work Group 9 of ISO TC211 for the development of
Sens., vol. 174, pp. 254–264, Apr. 2021. relevant standards on DGGS. His research interests include discrete global
grid systems (DGGS), spatial-temporal analysis, and remote sensing cloud
computing.

Yi Yang received the B.E. degree in surveying and Chengqi Cheng received the Ph.D. degree from
mapping engineering from Tongji university, Shang- Peking University, Beijing, China, in 1989.
hai, China, in 2018, and the M.S. degree in data He is currently a Professor with the College of
science (computer science and technology) from Engineering, Peking University. He established the
Peking University, Beijing, China, in 2021. He is Collaborative Innovation Center for Geospatial Data,
expected to do a second master’s degree with the Peking University, which has been involved in the
Imperial College London, London, U.K. IEEE Standards Association Corporate Program. His
His research interests include geospatial big research interests include global subdivision model
data, remote sensing image processing, and pattern and geographic information system applications.
recognition.

Authorized licensed use limited to: Purdue University. Downloaded on October 31,2024 at 16:08:15 UTC from IEEE Xplore. Restrictions apply.

Advanced Deep Learning Strategies For The Analysis of Remote Sensing Images
No ratings yet
Advanced Deep Learning Strategies For The Analysis of Remote Sensing Images
440 pages
Fully Convolutional Networks For Multisource Building Extraction 2018
No ratings yet
Fully Convolutional Networks For Multisource Building Extraction 2018
13 pages
Object Detection in Satellite Images by Faster R-CNN Incorporated With Enhanced ROI Pooling (FrRNet-ERoI) Framework
No ratings yet
Object Detection in Satellite Images by Faster R-CNN Incorporated With Enhanced ROI Pooling (FrRNet-ERoI) Framework
18 pages
SAR AI Paper
No ratings yet
SAR AI Paper
26 pages
A Review On Deep Learning in UAV Remote Sensing
No ratings yet
A Review On Deep Learning in UAV Remote Sensing
28 pages
AI Project Cycle
No ratings yet
AI Project Cycle
31 pages
Cost Estimate Classification System
100% (1)
Cost Estimate Classification System
7 pages
Physiological Control Systems
No ratings yet
Physiological Control Systems
49 pages
Fully Transformer Network For Change Detection of Remote Sensing Images
No ratings yet
Fully Transformer Network For Change Detection of Remote Sensing Images
18 pages
Hybrid Adaptive Neural Network For Remote Sensing Image Classification
No ratings yet
Hybrid Adaptive Neural Network For Remote Sensing Image Classification
10 pages
1 Framework Evaluating LULC CNN
No ratings yet
1 Framework Evaluating LULC CNN
23 pages
Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks
No ratings yet
Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks
30 pages
Resunet-A: A Deep Learning Framework For Semantic Segmentation of Remotely Sensed Data
No ratings yet
Resunet-A: A Deep Learning Framework For Semantic Segmentation of Remotely Sensed Data
24 pages
Remotesensing 13 00808 With Cover
No ratings yet
Remotesensing 13 00808 With Cover
42 pages
Classification of Multi-Spectral Data With Fine-Tuning Variants of Representative Models
No ratings yet
Classification of Multi-Spectral Data With Fine-Tuning Variants of Representative Models
23 pages
SMTGGG
No ratings yet
SMTGGG
24 pages
Review 1.3 PPT
No ratings yet
Review 1.3 PPT
27 pages
Remotesensing 14 01718
No ratings yet
Remotesensing 14 01718
26 pages
Said - Deep Learning For Change Detection
No ratings yet
Said - Deep Learning For Change Detection
32 pages
Remotesensing 13 02187 v2
No ratings yet
Remotesensing 13 02187 v2
20 pages
Review of Deep Learning Methods For Remote Sensing Satellite Images Classification Experimental Survey and Comparative Analysis
No ratings yet
Review of Deep Learning Methods For Remote Sensing Satellite Images Classification Experimental Survey and Comparative Analysis
24 pages
Urban Land-Use Classification Using Machine Learning Classifiers Comparative Evaluation and Post-Classification Multi-Feature Fusion Approach
No ratings yet
Urban Land-Use Classification Using Machine Learning Classifiers Comparative Evaluation and Post-Classification Multi-Feature Fusion Approach
22 pages
Information Fusion: Xiaowei Gu, Ce Zhang, Qiang Shen, Jungong Han, Plamen P. Angelov, Peter M. Atkinson
No ratings yet
Information Fusion: Xiaowei Gu, Ce Zhang, Qiang Shen, Jungong Han, Plamen P. Angelov, Peter M. Atkinson
26 pages
Background-Aware Cross-Attention Multiscale Fusion
No ratings yet
Background-Aware Cross-Attention Multiscale Fusion
20 pages
GEE面向对象
No ratings yet
GEE面向对象
17 pages
Multisource and Multitemporal Data Fusion in Remote Sensing
No ratings yet
Multisource and Multitemporal Data Fusion in Remote Sensing
26 pages
Neural Ordinary Differential Equations For Hyperspectral Image Classification-Plaza2020
No ratings yet
Neural Ordinary Differential Equations For Hyperspectral Image Classification-Plaza2020
17 pages
Remotesensing 16 03278
No ratings yet
Remotesensing 16 03278
18 pages
Li DeepLearninginMultimodal
No ratings yet
Li DeepLearninginMultimodal
16 pages
Deep Learning in Remote Sensing
No ratings yet
Deep Learning in Remote Sensing
29 pages
1 s2.0 S1110982324000048 Main
No ratings yet
1 s2.0 S1110982324000048 Main
17 pages
ACMFNet Attention-Based Cross-Modal Fusion Network For Building Extraction of Remote Sensing Images
No ratings yet
ACMFNet Attention-Based Cross-Modal Fusion Network For Building Extraction of Remote Sensing Images
14 pages
Permuted Spectral and Permuted Spectral Spatial CNN Models For PolSAR Multispectral Data Based Land Cover Classification
No ratings yet
Permuted Spectral and Permuted Spectral Spatial CNN Models For PolSAR Multispectral Data Based Land Cover Classification
26 pages
Clasificación de Escenas de Imágenes de Teledetección
No ratings yet
Clasificación de Escenas de Imágenes de Teledetección
22 pages
CHP - 10.1007 - 978 3 030 14132 5 - 13125
No ratings yet
CHP - 10.1007 - 978 3 030 14132 5 - 13125
15 pages
Land Use Classification of High-Resolution Multispectral Satellite Images With Fine-Grained Multiscale Networks and Superpixel Postprocessing
No ratings yet
Land Use Classification of High-Resolution Multispectral Satellite Images With Fine-Grained Multiscale Networks and Superpixel Postprocessing
15 pages
Ai For Remte Sensing Assignments Notes
No ratings yet
Ai For Remte Sensing Assignments Notes
16 pages
Research On Land Cover Classification of Multi Source Remote Sensing Data Based On Improved U Net Network
No ratings yet
Research On Land Cover Classification of Multi Source Remote Sensing Data Based On Improved U Net Network
16 pages
10.hyperspectral and LiDAR Data Classification Using Joint CNNs and Morphological Feature Learning
No ratings yet
10.hyperspectral and LiDAR Data Classification Using Joint CNNs and Morphological Feature Learning
16 pages
SoftFormer SAR Optical Fusion Transformer For - 2024 - ISPRS Journal of Photogr
No ratings yet
SoftFormer SAR Optical Fusion Transformer For - 2024 - ISPRS Journal of Photogr
17 pages
Dual-Branch Domain Adaptation Few-Shot Learning For Hyperspectral Image Classification
No ratings yet
Dual-Branch Domain Adaptation Few-Shot Learning For Hyperspectral Image Classification
16 pages
1 s2.0 S0924271621002379 Main
No ratings yet
1 s2.0 S0924271621002379 Main
15 pages
26 ESR-DMNet Enhanced Super-Resolution-Based Dual-Path Metric Change Detection Network For Remote Sensing Images With Different Resolutions
No ratings yet
26 ESR-DMNet Enhanced Super-Resolution-Based Dual-Path Metric Change Detection Network For Remote Sensing Images With Different Resolutions
15 pages
Earth Sci. Informatics - Publn
No ratings yet
Earth Sci. Informatics - Publn
12 pages
A Crossmodal Multiscale Fusion Network For Semantic Segmentation of Remote Sensing Data
No ratings yet
A Crossmodal Multiscale Fusion Network For Semantic Segmentation of Remote Sensing Data
12 pages
Spectralformer: Rethinking Hyperspectral Image Classification With Transformers
No ratings yet
Spectralformer: Rethinking Hyperspectral Image Classification With Transformers
13 pages
Confrence Paper Satellite Springer Format
No ratings yet
Confrence Paper Satellite Springer Format
14 pages
A Lightweight Semantic Segmentation Network Based On Self-Attention Mechanism and State Space Model For Efficient Urban Scene Segmentation
No ratings yet
A Lightweight Semantic Segmentation Network Based On Self-Attention Mechanism and State Space Model For Efficient Urban Scene Segmentation
15 pages
Satellite 4 Good
No ratings yet
Satellite 4 Good
14 pages
Remote Sensing: Classification and Segmentation of Satellite Orthoimagery Using Convolutional Neural Networks
No ratings yet
Remote Sensing: Classification and Segmentation of Satellite Orthoimagery Using Convolutional Neural Networks
21 pages
380 1325 1 PB
No ratings yet
380 1325 1 PB
12 pages
Wang Et Al. - 2024 - A Deep Inverse Convolutional Neural Network-Based Semantic Classification Method For Land Cover Remo
No ratings yet
Wang Et Al. - 2024 - A Deep Inverse Convolutional Neural Network-Based Semantic Classification Method For Land Cover Remo
14 pages
A Deep Learning Model With Capsules Embedded For High-Resolution Image Classification
No ratings yet
A Deep Learning Model With Capsules Embedded For High-Resolution Image Classification
10 pages
Remotesensing 13 00516 v3
No ratings yet
Remotesensing 13 00516 v3
19 pages
A Deep Neural Network Combined CNN and GCN For Remote Sensing Scene Classification
No ratings yet
A Deep Neural Network Combined CNN and GCN For Remote Sensing Scene Classification
14 pages
Ipl Prediction
No ratings yet
Ipl Prediction
12 pages
Remotesensing 13 04743 v2
No ratings yet
Remotesensing 13 04743 v2
14 pages
Building Footprint Generation Using Improved Generative Adversarial Networks
No ratings yet
Building Footprint Generation Using Improved Generative Adversarial Networks
5 pages
Urban Land Cover Classification Using Deep Learning.: Prof. Rushali Patil Priyanshu Rawat Abhishek Kumar Surender Singh
No ratings yet
Urban Land Cover Classification Using Deep Learning.: Prof. Rushali Patil Priyanshu Rawat Abhishek Kumar Surender Singh
9 pages
Classification of Multimission SAR Images Based On Probabilistic Graphical Models and Convolution Neural Networks
No ratings yet
Classification of Multimission SAR Images Based On Probabilistic Graphical Models and Convolution Neural Networks
5 pages
2019-Fast and Robust Matching For Multimodal Remote Sensing Image Registration
No ratings yet
2019-Fast and Robust Matching For Multimodal Remote Sensing Image Registration
12 pages
Operation Wood
No ratings yet
Operation Wood
369 pages
Applied Computing and Informatics: Kumash Kapadia, Hussein Abdel-Jaber, Fadi Thabtah, Wael Hadi
No ratings yet
Applied Computing and Informatics: Kumash Kapadia, Hussein Abdel-Jaber, Fadi Thabtah, Wael Hadi
6 pages
Machine Learning in Remote Sensing Data Processing
No ratings yet
Machine Learning in Remote Sensing Data Processing
6 pages
Advances in Scene Classification of Remotely Sensed High Resolutin Image and The Existing Datasets PDF
No ratings yet
Advances in Scene Classification of Remotely Sensed High Resolutin Image and The Existing Datasets PDF
5 pages
Cryptography and Network Security: Fourth Edition by William Stallings Lecture Slides by Lawrie Brown/Mod. & S. Kondakci
No ratings yet
Cryptography and Network Security: Fourth Edition by William Stallings Lecture Slides by Lawrie Brown/Mod. & S. Kondakci
32 pages
Secret Key Extraction Using Keyloggers
No ratings yet
Secret Key Extraction Using Keyloggers
9 pages
R18 B.Tech Ece
No ratings yet
R18 B.Tech Ece
2 pages
Seminar Report
No ratings yet
Seminar Report
29 pages
Dy Fxy Yx y DX: 2. Taylor's Series Method
No ratings yet
Dy Fxy Yx y DX: 2. Taylor's Series Method
2 pages
Sat Class 0811
0% (1)
Sat Class 0811
2 pages
Fulbright-MIT C-WS Brochure
No ratings yet
Fulbright-MIT C-WS Brochure
2 pages
Comp Vis Week 3
No ratings yet
Comp Vis Week 3
44 pages
Mean Centre - A Statistical Tool in Geography
No ratings yet
Mean Centre - A Statistical Tool in Geography
10 pages
Naïve Bayes Classifier: Ke Chen
No ratings yet
Naïve Bayes Classifier: Ke Chen
18 pages
Detailed Analysis of The Binary Search
No ratings yet
Detailed Analysis of The Binary Search
10 pages
Automated Attendance System With Multi-Faces Using Convolution Neural Network (CNN)
No ratings yet
Automated Attendance System With Multi-Faces Using Convolution Neural Network (CNN)
6 pages
Zhu 2015
No ratings yet
Zhu 2015
4 pages
K Nearest Neighbor
No ratings yet
K Nearest Neighbor
33 pages
Final Credit Risk Prediction Report Corrected
No ratings yet
Final Credit Risk Prediction Report Corrected
19 pages
PT Symmetry: Carl Bender Physics Department Washington University
No ratings yet
PT Symmetry: Carl Bender Physics Department Washington University
53 pages
A Survey On Post-Quantum Cryptography For 5G6G Communications - v1.2 (Cleared)
No ratings yet
A Survey On Post-Quantum Cryptography For 5G6G Communications - v1.2 (Cleared)
6 pages
Ijsra 2024 1210
No ratings yet
Ijsra 2024 1210
8 pages
System Analysis of Operation Subsystem Procedure To Easy Conversion To Program
No ratings yet
System Analysis of Operation Subsystem Procedure To Easy Conversion To Program
8 pages
Fisher Market
No ratings yet
Fisher Market
4 pages
Neuro-Fuzzy, Revision Questions June 1, 2005
No ratings yet
Neuro-Fuzzy, Revision Questions June 1, 2005
7 pages
DBMS 17-18
No ratings yet
DBMS 17-18
2 pages
LAB211 Assignment: Title
No ratings yet
LAB211 Assignment: Title
2 pages
Heteroskedasticity Test Glejser
No ratings yet
Heteroskedasticity Test Glejser
2 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Single-Stream CNN With Learnable Architecture For Multisource Remote Sensing Data

Uploaded by

Single-Stream CNN With Learnable Architecture For Multisource Remote Sensing Data

Uploaded by

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

60, 2022 5409218

Single-Stream CNN With Learnable Architecture

makes predictions based on the extracted features and outputs

certain layer in a CNN, where N, C in , H , and W represent

be the convolution kernel in the same layer, where C out is the

where i ∈ {1, . . . , H }, j ∈ {1, . . . , W }, O(i, j ) ∈ R N ×C ,

F(i + m, j + n) ∈ R N ×C , and ω(m, n) ∈ RC ×C .

In regular convolution, ω densely maps every input channel

where I is a C in × C in identity matrix, and 1r is a column

and stack them together vertically

(a) (b) (c)

TABLE I TABLE III

B. Classification and Results

You might also like