0% found this document useful (0 votes)

21 views21 pages

Paper 3

Uploaded by

giribabukande

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views21 pages

Paper 3

Uploaded by

giribabukande

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing

journal homepage: www.elsevier.com/locate/isprsjprs

ResUNet-a: A deep learning framework for semantic segmentation of T

remotely sensed data
Foivos I. Diakogiannisa, François Waldnerb, Peter Caccettaa, Chen Wuc
a
Data61, CSIRO, Floreat, WA, Australia
b
CSIRO Agriculture & Food, St Lucia, QLD, Australia
c
ICRAR, The University of Western Australia, Crawley, WA, Australia

ARTICLE INFO ABSTRACT

Keywords: Scene understanding of high resolution aerial images is of great importance for the task of automated monitoring
Convolutional neural network in various remote sensing applications. Due to the large within-class and small between-class variance in pixel
Loss function values of objects of interest, this remains a challenging task. In recent years, deep convolutional neural networks
Architecture have started being used in remote sensing applications and demonstrate state of the art performance for pixel
Data augmentation
level classification of objects. Here we propose a reliable framework for performant results for the task of se-
Very high spatial resolution
mantic segmentation of monotemporal very high resolution aerial images. Our framework consists of a novel
deep learning architecture, ResUNet-a, and a novel loss function based on the Dice loss. ResUNet-a uses a
UNet encoder/decoder backbone, in combination with residual connections, atrous convolutions, pyramid scene
parsing pooling and multi-tasking inference. ResUNet-a infers sequentially the boundary of the objects, the
distance transform of the segmentation mask, the segmentation mask and a colored reconstruction of the input.
Each of the tasks is conditioned on the inference of the previous ones, thus establishing a conditioned re-
lationship between the various tasks, as this is described through the architecture’s computation graph. We
analyse the performance of several flavours of the Generalized Dice loss for semantic segmentation, and we
introduce a novel variant loss function for semantic segmentation of objects that has excellent convergence
properties and behaves well even under the presence of highly imbalanced classes. The performance of our
modeling framework is evaluated on the ISPRS 2D Potsdam dataset. Results show state-of-the-art performance
with an average F1 score of 92.9% over all classes for our best model.

1. Introduction materials and with different structures, leading to an incredible di-

versity of colors, sizes, shapes, and textures. On the other hand, se-
Semantic labelling of very high resolution (VHR) remotely-sensed mantically-different man-made objects can present similar character-
images, i.e., the task of assigning a category to every pixel in an image, istics, e.g., cement rooftops, cement sidewalks, and cement roads.
is of great interest for a wide range of urban applications including Therefore, objects with similar spectral signatures can belong to com-
land-use planning, infrastructure management, as well as urban sprawl pletely different classes. Second, the intricate three-dimensional struc-
detection (Matikainen and Karila, 2011; Zhang and Seto, 2011; Lu ture of urban environments is favourable to interactions between these
et al., 2017; Goldblatt et al., 2018). Labelling tasks generally focus on objects, e.g., through occlusions and cast shadows.
extracting one specific category, e.g., building, road, or certain vege- Circumventing these issues requires going beyond the sole use of
tation types (Li et al., 2015; Cheng et al., 2017; Wen et al., 2017), or spectral information and including geometric elements of the urban
multiple classes all together (Paisitkriangkrai et al., 2016; Längkvist class appearance such as pattern, shape, size, context, and orientation.
et al., 2016; Liu et al., 2018; Marmanis et al., 2018). Nonetheless, pixel-based classifications still fail to satisfy the accuracy
Extracting spatially consistent information in urban environments requirements because they are affected by the salt-and-pepper effect
from remotely-sensed imagery remains particularly challenging for two and cannot fully exploit the rich information content of VHR data
main reasons. First, urban classes often display a high within-class (Myint et al., 2011; Li and Shao, 2014). GEographic Object-Based
variability and a low between-class variability. On the one hand, man- Imagery Analysis (GEOBIA) is an alternative image processing ap-
made objects of the same semantic class are often built in different proach that seeks to group pixels into meaningful objects based on

E-mail address: [email protected] (F.I. Diakogiannis).

https://doi.org/10.1016/j.isprsjprs.2020.01.013
Received 5 March 2019; Received in revised form 15 December 2019; Accepted 9 January 2020
0924-2716/ © 2020 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

specified parameters (Blaschke et al., 2014). Popular image segmenta- (Ruder, 2017, we present two variants of the basic architecture, a
tion algorithm in remote sensing include watershed segmentation single task and a multi-task one).
(Vincent and Soille, 1991), multi-resolution segmentation (Baatz and 2. We analyze the performance of various flavours of the Dice coeffi-
Schäpe, 2000) and mean-shift segmentation (Comaniciu and Meer, cient for semantic segmentation. Based on our findings, we in-
2002). In addition, GEOBIA also allows to compute additional attri- troduce a variant of the Dice loss function that speeds up the con-
butes related to the texture, context, and shape of the objects, which vergence of semantic segmentation tasks and improves
can then be added to the classification feature set. However, there is no performance. Our results indicate that the new loss function behaves
universally-accepted method to identify the segmentation parameters well even when there is a large class imbalance. This loss can also be
that provide optimal pixel grouping, which implies the GEOBIA is still used for continuous variables when the target domain of values is in
highly interactive and includes subjective trial-and-error methods and the range [0,1].
arbitrary decisions. Furthermore, image segmentation might fail to si-
multaneously address the wide range of object sizes that one typically In addition, we also present a data augmentation methodology,
encounters in urban landscapes ranging from finely structure objects where the input is viewed in multiple scales during training by the
such as cars and trees to larger objects such as buildings. Another algorithm, that improves performance and avoids overfitting. The
drawback is that GEOBIA relies on pre-selected features for which the performance of ResUNet-a was tested using the Potsdam data set
maximum attainable accuracy is a priori unknown. While several made available through the ISPRS competition (ISPRS). Validation re-
methods have been devised to extract and select features, these sults show that ResUNet-a achieves state-of-the-art results.
methods are not themselves learned from the data, and are thus po- This article is organized as follows. In Section 2 we provide a short
tentially sub-optimal. review of related work on the topic of semantic segmentation focused
In recent years, deep learning methods and Convolutional Neural on the field of remote sensing. In Section 3, we detail the model ar-
Networks (CNNs) in particular (LeCun et al., 1989) have surpassed chitecture and the modeling framework. Section 4 describes the data set
traditional methods in various computer vision tasks, such as object we used for training our algorithm. In Section 5 we provide an ex-
detection, semantic, and instance segmentation (see Rawat and Wang, perimental analysis that justifies the design choices for our modeling
2017, for a comprehensive review). Some of the key advantages of framework. Finally, Section 6 presents the performance evaluation of
CNN-based algorithms is that they provide end-to-end solutions, that our algorithm and comparison with other published results. Readers are
require minimal feature engineering which offer greater generalization referred to Section A for a description of our software implementation
capabilities. They also perform object-based classification, i.e., they and hardware configurations, and to Section C for the full error maps on
take into account features that characterize entire image objects, unseen test data.
thereby reducing the salt-and-pepper effect that affects conventional
classifiers. 2. Related work
Our approach to annotate image pixels with class labels is object-
based, that is, the algorithm extracts characteristic features from whole The task of semantic segmentation has attracted significant interest
(or parts of) objects that exist in images such as cars, trees, or corners of in the latest years, not only in the field of computer vision community
buildings and assigns a vector of class probabilities to each pixel. In but also in other disciplines (e.g. biomedical imaging, remote sensing)
contrast, using standard classifiers such as random forests, the prob- where automated annotation of images is an important process. In
ability of each class per pixel is based on features inherent in the particular, specialized techniques have been developed over different
spectral signature only. Features based on spectral signatures contain disciplines, since there are task-specific peculiarities that the commu-
less information than features based on objects. For example, looking at nity of computer vision does not have to address (and vice versa).
a car we understand not only it’s spectral features (color) but also how Starting from the computer vision community, when first in-
these vary as well as the extent these occupy in an image. In addition, troduced, Fully Convolutional Networks (hereafter FCN) for semantic
we understand that it is more probable a car to be surrounded by pixels segmentation (Long et al., 2014), improved the state of the art by a
belonging to a road, and less probable to be surrounded by pixels be- significant margin (20% relative improvement over the state of the art
longing to buildings. In the field of computer vision, there is a vast on the PASCAL VOC (Everingham et al., 2010) 2011 and 2012 test sets).
literature on various modules used in convolutional neural networks The authors replaced the last fully connected layers with convolutional
that make use of this idea of “per object classification”. These modules, layers. The original resolution was achieved with a combination of
such as atrous convolutions (Chen et al., 2016) and pyramid pooling upsampling and skip connections. Additional improvements have been
(He et al., 2014; Zhao et al., 2017a), boost the algorithmic performance presented with the use of deeplab models (Chen et al., 2016, 2017), that
on semantic segmentation tasks. In addition, after the residual networks first showcased the importance of atrous convolutions for the task of
era (He et al., 2015) it is now possible to train deeper neural networks semantic segmentation. Their model uses also a conditioned random
avoiding to a great extent the problem of vanishing (or exploding) field as a post processing step in order to refine the final segmentation.
gradients. A significant contribution in the field of computer vision came from the
Here, we introduce a novel Fully Convolutional Network (FCN) for community of biomedical imaging and in particular, the U-Net archi-
semantic segmentation, termed ResUNet-a. This network combines tecture (Ronneberger et al., 2015) that introduced the encoder-decoder
ideas distilled from computer vision applications of deep learning, and paradigm, for upsampling gradually from lower size features to the
demonstrates competitive performance. In addition, we describe a original image size. Currently, the state of the art on the computer vi-
modeling framework consisting of a new loss function that behaves well sion datasets is considered to be mask-rcnn (He et al., 2017), that
for semantic segmentation problems with class imbalance as well as for performs various tasks (object localization, semantic segmentation,
regression problems. In summary, the main contributions of this paper instance segmentation, pose estimation etc). A key element of the
are the following: success of this architecture is its multitasking nature.
One of the major advantages of CNNs over traditional classification
1. A novel architecture for understanding and labeling very high re- methods (e.g. random forests), is their ability to process input data in
solution images for the task of semantic segmentation. The archi- multiple context levels. This is achieved through the downsampling
tecture uses a UNet (Ronneberger et al., 2015) encoder/decoder operations that summarizes features. However, this advantage in fea-
backbone, in combination with, residual connections (He et al., ture extraction needs to be matched with a proper upsampling method,
2016), atrous convolutions (Chen et al., 2016, 2017), pyramid scene to retain information from all spatial resolution contexts and produce
parsing pooling (Zhao et al., 2017a) and multi tasking inference fine boundary layers. There has been a quick uptake of the approach in

95
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

the remote sensing community and various solutions based on deep aggregates features from higher receptive fields to local receptive fields.
learning have been presented recently (e.g. Sherrah, 2016; Audebert In a similar idea to our approach, the CASIA network uses features from
et al., 2016, 2017, 2018; Längkvist et al., 2016; Li et al., 2015; Li and multiple contexts, however these are evaluated at a different depth of
Shao, 2014; Volpi and Tuia, 2017; Liu et al., 2018; Liu et al., 2017b, the network and fused together in a completely different way. The ar-
2018a, 2017a; Pan et al., 2018b; Marmanis et al., 2016, 2018; Wen chitecture achieved state of the art performance on the ISPRS Potsdam
et al., 2017; Zhao et al., 2017b). A comprehensive review of deep and Vaihingen data. The loss function they used was the normalized
learning applications in the field of remote sensing can be found in Zhu cross entropy.
et al. (2017), Ma et al. (2019), Gu et al. (2019).
Discussing in more detail some of the most relevant approaches to 3. The ResUNet-a framework
our work, (Sherrah, 2016) utilized the FCN architecture, with a novel
no-downsampling approach based on atrous convolutions to mitigate In this section, we introduce the architecture of ResUNet-a in full
this problem. The summary pooling operation was traded with atrous detail (Section 3.1), a novel loss function design to achieve faster
convolutions, for filter processing at different scales. The best per- convergence and higher performance (Section 3.2), data augmentation
forming architectures from their experiments were the ones using methodology (Section 3.3) as well as the methodology we followed on
pretrained convolution networks. The loss used was categorical cross- performing inference on large images (Section 3.4). The training
entropy. strategy and software implementation characteristics can be found in
Liu et al. (2017a) introduced the Hourglass-shape network for se- Appendix A.
mantic segmentation on VHR images, which included an encoder-de-
coder style network, utilizing inception like modules. Their encoder-
decoder style departed from the UNet backbone, in that they did not use 3.1. Architecture
features from all spatial contexts of the encoder in the decoder branch.
Also, their decoder branch is not symmetric to the encoder. The Our architecture combines the following set of modules encoded in
building blocks of the encoder are inception modules. Feature upsam- our models:
pling takes place with the use of transpose convolutions. The loss used
was weighted binary cross entropy. 1. A UNet (Ronneberger et al., 2015) backbone architecture, i.e., the
Emphasizing on the importance of using the information from the encoder-decoder paradigm, is selected for smooth and gradual
boundaries of objects, Marmanis et al. (2018) utilized the Holistically transitions from the image to the segmentation mask.
Ne-sted Edge Detection network (Xie and Tu, 2015, HED) for predicting 2. To achieve consistent training as the depth of the network increases,
boundaries of objects. The loss used for the boundaries was an Eu- the building blocks of the UNet architecture were replaced with
clidean distance regression loss. The estimated boundaries were then modified residual blocks of convolutional layers (He et al., 2016).
concatenated with image features and provided them as input into Residual blocks remove to a great extent the problem of vanishing
another CNN segmentation network, for the final classification of and exploding gradients that is present in deep architectures.
pixels. For the CNN segmentation network, they experimented with two 3. For better understanding across scales, multiple parallel atrous
architectures, the SegNet (Badrinarayanan et al., 2015) and a Fully convolutions (Chen et al., 2016, 2017) with different dilation rates
Convolutional Network presented in Marmanis et al. (2016) that uses are employed within each residual building block. Although it is not
weights from pretrained architectures. One of the key differences in our completely clear why atrous convolutions perform well, the intui-
approach for boundary detection with Marmanis et al. (2018), is that tion behind their usage is that they increase the receptive field of
the boundary prediction happens at the end of our architecture, each layer. The rationale of using these multiple-scale layers is to
therefore the request for boundary prediction affects all features since extract object features at various receptive field scales. The hope is
the boundaries are strongly correlated with the extent of the predicted that this will improve performance by identifying correlations be-
classes. In contrast, in Marmanis et al. (2018), the boundaries are fed as tween objects at different locations in the image.
input to the segmentation branch of their network, i.e. the segmenta- 4. In order to enhance the performance of the network by including
tion part of their network uses them as additional input. Another dif- background context information we use the pyramid scene parsing
ference is that we do not use weights from pretrained networks. pooling (Zhao et al., 2017a) layer. In shallow architectures, where
Pan et al. (2018b) presented the Dense Pyramid Network. The au- the last layer of the encoder has a size no less than 16x16 pixels, we
thors incorporated group convolutions to process independently the use this layer in two locations within the architecture: after the
Digital Surface Model from the true orthophoto, presenting an inter- encoder part (i.e., middle of the network) and the second last layer
esting data fusion approach. The channels created from their initial before the creation of the segmentation mask. For deeper archi-
group convolutions were shuffled, in order to enhance the information tectures, we use this layer only close to the last output layer.
flow between channels. The authors, utilized a DenseNet (Huang et al., 5. In addition to the standard architecture that has a single segmen-
2016) architecture as their feature extractor. In addition, a Pyramid tation mask layer as output, we also present two models where we
Pooling layer was used at the end of their encoder branch, before perform multi-task learning. The algorithm learns simultaneously
constructing the final segmentation classes. In order to overcome the four complementary tasks. The first is the segmentation mask. The
class imbalance problem, they chose to use the Focal loss function (Lin second is the common boundary between the segmentation masks
et al., 2017). In comparison with our work, the authors did not use a that is known to improve performance for semantic segmentation
symmetric encoder-decoder architecture. The building blocks of their (Bertasius et al., 2015; Marmanis et al., 2018). The third is the
model were DenseNet units which are known to be more efficient than distance transform1 (Borgefors, 1986) of the segmentation mask.
standard residual units (Huang et al., 2016). The pyramid pooling op- The fourth is the actual colored image, in HSV color space. That is,
erator used in the end of their architecture, before the final segmen- the identity transform of the content, but in a different color space.
tation map, is at different scales than the one used in ResUNet-a.
Liu et al. (2018) introduced the CASIA network, which consists of a We term our network ResUNet-a because it consists of residual
pretrained deep encoder, a set of self-cascaded convolutional units and
a decoder part. The encoder part is deeper than the decoder part. The 1
The result of the distance transform on a binary segmentation mask is a gray
upscaling of the lower level features takes place with a resize operation level image, that takes values in the range [0,1], where each pixel value cor-
followed by a convolutional residual correction term. The self-cascaded responds to the distance to the closest boundary. In OPENCV this transform is
units, consist of a sequential multi-context aggregation layer, that encoded in cv::distance_transform.

96
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

building blocks with multiple atrous convolutions and a UNet backbone residual block (Fig. 1b), we used as many as three in parallel atrous
architecture. We present two basic architectures, ResUNet-a d6 and convolutions in addition to the standard set of two convolutions of the
ResUNet-a d7, that differ in their depth, i.e. the total number of residual network architecture, i.e., there were up to four parallel
layers. In ResUNet-a d6 the encoder part consists of six ResBlock-a branches of sets of two stacked convolutional layers. After the con-
building blocks followed by a PSPPooling layer. In ResUNet-a d7 volutions, the output is added to the initial input in the spirit of residual
the encoder consists of seven ResBlock-a building blocks. For each of building blocks. We decided to sum the various atrous branches (in-
the d6 or d7 models, there are also three different output possibilities: a stead of concatenating them) because it is known that the residual
single task semantic segmentation layer, a multi-task layer (mtsk), and blocks of two successive convolutional layers demonstrate constant
a conditioned multi-task output layer (cmtsk). The difference between condition number of the Hessian of the loss function, irrespective of the
the mtsk and cmtsk output layers is how the various complementary depth of the network Li et al. (2016). Therefore the summation scheme
tasks (i.e. the boundary, the distance map, and the color) are used for is easier to train (in comparison with the concatenation of features). In
the determination of the main target task, which is the semantic seg- the encoder part of the network, the output of each of the residual
mentation prediction. In the following we present in detail these blocks is downsampled with a convolution of kernel size of one and
models, starting from the basic ResUNet-a d6. stride of two. At the end of both the encoder and the decoder part, there
exists a PSPooling operator (Zhao et al., 2017a). In the PSPPooling
operator (Fig. 1c), the initial input is split in channel (feature) space in
3.1.1. ResUNet-a 4 equal partitions. Then we perform max pooling operation in succes-
The ResUNet-a d6 network consists of stacked layers of modified sive splits of the input layer, in 1, 4, 16 and 64 partitions. Note that in
residual building blocks (ResBlock-a), in an encoder-decoder style the middle layer (Layer 13 has size: [batch size]× 1024 × 8 × 8), the
(UNet). The input is initially subjected to a convolution layer of kernel split of 64 corresponds to the actual total size of the input (so we have
size (1, 1) to increase the number of features to the desired initial filter no additional gain with respect to max pooling from the last split). In
size. A (1, 1) convolution layer was used in order to avoid any in- Fig. 1a we present the full architecture of ResUNet-a (see also
formation loss from the initial image by summarizing features across Table 1). In the decoder part, the upsampling is being done with the use
pixels with a larger kernel. Then follow the residual blocks. In each

Fig. 1. Overview of the ResUNet-a d6 network. (a) The left (downward) branch is the encoder part of the architecture. The right (upward) branch is the decoder.
The last convolutional layer has as many channels as there are distinct classes. (b) Building block of the ResUNet-a network. Each unit within the residual block has
the same number of filters with all other units. Here d1, …, dn designate different dilation rates, (c) Pyramid scene parsing pooling layer. Pooling takes place in 1/1, 1/
2, 1/4 and 1/8 portions of the original image.

97
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Table 1 Table 3
Details of the ResUNet-a layers for the d6 model. Here f stands for the number Details of the replacement of the middle PSPPooling layer (Layer 13 from
of output channels (or features, the input number of features is deduced from Table 1) for the ResUNet-a d7 model.
the previous layers). k is the convolution kernel size, d is the dilation rate, and
Layer # Layer Type
s the stride of the convolution operation. In all convolution operations we used
appropriate zero padding to keep the dimensions of the produced feature maps Input Layer 12
equal to the input feature map (unless downsampling). A Conv2D(f = 2048, k = 1, d = 1, s = 2)(input)
B ResBlock-a(f = 2048, k = 3, d = 1, s = 1)(Layer A)
Layer # Layer Type
C MaxPooling(kernel = 2, stride = 2)(Layer B)
D UpSample(Layer C)
1 Conv2D(f = 32, k = 1, d = 1, s = 1)
E Concat(Layer D, Layer B)
2 ResBlock-a(f = 32, k = 3, d={1,3,15,31}, s = 1)
F Conv2D(f = 2048,kernel = 1)(Layer E)
3 Conv2D(f = 64, k = 1, d = 1, s = 2)
4 ResBlock-a(f = 64, k = 3, d={1,3,15,31}, s = 1)
5 Conv2D(f = 128, k = 1, d = 1, s = 2)
6 ResBlock-a(f = 128, k = 3, d={1,3,15}, s = 1) block at a lower resolution. The output of this layer is subjected to a
7 Conv2D(f = 256, k = 1, d = 1, s = 2) MaxPooling2D(kernel = 2, stride = 2) operation the output of
8 ResBlock-a(f = 256, k = 3, d={1,3,15}, s = 1) which is rescaled to its original size and then concatenated with the
9 Conv2D(f = 512, k = 1, d = 1, s = 2)
10 ResBlock-a(f = 512, k = 3, d = 1, s = 1)
original input layer. This operation is followed by a standard con-
11 Conv2D(f = 1024, k = 1, d = 1, s = 2) volution that brings the total number of features (i.e. the number of
12 ResBlock-a(f = 1024, k = 3, d = 1, s = 1) channels) to their original number before the concatenation. In version
13 PSPPooling 2 (hereafter d7v2), again the Layer 12 is replaced with a standard
14 UpSample (f = 512)
resnet block. However, now the MaxPooling operation following this
15 Combine (f = 512, Layers 14 & 10)
16 ResBlock-a(f = 512, k = 3, d = 1, s = 1) layer is replaced with a smaller PSPPooling layer that has three
17 UpSample (f = 256) parallel branches, performing pooling in 1/1, 1/2, 1/4 scales of the
18 Combine (f = 256, Layers 17 & 8) original filter (Fig. 1c). The reason for this is that the filters in the
19 ResBlock-a(f = 256, k = 3, d = 1, s = 1) middle of the d7 network cannot sustain 4 parallel pooling operations
20 UpSample (f = 128)
21 Combine (f = 128, Layers 20 & 6)
due to their small size (therefore, we remove the 1/8 scale pooling), for
22 ResBlock-a(f = 128, k = 3, d = 1, s = 1) an initial input image of size 256x256.
23 UpSample (f = 64) With regards to the model complexity, ResUNet-a d6 has 52 M
24 Combine (f = 64, Layers 23 & 4) trainable parameters for an initial filter size of 32. ResUNet-a d7 that
25 ResBlock-a(f = 64, k = 3, d = 1, s = 1)
has greater depth has 160 M parameters for the same initial filter size.
26 UpSample (f = 32)
27 Combine (f = 32, Layers 26 & 2) The number of parameters remains almost identical for the case of the
28 ResBlock-a(f = 32, k = 3, d = 1, s = 1) multi-task models as well.
29 Combine (f = 32, Layers 28 & 1)
30 PSPPooling
31 Conv2D (f = NClasses, k = 1, d = 1, s = 1)
3.1.2. Multitasking ResUNet-a
32 Softmax(dim = 1)
This version of ResUNet-a replaces the last layer (Layer 31) with a
multitasking block (Fig. 2). The multiple tasks are complementary. These
Table 2 are (a) the prediction of the semantic segmentation mask, (b) the
Details of the Combine(Input1,Input2) layer.
Layer # Layer Type

1 Input1
2 ReLU(Input1)
3 Concat(Layer 2,Input2)
3 Conv2DN(k = 1, d = 1, s = 1)

of nearest neighbours interpolation followed by a normed convolution

with a kernel size of one. By normed convolution, denoted with
Conv2DN, we mean a set of a single 2D convolution followed by a
BatchNorm layer. This approach for increasing the resolution of the
convolution features was used in order to avoid the chequerboard ar-
tifact in the segmentation mask (Odena et al., 2016). The combination
of layers from the encoder and decoder parts is being performed with
the Combine layer (Table 2). This module concatenates the two inputs
and subjects them to a normed convolution that brings the number of
features to the desired size.
The ResUNet-a d7 model is deeper than the corresponding d6
model, by one resunet building block both in the encoder and decoder
parts. We have tested two versions of this deeper architecture that differ
in the way the pooling takes place in the middle of the network. In
version 1 (hereafter d7v1) the PSPPooling Layer (Layer 13) is re- Fig. 2. Multi-task output layer of the ResUNet-a network. Referring to the
placed with one additional building block, that is a standard resnet example of ResUNet-a d6, Layers 29, 30 and 31 are replaced with one of two
block (see Table 3 for details). There is, of course, a corresponding variants of the multi-task layer. The first one, the conditioned multitask layer,
increase in the layers of the decoder part as well, by one additional combines the various intermediate products progressively so as the final seg-
residual building block. In more detail (Table 3), the PSPPooling mentation layer to take a “decision” based on inference from previous results.
layer in the middle of the network is replaced by a standard residual The simple multi-task layer keeps the tasks independent.

98
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

detection of the common boundaries between classes, (c) the re- segmentation are the focal loss Lin et al. (2017), see also Pan et al.
construction of the distance map and (e) the reconstruction of the original (2018b) for an application on VHR images, the boundary loss (Kervadec
image in HSV color space. Our choice of using a different color space than et al., 2018), and the Focal Tversky loss (Abraham and Khan, 2018). A
the original input was guided by the principle that: (a) we wanted to list of many other available loss functions can be found in Taghanaki
avoid the identity transform in order to exclude the algorithm recovering et al. (2019).
trivial solutions and (b) the HSV (or HSL) colorspace matches closely the
human perception of color Vadivel et al. (2005). It is important to note 3.2.1. Introducing the Tanimoto loss with complement
that these additional labels are derived using standard computer vision When it comes to semantic segmentation tasks, there are various
libraries from the initial image and segmentation mask, without the need options for the loss function. The Dice coefficient (Dice, 1945; Sørensen,
for additional information (e.g. separately annotated boundaries). A 1948), generalized for fuzzy binary vectors in a multiclass context
software implementation for this is given in Appendix B. The idea here is (Milletari et al., 2016, see also Crum et al., 2006; Sudre et al., 2017), is
that all these tasks are complementary and should help the target task a popular choice among practitioners. It has been shown that it can
that we are after. Indeed, the distance map provides information for the increase performance over the cross entropy loss (Novikov et al., 2017).
topological connectivity of the segmentation mask as well as the extent of The Dice coefficient can be generalized to continuous binary vectors in
the objects (for example if we have an image with a “car” (object class) on two ways: either by the summation of probabilities in the denominator
a “road” (another object class), then the ground truth of the mask of the or by the summation of their squared values. In the literature, there are
“road” will have a hole exactly to the location of the pixels corresponding at least three definitions which are equivalent (Crum et al., 2006;
to the “car” object). The boundary helps in better understanding the ex- Milletari et al., 2016; Drozdzal et al., 2016; Sudre et al., 2017):
tent of the segmentation mask. Finally, the colorspace transformation
provides additional information for the correlation between color varia- 2 i
pi li
D1 p , l =
tions and object extent. It also helps to keep “alive” the information of the i
pi + l
i i (1)
fine details of the original image to its full extent until the final output
layer. The rationale here is similar with the idea behind the concatenation 2 i
pi li
D2 p, l =
of higher order features (first layers) with lower order features that exist i
(pi2 + li2 ) (2)
in the UNet backbone architecture: the encoder layers have finer details
about the original image as closely as they are to the original input. pl
i i i
Hence, the reason for concatenating them with the layers of the decoder is D3 p , l =
(pi2 + 2
li ) (pi li ) (3)
to keep the fine details necessary until the final layer of the network that
i i

is ultimately responsible for the creation of the segmentation mask. By where p {pi }, pi [0, 1] is a continuous variable, representing the
demanding the network to be able to reconstruct the original image, we vector of probabilities for the i-th pixel, and l {li } are the corre-
are making sure that all fine details are preserved2 (an example of input sponding ground truth labels. For binary vectors, li {0, 1} . In the fol-
image, ground truth and inference for all the tasks in the conditioned lowing we will represent (where appropriate) for simplicity the set of
multitasking setting can be seen in Fig. 13). vector coordinates, p {pi } , with their corresponding tensor index no-
We present two flavours of the algorithm whose main difference is tation, i.e. p {pi } pi .
how the various tasks are used for the target output that we are inter- These three definitions are numerically equivalent, in the sense that
ested in. In the simple multi-task block (bottom right block of Fig. 2), they map the vectors (pi , li ) to the continuous domain [0, 1], i.e.
the four tasks are produced simultaneously and independently. That is, D (pi , li ): R2 [0, 1]. The gradients however, of these loss functions
there is no direct usage of the three complementary tasks (boundary, behave differently for gradient based optimization, i.e., for deep
distance, and color) in the construction of the target task that is the learning applications, as demonstrated in Milletari et al. (2016). In the
segmentation. The motivation here is that the different tasks will force remainder of this paper, we call Dice loss, the loss function with the
the algorithm to identify new meaningful features that are correlated functional form with the summation of probabilities and labels in the
with the output we are interested in and can help in the performance of denominator (Eq. (1)). We also use the name Tanimoto for the D3 loss
the algorithm for semantic segmentation. For the distance map, as well function (Eq. (3)) and designate it with the letter T D3 .
as the color reconstruction, we do not use the PSPPooling layer. This is We found empirically that the loss functions containing squares in
because it tends to produce large squared areas with the same values the denominator behave better in pointing to the ground truth irre-
(due to the pooling operation) and the depth of the convolution layers spective of the random initial configuration of weights. In addition, we
in the logits is not sufficient to diminish this (Fig. 14). found that we can achieve faster training convergence by com-
The second version of the algorithm uses a conditioned inference plementing the loss with a dual form that measures the overlap area of
methodology. That is, the network graph is constructed in such a way so as the complement of the regions of interest. That is, if pi measures the
to take advantage of the inference of the previous layers (top right block of probability of the ith pixel to belong in class li , the complement loss is
Fig. 2). We first predict the distance map. The distance map is then con- defined as T (1 pi , 1 li) , where the subtraction is performed ele-
catenated with the output of the PSPPooling layer and is used to calculate ment-wise, e.g. 1 pi = {1 p1 , 1 p2 , …, 1 pn } etc. The intuition
the boundary logits. Then both the distance map and the prediction of the behind the usage of the complement in the loss function comes from the
boundary are concatenated with the PSPPooling layer and the result is fact that the numerator of the Dice coefficient, i pi li , can be viewed as
provided as input to the segmentation logits for the final prediction. an inner product between the probability vector, p = {pi } and the
ground truth label vector, l = {li } . Then, the part of the probabilities
vector, pi , that corresponds to the elements of the label vector, li , that
3.2. Loss function have zero entries, does not alter the value of the inner product3. We,
therefore, propose that the best flow of gradients (hence faster training)
In this section, we introduce a new variant of the family of Dice loss
functions for semantic segmentation and regression problems. The Dice
3
family of losses is by no means the only option for the task of semantic As a simple example, consider four dimensional vectors, say
segmentation. Other interesting loss functions for the task of semantic p = (p1 , p2 , p3 , p4 ) and l = (1, 1, 0, 0) . The value of the inner product term is
p ·l = p1 + p2 , and therefore the information contained in p3 and p4 entries
is not apparent to the numerator of the loss. The complement
2
However, the color reconstruction on its own does not guarantee that the inner product provides information for these terms:
network learns meaningful correlations between classes and colors. (1 p)·(1 l) = (1 p)·(0, 0, 1, 1) = 2 (p3 + p4 ) .

99
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Fig. 3. Contour plots of the gradient flow (top row) and Laplacian operator (bottom row) for the various versions of the Dice loss functions (Eqs. (1)–(3)), Di , as well
as the functional forms with complements, Di . The black dot corresponds to the ground truth value (1, 0) . From left to right, for the top row, we have the gradient flow
of the generalized loss functions D1, D1, D2 , D2 and D3, D3 . The bottom panels are the corresponding Laplacian operators of these. The numerical values of the
isocontours on the images describe numerically the colorscheme with darker values corresponding to. smaller values.

is achieved using as a loss function the average of T (pi , li ) with its D3 (p , l) . From the gradient flow of the D1 loss, it is evident that for a
complement, T (1 pi , 1 li) : random initialization of the network weights (which is the case in deep
learning) that corresponds to some random point (px , py ) of the loss
T (pi , li ) + T (1 pi , 1 li )
T pi , li = . landscape, the gradients of the loss with respect to px , py will not ne-
2 (4) cessarily direct to the ground truth point in (1, 0) . In this respect, the
generalized Dice loss with complement, D1 behaves better. However,
the gradient flow lines do not pass through the ground truth point for
3.2.2. Experimental comparison with other Dice loss functions
all possible pairs of values (px , py ) . For the case of the loss functions
In order to justify these choices, we present an example with a single
D2 , D3 and their complements, the gradient flow lines pass through the
2D ground truth vector, (l = (1, 0) ), and a vector of probabilities
ground truth point, but these are not straight lines. Their forms with
p = (px , py ) [0, 1]2 . We consider the following six loss functions:
complement, D2 , D3 , have gradient lines flowing straight towards the
ground truth irrespective of the (random) initialization point. The La-
1. The Dice coefficient, D1 (pi , li ) ((Eq. (1)))
placians of these loss functions are in the corresponding bottom panels
2. The Dice coefficient with its complement:
of Fig. 3. It is clear that the extremum of the Laplacian operator is closer
D1 (pi , li ) = (D1 (pi , li ) + D1 (1 pi , 1 li ))/2 to the ground truth values only for the cases where we consider the loss
functions with complement. Interestingly, the Laplacian of the Tani-
moto functional form (D3 ) has extremum values closer to the ground
3. The Dice coefficient D2 (pi , li ) (Eq. (2)).
truth point in comparison with the D2 functional form.
4. The Dice coefficient with its complement, D2 . In summary, the Tanimoto loss with complement has gradient flow
5. The Tanimoto coefficient, T (pi , li ) (Eq. (3)). lines that are straight lines (geodesics, i.e. they follow the shortest path)
6. The Tanimoto coefficient with its complement, T (pi , li ) (Eq. (4)). pointing to the ground truth from any random initialization point, and
the second order derivative has extremum on the location of the ground
In Fig. 3 we plot the gradient field of the various flavours of the truth. This demonstrates, according to our opinion, the superiority of
family of Dice loss functions (top panels), as well as the Laplacians of the Tanimoto with complement as a loss function, among the family of
these (i.e. their 2nd order derivatives, bottom panels). The ground truth loss functions based on the Dice coefficient, for training deep learning
is marked with a black dot. What is important in these plots is that for a models.
random initialization of the weights for a neural network, the loss
function will take a (random) value in the area within [0, 1]2 . The
quality of the loss function then, as a suitable criterion for training deep 3.2.3. Tanimoto with complement as a regression loss
learning models, is whether the gradients, from every point of the area in It should be stressed, that if we restrict the output of the neural
the plot, direct the solution towards the ground truth point. Intuitively network in the range [0, 1] (with the use of softmax or sigmoid acti-
we also expect that the behavior of the gradients is even better, if the vations) then the Tanimoto loss can be used to recover also continuous
local extrema of the loss on the ground truth, is also a local extremum of variables in the range [0, 1]. In Fig. 4 we present an example of this, for
the Laplacian of the loss. As it is evident from the bottom panels of a ground truth vector of l = (0.25, 0.85) . In the top panels, we plot the
Fig. 3 this is not the case for all loss functions. gradient flow of the Tanimoto (left) and Tanimoto with complement
In more detail, in Fig. 3, we plot the gradient field of the Dice loss (right) functions. In the bottom panels, we plot the corresponding
functions and the corresponding Laplacian fields. In the top row are functions obtained after applying the Laplacian operator to the loss
shown the gradient fields of the three different functional form of the functions. This is an appealing property for the case of multi-task
Dice loss and the form with their complements. From left to right we learning, where one of the complementary goals is a continuous loss
have the Dice coefficient based loss with summation of probabilities in function. The reason being that the gradients of these components will
the denumerator, D1 (p , l) , its complement, D1 (p , l) , the Dice loss with have similar magnitude scale and the training will be equally balanced
summation of squares in the denominator, D2 (p , l) , its complement, to all complementary tasks. In contrast, when we use different func-
D2 (p , l) , and the third form of the Dice loss with summation of squares tional form functions for different tasks in the optimization, we have to
that also includes a subtraction term, D3 (p , l) , and its complement, explicitely balance the gradients of the different components, with the

100
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

3.2.4. Generalization to multiclass imbalanced problems

Following the Dice loss modification of Sudre et al. (2017) for in-
cluding weights per class, we generalize the Tanimoto loss for multi-
class labelling of images:
Nclass Npixels
J =1
wJ i=1
piJ liJ
T piJ , liJ = Nclass Npixels
.
J =1
wJ i = 1 (piJ + liJ2
2
piJ liJ ) (5)
Here wJ are the weights per class J , piJ is the probability of pixel i be-
longing to class J and liJ is the label of pixel i belonging to class J.
Weights are derived following the inverse “volume” weighting scheme
per Crum et al. (2006):
wJ = V J 2, (6)

where VJ is the total sum of true positives per class J , VJ = In liJ .

Npixels
i=1
the following we will exclusively use the weighted Tanimoto (Eq. (5))
with complement, T (piJ , liJ ) = (T (piJ , liJ ) + T (1 piJ , 1 liJ ))/2, and
we will refer to it simply as the Tanimoto loss.

3.3. Data augmentation

To avoid overfitting, we relied on geometric data augmentation so

that, in each iteration, the algorithm never sees the exact same set of
Fig. 4. Top panels: gradient flow of the Tanimoto (top left) and Tanimoto with images (i.e. the batch of images is always different). Each pair of image
complement (top right) loss functions for a continuous target value. Bottom and ground truth mask are rotated at a random angle, with a random
panels: corresponding Laplacian of the gradients. The “ground truth”, centre and zoomed in/out according to a random scale factor. The parts
(0.25, 0.85) is represented with a black dot. The numerical values of the iso-
of the image that are left out from the frame after the transformation
contours on the images describe numerically the colorscheme with darker va-
are filled in with reflect padding. This data augmentation methodology
lues corresponding to lower values.
is particularly useful for aerial images of urban areas due to the high
degree of reflect symmetry these areas have by design. We also used
extra cost of having to find the additional hyperparameter(s). For ex- random reflections in x , y directions as an additional data augmenta-
ample, assuming we have two complementary tasks described by two tion routine.
different functional form functions, L1 and L2 , then the total loss must The regularization approach is illustrated in Fig. 6 for a single
be balanced with the usage of some (unknown) hyperparameter a that datum of the ISPRS Potsdam data set (top row, FoV × 4 dataset). From
needs to be calculated: Ltotal = L1 + aL 2 . left to right, we show the false color infrared image of a 256x256 image
In Fig. 5 we plot the gradient flow for three different members of the patch, the corresponding digital elevation model, and the ground truth
Dice family loss functional forms. From left to right we plot the stan- mask. In rows 2–4, we provide examples of the random transformations
dard Dice loss with summation in the denominator (D1, Eq. (1), Sudre of the original image. By exposing the algorithm to different perspec-
et al., 2017), the Dice loss with squares in the denominator (D2 , Eq. (2), tives of the same objects scenery, we encode the prior knowledge that
Milletari et al., 2016) and Tanimoto with complement (Eq, (4))that we the algorithm should be able to identify the objects for all possible af-
introduce in this work. It is clear that the Tanimoto with complement fine transformations. That is, we make the segmentation task invariant
has the highest degree of symmetric gradients in both magnitude and in affine transformations. This is quite similar to the functionality of the
direction around the ground truth point (for this example, the “ground Spatial Transformer Network (Jaderberg et al., 2015), with the differ-
truth” is (px , py ) = (0.5, 0.5) ). In addition, it also has steeper gradients as ence that this information is hard-coded in the data rather than the
this is demonstrated from the distance of isocontours. The above help internal layers of the network. It should be noted that several authors
achieving faster convergence in problems with gradient descent opti- report performance gains when they use inputs viewed at different
mization. scales, e.g., Audebert et al. (2016) and Yang et al. (2018).

3.4. Inference methodology

In this section, we detail the approach we followed for performing

inference over large true orthophoto that exceeds the 256x256 size of
the image patches we use during training.
As detailed in the introduction, FCNs such as ResUNet-a use
contextual information to increase their performance. In practice, this
means that in a single 256x256 window for inference, the pixels that
are closer to the edges will not be classified as confidently as the ones
Fig. 5. Gradient flow of the Dice family of losses for three different functional
close to the center because more contextual information is available to
forms. From left to right: standard Dice loss (D1, Eq. (1), Sudre et al., 2017), central pixels. Indeed, contextual information for the edge pixels is
Dice loss with squares in the denominator (D2 , Eq. (2), Milletari et al., 2016), limited since there is no information outside the boundaries of the
Tanimoto with complement (Eq, (4), this work). The “ground truth”, (0.5, 0.5) is image patch. To further improve the performance of the algorithm and
represented with a black dot. It is clear that Tanimoto with complement has provide seamless segmentation masks, the inference is enhanced with
gradient flow (i.e. gradient magnitudes and direction) that is symmetric around multiple overlapping inference windows. This is like deciding on the
the ground truth point, thus making it suitable for continuous regression pro- classification result from multiple views (sliding windows) of the same
blems. In contrast the Dice loss D1 is not suitable for this usage, while D2 has a objects. This type of approach is also used for large-scale land cover
clear assymetry that affects the gradients magnitude around t.he ground truth. classification to combine classification in a seamless map (Lambert

101
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

background.
Unlike conventional pixel-based (e.g. random forests) or GEOBIA
approaches, CNNs have the ability to “see” image objects in their
contexts, which provides additional information to discriminate be-
tween classes. Thus, working with large image patches maximizes the
competitive advantage of CNNs, however, limits to the maximum patch
size are dictated by memory restrictions of the GPU hardware. We have
created two versions of the training data. In the first version, we re-
sampled the image tiles to half their original resolution and extracted
image patches of size 256x256 pixels to train the network. The reduc-
tion of the original tile size to half was decided with the mindset that we
can include more context information per image patch. This resulted in
image patches with four times larger Field of View (hereafter FoV) for
the same 256 × 256 patch size. We will refer to this dataset as
(FoV × 4) as it includes 4 times larger Field of View (area) in a single
256 × 256 image patch (in comparison with 256 × 256 image patches
extracted directly from the original unscaled dataset). In the second
version of the training data, we kept the full resolution tiles and again
extracted image patches of size 256 × 256 pixels. We will refer to this
dataset as FoV × 1. The 256 × 256 image patch size was the maximum
size that the memory capacity of our hardware configuration could
handle (see Appendix A) so as to process a meaningfully large batch of
datums. Each of the 256 × 256 patches used for training was extracted
from a sliding window swiped over the whole tile at a stride, i.e., step,
of 128 pixels. This approach guarantees that all pixels at the edge of a
patch become central pixels in subsequent patches. After slicing the
original images, we split4 the 256x256 patches into a training set, a
validation set, and a test set with the following ratios: 0.8–0.1–0.1.
The purpose of the two distinct datasets is: the FoV × 4 is useful in
order to understand how much (if any) the increased context in-
formation improves the performance of the algorithm. It also allows us
to perform more experiments much faster due to the decreased volume
of data. The FoV × 4 dataset is approximately 50 GB, with 10 k of
pairs of images, masks. The FoV × 1 has volume size of 250 GB, and
40 k pairs of images, masks. In addition, the FoV × 4 is a useful
Fig. 6. Example of data augmentation on image patches of size 256 × 256
(ground sampling distance 10 cm – FoV × 4 dataset). Top row: original image, benchmark on how the algorithm behaves with a smaller amount of
subsequent rows: random rotations with respect to (random) center and at a data than the one provided. Finally, the FoV × 1 version is used in
random scale (zoom in/out). Reflect padding was used to fill the missing values order to compare the performance of our architecture with other pub-
of the image after the transformation. lished results.

et al., 2016; Waldner et al., 2017). 5. Architecture and Tanimoto loss experimental analysis
Practically, we perform multiple overlapping windows passes over
the whole tile and store the class probabilities for each pixel and each In this section, we perform an experimental analysis of the
pass. The final class probability vector ( pi (x , y ) ) is obtained using the ResUNet-a architecture as well as the performance of the Tanimoto
average of all the prediction views. The sliding window has size equal with complement loss function.
to the tile dimensions (256 × 256), however, we step through the
whole image in strides of 256/4 = 64 pixels, in order to get multiple 5.1. Accuracy assessment
inference probabilities for each pixel. In order to account for the lack of
information outside the tile boundaries, we pad each tile with reflect For each tile of the test set, we constructed the confusion matrix and
padding at a size equal to 256/2 = 128 pixels (Ronneberger et al., extracted the several accuracy metrics such as the overall accuracy
2015). (OA), the precision, the recall, and the F1-score (F1 ):
TP + TN
OA =
FP + FN (7)
4. Data and preprocessing
TP
precision =
We sourced data from the ISPRS 2D Semantic Labelling Challenge TP + FP (8)
and in particular the Potsdam data set (ISPRS). The data consist of a set TP
of true orthophoto (TOP) extracted from a larger mosaic, and a Digital recall =
TP + FN (9)
Surface Model (DSM). The TOP consists of the four spectral bands in the
visible (VIS) red (R), green (G), and blue (G) and in the near infrared F1 = 2·
precision·recall
(NIR) and the ground sampling distance is 5 cm. The normalized DSM precision + recall (10)
layer provides information on the height of each pixel as the ground
where TP , FP , FN , and TN are the is true positive, false positive, false
elevation was subtracted. The four spectral bands (VISNIR) and the
normalized DSM were stacked (VISNIR + DSM) to be used to train the
semantic segmentation models. The labels consist of six classes, namely 4
Making sure there is no overlap between the image patches of the training,
impervious surfaces, buildings, cars, low vegetation, trees, and validation and test sets.

102
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

negative and true negative classifications, respectively. (both middle and end) provides additional learning capacity, however,
In addition, for the validation dataset (for which we have ground it also comes with training instability. This is fixed by adding the
truth labels), we use the Matthews Correlation Coefficient (hereafter conditioned multitasking module in the final model. Clearly, each
MCC, Matthews, 1975): module addition: (a) increases the complexity of the model since it
increases the total number of parameters and (b) it improves the con-
TP × TN FP × FN
MCC = vergence performance.
(TP + FP )(TP + FN )(TN + FP )(TN + FN ) (11) Next, we are interested in evaluating the importance of the
PSPPooling layer. In our experiments we found this layer to be more
important in the middle of the network than before the last output
5.2. Architecture ablation study
layers. For this purpose, we train two ResUNet-a d7 models, the d7v1
and d7v2, that are identical in all aspects except that the latter has a
In this section, we design two experiments in order to evaluate the
PSPPooling layer in the middle. Both models are trained with the same
performance of the various modules that we use in the ResUNet-a
fixed set of hyperparameters (i.e. no learning rate reduction takes place
architecture. In these experiments we used the FoV × 1 dataset, as this
during training). In Fig. 9 we show the convergence evolution of these
is the dataset that will be the ultimate testbed of ResUNet-a perfor-
networks. It is clear that the model with the PSPPooling layer in the
mance against other modeling fra-meworks. Our metric for under-
middle (i.e. v2) converges much faster to optimality, despite the fact
standing the performance gains of the various models tested is the
that it has greater complexity (i.e. number of parameters) than the
model complexity and training convergence: if model A has greater (or
model d7v1.
equal) number of parameters than model B, and model A converges
In addition to the above, we have to note that when the network
faster to optimality than model B, then it is most likely that will also
performs erroneous inference, due to the PSPPooling layer in the end,
achieve the highest overall score.
this may appear in the form of square blocks, indicating the influence of
In the first experiment we test the convergent properties of our ar-
the pooling area in square subregions of the output. The last PSPPooling
chitecture. In this, we are not interested in the final performance (after
layer is in particular problematic when dealing with regression output
learning rate reduction and finetuning) which is a very time consuming
problems. This is the reason why we did not use it in the evaluation of
operation, but how ResUNet-a behaves during training for the same
color and distance transform modules in the multitasking networks. In
fixed set of hyperparameters and epochs. We start by training a baseline
Fig. 8 we present two examples of erroneous inference of the last
model, a modified ResUNet (Zhang et al., 2017) where in order to keep
PSPPooling layer that appear in the form of square blocks. The first row
the number of parameters identical with the case with atrous, we use
corresponds to a zoom in region of tile 6_14, and the bottom row to a
the same ResUNet-a building blocks with dilation rate equal to 1 for
zoom in region of tile 6_15. From left to right: RGB bands of input
all parallel residual blocks (i.e. there are no atrous convolutions). This is
image, error map, and inference map. From the boundary of the error
similar in philosophy with the wide residual networks (Zagoruyko and
map It can be seen that the boundary of the error map has areas that
Komodakis, 2016), however, there are no dropout layers. Then, we
appear in the form of square blocks. That is, the effect of the pooling
modify this baseline by increasing the dilation rate, thus adding atrous
operation in various scales can dominate the inference area.
convolutions (model: ResUNet + Atrous). It should be clear that the
Comparing ResUNet-a-mtsk and ResUNet-a-cmtsk models (on
only difference between the models ResUNet and ResUNet + Atrous
the basis of the d7v1 feature extractor), we find that the latter de-
is that the latter has different dilation rates than the former, i.e. they
monstrates smaller variance in the values of the loss function (and in
have identical number of parameters. Then we add PSPPooling in both
consequence, the performance metric) during training. In Fig. 10 we
the middle and the end of the framework (model: ResUNet + Atrous
present an example of the comparative training evolution of the Re-
+ PSP), and finally we apply the conditioned multitasking, i.e. the full
sUNet-a d7v1 mtsk versus the ResUNet-a d7v1 cmtsk models. It is
ResUNet-a model (model: ResUNet + Atrous + PSP + CMTSK).
clear that the conditioned inference model demonstrates smaller var-
The differences in performance of the convergence rates is incremental
iance, and that, despite the random fluctuations of the MCC coefficient,
with each module addition. This performance difference can be seen in
the median performance of the conditioned multitasking model is
Fig. 7 and is substantial. In Fig. 7 we plot the Matthews correlation
higher than the median performance of the simple multitasking model.
coefficient (MCC) for all models. The MCC was calculated using the
This helps in stabilizing the gradient updates and results slightly better
success rate over all classes. The baseline ResUNet requires approxi-
performance. We have also found that the inclusion of the identity
mately 120 epochs to achieve the same performance level that Re-
sUNet-a - cmtsk achieves in epoch 40 . The mere change from
simple (ResUNet) to atrous convolutions (model ResUNet + Atrous)
almost doubles the convergence rate. The inclusion of the PSP module

Fig. 8. Example of PSPPooling erroneous inference behaviour for segmentation

tasks. The error appears in the form of (parts of) squares blocks. For each row,
from left to right: RGB bands of input image, error map, and segmentation
Fig. 7. Convergence performance of the ResUNet-a architecture. Starting mask. The top row corresponds to a zoom in region of tile 6_14 and the bottom
from a baseline wide-ResUNet, we add components keeping all training hyper- to a region of tile 6_15. Each image patch is of size 1256 × 1256 and corre-
parameters identical. sponds to a ground sampling distance of 5 cm.

103
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Fig. 9. Convergence performance evaluation for the PSPPooling layer in the Fig. 11. Training evolution of the same model using three different loss func-
middle. We compare two architectures ResUNet-a d7v1 architecture without tions. The Tanimoto with complement loss (solid red line, this work), the
PSPPooling layer (blue solid line) and with PSPPooling layer in the middle (i.e. Tanimoto (solid green line), and the Dice loss (dashed blue line, Sudre et al.,
d7v2, dashed green line). It is clear that the insertion of the PSPPooling layer in 2017). (For interpretation of the references to colour in this figure legend, the
the middle of the architecture boosts convergence performance of the network. reader is referred to the web version of this article.)
(For interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.) the Dice loss stagnates to lower values, while the Tanimoto loss with
complement converges faster to an optimal value. The difference in
performance is significant: the Tanimoto loss with complement
achieves for the same number of epochs an MCC = 85.99, while the
Dice loss stagnates at MCC = 80.72. The Tanimoto loss without com-
plement (Eq. (5)) gives a similar performance with the Tanimoto with
complement, however, it converges relatively slower and demonstrates
greater variance. In all experiments we performed, the Tanimoto with
complement gave us the best performance.

6. Results and discussion

In this section, we present and discuss the performance of

ResUNet-a. We also compare the efficiency of our model with results
from architectures of other authors. We present results in both the
Fig. 10. Training evolution of the conditioned vs the standard multi-task
FoV × 4 and FoV × 1 versions of the ISPRS Potsdam dataset. It should
models for the ResUNet-a d7v1 family of models. The conditioned model be noted that the ground truth masks of the test set were made publicly
(cmtsk) is represented with a red solid line, while the standard multi-task available on June 2018. Since then, the ISPRS 2D semantic label online
(mtsk) one with a dashed blue line. The mtsk demonstrates higher variance test results are not being updated. The ground truth labels used to
during training especially closer to the final convergence. (For interpretation of calculate the performance scores are the ones with the eroded bound-
the references to colour in this figure legend, the reader is referred to the web aries.
version of this article.)
6.1. Design of experiments
reconstruction of the input image (in HSV colorspace) helps further to
reduce the variance of the performance metric. In Section 5 we tested the convergence properties of the various
Our conclusion is that the greatest gain in using the conditioned modules that ResUNet-a uses. In this section, our goal is to train the
multi-task model is in faster and consistent convergence to optimal best convergent models until optimality and compare their perfor-
values, as well as better segmentation of the boundaries (in comparison mance. To this aim we document the following set of representative
with the single output models). experiments: (a) ResUNet-a d6 vs ResUNet-a d6 cmtsk. The relative
comparison of this will give us the performance boost between single
5.3. Performance evaluation of the proposed loss function task and conditioned multitasking models, keeping everything else the
same. (b) ResUNet-a d7v1 mtsk vs ResUNet-a d7v1 conditioned
In order to demonstrate the performance difference between the mtsk. Here we are trying to see if conditioned multitasking improves
Dice loss as defined in Crum et al. (2006) and the Tanimoto loss, and performance over simple multitasking. In order to reduce computation
the Tanimoto with complement (Eq. (4)) we train three identical time, we train these models with the FoV × 4 dataset. Finally, we train
models with the same set of hyper-parameters. The weighting scheme is the two best models, ResUNet-a d6 cmtsk and ResUNet-a d7v2
the same for all losses. In this experiment we used the FoV × 4 dataset, cmtsk on the FoV × 1 dataset, in order to see if there are performance
in order to complete it shorter time. As the loss function cannot be differences due to different Field of Views as well as compare our
responsible for overfitting (only the model capacity can lead to such models with the results from other authors.
behavior) our results persist also with the larger FoV × 1 dataset. In
Fig. 11 we plot the Matthews correlation coefficient (MCC). In this 6.2. Performance of ResUNet-a on the FoV × 4 dataset
particular example, we are not interested in achieving maximum per-
formance by reducing the learning rate and pushing the boundaries of ResUNet-a d6 (i.e. the model with no multi-task output) shows
what the model can achieve. We are only interested to compare the competitive performance in all classes (Table 4). The worst result (ex-
relative performance for the same training epochs between the different cluding the class “Background”) for this model comes in the class
losses with an identical set of fixed hyperparameters. It is evident that “Trees“, where it seems that ResUNet-a d6 systematically under

104
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Table 4
Potsdam comparison of results (F1 score and overall accuracy – OA) for the various ResUNet-a models trained on the FoV × 4 and FoV × 1 datasets. Highest score
is marked with bold. The average F1-score was calculated using all classes except the “Background” class. The overall accuracy, was calculated including the
“Background” category.
Methods DataSet ImSurface Building LowVeg Tree Car Avg. F1 OA

ResUNet-a d6 (FoV × 4) 92.7 97.1 86.4 85.8 95.8 91.6 90.1

ResUNet-a d6 cmtsk (FoV × 4) 91.4 97.6 87.4 88.1 95.3 91.9 90.1

ResUNet-a d7v1 mtsk (FoV × 4) 92.9 97.2 86.8 87.4 96.0 92.1 90.6
ResUNet-a d7v1 cmtsk (FoV × 4) 92.9 97.2 87.0 87.5 95.8 92.1 90.7

ResUNet-a d6 cmtsk (FoV × 1) 93.0 97.2 87.5 88.4 96.1 92.4 91.0
ResUNet-a d7v2 cmtsk (FoV × 1) 93.5 97.2 88.2 89.2 96.4 92.9 91.5

Background
ImSurf
Car
Building
LowVeg
Tree

Fig. 12. ResUNet-a d7v1 cmtsk inference on unseen test patches of size 256 × 256 (FoV × 4 – ground sampling distance 10 cm). From left to right: rgb image,
digital elevation map, ground truth, and prediction.

105
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

segments the area close to their boundary. This is partially owed to the
fact that we reduced the size of the original image, and fine details
required for the detailed extent of trees cannot be identified by the
algorithm. In fact, even for a human, the annotated boundaries of trees
are not always clear (e.g. see Fig. 12). The ResUNet-a d6 cmtsk model
provides a significant performance boost over the single task ResUNet-
a d6 model, for the classes “Bulding”, “LowVeg” and “Tree”. In these
classes it also outperforms the deeper models d7v1 (which, however, do
not include the PSPPooling layer at the end of the encoder). This is
due to the explicit requirement for the algorithm to reconstruct also the
boundaries and the distance map and use them to further refine the
segmentation mask. As a result, the algorithm gains a “better under-
standing” of the fine details of objects, even if in some cases it is dif-
ficult for humans to clearly identify their boundaries.
The ResUNet-a d7v1 cmtsk model demonstrates slightly increased
performance over all of the tested models (Table 4, although differences
are marginal for the FoV × 4 dataset, and vary between classes). In
addition, there are some annotation errors to the dataset that eventually
prove to be an upper bound to the performance. In Fig. 12 we give an
example of inference on 256x256 patches of images on unseen test data.
In Fig. 13 we provide an example of the inference performed by Re-
sUNet-a d7v1 cmtsk for all the predictive tasks (boundary, distance
transform, segmentation, and identity reconstruction). In all rows, the
left column corresponds to the same ground truth image. In the first
row, from left to right: input image, ground truth segmentation mask,
Fig. 13. ResUNet-a d7v1 cmtsk all tasks inference on unseen test patches of
inference segmentation mask. Second row, middle and right: ground
size 256 × 256 for the FoV × 4 dataset (ground sampling distance 10 cm). truth boundary and inference heat map of the confidence of the algo-
From left to right, top row: input image, ground truth segmentation mask, rithm for characterizing pixels as boundaries. The more faint the
predicted segmentation mask. Second row: input image, ground truth bound- boundaries are, the less confident is the algorithm for their character-
aries, predicted boundaries (confidence). Third row: input image, ground truth ization as boundaries. Third row, middle and right: distance map and
distance map, inferred distance map. Bottom row: input image, reconstructed inferred distance map. Last row, middle: reconstructed image in HSV
image, difference between input and predicted image. space. Right image: average error over all channels between the ori-
ginal RGB image and the reconstructed one. The reconstruction is ex-
cellent suggesting that the Tanimoto loss can be used for identity
mappings, whenever these are required (as a means of regularization or
for Generative Adversarial Networks training (Goodfellow et al., 2014),
e.g. Zhu et al. (2017)).
Finally, in Table 4, we provide a relative comparison between
models trained in the FoV × 4 and FoV × 1 versions of the datasets.
Clearly, there is a performance boost when using the higher resolution
dataset (FoV × 1) for the classes that require finer details. However, for
the class “Building” the score is actually better with the wider Field of
View (FoV × 4, model d6 cmtsk) dataset.

6.3. Comparison with other modeling frameworks

In this section, we compare the performance of the ResUNet-a

modeling framework with a representative sample of (peer reviewed)
alternative convolutional neural network models. For this comparison
we evaluate the models trained on the FoV × 1 dataset. The modeling
frameworks we compare against ResUNet-a have published results on
the ISPRS website. These are: UZ_1 (Volpi and Tuia, 2017), RIT_L7 (Liu
et al., 2017b), RIT_4 (Piramanayagam et al., 2018), DST_5 (Sherrah,
2016), CAS_Y3 (ISPRS), CASIA2 (Liu et al., 2018), DPN_MFFL (Pan
et al., 2018b), and HSN + OI + WBP (Liu et al., 2017a). To the best of
our knowledge, at the time of writing this manuscript, these consist of
the best performing models in the competition. For comparison, we
provide the F1-score per class over all test tiles, the average F1-score
over all classes over all test tiles, and the overall accuracy. Note that the
average F1-score was calculated using all classes except the “Back-
Fig. 14. Same as Fig. 13 for the ResUNet-a d7v2 cmtsk trained on the FoV × 1 ground” class. The overall accuracy, for ResUNet-a, was calculated
dataset (256 × 256 image patches, ground sampling distance 5 cm). It is clear including the “Background” category.
that finer details are present especially for the class “Trees” and “LowVeg”, that In Table 5 we provide the comparative results, per class, as well as
improve the performance of the algorithm over the FoV × 4 dataset.

106
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Table 5
Potsdam comparison of results (based on per class F1 score) with other authors. Best values are marked with bold, second best values are underlined, third best values
are in square brackets. Models trained with FoV × 1 were trained on 256x256 patches extracted from the original resolution images.
Methods ImSurface Building LowVeg Tree Car Avg. F1 OA

UZ_1 (Volpi and Tuia, 2017) 89.3 95.4 81.8 80.5 86.5 86.7 85.8
RIT_L7 (Liu et al., 2017b) 91.2 94.6 85.1 85.1 92.8 89.8 88.4
RIT_4 (Piramanayagam et al., 2018) 92.6 97.0 86.9 87.4 95.2 91.8 90.3
DST_5 (Sherrah, 2016) 92.5 96.4 86.7 88.8 94.7 91.7 90.3
CAS_Y3 (ISPRS) 92.2 95.7 87.2 87.6 95.6 91.7 90.1
CASIA2 (Liu et al., 2018) 93.3 97.0 [87.7] [88.4] 96.2 92.5 91.1
DPN_MFFL (Pan et al., 2018b) 92.4 [96.4] 87.8 88.0 95.7 92.1 90.4
HSN + OI + WBP (Liu et al., 2017a) 91.8 95.7 84.4 79.6 88.3 87.9 89.4

ResUNet-a d6 cmtsk (FoV × 1) [93.0] 97.2 87.5 [88.4] [96.1] [92.4] [91.0]
ResUNet-a d7v2 cmtsk (FoV × 1) 93.5 97.2 88.2 89.2 96.4 92.9 91.5

Table 6
Potsdam summary confusion matrix over all test tiles for ground truth masks that do not include the boundary. The results correspond to the best model, ResUNet-
a-cmtsk d7v2, trained on the FoV × 1 dataset. The overall accuracy achieved is 91.5%
Predicted Reference

ImSurface Building LowVeg Tree Car Clutter/Background

ImSurface 0.9478 0.0085 0.0247 0.0117 0.0002 0.0071

Building 0.0115 0.9765 0.0041 0.0025 0.0001 0.0053
LowVeg 0.0317 0.0057 0.9000 0.0532 0.0000 0.0095
Tree 0.0223 0.0036 0.0894 0.8807 0.0016 0.0024
Car 0.0070 0.0016 0.0002 0.0091 0.9735 0.0086
Clutter/Background 0.2809 0.0844 0.1200 0.0172 0.0090 0.4885

Precision/Correctness 0.9220 0.9679 0.8640 0.9030 0.9538 0.7742

Recall/Completeness 0.9478 0.9765 0.9000 0.8807 0.9735 0.4885

F1 0.9347 0.9722 0.8816 0.8917 0.9635 0.5990

the average F1 score and overall accuracy for the ResUNet-a d6 cmtsk results5. In Fig. 15 we provide the input image (left column), the error
and ResUNet-a d7v2 cmtsk models as well as results from other au- map between the inferred and ground truth masks (middle column) and
thors. ResUNet-a d6 performs very well in accordance with other state the inference (right column) for a sample of four test tiles. In Appendix
of the art modeling frameworks, and it ranks overall 3rd (average F1). It C we present the evaluation results for the rest of the test TOP tiles, per
should be stressed that for the majority of the results, the performance class. In all of these figures, for each row, from left to right: original
differences are marginal. Going deeper, the ResUNet-a d7v2 model image tile, error map and inference using our best model (ResUNet-a-
rank 1st among the representative sample of competing models, in all cmtsk d7v2).
classes, thus clearly demonstrating the improvement over the state of
the art. In Table 6 we provide the confusion matrix, over all test tiles, 7. Conclusions
for this particular model.
It should be noted that some of the contributors (e.g., CASIA2, In this work, we present a new deep learning modeling framework,
RIT_4, DST_5) in the ISPRS competition used networks with pre-trained for semantic segmentation of high resolution aerial images. The fra-
weights on external large data sets (e.g. ImageNet, Deng et al., 2009) mework consists of a novel multitasking deep learning architecture for
and fine-tuning, i.e. a methodology called transfer learning (Pan and semantic segmentation and a new variant of the Dice loss that we term
Yang, 2010, see also Penatti et al. (2015), Xie et al. (2015)) for remote Tanimoto.
sensing applications. In particular, CASIA2, that has the 2nd highest Our deep learning architecture, ResUNet-a, is based on the en-
overall score, used as a basis a state of the art pre-trained ResNet101 coder/decoder paradigm, where standard convolutions are replaced
(He et al., 2016) network. In contrast, ResUNet-a was trained from with ResNet units that contain multiple in parallel atrous convolutions.
random weights initialization only on the ISPRS Potsdam data set. Al- Pyramid scene parsing pooling is included in the middle and end of the
though it has been demonstrated that such a strategy does not influence network. The best performant variant of our models are conditioned
the final performance, i.e. it is possible to achieve the same perfor- multitasking models which predict among with the segmentation mask
mance without pre-trained weights (He et al., 2018), this comes at the also the boundaries of the various classes, the distance transform (that
expense of a very long training time. provides information for the topological connectivity of the objects) as
To visualize the performance of ResUNet-a, we generated error well as the identity reconstruction of the input image. The additionally
maps that indicate incorrect (correct) classification in red (green). All inferred tasks, are re-used internally into the network before the final
summary statistics and error maps were created using the software segmentation mask is produced. That is, the final segmentation mask is
provided on the ISPRS competition website. For all of our inference conditioned on the inference result of the boundaries of the objects as
results, we used the ground truth masks with eroded boundaries as well as the distance transform of their segmentation mask. We show
suggested by the curators of the ISPRS Potsdam data set (ISPRS). This
allows interested readers to have a clear picture of the strengths and
weaknesses of our algorithm in comparison with online published 5
For comparison, competition results can be found online.

107
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Background
ImSurf
Car
Building
LowVeg
Tree

Fig. 15. ResUNet-a best model results for tiles 2–13, 2–14, 3–13 and 3–14. From left to right, input image, difference between ground truth and predictions,
inference map. Image resolution:6 k × 6 k, ground sampling distance of 5 cm.

108
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

experimentally that the conditioned multitasking improves the perfor- Declaration of Competing Interest
mance of the inferred semantic segmentation classes. The ground truth
labels that are used during training for the boundaries, as well as the The authors declare that they have no known competing financial
distance transform, can be both calculated very easily from the ground interests or personal relationships that could have appeared to influ-
truth segmentation mask using standard computer vision software ence the work reported in this paper.
(OPENCV, see Section B for a PYTHON implementation).
We analyze the performance of various flavours of the Dice loss and
introduce a novel variant of this as a loss function, the Tanimoto loss. Acknowledgments
This loss can also be used for regression problems. This is an appealing
property that makes this loss useful for the case of multitasking pro- The authors acknowledge the support of the Scientific Computing
blems in that it results in balanced gradients for all tasks during team of CSIRO, and in particular Peter H. Campbell and Ondrej Hlinka.
training. We show experimentally that the Tanimoto loss speeds up the Their contribution was substantial in overcoming many technical dif-
training convergence and behaves well under the presence of heavily ficulties of distributed GPU computing. The authors are also grateful to
imbalanced data sets. John Taylor for his help in understanding and implementing distributed
The performance of our framework is evaluated on the 2D semantic optimization using HOROVOD (Sergeev and Balso, 2018). The authors
segmentation ISPRS Potsdam data set. Our best model, ResUNet-a acknowledge the support of the MXNET community, and in particular
d7v2 achieves top rank performance in comparison with other pub- Thomas Delteil, Sina Afrooze and Thom Lane. The authors acknowledge
lished results (Table 5) and demonstrates a clear improvement over the the provision of the Potsdam ISPRS dataset by BSF Swissphoto6. The
state of the art. The combination of ResUNet-a conditioned multi- authors acknowledge the contribution of the anonymous referees, whos
tasking with the proposed loss function is a reliable solution for per- questions helped to improve the quality of the manuscript.
formant semantic segmentation tasks.

Appendix A. Software implementation and training characteristics

ResUNet-a was built and trained using the MXNET deep learning library (Chen et al., 2015), under the GLUON API. Each of the models trained on
the FoV × 4 dataset was trained with a batch size of 256 on a single node containing 4 NVIDIA Tesla P100 GPUs in CSIRO HPC facilities. Due to the
complexity of the network, the batch size in a single GPU iteration cannot be made larger than 10 (per GPU). In order to increase the batch size we
used manual gradient aggregation7. For the models trained on the FoV × 1 dataset we used a batch size of 480 in order to speed up the computation.
These were trained in a distributed scheme, using the ring allreduce algorithm, and in particular it’s implementation on HOROVOD (Sergeev and Balso,
2018) for the MXNET (Chen et al., 2015) deep learning library. The optimal learning rate for all runs was set by the methodology developed in Smith
(2018). In particular, by monitoring the loss error during training for a continuously increasing learning rate, starting from a very low value. An
example is shown in Fig. A.16: The optimal learning rate is approximately the point of steepest decent of the loss functions. This process was
complete in approximately 1 epoch and it can be applied in a distributed scheme as well. We found it more useful than the linear learning rate scaling
that is used for large batch size (Goyal et al., 2017) in distributed optimization.
For all models, we used the Adam (Kingma and Ba, 2014) optimizer, with an initial learning rate of 0.001 (initial learning rate can also be set
higher for this dataset, see Fig. A.16), momentum parameters ( 1, 2 ) = (0.9, 0.999) . The learning rate was reduced by an order of magnitude
whenever the validation loss stopped decreasing. Overall we reduced the learning rate 3 times. We have also experimented with smaller batch sizes.
In particular, with a batch size of 32, the training is unstable. This is owed mainly to the fact that we used 4 GPUs for training, therefore the batch
size per GPU is 8, and this is not sufficient for the Batch Normalization layers that use only the data per GPU for the estimation of running means of
their parameters. When we experimented with synchronized Batch Normalization layers (Ioffe and Szegedy, 2015; Zhang et al., 2018), this increased
the stability of the training dramatically even with a batch size as small as 32. However, due to the GPU synchronization, this was a slow operation
that proved to be impractical for our purposes.
A software implementation for the ResUNet-a models that relate to this work can be found on github8.

6
http://www.bsf-swissphoto.com/unternehmen/ueber_uns_bsf.
7
A a short tutorial on manual gradient aggregation with the GLUON API in the MXNET framework can be found online.
8
https://github.com/feevos/resuneta

109
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Fig. A.16. Learning rate finder process for the FoV × 1 dataset. The model used was ResUNet-a d6 cmtsk. For this particular training profile, our learning rate
choice was 0.001, although a little higher values are possible (see Smith, 2018 for details).

Appendix B. Boundary and distance transform from segmentation mask

The boundaries and distance transform can be estimated efficiently from the segmentation ground truth mask by the python software routines
listed here. The input labels is a binary image, with 1 designating on class and 0 off class pixels. The shape of the labels is two dimensional (i.e. it
is a single channel image, of shape (Height,Width) – no channel dimension). In a multiclass context the segmentation mask must be provided in
one-hot encoding and applied iteratively per channel.

Appendix C. Inference results

In this section, we present classification results and error maps for all the test TOP tiles of the Potsdam ISPRS dataset (see Figs. C.17–C.19).

110
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Background
ImSurf
Car
Building
LowVeg
Tree

Fig. C.17. ResUNet-a best model results for tiles 4–13, 4–14, 4–15, 5–13. From left to right, input image, difference between ground truth and predictions, inference
map. Image resolution:6 k × 6 k, ground sampling distance of 5 cm.

111
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Background
ImSurf
Car
Building
LowVeg
Tree

Fig. C.18. As Fig. C.17 for tiles 5–14, 5–15, 6–13, 6–14.

112
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Background
ImSurf
Car
Building
LowVeg
Tree

Fig. C.19. As Fig. C.17 for tiles 6–15, 7–13.

Appendix D. Supplementary material

Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.isprsjprs.2020.01.013.

References distributed systems. arXiv preprint arXiv:1512.01274.

Cheng, G., Wang, Y., Xu, S., Wang, H., Xiang, S., Pan, C., 2017. Automatic road detection
and centerline extraction via cascaded end-to-end convolutional neural network.
Abraham, N., Khan, N.M., 2018. A novel focal tversky loss function with improved at- IEEE Trans. Geosci. Remote Sens. 55, 3322–3337.
tention u-net for lesion segmentation. CoRR abs/1810.07842. http://arxiv.org/abs/ Comaniciu, D., Meer, P., 2002. Mean shift: a robust approach toward feature space
1810.07842. analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24, 603–619.
Audebert, N., Le Saux, B., Lefèvre, S., 2018. Beyond rgb: Very high resolution urban Crum, W.R., Camara, O., Hill, D.L.G., 2006. Generalized overlap measures for evaluation
remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. and validation in medical image analysis. IEEE Trans. Med. Imaging 25, 1451–1461.
140, 20–32. http://dblp.uni-trier.de/db/journals/tmi/tmi25.html#CrumCH06.
Audebert, N., Le Saux, B., Lefévre, S., 2017. Segment-before-detect: vehicle detection and Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A Large-Scale
classification through semantic segmentation of aerial images. Remote Sens. Hierarchical Image Database. In: CVPR09.
9https://doi.org/10.3390/rs9040368. http://www.mdpi.com/2072-4292/9/4/368. Dice, L.R., 1945. Measures of the amount of ecologic association between species. Ecology
Audebert, N., Saux, B.L., Lefèvre, S., 2016. Semantic segmentation of earth observation 26, 297–302. doi:https://doi.org/10.2307/1932409.
data using multimodal and multi-scale deep networks. CoRR abs/1609.06846. Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C., 2016. The importance of
http://arxiv.org/abs/1609.06846. skip connections in biomedical image segmentation. CoRR abs/1608.04117. http://
Baatz, M., Schäpe, A., 2000. Multiresolution segmentation: an optimization approach for arxiv.org/abs/1608.04117.
high quality multi-scale image segmentation (ecognition), 12–23. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A., 2010. The pascal
Badrinarayanan, V., Kendall, A., Cipolla, R., 2015. Segnet: A deep convolutional encoder- visual object classes (voc) challenge. Int. J. Comput. Vision 88, 303–338.
decoder architecture for image segmentation. CoRR abs/1511.00561. http://arxiv. Goldblatt, R., Stuhlmacher, M.F., Tellman, B., Clinton, N., Hanson, G., Georgescu, M.,
org/abs/1511.00561. Wang, C., Serrano-Candela, F., Khandelwal, A.K., Cheng, W.H., et al., 2018. Using
Bertasius, G., Shi, J., Torresani, L., 2015. Semantic segmentation with boundary neural landsat and nighttime lights for supervised pixel-based image classification of urban
fields. CoRR abs/1511.02674. http://arxiv.org/abs/1511.02674. land cover. Remote Sens. Environ. 205, 253–275.
Blaschke, T., Hay, G.J., Kelly, M., Lang, S., Hofmann, P., Addink, E., Feitosa, R.Q., Van der Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
Meer, F., Van der Werff, H., Van Coillie, F., et al., 2014. Geographic object-based A., Bengio, Y., 2014. Generative adversarial nets. In: Ghahramani, Z., Welling, M.,
image analysis–towards a new paradigm. ISPRS J. Photogramm. Remote Sens. 87, Cortes, C., Lawrence, N.D., Weinberger, K.Q. (Eds.), Advances in Neural Information
180–191. Processing Systems, vol. 27. Curran Associates, Inc., pp. 2672–2680. http://papers.
Borgefors, G., 1986. Distance transformations in digital images. Comput. Vision Graph. nips.cc/paper/5423-generative-adversarial-nets.pdf.
Image Process. 34, 344–371. https://doi.org/10.1016/S0734-189X(86)80047-0. Goyal, P., Dollár, P., Girshick, R.B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A.,
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2016. Deeplab: Semantic Jia, Y., He, K., 2017. Accurate, large minibatch SGD: training imagenet in 1 hour.
image segmentation with deep convolutional nets, atrous convolution, and fully CoRR abs/1706.02677. http://arxiv.org/abs/1706.02677.
connected crfs. CoRR abs/1606.00915. http://arxiv.org/abs/1606.00915. Gu, Y., Wang, Y., Li, Y., 2019. A survey on deep learning-driven remote sensing image
Chen, L., Papandreou, G., Schroff, F., Adam, H., 2017. Rethinking atrous convolution for scene understanding: Scene classification, scene retrieval and scene-guided object
semantic image segmentation. CoRR abs/1706.05587. http://arxiv.org/abs/1706. detection. Appl. Sci. 9https://doi.org/10.3390/app9102110. https://www.mdpi.
05587. com/2076-3417/9/10/2110.
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z., He, K., Girshick, R.B., Dollár, P., 2018. Rethinking imagenet pre-training. CoRR abs/
2015. Mxnet: A flexible and efficient machine learning library for heterogeneous 1811.08883. http://arxiv.org/abs/1811.08883.

113
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

He, K., Gkioxari, G., Dollár, P., Girshick, R.B., 2017. Mask R-CNN. CoRR abs/1703.06870. 2868–2881.
http://arxiv.org/abs/1703.06870. Pan, S.J., Yang, Q., 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22,
He, K., Zhang, X., Ren, S., Sun, J., 2014. Spatial pyramid pooling in deep convolutional 1345–1359. https://doi.org/10.1109/TKDE.2009.191.
networks for visual recognition. CoRR abs/1406.4729. http://arxiv.org/abs/1406. Pan, X., Gao, L., Marinoni, A., Zhang, B., Yang, F., Gamba, P., 2018a. Semantic labeling of
4729. high resolution aerial imagery and lidar data with fine segmentation network.
He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep residual learning for image recognition. Remote Sens. 10https://doi.org/10.3390/rs10050743. http://www.mdpi.com/
CoRR abs/1512.03385. http://arxiv.org/abs/1512.03385. 2072-4292/10/5/743.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Identity mappings in deep residual networks. Pan, X., Gao, L., Zhang, B., Yang, F., Liao, W., 2018b. High-resolution aerial imagery
CoRR abs/1603.05027. http://arxiv.org/abs/1603.05027. semantic labeling with dense pyramid network. Sensors 18. https://doi.org/10.3390/
Huang, G., Liu, Z., Weinberger, K.Q., 2016. Densely connected convolutional networks. s18113774. http://www.mdpi.com/1424-8220/18/11/3774.
CoRR abs/1608.06993. http://arxiv.org/abs/1608.06993. Penatti, O.A., Nogueira, K., dos Santos, J.A., 2015. Do deep features generalize from
Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by everyday objects to remote sensing and aerial scenes domains? In: 2015 IEEE
reducing internal covariate shift. CoRR abs/1502.03167. http://arxiv.org/abs/1502. Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.
03167. 44–51. doi.ieeecomputersociety.org/10.1109/CVPRW.2015.7301382, doi:https://
ISPRS, International society for photogrammetry and remote sensing (isprs) and bsf doi.org/10.1109/CVPRW.2015.7301382.
swissphoto: Wg3 potsdam overhead data. http://www2.isprs.org/commissions/ Piramanayagam, S., Saber, E., Schwartzkopf, W., Koehler, F.W., 2018. Supervised clas-
comm3/wg4/tests.html. sification of multisensor remotely sensed images using a deep learning framework.
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K., 2015. Spatial transformer Remote Sens. 10. https://doi.org/10.3390/rs10091429. http://www.mdpi.com/
networks. CoRR abs/1506.02025. http://arxiv.org/abs/1506.02025. 2072-4292/10/9/1429.
Kervadec, H., Bouchtiba, J., Desrosiers, C., Ric Granger, Dolz, J., Ayed, I.B., 2018. Rawat, W., Wang, Z., 2017. Deep convolutional neural networks for image classification:
Boundary loss for highly unbalanced segmentation arXiv:1812.07032. a comprehensive review. Neural Comput. 29, 2352–2449. https://doi.org/10.1162/
Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. CoRR abs/1412. neco_a_00990. pMID: 28599112.
6980. http://arxiv.org/abs/1412.6980. Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biome-
Lambert, M.J., Waldner, F., Defourny, P., 2016. Cropland mapping over sahelian and dical image segmentation. CoRR abs/1505.04597. http://arxiv.org/abs/1505.04597.
sudanian agrosystems: a knowledge-based approach using proba-v time series at 100- Ruder, S., 2017. An overview of multi-task learning in deep neural networks. CoRR abs/
m. Remote Sens. 8, 232. 1706.05098. http://arxiv.org/abs/1706.05098.
Längkvist, M., Kiselev, A., Alirezaie, M., Loutfi, A., 2016. Classification and segmentation Sergeev, A., Balso, M.D., 2018. Horovod: fast and easy distributed deep learning in
of satellite orthoimagery using convolutional neural networks. Remote Sens. 8, 329. TensorFlow. arXiv preprint arXiv:1802.05799.
LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, Sherrah, J., 2016. Fully convolutional networks for dense semantic labelling of high-
L.D., 1989. Backpropagation applied to handwritten zip code recognition. Neural resolution aerial imagery. CoRR abs/1606.02585. http://arxiv.org/abs/1606.02585.
Comput. 1, 541–551. https://doi.org/10.1162/neco.1989.1.4.541. Smith, L.N., 2018. A disciplined approach to neural network hyper-parameters: Part 1 –
Li, E., Femiani, J., Xu, S., Zhang, X., Wonka, P., 2015. Robust rooftop extraction from learning rate, batch size, momentum, and weight decay. CoRR abs/1803.09820.
visible band images using higher order crf. IEEE Trans. Geosci. Remote Sens. 53, http://arxiv.org/abs/1803.09820.
4483–4495. Sørensen, T., 1948. A method of establishing groups of equal amplitude in plant sociology
Li, S., Jiao, J., Han, Y., Weissman, T., 2016. Demystifying resnet. CoRR abs/1611.01186. based on similarity of species and its application to analyses of the vegetation on
http://arxiv.org/abs/1611.01186. Danish commons. Biol. Skr. 5, 1–34.
Li, X., Shao, G., 2014. Object-based land-cover mapping with high resolution aerial Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Cardoso, M.J., 2017. Generalised dice
photography at a county scale in midwestern usa. Remote Sens. 6, 11372–11390. overlap as a deep learning loss function for highly unbalanced segmentations. CoRR
Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P., 2017. Focal loss for dense object abs/1707.03237. http://arxiv.org/abs/1707.03237.
detection. CoRR abs/1708.02002. http://arxiv.org/abs/1708.02002. Taghanaki, S.A., Abhishek, K., Cohen, J.P., Cohen-Adad, J., Hamarneh, G., 2019. Deep
Liu, Y., Fan, B., Wang, L., Bai, J., Xiang, S., Pan, C., 2018. Semantic labeling in very high semantic segmentation of natural and medical images: a review arXiv:1910.07655.
resolution images via a self-cascaded convolutional neural network. ISPRS J. Vadivel, A., Sural, Shamik, Majumdar, A.K., 2005. Human color perception in the hsv
Photogramm. Remote Sens. 145, 78–95. space and its application in histogram generation for image retrieval. doi:https://doi.
Liu, Y., Minh Nguyen, D., Deligiannis, N., Ding, W., Munteanu, A., 2017a. Hourglass- org/10.1117/12.586823.
shapenetwork based semantic segmentation for high resolution aerial imagery. Vincent, L., Soille, P., 1991. Watersheds in digital spaces: an efficient algorithm based on
Remote Sens. 9https://doi.org/10.3390/rs9060522. http://www.mdpi.com/2072- immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell. 583–598.
4292/9/6/522. Volpi, M., Tuia, D., 2017. Dense semantic labeling of subdecimeter resolution images with
Liu, Y., Piramanayagam, S., Monteiro, S.T., Saber, E., 2017b. Dense semantic labeling of convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 55, 881–893.
very-high-resolution aerial imagery and lidar with fully-convolutional neural net- Waldner, F., Hansen, M.C., Potapov, P.V., Löw, F., Newby, T., Ferreira, S., Defourny, P.,
works and higher-order crfs. In: Proceedings of the IEEE Conference on Computer 2017. National-scale cropland mapping based on spectral-temporal features and
Vision and Pattern Recognition (CVPR) Workshops, Honolulu, USA. outdated land cover information. PloS One 12 e0181911.
Long, J., Shelhamer, E., Darrell, T., 2014. Fully convolutional networks for semantic Wen, D., Huang, X., Liu, H., Liao, W., Zhang, L., 2017. Semantic classification of urban
segmentation. CoRR abs/1411.4038. http://arxiv.org/abs/1411.4038. trees using very high resolution satellite imagery. IEEE J. Sel. Top. Appl. Earth
Lu, X., Yuan, Y., Zheng, X., 2017. Joint dictionary learning for multispectral change de- Observ. Remote Sens. 10, 1413–1424.
tection. IEEE Trans. Cybernetics 47, 884–897. Xie, S., Tu, Z., 2015. Holistically-nested edge detection. CoRR abs/1504.06375. http://
Ma, L., Liu, Y., Zhang, X., Ye, Y., Yin, G., Johnson, B.A., 2019. Deep learning in remote arxiv.org/abs/1504.06375.
sensing applications: a meta-analysis and review. ISPRS J. Photogramm. Remote Xie, S.M., Jean, N., Burke, M., Lobell, D.B., Ermon, S., 2015. Transfer learning from deep
Sens. 152, 166–177. https://doi.org/10.1016/j.isprsjprs.2019.04.015. http://www. features for remote sensing and poverty mapping. CoRR abs/1510.00098. http://
sciencedirect.com/science/article/pii/S0924271619301108. arxiv.org/abs/1510.00098.
Marmanis, D., Schindler, K., Wegner, J.D., Galliani, S., Datcu, M., Stilla, U., 2018. Yang, H., Wu, P., Yao, X., Wu, Y., Wang, B., Xu, Y., 2018. Building extraction in very high
Classification with an edge: Improving semantic image segmentation with boundary resolution imagery by dense-attention networks. Remote Sens. 10https://doi.org/10.
detection. ISPRS J. Photogramm. Remote Sens. 135, 158–172. 3390/rs10111768. http://www.mdpi.com/2072-4292/10/11/1768.
Marmanis, D., Wegner, J.D., Galliani, S., Schindler, K., Datcu, M., Stilla, U., 2016. Zagoruyko, S., Komodakis, N., 2016. Wide residual networks. CoRR abs/1605.07146.
Semantic segmentation of aerial images with an ensemble of cnns. http://arxiv.org/abs/1605.07146, arXiv:1605.07146.
Matikainen, L., Karila, K., 2011. Segment-based land cover mapping of a suburban Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A., 2018. Context
areacomparison of high-resolution remotely sensed datasets using classification trees encoding for semantic segmentation. In: The IEEE Conference on Computer Vision
and test field points. Remote Sens. 3, 1777–1804. and Pattern Recognition (CVPR).
Matthews, B., 1975. Comparison of the predicted and observed secondary structure of t4 Zhang, Q., Seto, K.C., 2011. Mapping urbanization dynamics at regional and global scales
phage lysozyme. Biochimica et Biophysica Acta (BBA) – Protein Structure 405, using multi-temporal dmsp/ols nighttime light data. Remote Sens. Environ. 115,
442–451. https://doi.org/10.1016/0005-2795(75)90109-9. http://www. 2320–2329.
sciencedirect.com/science/article/pii/0005279575901099. Zhang, Z., Liu, Q., Wang, Y., 2017. Road extraction by deep residual u-net. CoRR abs/
Milletari, F., Navab, N., Ahmadi, S., 2016. V-net: Fully convolutional neural networks for 1711.10684. http://arxiv.org/abs/1711.10684.
volumetric medical image segmentation. CoRR abs/1606.04797. http://arxiv.org/ Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017a. Pyramid scene parsing network. In:
abs/1606.04797. CVPR.
Myint, S.W., Gober, P., Brazel, A., Grossman-Clarke, S., Weng, Q., 2011. Per-pixel vs. Zhao, W., Du, S., Wang, Q., Emery, W.J., 2017b. Contextually guided very-high-resolution
object-based classification of urban land cover extraction using high spatial resolu- imagery classification with semantic segments. ISPRS J. Photogramm. Remote Sens.
tion imagery. Remote Sens. Environ. 115, 1145–1161. 132, 48–60. https://doi.org/10.1016/j.isprsjprs.2017.08.011. http://www.
Novikov, A.A., Major, D., Lenis, D., Hladuvka, J., Wimmer, M., Bühler, K., 2017. Fully sciencedirect.com/science/article/pii/S0924271617300709.
convolutional architectures for multi-class segmentation in chest radiographs. CoRR Zhu, J., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using
abs/1701.08816. http://arxiv.org/abs/1701.08816. cycle-consistent adversarial networks. CoRR abs/1703.10593. http://arxiv.org/abs/
Odena, A., Dumoulin, V., Olah, C., 2016. Deconvolution and checkerboard artifacts. 1703.10593.
Distill. http://distill.pub/2016/deconv-checkerboard/. Zhu, X.X., Tuia, D., Mou, L., Xia, G., Zhang, L., Xu, F., Fraundorfer, F., 2017. Deep
Paisitkriangkrai, S., Sherrah, J., Janney, P., van den Hengel, A., 2016. Semantic labeling learning in remote sensing: a comprehensive review and list of resources. IEEE
of aerial and satellite imagery. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 9, Geosci. Remote Sens. Mag. 5, 8–36. https://doi.org/10.1109/MGRS.2017.2762307.

114

Advanced Deep Learning Strategies For The Analysis of Remote Sensing Images
No ratings yet
Advanced Deep Learning Strategies For The Analysis of Remote Sensing Images
440 pages
A Survey On Object Detection in Optical Remote Sensing Images
No ratings yet
A Survey On Object Detection in Optical Remote Sensing Images
52 pages
A Lightweight Semantic Segmentation Network Based On Self-Attention Mechanism and State Space Model For Efficient Urban Scene Segmentation
No ratings yet
A Lightweight Semantic Segmentation Network Based On Self-Attention Mechanism and State Space Model For Efficient Urban Scene Segmentation
15 pages
Fevo 11 1201125
No ratings yet
Fevo 11 1201125
22 pages
Wang Et Al. - 2024 - A Deep Inverse Convolutional Neural Network-Based Semantic Classification Method For Land Cover Remo
No ratings yet
Wang Et Al. - 2024 - A Deep Inverse Convolutional Neural Network-Based Semantic Classification Method For Land Cover Remo
14 pages
Blaschke 2004
No ratings yet
Blaschke 2004
26 pages
Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks
No ratings yet
Beyond RGB: Very High Resolution Urban Remote Sensing With Multimodal Deep Networks
30 pages
FGRMNet Fully Graph Relational Matching Network For - 2025 - Expert Systems Wi
No ratings yet
FGRMNet Fully Graph Relational Matching Network For - 2025 - Expert Systems Wi
17 pages
A Relation-Augmented Embedded Graph Attention Network For Remote Sensing Object
No ratings yet
A Relation-Augmented Embedded Graph Attention Network For Remote Sensing Object
18 pages
Resunet-A: A Deep Learning Framework For Semantic Segmentation of Remotely Sensed Data
No ratings yet
Resunet-A: A Deep Learning Framework For Semantic Segmentation of Remotely Sensed Data
24 pages
Single-Stream CNN With Learnable Architecture For Multisource Remote Sensing Data
No ratings yet
Single-Stream CNN With Learnable Architecture For Multisource Remote Sensing Data
18 pages
380 1325 1 PB
No ratings yet
380 1325 1 PB
12 pages
Semantic Segmentation For Urban-Scene Images: Shorya Sharma
No ratings yet
Semantic Segmentation For Urban-Scene Images: Shorya Sharma
15 pages
A Deep Learning Model With Capsules Embedded For High-Resolution Image Classification
No ratings yet
A Deep Learning Model With Capsules Embedded For High-Resolution Image Classification
10 pages
BAS4Net Boundary-Aware Semi-Supervised Semantic Segmentation Network For Very High Resolution Remote Sensing Images
No ratings yet
BAS4Net Boundary-Aware Semi-Supervised Semantic Segmentation Network For Very High Resolution Remote Sensing Images
16 pages
Satellite Image Classification Using Deep Learning Approach
No ratings yet
Satellite Image Classification Using Deep Learning Approach
14 pages
Deep Learning For The Automatic Division of Building Constructions Into Sections On Remote Sensing Images
No ratings yet
Deep Learning For The Automatic Division of Building Constructions Into Sections On Remote Sensing Images
15 pages
Remotesensing 16 00327
No ratings yet
Remotesensing 16 00327
28 pages
Remotesensing 14 00984 v2
No ratings yet
Remotesensing 14 00984 v2
21 pages
Remote-Sensing Image Segmentation Based On Implicit 3-D Scene Representation
No ratings yet
Remote-Sensing Image Segmentation Based On Implicit 3-D Scene Representation
5 pages
Remotesensing 13 02187 v2
No ratings yet
Remotesensing 13 02187 v2
20 pages
Aerial Imagery Pixel-Level Segmentation
No ratings yet
Aerial Imagery Pixel-Level Segmentation
30 pages
Remote Sensing Image Classification Base
No ratings yet
Remote Sensing Image Classification Base
9 pages
2023 Li Omnicity
No ratings yet
2023 Li Omnicity
11 pages
Exploring Fusion Techniques in U-Net and DeepLab V3 Architectures For Multi-Modal Land Cover Classification
No ratings yet
Exploring Fusion Techniques in U-Net and DeepLab V3 Architectures For Multi-Modal Land Cover Classification
12 pages
Ijgi 10 00488
No ratings yet
Ijgi 10 00488
23 pages
Semantic Segmentation of Remote Sensing Images Usi
No ratings yet
Semantic Segmentation of Remote Sensing Images Usi
12 pages
Semantic Segmentation
No ratings yet
Semantic Segmentation
22 pages
1 s2.0 S0924271621002379 Main
No ratings yet
1 s2.0 S0924271621002379 Main
15 pages
A Novel Convolutional Neural Network Architecture of Multispectral Remote
No ratings yet
A Novel Convolutional Neural Network Architecture of Multispectral Remote
22 pages
2 Manual RPI M50A 12s V1 EU EN 2017-03-09
No ratings yet
2 Manual RPI M50A 12s V1 EU EN 2017-03-09
166 pages
Scale Space Representation SPIE 2011
No ratings yet
Scale Space Representation SPIE 2011
11 pages
General-Purpose Multimodal Transformer Meets Remote Sensing Semantic Segmentation
No ratings yet
General-Purpose Multimodal Transformer Meets Remote Sensing Semantic Segmentation
8 pages
Seg-LSTM: Performance of XLSTM For Semantic Segmentation of Remotely Sensed Images
No ratings yet
Seg-LSTM: Performance of XLSTM For Semantic Segmentation of Remotely Sensed Images
5 pages
Boundary-Aware Dual-Stream Network For VHR Remote Sensing Images Semantic Segmentation
No ratings yet
Boundary-Aware Dual-Stream Network For VHR Remote Sensing Images Semantic Segmentation
9 pages
Remotesensing 16 03278
No ratings yet
Remotesensing 16 03278
18 pages
Can Semantic Labeling Methods Generalize To Any City - The Inria Aerial Image Labeling Benchmark - AerialImageLabelingDataset
No ratings yet
Can Semantic Labeling Methods Generalize To Any City - The Inria Aerial Image Labeling Benchmark - AerialImageLabelingDataset
5 pages
Large Selective Kernel Network For Remote Sensing Object Detection
No ratings yet
Large Selective Kernel Network For Remote Sensing Object Detection
16 pages
Remotesensing 13 03363 With Cover
No ratings yet
Remotesensing 13 03363 With Cover
25 pages
Classification of Multi-Spectral Data With Fine-Tuning Variants of Representative Models
No ratings yet
Classification of Multi-Spectral Data With Fine-Tuning Variants of Representative Models
23 pages
Remotesensing 13 00808 With Cover
No ratings yet
Remotesensing 13 00808 With Cover
42 pages
Remotesensing 11 00403 v2
No ratings yet
Remotesensing 11 00403 v2
19 pages
Remote Sensing: Classification and Segmentation of Satellite Orthoimagery Using Convolutional Neural Networks
No ratings yet
Remote Sensing: Classification and Segmentation of Satellite Orthoimagery Using Convolutional Neural Networks
21 pages
Remote Sensing and Digital Image Processing
No ratings yet
Remote Sensing and Digital Image Processing
27 pages
Clasificación de Escenas de Imágenes de Teledetección
No ratings yet
Clasificación de Escenas de Imágenes de Teledetección
22 pages
Priyanka Pavithra Fundoodata
0% (1)
Priyanka Pavithra Fundoodata
170 pages
Semantic Segmentation With Attention Mechanism For
No ratings yet
Semantic Segmentation With Attention Mechanism For
13 pages
Satellite 4 Good
No ratings yet
Satellite 4 Good
14 pages
10623proposal Copy
No ratings yet
10623proposal Copy
4 pages
Remotesensing 13 04743 v2
No ratings yet
Remotesensing 13 04743 v2
14 pages
ML Research Paper
No ratings yet
ML Research Paper
8 pages
HW Ch7 1
No ratings yet
HW Ch7 1
12 pages
(IJCST-V12I3P11) :M. Rega, Dr. S. Sivakumar
No ratings yet
(IJCST-V12I3P11) :M. Rega, Dr. S. Sivakumar
6 pages
Study On Street Vendors Before and After Pandemic
100% (1)
Study On Street Vendors Before and After Pandemic
81 pages
Unsupervised Feature Learning For Aerial Imagery
No ratings yet
Unsupervised Feature Learning For Aerial Imagery
13 pages
Zhao 2022
No ratings yet
Zhao 2022
18 pages
A Deep Neural Network Combined CNN and GCN For Remote Sensing Scene Classification
No ratings yet
A Deep Neural Network Combined CNN and GCN For Remote Sensing Scene Classification
14 pages
Geoprocessing 2013 4 40 30068
No ratings yet
Geoprocessing 2013 4 40 30068
6 pages
LT-LT-: Satellite Tracer
No ratings yet
LT-LT-: Satellite Tracer
70 pages
Deep Learning Techniques To Classify The Aerial Images With Gabor Filter
No ratings yet
Deep Learning Techniques To Classify The Aerial Images With Gabor Filter
8 pages
Step FOPDT Lengkap
No ratings yet
Step FOPDT Lengkap
110 pages
Beyond The Blackboard Reflection Paper
100% (1)
Beyond The Blackboard Reflection Paper
3 pages
Hytera+VM780+4G+Body+Worn+Camera+User+Manual+ (HyTalk) +V1.0.00 Eng
No ratings yet
Hytera+VM780+4G+Body+Worn+Camera+User+Manual+ (HyTalk) +V1.0.00 Eng
50 pages
Hybrid Adaptive Neural Network For Remote Sensing Image Classification
No ratings yet
Hybrid Adaptive Neural Network For Remote Sensing Image Classification
10 pages
Environmental Exploration and Monitoring of Vegetation Cover Using Deep Convolutional Neural Network in Gombe State
No ratings yet
Environmental Exploration and Monitoring of Vegetation Cover Using Deep Convolutional Neural Network in Gombe State
8 pages
Cobra C1 FastScanManual
No ratings yet
Cobra C1 FastScanManual
64 pages
Can Semantic Labeling Methods Generalize To Any City? The Inria Aerial Image Labeling Benchmark
No ratings yet
Can Semantic Labeling Methods Generalize To Any City? The Inria Aerial Image Labeling Benchmark
5 pages
Advances in Scene Classification of Remotely Sensed High Resolutin Image and The Existing Datasets PDF
No ratings yet
Advances in Scene Classification of Remotely Sensed High Resolutin Image and The Existing Datasets PDF
5 pages
Sheet Metal Shop Exp 1.3
No ratings yet
Sheet Metal Shop Exp 1.3
30 pages
Activity Based Costing
No ratings yet
Activity Based Costing
34 pages
Acre
No ratings yet
Acre
6 pages
Chapter One To Five Collective
No ratings yet
Chapter One To Five Collective
33 pages
Paper 5
No ratings yet
Paper 5
15 pages
T 14.419.003 SH1 AA - CEF - Signed PDF
No ratings yet
T 14.419.003 SH1 AA - CEF - Signed PDF
33 pages
Visual COBOL Question and Answers PDF
No ratings yet
Visual COBOL Question and Answers PDF
33 pages
Beige Aesthetic Modern Business Plan A4 Document
No ratings yet
Beige Aesthetic Modern Business Plan A4 Document
22 pages
Schmidt Sciences
No ratings yet
Schmidt Sciences
6 pages
Edexcel Igcse Physics
No ratings yet
Edexcel Igcse Physics
12 pages
Concrete Sheet Pile Drawingdrawing06040
100% (1)
Concrete Sheet Pile Drawingdrawing06040
4 pages
X4751 enUS 4751 CementIndustryBrochure 010920
No ratings yet
X4751 enUS 4751 CementIndustryBrochure 010920
12 pages
Kinetic Theory & Thermal Properties Notes IGCSE AVG
100% (3)
Kinetic Theory & Thermal Properties Notes IGCSE AVG
12 pages
EDC Unit-2
No ratings yet
EDC Unit-2
22 pages
CBSE Class 6 Social Science Sample Paper SA 2 SET 1
No ratings yet
CBSE Class 6 Social Science Sample Paper SA 2 SET 1
2 pages
SS Jntu Hyd
No ratings yet
SS Jntu Hyd
19 pages
Permodelan Proses Bisnis Untuk Procurement Suku Cadang Impor (Studi Pada PT Berkah Industri Mesin Angkat Surabaya)
No ratings yet
Permodelan Proses Bisnis Untuk Procurement Suku Cadang Impor (Studi Pada PT Berkah Industri Mesin Angkat Surabaya)
10 pages
2 - Benefits of IEEE Membership and Join IEEE
No ratings yet
2 - Benefits of IEEE Membership and Join IEEE
15 pages
Activity Sheet 1: Purposive Communication
No ratings yet
Activity Sheet 1: Purposive Communication
4 pages
MSA Case Studies
No ratings yet
MSA Case Studies
10 pages
A Review On Different Glaucoma Detection PDF
No ratings yet
A Review On Different Glaucoma Detection PDF
6 pages
A Review On Different Glaucoma Detection PDF
No ratings yet
A Review On Different Glaucoma Detection PDF
6 pages
Subject Title: Analog Circuits Course Code: Year and Semester: II & II
No ratings yet
Subject Title: Analog Circuits Course Code: Year and Semester: II & II
8 pages
Elregaily 20
No ratings yet
Elregaily 20
7 pages
Department Vision Mission
No ratings yet
Department Vision Mission
4 pages
Shamjith UiUx Design Resume
No ratings yet
Shamjith UiUx Design Resume
1 page
SLG - Sequence of Operation
No ratings yet
SLG - Sequence of Operation
1 page
Guidelines ITR 2020-21-For Mentor and Students
No ratings yet
Guidelines ITR 2020-21-For Mentor and Students
2 pages
Extra Bits SS
No ratings yet
Extra Bits SS
2 pages
Information Required For Preparation of Offers For Safety Consultancy Assignments
No ratings yet
Information Required For Preparation of Offers For Safety Consultancy Assignments
3 pages
Ss Jntuk Dec 2015
No ratings yet
Ss Jntuk Dec 2015
4 pages
Analog Circuits Lab
No ratings yet
Analog Circuits Lab
1 page
07 Rawlbolts Plugs Anchors
No ratings yet
07 Rawlbolts Plugs Anchors
1 page
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Paper 3

Uploaded by

Paper 3

Uploaded by

ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing

ResUNet-a: A deep learning framework for semantic segmentation of T

ARTICLE INFO ABSTRACT

1. Introduction materials and with different structures, leading to an incredible di-

E-mail address: [email protected] (F.I. Diakogiannis).

of nearest neighbours interpolation followed by a normed convolution

3.2.4. Generalization to multiclass imbalanced problems

where VJ is the total sum of true positives per class J , VJ = In liJ .

3.3. Data augmentation

To avoid overfitting, we relied on geometric data augmentation so

3.4. Inference methodology

In this section, we detail the approach we followed for performing

Fig. 8. Example of PSPPooling erroneous inference behaviour for segmentation

6. Results and discussion

In this section, we present and discuss the performance of

ResUNet-a d6 (FoV × 4) 92.7 97.1 86.4 85.8 95.8 91.6 90.1

6.3. Comparison with other modeling frameworks

In this section, we compare the performance of the ResUNet-a

ImSurface Building LowVeg Tree Car Clutter/Background

ImSurface 0.9478 0.0085 0.0247 0.0117 0.0002 0.0071

Precision/Correctness 0.9220 0.9679 0.8640 0.9030 0.9538 0.7742

F1 0.9347 0.9722 0.8816 0.8917 0.9635 0.5990

Appendix A. Software implementation and training characteristics

Appendix B. Boundary and distance transform from segmentation mask

Appendix C. Inference results

Fig. C.19. As Fig. C.17 for tiles 6–15, 7–13.

Appendix D. Supplementary material

References distributed systems. arXiv preprint arXiv:1512.01274.

You might also like