Paper 3
Paper 3
Keywords: Scene understanding of high resolution aerial images is of great importance for the task of automated monitoring
Convolutional neural network in various remote sensing applications. Due to the large within-class and small between-class variance in pixel
Loss function values of objects of interest, this remains a challenging task. In recent years, deep convolutional neural networks
Architecture have started being used in remote sensing applications and demonstrate state of the art performance for pixel
Data augmentation
level classification of objects. Here we propose a reliable framework for performant results for the task of se-
Very high spatial resolution
mantic segmentation of monotemporal very high resolution aerial images. Our framework consists of a novel
deep learning architecture, ResUNet-a, and a novel loss function based on the Dice loss. ResUNet-a uses a
UNet encoder/decoder backbone, in combination with residual connections, atrous convolutions, pyramid scene
parsing pooling and multi-tasking inference. ResUNet-a infers sequentially the boundary of the objects, the
distance transform of the segmentation mask, the segmentation mask and a colored reconstruction of the input.
Each of the tasks is conditioned on the inference of the previous ones, thus establishing a conditioned re-
lationship between the various tasks, as this is described through the architecture’s computation graph. We
analyse the performance of several flavours of the Generalized Dice loss for semantic segmentation, and we
introduce a novel variant loss function for semantic segmentation of objects that has excellent convergence
properties and behaves well even under the presence of highly imbalanced classes. The performance of our
modeling framework is evaluated on the ISPRS 2D Potsdam dataset. Results show state-of-the-art performance
with an average F1 score of 92.9% over all classes for our best model.
https://doi.org/10.1016/j.isprsjprs.2020.01.013
Received 5 March 2019; Received in revised form 15 December 2019; Accepted 9 January 2020
0924-2716/ © 2020 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
specified parameters (Blaschke et al., 2014). Popular image segmenta- (Ruder, 2017, we present two variants of the basic architecture, a
tion algorithm in remote sensing include watershed segmentation single task and a multi-task one).
(Vincent and Soille, 1991), multi-resolution segmentation (Baatz and 2. We analyze the performance of various flavours of the Dice coeffi-
Schäpe, 2000) and mean-shift segmentation (Comaniciu and Meer, cient for semantic segmentation. Based on our findings, we in-
2002). In addition, GEOBIA also allows to compute additional attri- troduce a variant of the Dice loss function that speeds up the con-
butes related to the texture, context, and shape of the objects, which vergence of semantic segmentation tasks and improves
can then be added to the classification feature set. However, there is no performance. Our results indicate that the new loss function behaves
universally-accepted method to identify the segmentation parameters well even when there is a large class imbalance. This loss can also be
that provide optimal pixel grouping, which implies the GEOBIA is still used for continuous variables when the target domain of values is in
highly interactive and includes subjective trial-and-error methods and the range [0,1].
arbitrary decisions. Furthermore, image segmentation might fail to si-
multaneously address the wide range of object sizes that one typically In addition, we also present a data augmentation methodology,
encounters in urban landscapes ranging from finely structure objects where the input is viewed in multiple scales during training by the
such as cars and trees to larger objects such as buildings. Another algorithm, that improves performance and avoids overfitting. The
drawback is that GEOBIA relies on pre-selected features for which the performance of ResUNet-a was tested using the Potsdam data set
maximum attainable accuracy is a priori unknown. While several made available through the ISPRS competition (ISPRS). Validation re-
methods have been devised to extract and select features, these sults show that ResUNet-a achieves state-of-the-art results.
methods are not themselves learned from the data, and are thus po- This article is organized as follows. In Section 2 we provide a short
tentially sub-optimal. review of related work on the topic of semantic segmentation focused
In recent years, deep learning methods and Convolutional Neural on the field of remote sensing. In Section 3, we detail the model ar-
Networks (CNNs) in particular (LeCun et al., 1989) have surpassed chitecture and the modeling framework. Section 4 describes the data set
traditional methods in various computer vision tasks, such as object we used for training our algorithm. In Section 5 we provide an ex-
detection, semantic, and instance segmentation (see Rawat and Wang, perimental analysis that justifies the design choices for our modeling
2017, for a comprehensive review). Some of the key advantages of framework. Finally, Section 6 presents the performance evaluation of
CNN-based algorithms is that they provide end-to-end solutions, that our algorithm and comparison with other published results. Readers are
require minimal feature engineering which offer greater generalization referred to Section A for a description of our software implementation
capabilities. They also perform object-based classification, i.e., they and hardware configurations, and to Section C for the full error maps on
take into account features that characterize entire image objects, unseen test data.
thereby reducing the salt-and-pepper effect that affects conventional
classifiers. 2. Related work
Our approach to annotate image pixels with class labels is object-
based, that is, the algorithm extracts characteristic features from whole The task of semantic segmentation has attracted significant interest
(or parts of) objects that exist in images such as cars, trees, or corners of in the latest years, not only in the field of computer vision community
buildings and assigns a vector of class probabilities to each pixel. In but also in other disciplines (e.g. biomedical imaging, remote sensing)
contrast, using standard classifiers such as random forests, the prob- where automated annotation of images is an important process. In
ability of each class per pixel is based on features inherent in the particular, specialized techniques have been developed over different
spectral signature only. Features based on spectral signatures contain disciplines, since there are task-specific peculiarities that the commu-
less information than features based on objects. For example, looking at nity of computer vision does not have to address (and vice versa).
a car we understand not only it’s spectral features (color) but also how Starting from the computer vision community, when first in-
these vary as well as the extent these occupy in an image. In addition, troduced, Fully Convolutional Networks (hereafter FCN) for semantic
we understand that it is more probable a car to be surrounded by pixels segmentation (Long et al., 2014), improved the state of the art by a
belonging to a road, and less probable to be surrounded by pixels be- significant margin (20% relative improvement over the state of the art
longing to buildings. In the field of computer vision, there is a vast on the PASCAL VOC (Everingham et al., 2010) 2011 and 2012 test sets).
literature on various modules used in convolutional neural networks The authors replaced the last fully connected layers with convolutional
that make use of this idea of “per object classification”. These modules, layers. The original resolution was achieved with a combination of
such as atrous convolutions (Chen et al., 2016) and pyramid pooling upsampling and skip connections. Additional improvements have been
(He et al., 2014; Zhao et al., 2017a), boost the algorithmic performance presented with the use of deeplab models (Chen et al., 2016, 2017), that
on semantic segmentation tasks. In addition, after the residual networks first showcased the importance of atrous convolutions for the task of
era (He et al., 2015) it is now possible to train deeper neural networks semantic segmentation. Their model uses also a conditioned random
avoiding to a great extent the problem of vanishing (or exploding) field as a post processing step in order to refine the final segmentation.
gradients. A significant contribution in the field of computer vision came from the
Here, we introduce a novel Fully Convolutional Network (FCN) for community of biomedical imaging and in particular, the U-Net archi-
semantic segmentation, termed ResUNet-a. This network combines tecture (Ronneberger et al., 2015) that introduced the encoder-decoder
ideas distilled from computer vision applications of deep learning, and paradigm, for upsampling gradually from lower size features to the
demonstrates competitive performance. In addition, we describe a original image size. Currently, the state of the art on the computer vi-
modeling framework consisting of a new loss function that behaves well sion datasets is considered to be mask-rcnn (He et al., 2017), that
for semantic segmentation problems with class imbalance as well as for performs various tasks (object localization, semantic segmentation,
regression problems. In summary, the main contributions of this paper instance segmentation, pose estimation etc). A key element of the
are the following: success of this architecture is its multitasking nature.
One of the major advantages of CNNs over traditional classification
1. A novel architecture for understanding and labeling very high re- methods (e.g. random forests), is their ability to process input data in
solution images for the task of semantic segmentation. The archi- multiple context levels. This is achieved through the downsampling
tecture uses a UNet (Ronneberger et al., 2015) encoder/decoder operations that summarizes features. However, this advantage in fea-
backbone, in combination with, residual connections (He et al., ture extraction needs to be matched with a proper upsampling method,
2016), atrous convolutions (Chen et al., 2016, 2017), pyramid scene to retain information from all spatial resolution contexts and produce
parsing pooling (Zhao et al., 2017a) and multi tasking inference fine boundary layers. There has been a quick uptake of the approach in
95
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
the remote sensing community and various solutions based on deep aggregates features from higher receptive fields to local receptive fields.
learning have been presented recently (e.g. Sherrah, 2016; Audebert In a similar idea to our approach, the CASIA network uses features from
et al., 2016, 2017, 2018; Längkvist et al., 2016; Li et al., 2015; Li and multiple contexts, however these are evaluated at a different depth of
Shao, 2014; Volpi and Tuia, 2017; Liu et al., 2018; Liu et al., 2017b, the network and fused together in a completely different way. The ar-
2018a, 2017a; Pan et al., 2018b; Marmanis et al., 2016, 2018; Wen chitecture achieved state of the art performance on the ISPRS Potsdam
et al., 2017; Zhao et al., 2017b). A comprehensive review of deep and Vaihingen data. The loss function they used was the normalized
learning applications in the field of remote sensing can be found in Zhu cross entropy.
et al. (2017), Ma et al. (2019), Gu et al. (2019).
Discussing in more detail some of the most relevant approaches to 3. The ResUNet-a framework
our work, (Sherrah, 2016) utilized the FCN architecture, with a novel
no-downsampling approach based on atrous convolutions to mitigate In this section, we introduce the architecture of ResUNet-a in full
this problem. The summary pooling operation was traded with atrous detail (Section 3.1), a novel loss function design to achieve faster
convolutions, for filter processing at different scales. The best per- convergence and higher performance (Section 3.2), data augmentation
forming architectures from their experiments were the ones using methodology (Section 3.3) as well as the methodology we followed on
pretrained convolution networks. The loss used was categorical cross- performing inference on large images (Section 3.4). The training
entropy. strategy and software implementation characteristics can be found in
Liu et al. (2017a) introduced the Hourglass-shape network for se- Appendix A.
mantic segmentation on VHR images, which included an encoder-de-
coder style network, utilizing inception like modules. Their encoder-
decoder style departed from the UNet backbone, in that they did not use 3.1. Architecture
features from all spatial contexts of the encoder in the decoder branch.
Also, their decoder branch is not symmetric to the encoder. The Our architecture combines the following set of modules encoded in
building blocks of the encoder are inception modules. Feature upsam- our models:
pling takes place with the use of transpose convolutions. The loss used
was weighted binary cross entropy. 1. A UNet (Ronneberger et al., 2015) backbone architecture, i.e., the
Emphasizing on the importance of using the information from the encoder-decoder paradigm, is selected for smooth and gradual
boundaries of objects, Marmanis et al. (2018) utilized the Holistically transitions from the image to the segmentation mask.
Ne-sted Edge Detection network (Xie and Tu, 2015, HED) for predicting 2. To achieve consistent training as the depth of the network increases,
boundaries of objects. The loss used for the boundaries was an Eu- the building blocks of the UNet architecture were replaced with
clidean distance regression loss. The estimated boundaries were then modified residual blocks of convolutional layers (He et al., 2016).
concatenated with image features and provided them as input into Residual blocks remove to a great extent the problem of vanishing
another CNN segmentation network, for the final classification of and exploding gradients that is present in deep architectures.
pixels. For the CNN segmentation network, they experimented with two 3. For better understanding across scales, multiple parallel atrous
architectures, the SegNet (Badrinarayanan et al., 2015) and a Fully convolutions (Chen et al., 2016, 2017) with different dilation rates
Convolutional Network presented in Marmanis et al. (2016) that uses are employed within each residual building block. Although it is not
weights from pretrained architectures. One of the key differences in our completely clear why atrous convolutions perform well, the intui-
approach for boundary detection with Marmanis et al. (2018), is that tion behind their usage is that they increase the receptive field of
the boundary prediction happens at the end of our architecture, each layer. The rationale of using these multiple-scale layers is to
therefore the request for boundary prediction affects all features since extract object features at various receptive field scales. The hope is
the boundaries are strongly correlated with the extent of the predicted that this will improve performance by identifying correlations be-
classes. In contrast, in Marmanis et al. (2018), the boundaries are fed as tween objects at different locations in the image.
input to the segmentation branch of their network, i.e. the segmenta- 4. In order to enhance the performance of the network by including
tion part of their network uses them as additional input. Another dif- background context information we use the pyramid scene parsing
ference is that we do not use weights from pretrained networks. pooling (Zhao et al., 2017a) layer. In shallow architectures, where
Pan et al. (2018b) presented the Dense Pyramid Network. The au- the last layer of the encoder has a size no less than 16x16 pixels, we
thors incorporated group convolutions to process independently the use this layer in two locations within the architecture: after the
Digital Surface Model from the true orthophoto, presenting an inter- encoder part (i.e., middle of the network) and the second last layer
esting data fusion approach. The channels created from their initial before the creation of the segmentation mask. For deeper archi-
group convolutions were shuffled, in order to enhance the information tectures, we use this layer only close to the last output layer.
flow between channels. The authors, utilized a DenseNet (Huang et al., 5. In addition to the standard architecture that has a single segmen-
2016) architecture as their feature extractor. In addition, a Pyramid tation mask layer as output, we also present two models where we
Pooling layer was used at the end of their encoder branch, before perform multi-task learning. The algorithm learns simultaneously
constructing the final segmentation classes. In order to overcome the four complementary tasks. The first is the segmentation mask. The
class imbalance problem, they chose to use the Focal loss function (Lin second is the common boundary between the segmentation masks
et al., 2017). In comparison with our work, the authors did not use a that is known to improve performance for semantic segmentation
symmetric encoder-decoder architecture. The building blocks of their (Bertasius et al., 2015; Marmanis et al., 2018). The third is the
model were DenseNet units which are known to be more efficient than distance transform1 (Borgefors, 1986) of the segmentation mask.
standard residual units (Huang et al., 2016). The pyramid pooling op- The fourth is the actual colored image, in HSV color space. That is,
erator used in the end of their architecture, before the final segmen- the identity transform of the content, but in a different color space.
tation map, is at different scales than the one used in ResUNet-a.
Liu et al. (2018) introduced the CASIA network, which consists of a We term our network ResUNet-a because it consists of residual
pretrained deep encoder, a set of self-cascaded convolutional units and
a decoder part. The encoder part is deeper than the decoder part. The 1
The result of the distance transform on a binary segmentation mask is a gray
upscaling of the lower level features takes place with a resize operation level image, that takes values in the range [0,1], where each pixel value cor-
followed by a convolutional residual correction term. The self-cascaded responds to the distance to the closest boundary. In OPENCV this transform is
units, consist of a sequential multi-context aggregation layer, that encoded in cv::distance_transform.
96
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
building blocks with multiple atrous convolutions and a UNet backbone residual block (Fig. 1b), we used as many as three in parallel atrous
architecture. We present two basic architectures, ResUNet-a d6 and convolutions in addition to the standard set of two convolutions of the
ResUNet-a d7, that differ in their depth, i.e. the total number of residual network architecture, i.e., there were up to four parallel
layers. In ResUNet-a d6 the encoder part consists of six ResBlock-a branches of sets of two stacked convolutional layers. After the con-
building blocks followed by a PSPPooling layer. In ResUNet-a d7 volutions, the output is added to the initial input in the spirit of residual
the encoder consists of seven ResBlock-a building blocks. For each of building blocks. We decided to sum the various atrous branches (in-
the d6 or d7 models, there are also three different output possibilities: a stead of concatenating them) because it is known that the residual
single task semantic segmentation layer, a multi-task layer (mtsk), and blocks of two successive convolutional layers demonstrate constant
a conditioned multi-task output layer (cmtsk). The difference between condition number of the Hessian of the loss function, irrespective of the
the mtsk and cmtsk output layers is how the various complementary depth of the network Li et al. (2016). Therefore the summation scheme
tasks (i.e. the boundary, the distance map, and the color) are used for is easier to train (in comparison with the concatenation of features). In
the determination of the main target task, which is the semantic seg- the encoder part of the network, the output of each of the residual
mentation prediction. In the following we present in detail these blocks is downsampled with a convolution of kernel size of one and
models, starting from the basic ResUNet-a d6. stride of two. At the end of both the encoder and the decoder part, there
exists a PSPooling operator (Zhao et al., 2017a). In the PSPPooling
operator (Fig. 1c), the initial input is split in channel (feature) space in
3.1.1. ResUNet-a 4 equal partitions. Then we perform max pooling operation in succes-
The ResUNet-a d6 network consists of stacked layers of modified sive splits of the input layer, in 1, 4, 16 and 64 partitions. Note that in
residual building blocks (ResBlock-a), in an encoder-decoder style the middle layer (Layer 13 has size: [batch size]× 1024 × 8 × 8), the
(UNet). The input is initially subjected to a convolution layer of kernel split of 64 corresponds to the actual total size of the input (so we have
size (1, 1) to increase the number of features to the desired initial filter no additional gain with respect to max pooling from the last split). In
size. A (1, 1) convolution layer was used in order to avoid any in- Fig. 1a we present the full architecture of ResUNet-a (see also
formation loss from the initial image by summarizing features across Table 1). In the decoder part, the upsampling is being done with the use
pixels with a larger kernel. Then follow the residual blocks. In each
Fig. 1. Overview of the ResUNet-a d6 network. (a) The left (downward) branch is the encoder part of the architecture. The right (upward) branch is the decoder.
The last convolutional layer has as many channels as there are distinct classes. (b) Building block of the ResUNet-a network. Each unit within the residual block has
the same number of filters with all other units. Here d1, …, dn designate different dilation rates, (c) Pyramid scene parsing pooling layer. Pooling takes place in 1/1, 1/
2, 1/4 and 1/8 portions of the original image.
97
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
Table 1 Table 3
Details of the ResUNet-a layers for the d6 model. Here f stands for the number Details of the replacement of the middle PSPPooling layer (Layer 13 from
of output channels (or features, the input number of features is deduced from Table 1) for the ResUNet-a d7 model.
the previous layers). k is the convolution kernel size, d is the dilation rate, and
Layer # Layer Type
s the stride of the convolution operation. In all convolution operations we used
appropriate zero padding to keep the dimensions of the produced feature maps Input Layer 12
equal to the input feature map (unless downsampling). A Conv2D(f = 2048, k = 1, d = 1, s = 2)(input)
B ResBlock-a(f = 2048, k = 3, d = 1, s = 1)(Layer A)
Layer # Layer Type
C MaxPooling(kernel = 2, stride = 2)(Layer B)
D UpSample(Layer C)
1 Conv2D(f = 32, k = 1, d = 1, s = 1)
E Concat(Layer D, Layer B)
2 ResBlock-a(f = 32, k = 3, d={1,3,15,31}, s = 1)
F Conv2D(f = 2048,kernel = 1)(Layer E)
3 Conv2D(f = 64, k = 1, d = 1, s = 2)
4 ResBlock-a(f = 64, k = 3, d={1,3,15,31}, s = 1)
5 Conv2D(f = 128, k = 1, d = 1, s = 2)
6 ResBlock-a(f = 128, k = 3, d={1,3,15}, s = 1) block at a lower resolution. The output of this layer is subjected to a
7 Conv2D(f = 256, k = 1, d = 1, s = 2) MaxPooling2D(kernel = 2, stride = 2) operation the output of
8 ResBlock-a(f = 256, k = 3, d={1,3,15}, s = 1) which is rescaled to its original size and then concatenated with the
9 Conv2D(f = 512, k = 1, d = 1, s = 2)
10 ResBlock-a(f = 512, k = 3, d = 1, s = 1)
original input layer. This operation is followed by a standard con-
11 Conv2D(f = 1024, k = 1, d = 1, s = 2) volution that brings the total number of features (i.e. the number of
12 ResBlock-a(f = 1024, k = 3, d = 1, s = 1) channels) to their original number before the concatenation. In version
13 PSPPooling 2 (hereafter d7v2), again the Layer 12 is replaced with a standard
14 UpSample (f = 512)
resnet block. However, now the MaxPooling operation following this
15 Combine (f = 512, Layers 14 & 10)
16 ResBlock-a(f = 512, k = 3, d = 1, s = 1) layer is replaced with a smaller PSPPooling layer that has three
17 UpSample (f = 256) parallel branches, performing pooling in 1/1, 1/2, 1/4 scales of the
18 Combine (f = 256, Layers 17 & 8) original filter (Fig. 1c). The reason for this is that the filters in the
19 ResBlock-a(f = 256, k = 3, d = 1, s = 1) middle of the d7 network cannot sustain 4 parallel pooling operations
20 UpSample (f = 128)
21 Combine (f = 128, Layers 20 & 6)
due to their small size (therefore, we remove the 1/8 scale pooling), for
22 ResBlock-a(f = 128, k = 3, d = 1, s = 1) an initial input image of size 256x256.
23 UpSample (f = 64) With regards to the model complexity, ResUNet-a d6 has 52 M
24 Combine (f = 64, Layers 23 & 4) trainable parameters for an initial filter size of 32. ResUNet-a d7 that
25 ResBlock-a(f = 64, k = 3, d = 1, s = 1)
has greater depth has 160 M parameters for the same initial filter size.
26 UpSample (f = 32)
27 Combine (f = 32, Layers 26 & 2) The number of parameters remains almost identical for the case of the
28 ResBlock-a(f = 32, k = 3, d = 1, s = 1) multi-task models as well.
29 Combine (f = 32, Layers 28 & 1)
30 PSPPooling
31 Conv2D (f = NClasses, k = 1, d = 1, s = 1)
3.1.2. Multitasking ResUNet-a
32 Softmax(dim = 1)
This version of ResUNet-a replaces the last layer (Layer 31) with a
multitasking block (Fig. 2). The multiple tasks are complementary. These
Table 2 are (a) the prediction of the semantic segmentation mask, (b) the
Details of the Combine(Input1,Input2) layer.
Layer # Layer Type
1 Input1
2 ReLU(Input1)
3 Concat(Layer 2,Input2)
3 Conv2DN(k = 1, d = 1, s = 1)
98
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
detection of the common boundaries between classes, (c) the re- segmentation are the focal loss Lin et al. (2017), see also Pan et al.
construction of the distance map and (e) the reconstruction of the original (2018b) for an application on VHR images, the boundary loss (Kervadec
image in HSV color space. Our choice of using a different color space than et al., 2018), and the Focal Tversky loss (Abraham and Khan, 2018). A
the original input was guided by the principle that: (a) we wanted to list of many other available loss functions can be found in Taghanaki
avoid the identity transform in order to exclude the algorithm recovering et al. (2019).
trivial solutions and (b) the HSV (or HSL) colorspace matches closely the
human perception of color Vadivel et al. (2005). It is important to note 3.2.1. Introducing the Tanimoto loss with complement
that these additional labels are derived using standard computer vision When it comes to semantic segmentation tasks, there are various
libraries from the initial image and segmentation mask, without the need options for the loss function. The Dice coefficient (Dice, 1945; Sørensen,
for additional information (e.g. separately annotated boundaries). A 1948), generalized for fuzzy binary vectors in a multiclass context
software implementation for this is given in Appendix B. The idea here is (Milletari et al., 2016, see also Crum et al., 2006; Sudre et al., 2017), is
that all these tasks are complementary and should help the target task a popular choice among practitioners. It has been shown that it can
that we are after. Indeed, the distance map provides information for the increase performance over the cross entropy loss (Novikov et al., 2017).
topological connectivity of the segmentation mask as well as the extent of The Dice coefficient can be generalized to continuous binary vectors in
the objects (for example if we have an image with a “car” (object class) on two ways: either by the summation of probabilities in the denominator
a “road” (another object class), then the ground truth of the mask of the or by the summation of their squared values. In the literature, there are
“road” will have a hole exactly to the location of the pixels corresponding at least three definitions which are equivalent (Crum et al., 2006;
to the “car” object). The boundary helps in better understanding the ex- Milletari et al., 2016; Drozdzal et al., 2016; Sudre et al., 2017):
tent of the segmentation mask. Finally, the colorspace transformation
provides additional information for the correlation between color varia- 2 i
pi li
D1 p , l =
tions and object extent. It also helps to keep “alive” the information of the i
pi + l
i i (1)
fine details of the original image to its full extent until the final output
layer. The rationale here is similar with the idea behind the concatenation 2 i
pi li
D2 p, l =
of higher order features (first layers) with lower order features that exist i
(pi2 + li2 ) (2)
in the UNet backbone architecture: the encoder layers have finer details
about the original image as closely as they are to the original input. pl
i i i
Hence, the reason for concatenating them with the layers of the decoder is D3 p , l =
(pi2 + 2
li ) (pi li ) (3)
to keep the fine details necessary until the final layer of the network that
i i
is ultimately responsible for the creation of the segmentation mask. By where p {pi }, pi [0, 1] is a continuous variable, representing the
demanding the network to be able to reconstruct the original image, we vector of probabilities for the i-th pixel, and l {li } are the corre-
are making sure that all fine details are preserved2 (an example of input sponding ground truth labels. For binary vectors, li {0, 1} . In the fol-
image, ground truth and inference for all the tasks in the conditioned lowing we will represent (where appropriate) for simplicity the set of
multitasking setting can be seen in Fig. 13). vector coordinates, p {pi } , with their corresponding tensor index no-
We present two flavours of the algorithm whose main difference is tation, i.e. p {pi } pi .
how the various tasks are used for the target output that we are inter- These three definitions are numerically equivalent, in the sense that
ested in. In the simple multi-task block (bottom right block of Fig. 2), they map the vectors (pi , li ) to the continuous domain [0, 1], i.e.
the four tasks are produced simultaneously and independently. That is, D (pi , li ): R2 [0, 1]. The gradients however, of these loss functions
there is no direct usage of the three complementary tasks (boundary, behave differently for gradient based optimization, i.e., for deep
distance, and color) in the construction of the target task that is the learning applications, as demonstrated in Milletari et al. (2016). In the
segmentation. The motivation here is that the different tasks will force remainder of this paper, we call Dice loss, the loss function with the
the algorithm to identify new meaningful features that are correlated functional form with the summation of probabilities and labels in the
with the output we are interested in and can help in the performance of denominator (Eq. (1)). We also use the name Tanimoto for the D3 loss
the algorithm for semantic segmentation. For the distance map, as well function (Eq. (3)) and designate it with the letter T D3 .
as the color reconstruction, we do not use the PSPPooling layer. This is We found empirically that the loss functions containing squares in
because it tends to produce large squared areas with the same values the denominator behave better in pointing to the ground truth irre-
(due to the pooling operation) and the depth of the convolution layers spective of the random initial configuration of weights. In addition, we
in the logits is not sufficient to diminish this (Fig. 14). found that we can achieve faster training convergence by com-
The second version of the algorithm uses a conditioned inference plementing the loss with a dual form that measures the overlap area of
methodology. That is, the network graph is constructed in such a way so as the complement of the regions of interest. That is, if pi measures the
to take advantage of the inference of the previous layers (top right block of probability of the ith pixel to belong in class li , the complement loss is
Fig. 2). We first predict the distance map. The distance map is then con- defined as T (1 pi , 1 li) , where the subtraction is performed ele-
catenated with the output of the PSPPooling layer and is used to calculate ment-wise, e.g. 1 pi = {1 p1 , 1 p2 , …, 1 pn } etc. The intuition
the boundary logits. Then both the distance map and the prediction of the behind the usage of the complement in the loss function comes from the
boundary are concatenated with the PSPPooling layer and the result is fact that the numerator of the Dice coefficient, i pi li , can be viewed as
provided as input to the segmentation logits for the final prediction. an inner product between the probability vector, p = {pi } and the
ground truth label vector, l = {li } . Then, the part of the probabilities
vector, pi , that corresponds to the elements of the label vector, li , that
3.2. Loss function have zero entries, does not alter the value of the inner product3. We,
therefore, propose that the best flow of gradients (hence faster training)
In this section, we introduce a new variant of the family of Dice loss
functions for semantic segmentation and regression problems. The Dice
3
family of losses is by no means the only option for the task of semantic As a simple example, consider four dimensional vectors, say
segmentation. Other interesting loss functions for the task of semantic p = (p1 , p2 , p3 , p4 ) and l = (1, 1, 0, 0) . The value of the inner product term is
p ·l = p1 + p2 , and therefore the information contained in p3 and p4 entries
is not apparent to the numerator of the loss. The complement
2
However, the color reconstruction on its own does not guarantee that the inner product provides information for these terms:
network learns meaningful correlations between classes and colors. (1 p)·(1 l) = (1 p)·(0, 0, 1, 1) = 2 (p3 + p4 ) .
99
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
Fig. 3. Contour plots of the gradient flow (top row) and Laplacian operator (bottom row) for the various versions of the Dice loss functions (Eqs. (1)–(3)), Di , as well
as the functional forms with complements, Di . The black dot corresponds to the ground truth value (1, 0) . From left to right, for the top row, we have the gradient flow
of the generalized loss functions D1, D1, D2 , D2 and D3, D3 . The bottom panels are the corresponding Laplacian operators of these. The numerical values of the
isocontours on the images describe numerically the colorscheme with darker values corresponding to. smaller values.
is achieved using as a loss function the average of T (pi , li ) with its D3 (p , l) . From the gradient flow of the D1 loss, it is evident that for a
complement, T (1 pi , 1 li) : random initialization of the network weights (which is the case in deep
learning) that corresponds to some random point (px , py ) of the loss
T (pi , li ) + T (1 pi , 1 li )
T pi , li = . landscape, the gradients of the loss with respect to px , py will not ne-
2 (4) cessarily direct to the ground truth point in (1, 0) . In this respect, the
generalized Dice loss with complement, D1 behaves better. However,
the gradient flow lines do not pass through the ground truth point for
3.2.2. Experimental comparison with other Dice loss functions
all possible pairs of values (px , py ) . For the case of the loss functions
In order to justify these choices, we present an example with a single
D2 , D3 and their complements, the gradient flow lines pass through the
2D ground truth vector, (l = (1, 0) ), and a vector of probabilities
ground truth point, but these are not straight lines. Their forms with
p = (px , py ) [0, 1]2 . We consider the following six loss functions:
complement, D2 , D3 , have gradient lines flowing straight towards the
ground truth irrespective of the (random) initialization point. The La-
1. The Dice coefficient, D1 (pi , li ) ((Eq. (1)))
placians of these loss functions are in the corresponding bottom panels
2. The Dice coefficient with its complement:
of Fig. 3. It is clear that the extremum of the Laplacian operator is closer
D1 (pi , li ) = (D1 (pi , li ) + D1 (1 pi , 1 li ))/2 to the ground truth values only for the cases where we consider the loss
functions with complement. Interestingly, the Laplacian of the Tani-
moto functional form (D3 ) has extremum values closer to the ground
3. The Dice coefficient D2 (pi , li ) (Eq. (2)).
truth point in comparison with the D2 functional form.
4. The Dice coefficient with its complement, D2 . In summary, the Tanimoto loss with complement has gradient flow
5. The Tanimoto coefficient, T (pi , li ) (Eq. (3)). lines that are straight lines (geodesics, i.e. they follow the shortest path)
6. The Tanimoto coefficient with its complement, T (pi , li ) (Eq. (4)). pointing to the ground truth from any random initialization point, and
the second order derivative has extremum on the location of the ground
In Fig. 3 we plot the gradient field of the various flavours of the truth. This demonstrates, according to our opinion, the superiority of
family of Dice loss functions (top panels), as well as the Laplacians of the Tanimoto with complement as a loss function, among the family of
these (i.e. their 2nd order derivatives, bottom panels). The ground truth loss functions based on the Dice coefficient, for training deep learning
is marked with a black dot. What is important in these plots is that for a models.
random initialization of the weights for a neural network, the loss
function will take a (random) value in the area within [0, 1]2 . The
quality of the loss function then, as a suitable criterion for training deep 3.2.3. Tanimoto with complement as a regression loss
learning models, is whether the gradients, from every point of the area in It should be stressed, that if we restrict the output of the neural
the plot, direct the solution towards the ground truth point. Intuitively network in the range [0, 1] (with the use of softmax or sigmoid acti-
we also expect that the behavior of the gradients is even better, if the vations) then the Tanimoto loss can be used to recover also continuous
local extrema of the loss on the ground truth, is also a local extremum of variables in the range [0, 1]. In Fig. 4 we present an example of this, for
the Laplacian of the loss. As it is evident from the bottom panels of a ground truth vector of l = (0.25, 0.85) . In the top panels, we plot the
Fig. 3 this is not the case for all loss functions. gradient flow of the Tanimoto (left) and Tanimoto with complement
In more detail, in Fig. 3, we plot the gradient field of the Dice loss (right) functions. In the bottom panels, we plot the corresponding
functions and the corresponding Laplacian fields. In the top row are functions obtained after applying the Laplacian operator to the loss
shown the gradient fields of the three different functional form of the functions. This is an appealing property for the case of multi-task
Dice loss and the form with their complements. From left to right we learning, where one of the complementary goals is a continuous loss
have the Dice coefficient based loss with summation of probabilities in function. The reason being that the gradients of these components will
the denumerator, D1 (p , l) , its complement, D1 (p , l) , the Dice loss with have similar magnitude scale and the training will be equally balanced
summation of squares in the denominator, D2 (p , l) , its complement, to all complementary tasks. In contrast, when we use different func-
D2 (p , l) , and the third form of the Dice loss with summation of squares tional form functions for different tasks in the optimization, we have to
that also includes a subtraction term, D3 (p , l) , and its complement, explicitely balance the gradients of the different components, with the
100
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
101
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
background.
Unlike conventional pixel-based (e.g. random forests) or GEOBIA
approaches, CNNs have the ability to “see” image objects in their
contexts, which provides additional information to discriminate be-
tween classes. Thus, working with large image patches maximizes the
competitive advantage of CNNs, however, limits to the maximum patch
size are dictated by memory restrictions of the GPU hardware. We have
created two versions of the training data. In the first version, we re-
sampled the image tiles to half their original resolution and extracted
image patches of size 256x256 pixels to train the network. The reduc-
tion of the original tile size to half was decided with the mindset that we
can include more context information per image patch. This resulted in
image patches with four times larger Field of View (hereafter FoV) for
the same 256 × 256 patch size. We will refer to this dataset as
(FoV × 4) as it includes 4 times larger Field of View (area) in a single
256 × 256 image patch (in comparison with 256 × 256 image patches
extracted directly from the original unscaled dataset). In the second
version of the training data, we kept the full resolution tiles and again
extracted image patches of size 256 × 256 pixels. We will refer to this
dataset as FoV × 1. The 256 × 256 image patch size was the maximum
size that the memory capacity of our hardware configuration could
handle (see Appendix A) so as to process a meaningfully large batch of
datums. Each of the 256 × 256 patches used for training was extracted
from a sliding window swiped over the whole tile at a stride, i.e., step,
of 128 pixels. This approach guarantees that all pixels at the edge of a
patch become central pixels in subsequent patches. After slicing the
original images, we split4 the 256x256 patches into a training set, a
validation set, and a test set with the following ratios: 0.8–0.1–0.1.
The purpose of the two distinct datasets is: the FoV × 4 is useful in
order to understand how much (if any) the increased context in-
formation improves the performance of the algorithm. It also allows us
to perform more experiments much faster due to the decreased volume
of data. The FoV × 4 dataset is approximately 50 GB, with 10 k of
pairs of images, masks. The FoV × 1 has volume size of 250 GB, and
40 k pairs of images, masks. In addition, the FoV × 4 is a useful
Fig. 6. Example of data augmentation on image patches of size 256 × 256
(ground sampling distance 10 cm – FoV × 4 dataset). Top row: original image, benchmark on how the algorithm behaves with a smaller amount of
subsequent rows: random rotations with respect to (random) center and at a data than the one provided. Finally, the FoV × 1 version is used in
random scale (zoom in/out). Reflect padding was used to fill the missing values order to compare the performance of our architecture with other pub-
of the image after the transformation. lished results.
et al., 2016; Waldner et al., 2017). 5. Architecture and Tanimoto loss experimental analysis
Practically, we perform multiple overlapping windows passes over
the whole tile and store the class probabilities for each pixel and each In this section, we perform an experimental analysis of the
pass. The final class probability vector ( pi (x , y ) ) is obtained using the ResUNet-a architecture as well as the performance of the Tanimoto
average of all the prediction views. The sliding window has size equal with complement loss function.
to the tile dimensions (256 × 256), however, we step through the
whole image in strides of 256/4 = 64 pixels, in order to get multiple 5.1. Accuracy assessment
inference probabilities for each pixel. In order to account for the lack of
information outside the tile boundaries, we pad each tile with reflect For each tile of the test set, we constructed the confusion matrix and
padding at a size equal to 256/2 = 128 pixels (Ronneberger et al., extracted the several accuracy metrics such as the overall accuracy
2015). (OA), the precision, the recall, and the F1-score (F1 ):
TP + TN
OA =
FP + FN (7)
4. Data and preprocessing
TP
precision =
We sourced data from the ISPRS 2D Semantic Labelling Challenge TP + FP (8)
and in particular the Potsdam data set (ISPRS). The data consist of a set TP
of true orthophoto (TOP) extracted from a larger mosaic, and a Digital recall =
TP + FN (9)
Surface Model (DSM). The TOP consists of the four spectral bands in the
visible (VIS) red (R), green (G), and blue (G) and in the near infrared F1 = 2·
precision·recall
(NIR) and the ground sampling distance is 5 cm. The normalized DSM precision + recall (10)
layer provides information on the height of each pixel as the ground
where TP , FP , FN , and TN are the is true positive, false positive, false
elevation was subtracted. The four spectral bands (VISNIR) and the
normalized DSM were stacked (VISNIR + DSM) to be used to train the
semantic segmentation models. The labels consist of six classes, namely 4
Making sure there is no overlap between the image patches of the training,
impervious surfaces, buildings, cars, low vegetation, trees, and validation and test sets.
102
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
negative and true negative classifications, respectively. (both middle and end) provides additional learning capacity, however,
In addition, for the validation dataset (for which we have ground it also comes with training instability. This is fixed by adding the
truth labels), we use the Matthews Correlation Coefficient (hereafter conditioned multitasking module in the final model. Clearly, each
MCC, Matthews, 1975): module addition: (a) increases the complexity of the model since it
increases the total number of parameters and (b) it improves the con-
TP × TN FP × FN
MCC = vergence performance.
(TP + FP )(TP + FN )(TN + FP )(TN + FN ) (11) Next, we are interested in evaluating the importance of the
PSPPooling layer. In our experiments we found this layer to be more
important in the middle of the network than before the last output
5.2. Architecture ablation study
layers. For this purpose, we train two ResUNet-a d7 models, the d7v1
and d7v2, that are identical in all aspects except that the latter has a
In this section, we design two experiments in order to evaluate the
PSPPooling layer in the middle. Both models are trained with the same
performance of the various modules that we use in the ResUNet-a
fixed set of hyperparameters (i.e. no learning rate reduction takes place
architecture. In these experiments we used the FoV × 1 dataset, as this
during training). In Fig. 9 we show the convergence evolution of these
is the dataset that will be the ultimate testbed of ResUNet-a perfor-
networks. It is clear that the model with the PSPPooling layer in the
mance against other modeling fra-meworks. Our metric for under-
middle (i.e. v2) converges much faster to optimality, despite the fact
standing the performance gains of the various models tested is the
that it has greater complexity (i.e. number of parameters) than the
model complexity and training convergence: if model A has greater (or
model d7v1.
equal) number of parameters than model B, and model A converges
In addition to the above, we have to note that when the network
faster to optimality than model B, then it is most likely that will also
performs erroneous inference, due to the PSPPooling layer in the end,
achieve the highest overall score.
this may appear in the form of square blocks, indicating the influence of
In the first experiment we test the convergent properties of our ar-
the pooling area in square subregions of the output. The last PSPPooling
chitecture. In this, we are not interested in the final performance (after
layer is in particular problematic when dealing with regression output
learning rate reduction and finetuning) which is a very time consuming
problems. This is the reason why we did not use it in the evaluation of
operation, but how ResUNet-a behaves during training for the same
color and distance transform modules in the multitasking networks. In
fixed set of hyperparameters and epochs. We start by training a baseline
Fig. 8 we present two examples of erroneous inference of the last
model, a modified ResUNet (Zhang et al., 2017) where in order to keep
PSPPooling layer that appear in the form of square blocks. The first row
the number of parameters identical with the case with atrous, we use
corresponds to a zoom in region of tile 6_14, and the bottom row to a
the same ResUNet-a building blocks with dilation rate equal to 1 for
zoom in region of tile 6_15. From left to right: RGB bands of input
all parallel residual blocks (i.e. there are no atrous convolutions). This is
image, error map, and inference map. From the boundary of the error
similar in philosophy with the wide residual networks (Zagoruyko and
map It can be seen that the boundary of the error map has areas that
Komodakis, 2016), however, there are no dropout layers. Then, we
appear in the form of square blocks. That is, the effect of the pooling
modify this baseline by increasing the dilation rate, thus adding atrous
operation in various scales can dominate the inference area.
convolutions (model: ResUNet + Atrous). It should be clear that the
Comparing ResUNet-a-mtsk and ResUNet-a-cmtsk models (on
only difference between the models ResUNet and ResUNet + Atrous
the basis of the d7v1 feature extractor), we find that the latter de-
is that the latter has different dilation rates than the former, i.e. they
monstrates smaller variance in the values of the loss function (and in
have identical number of parameters. Then we add PSPPooling in both
consequence, the performance metric) during training. In Fig. 10 we
the middle and the end of the framework (model: ResUNet + Atrous
present an example of the comparative training evolution of the Re-
+ PSP), and finally we apply the conditioned multitasking, i.e. the full
sUNet-a d7v1 mtsk versus the ResUNet-a d7v1 cmtsk models. It is
ResUNet-a model (model: ResUNet + Atrous + PSP + CMTSK).
clear that the conditioned inference model demonstrates smaller var-
The differences in performance of the convergence rates is incremental
iance, and that, despite the random fluctuations of the MCC coefficient,
with each module addition. This performance difference can be seen in
the median performance of the conditioned multitasking model is
Fig. 7 and is substantial. In Fig. 7 we plot the Matthews correlation
higher than the median performance of the simple multitasking model.
coefficient (MCC) for all models. The MCC was calculated using the
This helps in stabilizing the gradient updates and results slightly better
success rate over all classes. The baseline ResUNet requires approxi-
performance. We have also found that the inclusion of the identity
mately 120 epochs to achieve the same performance level that Re-
sUNet-a - cmtsk achieves in epoch 40 . The mere change from
simple (ResUNet) to atrous convolutions (model ResUNet + Atrous)
almost doubles the convergence rate. The inclusion of the PSP module
103
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
Fig. 9. Convergence performance evaluation for the PSPPooling layer in the Fig. 11. Training evolution of the same model using three different loss func-
middle. We compare two architectures ResUNet-a d7v1 architecture without tions. The Tanimoto with complement loss (solid red line, this work), the
PSPPooling layer (blue solid line) and with PSPPooling layer in the middle (i.e. Tanimoto (solid green line), and the Dice loss (dashed blue line, Sudre et al.,
d7v2, dashed green line). It is clear that the insertion of the PSPPooling layer in 2017). (For interpretation of the references to colour in this figure legend, the
the middle of the architecture boosts convergence performance of the network. reader is referred to the web version of this article.)
(For interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.) the Dice loss stagnates to lower values, while the Tanimoto loss with
complement converges faster to an optimal value. The difference in
performance is significant: the Tanimoto loss with complement
achieves for the same number of epochs an MCC = 85.99, while the
Dice loss stagnates at MCC = 80.72. The Tanimoto loss without com-
plement (Eq. (5)) gives a similar performance with the Tanimoto with
complement, however, it converges relatively slower and demonstrates
greater variance. In all experiments we performed, the Tanimoto with
complement gave us the best performance.
104
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
Table 4
Potsdam comparison of results (F1 score and overall accuracy – OA) for the various ResUNet-a models trained on the FoV × 4 and FoV × 1 datasets. Highest score
is marked with bold. The average F1-score was calculated using all classes except the “Background” class. The overall accuracy, was calculated including the
“Background” category.
Methods DataSet ImSurface Building LowVeg Tree Car Avg. F1 OA
ResUNet-a d7v1 mtsk (FoV × 4) 92.9 97.2 86.8 87.4 96.0 92.1 90.6
ResUNet-a d7v1 cmtsk (FoV × 4) 92.9 97.2 87.0 87.5 95.8 92.1 90.7
ResUNet-a d6 cmtsk (FoV × 1) 93.0 97.2 87.5 88.4 96.1 92.4 91.0
ResUNet-a d7v2 cmtsk (FoV × 1) 93.5 97.2 88.2 89.2 96.4 92.9 91.5
Background
ImSurf
Car
Building
LowVeg
Tree
Fig. 12. ResUNet-a d7v1 cmtsk inference on unseen test patches of size 256 × 256 (FoV × 4 – ground sampling distance 10 cm). From left to right: rgb image,
digital elevation map, ground truth, and prediction.
105
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
segments the area close to their boundary. This is partially owed to the
fact that we reduced the size of the original image, and fine details
required for the detailed extent of trees cannot be identified by the
algorithm. In fact, even for a human, the annotated boundaries of trees
are not always clear (e.g. see Fig. 12). The ResUNet-a d6 cmtsk model
provides a significant performance boost over the single task ResUNet-
a d6 model, for the classes “Bulding”, “LowVeg” and “Tree”. In these
classes it also outperforms the deeper models d7v1 (which, however, do
not include the PSPPooling layer at the end of the encoder). This is
due to the explicit requirement for the algorithm to reconstruct also the
boundaries and the distance map and use them to further refine the
segmentation mask. As a result, the algorithm gains a “better under-
standing” of the fine details of objects, even if in some cases it is dif-
ficult for humans to clearly identify their boundaries.
The ResUNet-a d7v1 cmtsk model demonstrates slightly increased
performance over all of the tested models (Table 4, although differences
are marginal for the FoV × 4 dataset, and vary between classes). In
addition, there are some annotation errors to the dataset that eventually
prove to be an upper bound to the performance. In Fig. 12 we give an
example of inference on 256x256 patches of images on unseen test data.
In Fig. 13 we provide an example of the inference performed by Re-
sUNet-a d7v1 cmtsk for all the predictive tasks (boundary, distance
transform, segmentation, and identity reconstruction). In all rows, the
left column corresponds to the same ground truth image. In the first
row, from left to right: input image, ground truth segmentation mask,
Fig. 13. ResUNet-a d7v1 cmtsk all tasks inference on unseen test patches of
inference segmentation mask. Second row, middle and right: ground
size 256 × 256 for the FoV × 4 dataset (ground sampling distance 10 cm). truth boundary and inference heat map of the confidence of the algo-
From left to right, top row: input image, ground truth segmentation mask, rithm for characterizing pixels as boundaries. The more faint the
predicted segmentation mask. Second row: input image, ground truth bound- boundaries are, the less confident is the algorithm for their character-
aries, predicted boundaries (confidence). Third row: input image, ground truth ization as boundaries. Third row, middle and right: distance map and
distance map, inferred distance map. Bottom row: input image, reconstructed inferred distance map. Last row, middle: reconstructed image in HSV
image, difference between input and predicted image. space. Right image: average error over all channels between the ori-
ginal RGB image and the reconstructed one. The reconstruction is ex-
cellent suggesting that the Tanimoto loss can be used for identity
mappings, whenever these are required (as a means of regularization or
for Generative Adversarial Networks training (Goodfellow et al., 2014),
e.g. Zhu et al. (2017)).
Finally, in Table 4, we provide a relative comparison between
models trained in the FoV × 4 and FoV × 1 versions of the datasets.
Clearly, there is a performance boost when using the higher resolution
dataset (FoV × 1) for the classes that require finer details. However, for
the class “Building” the score is actually better with the wider Field of
View (FoV × 4, model d6 cmtsk) dataset.
106
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
Table 5
Potsdam comparison of results (based on per class F1 score) with other authors. Best values are marked with bold, second best values are underlined, third best values
are in square brackets. Models trained with FoV × 1 were trained on 256x256 patches extracted from the original resolution images.
Methods ImSurface Building LowVeg Tree Car Avg. F1 OA
UZ_1 (Volpi and Tuia, 2017) 89.3 95.4 81.8 80.5 86.5 86.7 85.8
RIT_L7 (Liu et al., 2017b) 91.2 94.6 85.1 85.1 92.8 89.8 88.4
RIT_4 (Piramanayagam et al., 2018) 92.6 97.0 86.9 87.4 95.2 91.8 90.3
DST_5 (Sherrah, 2016) 92.5 96.4 86.7 88.8 94.7 91.7 90.3
CAS_Y3 (ISPRS) 92.2 95.7 87.2 87.6 95.6 91.7 90.1
CASIA2 (Liu et al., 2018) 93.3 97.0 [87.7] [88.4] 96.2 92.5 91.1
DPN_MFFL (Pan et al., 2018b) 92.4 [96.4] 87.8 88.0 95.7 92.1 90.4
HSN + OI + WBP (Liu et al., 2017a) 91.8 95.7 84.4 79.6 88.3 87.9 89.4
ResUNet-a d6 cmtsk (FoV × 1) [93.0] 97.2 87.5 [88.4] [96.1] [92.4] [91.0]
ResUNet-a d7v2 cmtsk (FoV × 1) 93.5 97.2 88.2 89.2 96.4 92.9 91.5
Table 6
Potsdam summary confusion matrix over all test tiles for ground truth masks that do not include the boundary. The results correspond to the best model, ResUNet-
a-cmtsk d7v2, trained on the FoV × 1 dataset. The overall accuracy achieved is 91.5%
Predicted Reference
the average F1 score and overall accuracy for the ResUNet-a d6 cmtsk results5. In Fig. 15 we provide the input image (left column), the error
and ResUNet-a d7v2 cmtsk models as well as results from other au- map between the inferred and ground truth masks (middle column) and
thors. ResUNet-a d6 performs very well in accordance with other state the inference (right column) for a sample of four test tiles. In Appendix
of the art modeling frameworks, and it ranks overall 3rd (average F1). It C we present the evaluation results for the rest of the test TOP tiles, per
should be stressed that for the majority of the results, the performance class. In all of these figures, for each row, from left to right: original
differences are marginal. Going deeper, the ResUNet-a d7v2 model image tile, error map and inference using our best model (ResUNet-a-
rank 1st among the representative sample of competing models, in all cmtsk d7v2).
classes, thus clearly demonstrating the improvement over the state of
the art. In Table 6 we provide the confusion matrix, over all test tiles, 7. Conclusions
for this particular model.
It should be noted that some of the contributors (e.g., CASIA2, In this work, we present a new deep learning modeling framework,
RIT_4, DST_5) in the ISPRS competition used networks with pre-trained for semantic segmentation of high resolution aerial images. The fra-
weights on external large data sets (e.g. ImageNet, Deng et al., 2009) mework consists of a novel multitasking deep learning architecture for
and fine-tuning, i.e. a methodology called transfer learning (Pan and semantic segmentation and a new variant of the Dice loss that we term
Yang, 2010, see also Penatti et al. (2015), Xie et al. (2015)) for remote Tanimoto.
sensing applications. In particular, CASIA2, that has the 2nd highest Our deep learning architecture, ResUNet-a, is based on the en-
overall score, used as a basis a state of the art pre-trained ResNet101 coder/decoder paradigm, where standard convolutions are replaced
(He et al., 2016) network. In contrast, ResUNet-a was trained from with ResNet units that contain multiple in parallel atrous convolutions.
random weights initialization only on the ISPRS Potsdam data set. Al- Pyramid scene parsing pooling is included in the middle and end of the
though it has been demonstrated that such a strategy does not influence network. The best performant variant of our models are conditioned
the final performance, i.e. it is possible to achieve the same perfor- multitasking models which predict among with the segmentation mask
mance without pre-trained weights (He et al., 2018), this comes at the also the boundaries of the various classes, the distance transform (that
expense of a very long training time. provides information for the topological connectivity of the objects) as
To visualize the performance of ResUNet-a, we generated error well as the identity reconstruction of the input image. The additionally
maps that indicate incorrect (correct) classification in red (green). All inferred tasks, are re-used internally into the network before the final
summary statistics and error maps were created using the software segmentation mask is produced. That is, the final segmentation mask is
provided on the ISPRS competition website. For all of our inference conditioned on the inference result of the boundaries of the objects as
results, we used the ground truth masks with eroded boundaries as well as the distance transform of their segmentation mask. We show
suggested by the curators of the ISPRS Potsdam data set (ISPRS). This
allows interested readers to have a clear picture of the strengths and
weaknesses of our algorithm in comparison with online published 5
For comparison, competition results can be found online.
107
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
Background
ImSurf
Car
Building
LowVeg
Tree
Background
ImSurf
Car
Building
LowVeg
Tree
Background
ImSurf
Car
Building
LowVeg
Tree
Background
ImSurf
Car
Building
LowVeg
Tree
Fig. 15. ResUNet-a best model results for tiles 2–13, 2–14, 3–13 and 3–14. From left to right, input image, difference between ground truth and predictions,
inference map. Image resolution:6 k × 6 k, ground sampling distance of 5 cm.
108
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
experimentally that the conditioned multitasking improves the perfor- Declaration of Competing Interest
mance of the inferred semantic segmentation classes. The ground truth
labels that are used during training for the boundaries, as well as the The authors declare that they have no known competing financial
distance transform, can be both calculated very easily from the ground interests or personal relationships that could have appeared to influ-
truth segmentation mask using standard computer vision software ence the work reported in this paper.
(OPENCV, see Section B for a PYTHON implementation).
We analyze the performance of various flavours of the Dice loss and
introduce a novel variant of this as a loss function, the Tanimoto loss. Acknowledgments
This loss can also be used for regression problems. This is an appealing
property that makes this loss useful for the case of multitasking pro- The authors acknowledge the support of the Scientific Computing
blems in that it results in balanced gradients for all tasks during team of CSIRO, and in particular Peter H. Campbell and Ondrej Hlinka.
training. We show experimentally that the Tanimoto loss speeds up the Their contribution was substantial in overcoming many technical dif-
training convergence and behaves well under the presence of heavily ficulties of distributed GPU computing. The authors are also grateful to
imbalanced data sets. John Taylor for his help in understanding and implementing distributed
The performance of our framework is evaluated on the 2D semantic optimization using HOROVOD (Sergeev and Balso, 2018). The authors
segmentation ISPRS Potsdam data set. Our best model, ResUNet-a acknowledge the support of the MXNET community, and in particular
d7v2 achieves top rank performance in comparison with other pub- Thomas Delteil, Sina Afrooze and Thom Lane. The authors acknowledge
lished results (Table 5) and demonstrates a clear improvement over the the provision of the Potsdam ISPRS dataset by BSF Swissphoto6. The
state of the art. The combination of ResUNet-a conditioned multi- authors acknowledge the contribution of the anonymous referees, whos
tasking with the proposed loss function is a reliable solution for per- questions helped to improve the quality of the manuscript.
formant semantic segmentation tasks.
ResUNet-a was built and trained using the MXNET deep learning library (Chen et al., 2015), under the GLUON API. Each of the models trained on
the FoV × 4 dataset was trained with a batch size of 256 on a single node containing 4 NVIDIA Tesla P100 GPUs in CSIRO HPC facilities. Due to the
complexity of the network, the batch size in a single GPU iteration cannot be made larger than 10 (per GPU). In order to increase the batch size we
used manual gradient aggregation7. For the models trained on the FoV × 1 dataset we used a batch size of 480 in order to speed up the computation.
These were trained in a distributed scheme, using the ring allreduce algorithm, and in particular it’s implementation on HOROVOD (Sergeev and Balso,
2018) for the MXNET (Chen et al., 2015) deep learning library. The optimal learning rate for all runs was set by the methodology developed in Smith
(2018). In particular, by monitoring the loss error during training for a continuously increasing learning rate, starting from a very low value. An
example is shown in Fig. A.16: The optimal learning rate is approximately the point of steepest decent of the loss functions. This process was
complete in approximately 1 epoch and it can be applied in a distributed scheme as well. We found it more useful than the linear learning rate scaling
that is used for large batch size (Goyal et al., 2017) in distributed optimization.
For all models, we used the Adam (Kingma and Ba, 2014) optimizer, with an initial learning rate of 0.001 (initial learning rate can also be set
higher for this dataset, see Fig. A.16), momentum parameters ( 1, 2 ) = (0.9, 0.999) . The learning rate was reduced by an order of magnitude
whenever the validation loss stopped decreasing. Overall we reduced the learning rate 3 times. We have also experimented with smaller batch sizes.
In particular, with a batch size of 32, the training is unstable. This is owed mainly to the fact that we used 4 GPUs for training, therefore the batch
size per GPU is 8, and this is not sufficient for the Batch Normalization layers that use only the data per GPU for the estimation of running means of
their parameters. When we experimented with synchronized Batch Normalization layers (Ioffe and Szegedy, 2015; Zhang et al., 2018), this increased
the stability of the training dramatically even with a batch size as small as 32. However, due to the GPU synchronization, this was a slow operation
that proved to be impractical for our purposes.
A software implementation for the ResUNet-a models that relate to this work can be found on github8.
6
http://www.bsf-swissphoto.com/unternehmen/ueber_uns_bsf.
7
A a short tutorial on manual gradient aggregation with the GLUON API in the MXNET framework can be found online.
8
https://github.com/feevos/resuneta
109
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
Fig. A.16. Learning rate finder process for the FoV × 1 dataset. The model used was ResUNet-a d6 cmtsk. For this particular training profile, our learning rate
choice was 0.001, although a little higher values are possible (see Smith, 2018 for details).
The boundaries and distance transform can be estimated efficiently from the segmentation ground truth mask by the python software routines
listed here. The input labels is a binary image, with 1 designating on class and 0 off class pixels. The shape of the labels is two dimensional (i.e. it
is a single channel image, of shape (Height,Width) – no channel dimension). In a multiclass context the segmentation mask must be provided in
one-hot encoding and applied iteratively per channel.
In this section, we present classification results and error maps for all the test TOP tiles of the Potsdam ISPRS dataset (see Figs. C.17–C.19).
110
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
Background
ImSurf
Car
Building
LowVeg
Tree
Background
ImSurf
Car
Building
LowVeg
Tree
Background
ImSurf
Car
Building
LowVeg
Tree
Background
ImSurf
Car
Building
LowVeg
Tree
Fig. C.17. ResUNet-a best model results for tiles 4–13, 4–14, 4–15, 5–13. From left to right, input image, difference between ground truth and predictions, inference
map. Image resolution:6 k × 6 k, ground sampling distance of 5 cm.
111
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
Background
ImSurf
Car
Building
LowVeg
Tree
Background
ImSurf
Car
Building
LowVeg
Tree
Background
ImSurf
Car
Building
LowVeg
Tree
Background
ImSurf
Car
Building
LowVeg
Tree
Fig. C.18. As Fig. C.17 for tiles 5–14, 5–15, 6–13, 6–14.
112
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
Background
ImSurf
Car
Building
LowVeg
Tree
Background
ImSurf
Car
Building
LowVeg
Tree
Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.isprsjprs.2020.01.013.
113
F.I. Diakogiannis, et al. ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–114
He, K., Gkioxari, G., Dollár, P., Girshick, R.B., 2017. Mask R-CNN. CoRR abs/1703.06870. 2868–2881.
http://arxiv.org/abs/1703.06870. Pan, S.J., Yang, Q., 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22,
He, K., Zhang, X., Ren, S., Sun, J., 2014. Spatial pyramid pooling in deep convolutional 1345–1359. https://doi.org/10.1109/TKDE.2009.191.
networks for visual recognition. CoRR abs/1406.4729. http://arxiv.org/abs/1406. Pan, X., Gao, L., Marinoni, A., Zhang, B., Yang, F., Gamba, P., 2018a. Semantic labeling of
4729. high resolution aerial imagery and lidar data with fine segmentation network.
He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep residual learning for image recognition. Remote Sens. 10https://doi.org/10.3390/rs10050743. http://www.mdpi.com/
CoRR abs/1512.03385. http://arxiv.org/abs/1512.03385. 2072-4292/10/5/743.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Identity mappings in deep residual networks. Pan, X., Gao, L., Zhang, B., Yang, F., Liao, W., 2018b. High-resolution aerial imagery
CoRR abs/1603.05027. http://arxiv.org/abs/1603.05027. semantic labeling with dense pyramid network. Sensors 18. https://doi.org/10.3390/
Huang, G., Liu, Z., Weinberger, K.Q., 2016. Densely connected convolutional networks. s18113774. http://www.mdpi.com/1424-8220/18/11/3774.
CoRR abs/1608.06993. http://arxiv.org/abs/1608.06993. Penatti, O.A., Nogueira, K., dos Santos, J.A., 2015. Do deep features generalize from
Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by everyday objects to remote sensing and aerial scenes domains? In: 2015 IEEE
reducing internal covariate shift. CoRR abs/1502.03167. http://arxiv.org/abs/1502. Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp.
03167. 44–51. doi.ieeecomputersociety.org/10.1109/CVPRW.2015.7301382, doi:https://
ISPRS, International society for photogrammetry and remote sensing (isprs) and bsf doi.org/10.1109/CVPRW.2015.7301382.
swissphoto: Wg3 potsdam overhead data. http://www2.isprs.org/commissions/ Piramanayagam, S., Saber, E., Schwartzkopf, W., Koehler, F.W., 2018. Supervised clas-
comm3/wg4/tests.html. sification of multisensor remotely sensed images using a deep learning framework.
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K., 2015. Spatial transformer Remote Sens. 10. https://doi.org/10.3390/rs10091429. http://www.mdpi.com/
networks. CoRR abs/1506.02025. http://arxiv.org/abs/1506.02025. 2072-4292/10/9/1429.
Kervadec, H., Bouchtiba, J., Desrosiers, C., Ric Granger, Dolz, J., Ayed, I.B., 2018. Rawat, W., Wang, Z., 2017. Deep convolutional neural networks for image classification:
Boundary loss for highly unbalanced segmentation arXiv:1812.07032. a comprehensive review. Neural Comput. 29, 2352–2449. https://doi.org/10.1162/
Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. CoRR abs/1412. neco_a_00990. pMID: 28599112.
6980. http://arxiv.org/abs/1412.6980. Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biome-
Lambert, M.J., Waldner, F., Defourny, P., 2016. Cropland mapping over sahelian and dical image segmentation. CoRR abs/1505.04597. http://arxiv.org/abs/1505.04597.
sudanian agrosystems: a knowledge-based approach using proba-v time series at 100- Ruder, S., 2017. An overview of multi-task learning in deep neural networks. CoRR abs/
m. Remote Sens. 8, 232. 1706.05098. http://arxiv.org/abs/1706.05098.
Längkvist, M., Kiselev, A., Alirezaie, M., Loutfi, A., 2016. Classification and segmentation Sergeev, A., Balso, M.D., 2018. Horovod: fast and easy distributed deep learning in
of satellite orthoimagery using convolutional neural networks. Remote Sens. 8, 329. TensorFlow. arXiv preprint arXiv:1802.05799.
LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, Sherrah, J., 2016. Fully convolutional networks for dense semantic labelling of high-
L.D., 1989. Backpropagation applied to handwritten zip code recognition. Neural resolution aerial imagery. CoRR abs/1606.02585. http://arxiv.org/abs/1606.02585.
Comput. 1, 541–551. https://doi.org/10.1162/neco.1989.1.4.541. Smith, L.N., 2018. A disciplined approach to neural network hyper-parameters: Part 1 –
Li, E., Femiani, J., Xu, S., Zhang, X., Wonka, P., 2015. Robust rooftop extraction from learning rate, batch size, momentum, and weight decay. CoRR abs/1803.09820.
visible band images using higher order crf. IEEE Trans. Geosci. Remote Sens. 53, http://arxiv.org/abs/1803.09820.
4483–4495. Sørensen, T., 1948. A method of establishing groups of equal amplitude in plant sociology
Li, S., Jiao, J., Han, Y., Weissman, T., 2016. Demystifying resnet. CoRR abs/1611.01186. based on similarity of species and its application to analyses of the vegetation on
http://arxiv.org/abs/1611.01186. Danish commons. Biol. Skr. 5, 1–34.
Li, X., Shao, G., 2014. Object-based land-cover mapping with high resolution aerial Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Cardoso, M.J., 2017. Generalised dice
photography at a county scale in midwestern usa. Remote Sens. 6, 11372–11390. overlap as a deep learning loss function for highly unbalanced segmentations. CoRR
Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P., 2017. Focal loss for dense object abs/1707.03237. http://arxiv.org/abs/1707.03237.
detection. CoRR abs/1708.02002. http://arxiv.org/abs/1708.02002. Taghanaki, S.A., Abhishek, K., Cohen, J.P., Cohen-Adad, J., Hamarneh, G., 2019. Deep
Liu, Y., Fan, B., Wang, L., Bai, J., Xiang, S., Pan, C., 2018. Semantic labeling in very high semantic segmentation of natural and medical images: a review arXiv:1910.07655.
resolution images via a self-cascaded convolutional neural network. ISPRS J. Vadivel, A., Sural, Shamik, Majumdar, A.K., 2005. Human color perception in the hsv
Photogramm. Remote Sens. 145, 78–95. space and its application in histogram generation for image retrieval. doi:https://doi.
Liu, Y., Minh Nguyen, D., Deligiannis, N., Ding, W., Munteanu, A., 2017a. Hourglass- org/10.1117/12.586823.
shapenetwork based semantic segmentation for high resolution aerial imagery. Vincent, L., Soille, P., 1991. Watersheds in digital spaces: an efficient algorithm based on
Remote Sens. 9https://doi.org/10.3390/rs9060522. http://www.mdpi.com/2072- immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell. 583–598.
4292/9/6/522. Volpi, M., Tuia, D., 2017. Dense semantic labeling of subdecimeter resolution images with
Liu, Y., Piramanayagam, S., Monteiro, S.T., Saber, E., 2017b. Dense semantic labeling of convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 55, 881–893.
very-high-resolution aerial imagery and lidar with fully-convolutional neural net- Waldner, F., Hansen, M.C., Potapov, P.V., Löw, F., Newby, T., Ferreira, S., Defourny, P.,
works and higher-order crfs. In: Proceedings of the IEEE Conference on Computer 2017. National-scale cropland mapping based on spectral-temporal features and
Vision and Pattern Recognition (CVPR) Workshops, Honolulu, USA. outdated land cover information. PloS One 12 e0181911.
Long, J., Shelhamer, E., Darrell, T., 2014. Fully convolutional networks for semantic Wen, D., Huang, X., Liu, H., Liao, W., Zhang, L., 2017. Semantic classification of urban
segmentation. CoRR abs/1411.4038. http://arxiv.org/abs/1411.4038. trees using very high resolution satellite imagery. IEEE J. Sel. Top. Appl. Earth
Lu, X., Yuan, Y., Zheng, X., 2017. Joint dictionary learning for multispectral change de- Observ. Remote Sens. 10, 1413–1424.
tection. IEEE Trans. Cybernetics 47, 884–897. Xie, S., Tu, Z., 2015. Holistically-nested edge detection. CoRR abs/1504.06375. http://
Ma, L., Liu, Y., Zhang, X., Ye, Y., Yin, G., Johnson, B.A., 2019. Deep learning in remote arxiv.org/abs/1504.06375.
sensing applications: a meta-analysis and review. ISPRS J. Photogramm. Remote Xie, S.M., Jean, N., Burke, M., Lobell, D.B., Ermon, S., 2015. Transfer learning from deep
Sens. 152, 166–177. https://doi.org/10.1016/j.isprsjprs.2019.04.015. http://www. features for remote sensing and poverty mapping. CoRR abs/1510.00098. http://
sciencedirect.com/science/article/pii/S0924271619301108. arxiv.org/abs/1510.00098.
Marmanis, D., Schindler, K., Wegner, J.D., Galliani, S., Datcu, M., Stilla, U., 2018. Yang, H., Wu, P., Yao, X., Wu, Y., Wang, B., Xu, Y., 2018. Building extraction in very high
Classification with an edge: Improving semantic image segmentation with boundary resolution imagery by dense-attention networks. Remote Sens. 10https://doi.org/10.
detection. ISPRS J. Photogramm. Remote Sens. 135, 158–172. 3390/rs10111768. http://www.mdpi.com/2072-4292/10/11/1768.
Marmanis, D., Wegner, J.D., Galliani, S., Schindler, K., Datcu, M., Stilla, U., 2016. Zagoruyko, S., Komodakis, N., 2016. Wide residual networks. CoRR abs/1605.07146.
Semantic segmentation of aerial images with an ensemble of cnns. http://arxiv.org/abs/1605.07146, arXiv:1605.07146.
Matikainen, L., Karila, K., 2011. Segment-based land cover mapping of a suburban Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A., 2018. Context
areacomparison of high-resolution remotely sensed datasets using classification trees encoding for semantic segmentation. In: The IEEE Conference on Computer Vision
and test field points. Remote Sens. 3, 1777–1804. and Pattern Recognition (CVPR).
Matthews, B., 1975. Comparison of the predicted and observed secondary structure of t4 Zhang, Q., Seto, K.C., 2011. Mapping urbanization dynamics at regional and global scales
phage lysozyme. Biochimica et Biophysica Acta (BBA) – Protein Structure 405, using multi-temporal dmsp/ols nighttime light data. Remote Sens. Environ. 115,
442–451. https://doi.org/10.1016/0005-2795(75)90109-9. http://www. 2320–2329.
sciencedirect.com/science/article/pii/0005279575901099. Zhang, Z., Liu, Q., Wang, Y., 2017. Road extraction by deep residual u-net. CoRR abs/
Milletari, F., Navab, N., Ahmadi, S., 2016. V-net: Fully convolutional neural networks for 1711.10684. http://arxiv.org/abs/1711.10684.
volumetric medical image segmentation. CoRR abs/1606.04797. http://arxiv.org/ Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017a. Pyramid scene parsing network. In:
abs/1606.04797. CVPR.
Myint, S.W., Gober, P., Brazel, A., Grossman-Clarke, S., Weng, Q., 2011. Per-pixel vs. Zhao, W., Du, S., Wang, Q., Emery, W.J., 2017b. Contextually guided very-high-resolution
object-based classification of urban land cover extraction using high spatial resolu- imagery classification with semantic segments. ISPRS J. Photogramm. Remote Sens.
tion imagery. Remote Sens. Environ. 115, 1145–1161. 132, 48–60. https://doi.org/10.1016/j.isprsjprs.2017.08.011. http://www.
Novikov, A.A., Major, D., Lenis, D., Hladuvka, J., Wimmer, M., Bühler, K., 2017. Fully sciencedirect.com/science/article/pii/S0924271617300709.
convolutional architectures for multi-class segmentation in chest radiographs. CoRR Zhu, J., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using
abs/1701.08816. http://arxiv.org/abs/1701.08816. cycle-consistent adversarial networks. CoRR abs/1703.10593. http://arxiv.org/abs/
Odena, A., Dumoulin, V., Olah, C., 2016. Deconvolution and checkerboard artifacts. 1703.10593.
Distill. http://distill.pub/2016/deconv-checkerboard/. Zhu, X.X., Tuia, D., Mou, L., Xia, G., Zhang, L., Xu, F., Fraundorfer, F., 2017. Deep
Paisitkriangkrai, S., Sherrah, J., Janney, P., van den Hengel, A., 2016. Semantic labeling learning in remote sensing: a comprehensive review and list of resources. IEEE
of aerial and satellite imagery. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 9, Geosci. Remote Sens. Mag. 5, 8–36. https://doi.org/10.1109/MGRS.2017.2762307.
114