Satellite Image Segmentation With Convolutional Neural Networks (CNN)
Satellite Image Segmentation With Convolutional Neural Networks (CNN)
Road Segmentation
Dario Pavllo , Mattia Martinelli and Chanhee Hwang
Ecole polytechnique federale de Lausanne
Switzerland
[email protected], [email protected], [email protected]
Image segmentation is a technique that is becoming increas- Fig. 1: Some areas from the training set and their respective
ingly popular for various tasks in computer vision. Generally ground truth masks (in red).
speaking, this process consists in labelling every part of
an image according to certain criteria. For instance, such
a technique could be used for face detection in photos, or of the proposed techniques, and Section VI gives a comparison
detection of roads in autonomous driving vehicles. of the obtained results.
Recently, the increase of computing performance, as well
II. E XPLORATORY DATA ANALYSIS
as the ability to exploit massively parallel computation with
GPUs, has led to the development of new machine learning The dataset consists of 100 satellite images of urban areas
techniques that are able to process images in reasonable time. and their respective ground truth masks, where white pixels
However, image processing has always been a challenging represent roads (foreground) and black pixels represent the rest
task, as the information is organized in a definite geometri- (background).
cal structure, and therefore algorithms should consider their The task is to classify blocks of 16 16 pixels, considering
morphology. In addition, the computational cost to process an that the label associated with each block corresponds to 1 if
image does not scale linearly with its size, and this has led the average value of the ground truth pixels in that block is
to the development of new techniques such as convolutional greater than a threshold (0.25), 0 otherwise.
neural networks, which foster sparse connections and weight By looking at the training set, it can be observed that the
sharing in order to reduce the complexity of the problem. classification task is not trivial, as some roads are covered
The aim of this project is to build a model that is able to by trees. Furthermore, some asphalt areas are not labelled
perform the segmentation of satellite images. Specifically, the as road (e.g. parking lots and walkways), and this could
segmentation consists in detecting which parts of the images potentially confuse the training model. Figure 1 shows these
are roads, and which parts are background (e.g. buildings, complications.
fields, water). For these reasons, it is agreeable to think that the classifier
This report provides a brief overview of different methods should take into consideration a sort of context, i.e. it should
that can be used to solve this problem, and particularly it look at nearby pixels in order to infer some information about
addresses convolutional neural networks, which represent the the block that is being classified.
state-of-the-art technique for image classification. Specifically,
Section III proposes several approaches and methods that III. M ODELS AND METHODS
can be used to perform this task. The rest is organized as In order to the evaluate the quality of the proposed model,
follows: Section IV provides some implementation details that it is important to have a baseline model that will be used
have proved useful to improve the performance of our model; as a comparison for the classification accuracy. Based on
Section V explains the methodology of the accuracy validation the observation that there are fewer foreground areas than
(a) Dead filters (ReLU) (b) Good filters (Leaky ReLU)
Fig. 2: Sliding window approach. The small square at the Fig. 3: Visualization of the filters in the first layer, with the
centre is the patch (of size 16 16) that is being classified, same model and the same training set, but different activation
whereas the big square represents the current context (i.e. functions.
window, of size 72 72). In this figure, subsequent windows
are spaced apart by 16 pixels (stride).
a longer training time (which can be already excessive). For
this reason, a variant known as Leaky ReLU has been used as
background areas, the first baseline model that has been used the activation function for all intermediate layers, with good
classifies all blocks as background (i.e. 0). results. It is defined as f (x) = max(x, x), with 1, and
It is possible to improve the initial result by using a linear in our case has been chosen to be equal to 0.1 . Although
classification model, such as the logistic regression. However, this might seem a high value, some studies have shown that
for the reasons mentioned in the previous section, such a model higher values perform better than lower ones [2], and with this
would not be able to correctly process the context of images, dataset = 0.1 has proved effective to prevent dead filters,
since it would not be capable to detect their morphological as shown in Figure 3.
structure (unless complex feature extraction techniques are
used).
In order to confirm this claim, a logistic regression model B. Image augmentation
has been implemented for comparison purposes. The input
Since the dataset is very small (100 images), an image
features correspond to the mean (over the entire 16 16
augmentation strategy has been adopted to virtually increase
patch) and the standard deviation of each RGB channel (for a
its size. Specifically, before being supplied to the neural
total of 6 features), and they are transformed according to a
network, each training sample (i.e window) is randomly ro-
polynomial basis of degree 4 (including interactions as well).
tated in steps of 90 degrees, and it is also randomly flipped
However, the most reasonable choice would be to use
horizontally/vertically. This effectively yields an increase of
a model based on convolutional neural networks (CNNs),
the dataset size by a multiplicative factor of 8, and has been
since they are well suited for images. Indeed, this model has
shown to greatly improve the accuracy of the model. The
provided excellent results on our dataset and has been adopted
implementation details of this technique are shown in section
as final solution.
Section IV.
According to our research, several methods have been
proposed to solve the task of per-pixel classification; the
one adopted in this project consists of a sliding window C. Regularization
approach [1]: the objective is to classify the block at the
Although the dataset augmentation helps to reduce overfit-
centre of an image, according a certain context, which in this
ting, the use of Dropout layers has been very effective in our
case corresponds to a square window of size window_size
model. They have been added after each max-pooling layer
window_size (a hyperparameter). Figure 2 shows this
(with p = 0.25), and also after the fully connected layer (with
technique more clearly.
p = 0.5). Furthermore, L2 regularization has been used for
For what concerns the neural network structure, the number
the weights (and not biases) of the fully connected and output
of layers and filters has been optimized to perform well on
layers, with = 106 .
this dataset. Furthermore, the following features have been
explored:
The window size has been empirically chosen in order to
A. Activation functions take into account a context that is large enough, considering
ReLUs are the standard choice for deep neural networks; that large windows are computationally expensive. Therefore,
however, when a high learning rate is used, some units can a size of 72 72 has proved to be a good compromise. Table I
get stuck and cause the so-called dead filters. This problem shows the complete structure of the proposed neural network,
can be mitigated by using a lower learning rate, at the cost of which is the result of various experiments.
Type Notes
Input 72 72 3
Convolution + Leaky ReLU 64 5 5 filters
Max Pooling 22
Dropout p = 0.25
Convolution + Leaky ReLU 128 3 3 filters
Max Pooling 22
Dropout p = 0.25
Convolution + Leaky ReLU 256 3 3 filters
Max Pooling 22
Dropout p = 0.25
Convolution + Leaky ReLU 256 3 3 filters
Max Pooling 22
Dropout p = 0.25
Fully connected + Leaky ReLU 128 neurons
Dropout p = 0.5
Output + Softmax 2 neurons
TABLE I: Full list of layers in the neural network.
# Model Accuracy
A All background 74.09% 1.2%
B Logistic regression 78.53% 0.2%
C CNN 92.05% 0.9%
D CNN + LR 92.22% 0.9%
E CNN + LR + D 92.57% 1.0%
F CNN + LR + D + IA 92.89% 0.7%
TABLE II: Tested models along with their respective cross-
validation results.
Legend: CNN: Convolutional Neural Network; LR: Leaky
ReLU; D: Dropout; IA: Image Augmentation.