Classification of Textures Using Convolutional
Classification of Textures Using Convolutional
Ivar Persson
[email protected]
Supervisors:
Magnus Oskarsson, Kalle Åström
Master’s thesis
Lund University
Sammanfattning
Denna Masteruppsats har behandlat segmentering och klassificering av bakgrundstexturer i bil-
der. För att segmentera bilderna har vi använt en algoritm, kallad SLIC, för att skapa så kallade
superpixlar. En superpixel är en samling pixlar som är nära varandra rent fysiskt och med liknan-
de färg. Dessa superpixlar klassificerades sen av ett Faltande Neuralt Nätverk (eng. Convolutional
Neural Network) som tränades upp genom att låta nätverket se superpixlar som redan var manuellt
klassificerade. Eftersom detta nätverket gjorde ca. 30% fel lades ett andra steg till i klassificeringsal-
goritmen. Detta andra steg bestod i att angränsande superpixlars ursprungliga klassificering ökade
chansen för de typerna av texturer i suprpixeln i fråga. Som en jämförelse tränades också ett neuralt
nätverk för att hantera angränsande suprpixlar.
Genom att använda det faltade neurala nätverket följt av det neurala nätverket för grannliggande
superpixlar sänktes felen till ca. 10%, medan den vanliga hanteringen av grannliggande pixlar endast
sänkte felen med några få procentenheter. Trots att felen var så låga kunde dock inte resultaten
återupprepas på okända bilder, antagligen på grund av att de neurala nätverken var övertränade på
de relativt låga antalet superpixlar. När ett neuralt nätverk är övertränat innebär det att det har
väldigt stor förmåga att känna igen, i det här fallet superpixlar, som liknar de som är använda vid
träningen men helt oförmögen att korrekt klassificera andra superpixlar. Metoden med två neurala
nätverk har potential och genom att ha fler träningsbilder och eventuellt mindre superpixlar, så
kanske resultaten från träningen kan återupprepas på okänd data.
Acknowledgements
I would like to thank my supervisors Magnus Oskarsson and Kalle Åström, who helped me plan
this thesis work and helped me throughout the project with all different kinds of problems. I would
also like to direct many thanks to Anna Broms and Anna Åberg. Anna Broms for your wizardry
with LATEX, Anna ˚ Åberg for your uncanny ability to find all of my spelling errors and both of you
for acting sounding boards willingly or not.
Contents
1 Introduction 1
1.1 Texture classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Theory 2
2.1 Superpixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Training a CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Back propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Measuring the error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Introducing spatial biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Implementation 11
3.1 Image material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Structure of program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Texture choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Results 13
4.1 Overall performance of the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Visual illustration of program performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Different types of miss-classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 1: A typical image where the background textures were to be segmented and classified.
1
1.1 Texture classification
As noted we will classify background textures. There will be a total of 15 classes including grass, forest,
bushes, sky, water and an object class. These are all textures from the outside and there is a good reason
for that. Imagine any room indoors; the entire room is filled with objects, for example a couch, a book
shelf, a table etc. but there are very few open spaces with for example just floor or wall, as there are
probably some rugs and paintings. Outdoors we can often find images with big sections of a forest or an
ocean which will work to our advantage as we want images with a lot of background textures and few
objects.
When we speak of texture in every day life, we often speak of the feel of a material, a fabric or a
piece of wood for example. This idea is what we want to use CNNs for to differentiate between similar
textures. As an example we can consider blue skies and water. How do we separate them even if they
are both blue? The water will often have ripples and other small waves while the blue skies will be a
smoother, or a more homogeneous, blue colour. Traits like this will hopefully separate similar textures
from each other.
2 Theory
In this theory segment we will present the relevant theory about superpixels, why we want them and
how we calculate them. Next we will explain how CNNs works and how the training of a CNN is a big
optimisation problem. We will not cover general artificial neural networks, however it can be pointed out
that some of the theory is the same. As a finalisation we present the second stage in our classification
algorithm, the spatial biases and how they can help reduce error rates.
2.1 Superpixels
As we are classifying textures it is logical that we need to analyse a larger area, not just a single pixel,
to determine which texture it belongs to. A quick and easy solution would be to divide the image into
squares and analyse each square by it self, however, texture boundaries do not follow straight lines and
therefore we need a robust system to select a collection of pixels that all describes one texture only. This
is why we want superpixels. The results of a superpixel segmentation can be seen in Figure 2, where two
different numbers of superpixels have been used. As we can see, pixels close to each other spatially and
similar in colour are bundled together, just as we were interested in.
There exists many different algorithms to calculate suprpixels, such as a known graph based algorithm
by Shi and Malik [15]. However, Achanta et al. [1] shows that their proposed algorithm, Simple Linear
Iterative Clustering (SLIC), is both quicker, more memory efficient and more robust in terms of obeying
boundaries. In matlab, there is a prewritten implementation (superpixels.m) of the SLIC algorithm
which we are able to use. The algorithm by Achanta et al. is described in Algorithm 1.
2
(a) 500 superpixels (b) 1000 superpixels
Figure 2: An image of a building with two examples of different number of superpixels in the segmentation.
3
The authors state that the while-statement on row 7 can be replaced by a for -statement of ten iterations.
It is usually enough for most images [1] and this is also the default method in matlab. Very consciously
we have, until now, referred to the distance measure as some sort of measure. When we are calculating
the distance between a cluster centre Ck and a pixel i = [li , ai , bi , xi , yi ] we cannot use an ordinary
5D Euclidean distance as it would give different importance to spatial proximity for different sizes of
superpixels. Therefore, we define a modified distance D0 as a combination of the colour proximity dc and
the spatial proximity ds :
q
dc = (lk − lp )2 + (ak − ap )2 + (bk − bp )2 ,
q
ds = (xk − xp )2 + (yk − yk )2 ,
(1)
s
2 2
dc ds
D0 = + .
Nc Ns
Here we get a 3D Euclidean measure for the colour difference and a 2D Euclidean measure for spatial
differences, which are then weighed against each other. The maximum expected width and height of a
superpixel is the grid interval, therefore we set Ns = S. The maximum colour distance Nc is set to some
constant m as it is harder to define a maximum colour distance. Every pixel is different and therefore we
leave m as a constant. By inserting the values of Ns and Nc into D0 in equation (1) and multiply with
m, we get the distance measure used in the algorithm
s 2
ds
D = d2c + m2 .
S
We have m as a tunable parameter. Small m promotes colour proximity and large m promotes spatial
proximity [1]. Achanta et al. also tells us that m ∈ [1, 40] for images in CIELAB space. In the algorithm
in matlab, m = 10 by default and we will also use this value. These default values made the superpixel
segmentation in Figure 2.
where h is the convolution kernel and x is the input image. The output y is known as the activation
map or featue map. An illustration of this convolution can be seen in Figure 3, where the convolution
kernel (in the middle) is element wise multiplied with all elements across the image [13]. The values of
the elements in the convolution kernel are known as weights wji . The subscripts to w is indicated as wji
is the ith weight belonging to neuron j. The weight wj0 is known as the bias to neuron j, that is added
to each element in the activation map. The new image created from the convolution, the activation map,
is seen to the right in Figure 3.
When constructing a convolution kernel, the network designer has a few choices. First, the size of the
kernel has to be considered. A large kernel will analyse the larger context of the image, while a small one
will capture structures at a finer level. In Figure 3 we see a fairly small kernel with size 3 × 3. Secondly,
one can choose to use zero-padding on some or all sides of the image in order to control the size of the
output from the convolution. For example, in Figure 3 we seem to have one line of zero padding on the
left, right and top. In an ordinary two dimensional convolution, such as equation (2), the summations are
over the range of [−∞, ∞]. However the input image x is not infinitely large and with the convolutions
here no calculations are made where the kernel does not completely cover the image. Once we have
4
Figure 3: Two-dimensional convolution. The convolution kernel (bottom) is element wise multiplied with
each element the kernel covers, in this case a 3 × 3 area, before it moves on to the next location.
Figure 4: Illustration of how a non-dilated convolution kernel looks compared to one with one dilation.
chosen filter size and zero-padding it is also common to consider the stride. A filter is usually calculated
at some point and then moved one step. This may not be the best way in all cases, as larger steps of
more than one pixel is needed. This can, among other methods still to be described, be used to control
the reduction of the dimensions of the activation map [8]. Furthermore, one can choose to dilate the
convolution kernel, i.e. inserting rows and columns with zeros in them, as seen in Figure 4. This has
proven to lower errors in semantic segmentation [22] where textures are considered.
We have chosen to use convolution kernels of size 5 × 5 and 6 × 6. These are a little bigger than
the ones shown in figure 3. For example, since we are dealing with textures we do not want to analyse
individual tree branches, we want to analyse the entire tree. The superpixels generated by the SLIC
algorithm can have various shapes and therefore they are padded with zeros to become rectangular. This
rectangular shape is needed when running the superpixels through the network. Since we already have
some zeros around the pixels any additional zero padding is deemed unnecessary. We opted to use a
stride of one since the images were not very big and we do not need rapid dimension reduction. Although
we are doing semantic segmentation of images and should consider dilated kernels as proposed in [22],
the neural network will classify entire superpixels in a similar manner to object detection networks and
5
Figure 5: Illustration of the max pooling operation. 2×2 filters with stride 2 are placed over the activation
map of size 4 × 4. The maximum value in each area is passed on to the next step.
therefore we will not dilate the convolution kernels. One can say that the network does not consider the
entire image or its neighbouring parts, it only classifies the superpixel.
Pooling is a type of down-sampling of an activation map from the convolution. Some common types
of pooling are max, average and sum pooling. This kind of layers are used to reduce spatial dimensions
of activation maps to the next layer [14]. Pooling is conducted by running a filter, usually of size 2 × 2
or 3 × 3 over the activation map in a similar manner as the convolution kernel. However, instead of
convolving all values in the range of the kernel, the maximum, average or sum is used as the value for
the activation map in the next layer. In Figure 5 we can see an example of the effect of max pooling with
2 × 2 filters and stride 2. Scherer et al. [14] show that max pooling is often superior and max pooling is
now the standard pooling type in many networks [9, 7, 17]. As with the convolution kernel, the network
designer has to choose size and stride of the pooling filter. Krizhevsky et al. [9] show that overlapping
pooling layers, i.e. pooling with for example size equal to 3 and stride equal to 2, reduces error rates and
decrease the risk of overfitting the network to training data. Overfitting is the phenomenon where you
train you network to recognise your training data very well but the network is not stable nor capable to
classify object not belonging to the training set. In accordance with the article by Krizhevsky et al. [9]
we use pooling with size 3, stride 2 and since Scherer et al. [14] shows that max pooling is superior we
have used that as well.
Within the field of neuroscience it has been noted that the firing intensity of a neuron is not linearly
dependent on the input activation [5]. If we translate this fact about biological neurons into these artificial
ones, it would mean that the actual value of the activation map will not be linearly dependent on the value
calculated from the convolution and pooling. Therefore, a non-linear activation function, applied to each
element in the activation map, has been introduced to better simulate the natural behaviour of biological
neurons. Historically a sigmoid or hyperbolic tangent function have been used [5]. This is however not
the best method, as Glorot et al. [5] shows in his article where the non-linear and non-symmetric function
(
x, x ≥ 0
f (x) = (3)
0, x < 0,
is used. This is known as the Rectified linear unit (ReLU). LeCun et al. [11] describes how the ReLU
has a quicker learning process than the sigmoid function. It has been noted that cortical neurons (brain
6
cells) rarely fire at full intensity and that is why a non-saturating activation function (a function without
supremum), such as the ReLU can be suitable [5]. In 2015, He et al. [7] proposed the Parametric Rectified
Linear Unit (PReLU) to replace the ReLU. The PReLU is similar to the ReLU in equation (3), however,
for inputs smaller than zero we still have some non-zero output. We have
(
x, x ≥ 0
f (x) = (4)
ai x, x < 0,
where ai is some trainable scalar connected to that neuron. The PReLU outperformed the ordinary
ReLU on the ImageNet (the unofficial object recognition olympiad) [7]. The used implementation in
matlab from MatConvNet, have a more static version of the PReLU is used, called Leaky Rectified
Linear Unit (LReLU), where ai is set to some constant for all neurons[19]. The LReLU does, however,
not increase performance of the network significantly [7].
In order to limit the implementation needed we have chosen the ReLU as activation function and
not PReLU or LReLU. This function will be applied to the activation map after every convolution to
mimic nature as much as possible. Even though the PReLU outperforms the ReLU, many systems use
the ReLU and still get very low error rates [9, 17] and therefore we can choose to save implementation
time without jeopardising the performance of the network. As the LReLU does not significantly increase
performance there is no need to use it.
7
2.3.1 Back propagation
The back propagation algorithm is based on several partial differentiations along with the chain rule.
This derivation is based on the derivation in [4] with some extra comments. In a neuron we have several
weights connecting other neurons to this neuron. High values on the weights indicate a strong connection.
In general, we have a convolution and get an activation map vj (k) as
n
X
vj (k) = wji (k)xi (k), (5)
i=0
where vj (k) is the activation map belonging to the j:th neuron in epoch k. For each k and j, v is a matrix.
We have as input into neuron j xi (k) from neuron i, which is connected to neuron j with the weights
wji (k). We can compare with Figure 3, where neuron i is to the left, the convolution kernel is composed
by the values wji (k) and the resulting activation map is to the right. In total n weights connects neuron
i to neuron j. A schematic image of the connections between neuron can be seen in Figure 6, where our
neuron j is connected to 4 neurons. The final output from the neuron j is then given by
∂Ej (k) ∂Ej (k) ∂ej (k) ∂yj (k) ∂vj (k)
= .
∂wji (k) ∂ej (k) ∂yj (k) ∂vj (k) ∂wji (k)
We can now identify the partial derivatives from equations (8), (7), (6) and (5)
∂Ej (k)
= −ej (k)Φ0j (vj (k))yi (k),
∂wji (k)
where we define
∂Ej (k)
δj (k) = − = ej (k)Φ0j (vj (k))
∂vj (k)
as the local gradient in neuron j. No ground truth exist in our CNN as we have hidden neurons. We cannot
form an error or error energy, how could then the local gradients for the hidden layers be calculated?
P We begin by considering the error el (k) in some non-hidden neuron and the total error energy E(k) =
j Ej (k) in the network. We obtain
8
Figure 6: Schematic image of connections between neurons. Inside neuron j we see the summation from
equation (5), the activation function from equation (6) and the output yj (k).
∂vl (k)
= wlj (k). (12)
∂yj (k)
Substituting (11) and (12) into equation (10) we get
∂E(k) X X
=− el Φ0l (vl (k))wlj = − δl (k)wlj ,
∂yj (k)
l l
where δl (k) is the local gradient. We can now describe the gradient in a hidden neuron j as
i.e. the sum of previous gradients times the weights and times the derivative of the activation function in
neuron j. To update the weights we will need two more parameters; the learning rate and the momentum.
We define the size of the weight update on the weight from neuron i to j, ∆wji (k), as
9
last update ∆wji (k −1) in the previous epoch k −1, and multiply it with our momentum β, commonly set
to 0.9 [9, 17, 22]. The second term is now a correction term to steer the updates in the correct direction.
By using this momentum we avoid getting stuck when the local gradient is small.
The observant reader might recognise the weight update as simple gradient search with an extra
momentum. An obvious problem with a gradient search is of course the risk of getting stuck in a local
minimum in the error function. This is, however, not usually a problem as LeCun et al. [11] mentions.
Choromanska et al. [2] even proves that recovering the global minimum is harder when the network is
bigger and also irrelevant as a too good minimum probably will cause a lot of overfitting.
where c is the correct classification label and X is the output vector from the second to last layer. Another
common thing is to calculate the loss as L(X, c) = − ln(X(c)) called the log loss. The softmax log loss
normalises the predictions apart from the log loss [19]. This means that our resulting vector will contain
the probabilities that the superpixel belongs to some class. In other words, these number all sum to one.
A perfect network where the output vector is one for the correct texture and zero for everyone else
will become L(X, c) = 0 as both the numerator and the denominator will be one. For a truly random
1 1
weights we will get X(c) = 15 , for 15 classes, and X(q) = 15 for all q. Inserting this into equation (14)
(
will yield L(X, c) = −ln( e15e1/15)
( 15) ) ≈ 2.708. With 7 classes we get L(X, c) ≈ 1.946 with random weights.
10
an ”image” of size 1 × 15 × 10. We trick the CNN into believing that our 15 classes in 10 superpixels
are an image with 15 pixels in a row and 10 colour channels. This ”image” is first sent into a neural
network with a convolution layer of size 1 × 1 × 10 in order to sum up the probabilities for each of the
classes by them self. Following, there are two fully connected convolution layers in order two change the
size from the input of 1 × 15 × 10 to an output of 1 × 1 × 15. This network is trained on the results
that the best classification network produced. In Section 2.3 we decided that the initialisation of the
weights for the CNN would be drawn from a normal distribution, however with this non-conventional
spatial network it was not possible as we did not reach convergence. Instead the initialisation was that
the network would send the result vector from the pixel it self without any alterations from the network.
We then trained and optimised the weights from that starting point. We should expect a smaller initial
value of the objective function compared to the inital for the classification network.
3 Implementation
Here we will present the final details of the classification program, including the overall structure, what
textures we considered and how we defined each of them. We will also present the used image material
that the network was trained on.
All images in the data set are of size 360 × 480 pixels (or transposed) and by visual evaluation it is
deemed reasonable to have approximately 260 superpixels in each image. Due to the nature of the SLIC
algorithm it never produced exacly 260 superpixels but close to it. If we choose the superpixels to be too
small, creation of the ground truth will take a very long time and too large superpixels will not segment
the image enough and we run the risk of having several textures in one superpixel.
In order to save RAM during training we will use stochastic gradient descent (SGD) when back
propagation is performed. This means that a small set of images are evaluated at a time and the weights
are then updated after each small set [11]. This will cause a bit of a jiggle in the way the weights are
optimised, however SGD has been proven to work for reasonable batch sizes [8]. We will use batch sizes
of 100 superpixels.
11
In accordance with the suggestion Karpathy gives in [8] we will construct the network as a flow
of convolutional layers followed by ReLU units and sometimes also followed by a max pool layer. The
complete structure is shown in Table 1. The second column describes what type of layer we have. The
third and fourth column describes the general size of filters, while the last describes the total size of the
activation map once that layer has been calculated.
Table 1: Complete structure of the CNN with all layers. The convolution layers 12 and 14 are known as
the fully connected layers as all neurons have connections with all the data from the previous layer. As
explained in section 2.3 this is why the dropout layers are placed in layer 11 and 13.
• Forest: Forests and stand alone trees, as long as they are not bushes.
• Bushes: Bushes, not trees.
• Other vegetation: All other vegetation including cactuses and water lily pads.
• Soil/Dirt: Soil and dirt, but also mud, earth and non-cultured farmland.
12
• Wall: Background walls in images, when they are not considered objects (in a house for example).
• Urban area: Everything man made that is not considered an object, not including walls.
• Other Object: Cars, people, houses, animals etc.
Many of these textures require subjective analysis from the creator of the ground truth in order to assign
the correct class, for instance if something is an object or not, or if we have a tree or a bush. These very
probable clashes are chosen consciously in order to test the performance of the network. A human should
differentiate a tree from a bush but it is more interesting to know how the network performs. As we are
aware of these potential problems we will also train a network to only distinguish a few textures. Here
we mash similar textures together into one:
These new textures are chosen because of two reasons. First, textures that probably will be miss-
classified to each other are collected and secondly, textures that describe similar things are collected.
These two reasons are sometimes mutual, as with for example Forest and Bushes, where we can expect
some problems. When considering Snow and Mountains we can expect few or none miss-classifications
as a mountain is grey and uneven while snow is white and smooth, but they are usually found in similar
places. Some textures, for example Water and Sky/Clouds, were kept apart from each other even though
there is a risk of miss-classifications since the meaning of the classification would be lost if fewer than
these classes are used. It is very reasonable to assume that this smaller network will have a lot fewer
classification errors, however, the classification will of course be less detailed.
4 Results
The results are divided into three parts. First we present the overall performance of the network with
general error rates for the network. Then we investigate the network’s performance on entire images and
visually compare resulting segmentations to each other. Finally we will show in what manner the network
makes its errors. At first, in this results section, we will see the number of right and wrong but in the
final subsection to the results we will investigate what types of errors the network makes and also what
errors it never makes.
13
(a) Training and validation errors with 15 texture classes. Shown is both the Top-1 error (left) and Top-3 error
(right).
(b) Training and validation errors with 7 texture classes. Shown is both the Top-1 error (left) and Top-3 error
(right).
14
(a) Training and validation errors of the bias network for spatial dependence with 15 texture classes. Shown is
both the Top-1 error (left) and Top-3 error (right).
(b) Training and validation errors with 7 texture classes. Shown is both the Top-1 error (left) and Top-3 error
(right).
15
The network for spatial bias had as its input the output from the classification network. A trivial
network would get the same error rate as in Figure 7 and therefore it is obvious that if the spatial network
were to be of any use, it needs to drop the error rate to under 30%. In Figure 8 we can see that the
Top-1 and Top-3 errors are lower for both 15 and 7 classes compared to Figure 7. The Top-1 error has
now dropped to approximately 10% instead of 30%, and a similar story is true for the Top-3 error.
(a) The original image, a tropical beach. (b) Results from the classifier network. Total error rate:
36%
(c) Results with the static bias. Total error rate: 32% (d) Results with the spatial network. Total error rate:
12%
Figure 9: Results from an image belonging to the training set with all 15 classes. Bright green is Grass,
green is Forest, light gray is Sand, dark gray is Stone, blue is Water, light blue is Sky/Clouds and black
is Urban area.
16
One other aspect to investigate is the comparison between Figure 9c, static bias and Figure 9d, the
spatial network. Both of them are reducing the error rates, however, the spatial network has almost a
third of the error rate compared to the static bias.
The results from Section 4.1 had similar error rates with 7 or 15 classes, however, when using images
the segmentation with 7 classes yields higher error rates than all the methods for 15 classes, as can be
seen in Figure 10. Precisely as in Figure 9 all methods can distinguish the vegetation in the background,
however a large part of mountains can be seen (dark green) where it should be sand and water. The
resulting error rates are higher for every method while using the smaller network with 7 textures compared
to 15 textures.
The networks were then tested on images not belonging to any training data or similar to any training
data to see how well they performed on unknown data. As the image is unknown no ground truth exists
and therefore we have no error rates. Even though the error rates are missing we can visually compare
the methods in Figure 11. It is obvious that the landing pool for the waterfall have not been classified
correctly. We have either sand or stone from all methods and no water. All methods are more or less
capable of classifying the vegetation in one way or another, while the waterfall itself is not consistently
classified as water. The same problems exist for the sky.
When using 7 classes in Figure 12 we do not get any errors on the type of vegetation, as there is only
on class, however the water in the landing pool is still considered to be mountains (including stone) or
earth (including sand).
(a) The original image, a tropical beach. (b) Results from the classifier network. Total error rate:
59%
(c) Results with the static bias. Total error rate: 61% (d) Results with the spatial network. Total error rate:
42%
Figure 10: Results from an image belonging to the training set with 7 classes. Light green is Vegetation,
dark green is Mountains, grey is Earth, blue is Water and light blue is Sky/Clouds.
17
(a) The original image, a waterfall in a djugle. (b) Results from the classifier network.
(c) Results with the static bias. (d) Results with the spatial network.
Figure 11: Results from an image unknown to the network with 15 classes. Bright green is Grass, pale
green is Forest, green is Other vegetation, light gray is Sand, dark gray is Stone, blue is Water, light blue
is Sky/Clouds and black is Urban area.
18
Table 2: Proportion of classifications for 15 classes. Given that a pixel belongs to class i, it will be classified as class j with the chance found in row i and
column j, i.e. the chance of a correct classification can be found on the diagonals.
Texture Grass Water Forest Bushes Soil Sand Sky Mount. Snow Stone Crowd Wall Objects Oth. veg. Urban
Grass 0.7679 0 0.1071 0 0 0.0179 0 0 0 0.0714 0 0 0.0179 0 0.0179
Water 0 0.7213 0.0984 0 0 0.0164 0.0164 0.1184 0 0 0 0.0164 0 0 0.0164
Forest 0 0.0230 0.6897 0 0 0 0.0230 0.0115 0.0115 0.0690 0.0345 0 0.0115 0.0575 0.0690
Bushes 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
Soil 0 0 0.0476 0 0.4762 0.0476 0 0.0476 0 0.1905 0 0 0 0 0.1905
Sand 0 0 0 0 0 0.7143 0 0 0 0.1964 0.0179 0 0.0179 0 0.0536
19
Sky 0 0.0522 0 0 0 0 0.8731 0.0224 0.0224 0 0 0 0.0224 0 0.0075
Mount. 0.0625 0.0417 0.0417 0 0 0.0417 0.0625 0.5208 0 0.1250 0 0.0208 0 0 0.0833
Snow 0 0.0323 0 0 0 0 0 0 0.9677 0 0 0 0 0 0
Stone 0.0299 0.0149 0.0448 0 0.0448 0.0746 0 0.0597 0 0.5821 0.0299 0 0.0149 0 0.1045
Crowd 0 0 0.0952 0 0 0 0 0 0 0 0.8571 0 0 0 0.0476
Wall 0 0.0500 0 0 0 0.0250 0.0500 0.0500 0 0.0250 0 0.7250 0.0500 0 0.0250
Objects 0.0519 0.0130 0.0909 0 0 0.0260 0.0130 0 0.0390 0.0130 0 0.0649 0.5064 0 0.1818
Oth. veg. 0 0 0.1250 0 0 0 0 0 0 0 0.0208 0 0.0417 0.7917 0.0208
Urban 0.1136 0.0227 0.0909 0 0.0227 0 0 0.0909 0 0.1136 0.0227 0 0.0455 0 0.4773
(a) The original image, a waterfall in a djugle. (b) Results from the classifier network.
(c) Results with the static bias. (d) Results with the spatial network.
Figure 12: Results from an image unknown to the network with 7 classes. Light green is Vegetation, dark
green is Mountains, grey is Earth, blue is Water and light blue is Sky/Clouds.
Table 3: Proportion of classifications for 7 classes. Given that a pixel belongs to class i, it will be classified
as class j with the chance found in row i and column j, i.e. the chance of a correct classification can be
found on the diagonals.
Texture Vegetation Water Earth Sky Mountain Urban area Other objects
Vegetation 0.8063 0 0 0.0052 0.0995 0.0105 0.0785
Water 0.1667 0.5303 0.0152 0.0758 0.1818 0 0.0303
Earth 0.0108 0.0215 0.5376 0 0.3548 0.0645 0.0108
Sky 0.0238 0.0397 0 0.7460 0.1508 0.0238 0.0159
Mountain 0.0255 0.0127 0.0191 0.0382 0.8471 0.0191 0.0382
Urban area 0.0588 0.0294 0.0294 0.0294 0.2647 0.4853 0.1029
Other objects 0.2282 0 0.0217 0.0109 0.1957 0.1087 0.4348
20
to never class anything as a bush and everything as trees, and still keep the error rate low. This is a very
common problem when training neural networks if you do not have a balanced number of superpixels
in each class. Everything regarding the network is however not bad, a good behaviour from the network
can also be seen in Table 2. There are quite a lot of zeros in the table. This means that there are many
types of pixels that are never classified as some other type of texture. For example if we look at the
column with Soil, we have only non-zero elements in the rows with Soil, Stone and Urban. This means
that if a pixel is classified as Soil, it will be either a correct classification, Stone or Urban, nothing else.
We see the same behaviour in the columns with Grass, Sky and Snow to mention a few. Furthermore if
we compare Grass with Crowd for example we notice that no Grass pixels are ever classified as Crowd
or vice versa, they have apparently nothing in common. This happens with multiple other texture pairs.
Comparing this behaviour with many zeros with the network with 7 classes in Table 3 there are hardly
any zeros at all and there are no pairs of texture never classified as each other.
In Section 4.1 we can compare the results of the classification network and the network for spatial
bias for 15 and 7 classes in Figures 7 and 8. Worth noting is that the Top-1 error and Top-3 error
behaves roughly the same, with roughly the same errors when comparing 15 and 7 classes, both with
the classification network and the spatial bias. This leads us to believe that the networks for 7 and 15
classes will have the same error rates when tested on images, however, Figures 9 and 10 tells another
story. Here we see a much higher error rate for 7 classes regardless of method, a very surprising result
when considering the error rates in Figures 7 and 8. When we instead consider what we discussed in the
previous paragraph this result should not come as a surprise, there are many more zeros in Table 2 than
Table 3 and many of the classification rates for correct classification are surprisingly low. The interesting
question that still remains unanswered is why 7 classes behaves in a much worse way with the images
compared to 15 classes despite similar error rates in Figures 7 and 8.
A very pleasing result is that the network for spatial biases reduces error rates more than the static
bias. The spatial network can handle more different cases than the static bias. This shows the flexibility
of the network compared to the static bias. However an alarming behaviour can be seen in Figure 11d,
where errors seems to increase compared to the static bias, or at least not reducing the errors. It would
seem as the spatial network is overfitted to the training data. It is not especially overfitted when we see
Figure 8, however, other problems have possible occurred. We will get back to this thought later in this
section. Another observation we can do with the spatial network is that is does not smooth out the image
just as much as the static bias. We noted in Section 2.4 that a desired trait of the images is smoothness,
however, too much smoothing will cause us to smooth over smaller objects or narrow texture areas. A
drawback to this lack of smoothing from the spatial network seem to be some pixels changing texture
classification in a weird and unpredictable way.
Earlier, we stated that the superpixels will conform well to borders, just as the authors of the article
explained [1], and that this is a necessity for us to not have to deal with multiple textures in each
superpixel. This statement has not always worked and the part we described as a necessity has not
always been a necessity. For example, fuzzy lines on the horizon or similar clothing on people compared to
their surroundings were issues that caused problems when the superpixels were created. Some superpixels
contained both sky and water and some others contained both grass and trees. This caused some problems
when ground truth was created, since the creator of the ground truth needed to subjectively choose
which texture it depicted most. An idea, that never was implemented, was to simply throw away these
superpixels from the training set. This could have been successful, however it would still need a subjective
choice of "throw away" or not. The second reason forit to not be implemented was due to the lack of
superpixels, we needed everyone we could get our hands on. A more reasonable solution to investigate
in the future could possibly be to reduce the size of the superpixels or even change the tunable constant
in the SLIC algorithm to promote close colour proximity rather than spatial proximity. We would get
more jagged superpixels, but they would hopefully only contain one texture. When considering changes
to the superpixels we could also choose not to enlarge them as much as we do now. This would have no
effect on the superpixel quality, but by halving the size to 25 × 25 or even smaller we could save a few
hundred thousand weights, which would lead to quicker training and possibly less overfitting.
When designing new algorithms it is common to blame your data for many problems, such as lack
of data or contradicting data but in this case the data is really a problem. In this project we trained a
network using 7967 superpixels from 31 images, a tiny amount of data. The benchmarks tests mentioned
in the Introduction and other tests not included have training sets with 50.000, 100.000 or even millions
of data points. Some of these are of course tests with a lot more classes but this still brings our size of the
21
training set into perspective. The root of the problem is the low number of images used. As an example,
all the superpixels of Sand are from just two or three of these images and it is very naive to think that
all types and all looks of sand can be contained in just a few images, we should probably have had at
least 31 images just for the sand. This problem exists for all textures. We can clearly see the issues when
looking at Figures 11 and 12, where the water is completely wrong. The network either thinks it is sand
or mountains which is totally wrong. This is an example of what happens when the data for water is not
diverse enough. This lack of data can explain part of the problem where we have low validation error in
Figure 7 but still high errors with other images, such as Figure 11. Our network is fairly stable when we
only consider the type of data we have given but since the validation data shown in Figures 7 and 8 we
are tricked by Figures 7 and 8 despite that the network is overfitted.
5.2 Conclusions
In the introduction we said that an important question to investigate with this thesis was whether our
CNN could differentiate similar looking textures. The short answer is yes and no. The long answer is
that our CNN could distinguish sky and water in many cases and only had serious trouble with breaking
waves. Trees and bushes were to textures that was not differentiated well. The second part of the question
was what types of errors the network made. Some errors were expected, such as trees and bushes, while
other were more unexpected. The issues with breaking waves and snow was unexpected but when you
gave it a second thought it was fairly reasonable.
To conclude and summarise this thesis we should remember three topics for the future. First of all, the
classification network with the added spatial network did classify many of the textures and did achieve
a fairly low error rate in the end. Secondly we do have to mention the superpixels. If we would have
used smaller superpixels with more emphasis on colour proximity than we had here. By using smaller
superpixels the creation of ground truth would demand fewer subjective choices from the creator. Thirdly
and to finalise this report we do mention the low error rates we had and see it as a good sign to continue
investigating CNN for future classification programs.
22
References
[1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine
Suüsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions
on pattern analysis and machine intelligence, 34(11):2274–2281, November 2012.
[2] Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous, and Yann LeCun. The
loss surface of multilayer networks. CoRR, abs/1412.0233, 2014.
[3] Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár,
Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey
Zweig. From captions to visual concepts and back. CoRR, abs/1411.4952, 2014.
[4] David B Fogel, Derong Liu, and James M Keller. Multilayer Neural Networks and Backpropagation,
pages 35–60. John Wiley & Sons, Inc., 2016.
[5] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Aistats,
volume 15, page 275, 2011.
[6] Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. The iapr tc-12 benchmark:
A new evaluation resource for visual information systems. International Conference on Language
Resources and Evaluation, May 2006.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In The IEEE International Conference on
Computer Vision (ICCV), December 2015.
[8] Andrej Karpathy. Cs231n convolutional neural networks for visual recognition - lecture notes.
http://cs231n.github.io, 2017.
[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc.,
2012.
[10] Yann LeCun. Generalization and network design strategies. Technical Report CRG-TR-89-4, 1989.
[11] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. In Nature, volume 521, pages
436–444, May 2015.
[12] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO:
common objects in context. CoRR, abs/1405.0312, 2014.
[13] Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks. CoRR,
abs/1511.08458, 2015.
[14] Dominik Scherer, Andreas Müller, and Sven Behnke. Evaluation of Pooling Operations in Con-
volutional Architectures for Object Recognition, pages 92–101. Springer Berlin Heidelberg, Berlin,
Heidelberg, 2010.
[15] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on
pattern analysis and machine intelligence, 22(8):888–905, August 2000.
[16] Princeton University Stanford Vision Lab, Stanford University. Imagenet. http://image-
net.org/index, 2016.
[17] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.
CoRR, abs/1409.4842, 2014.
23
[18] F. H. C. Tivive and A. Bouzerdoum. Texture classification using convolutional neural networks. In
TENCON 2006 - 2006 IEEE Region 10 Conference, pages 1–4, Nov 2006.
[19] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proceeding of
the ACM Int. Conf. on Multimedia, 2015.
[20] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image
caption generator. CoRR, abs/1411.4555, 2014.
[21] J.S. Denker D. Henderson R.E. Howard W. Hubbard L.D. Jackel Y. LeCun, B. Boser. Backpropa-
gation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989.
[22] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR,
abs/1511.07122, 2015.
24