Semantic Segmentation of Images
Semantic Segmentation of Images
Convolutional Networks
Purdue University
Purdue University 1
Preamble
Purdue University 2
Preamble (contd.)
Since the overarching goal is to segment out of an image the objects that belong
to different categories, the classification that is needed for semantic
segmentation is at the pixel level.
For semantic segmentation, a network must still become aware of the abstractions
at a level higher than that of a pixel — otherwise how else would the network be
able to aggregate the pixels together for any object — but, at the same time, the
network must be able to map those abstractions back to the pixel level.
Purdue University 3
Preamble (contd.)
This calls for using some sort of an encoder-decoder architecture for a neural
network. The encoder’s job would be to create high-level abstractions in an image
and the decoder’s job to map those back to the image.
https://arxiv.org/abs/1505.04597
The two key ideas that the Unet network is based on are: (1) The concept of a
“Transpose Convolution”; and (2) notion of skip connections (which is something
that should be very familiar to you by this time). The transpose convolutions are
used to up-sample the abstractions till we get back to the pixel-level resolution.
And the skip connections are used as pathways between the encoder part and the
decoder part in order to harmonize them.
Purdue University 4
Preamble (contd.)
The main goal of this lecture is present my implementation of the Unet
under the name mUnet, with the letter ’m’ standing for “multi’ for mUnet’s
ability to segment out multiple objects of different types simultaneously
from a given input image. The original Unet, on the other hand, outputs a
single output mask for the segmentation. By contrast, mUnet produces
multiple masks, one for each type of object.
Additionally, mUnet uses skip connections not only between the encoder part and
the decoder part, as in the original Unet, but also within the encoder and within
the decoder.
You will find the code for mUnet in Version 1.0.9 of DLStudio that you can
download from:
https://pypi.org/project/DLStudio/
Regarding the organization of this lecture, since basic to understanding the notion
of “transpose convolution” is the representation of a 2D convolution as a
matrix-vector product, the section that follows is devoted to that.
Purdue University 6
Outline
1 A Brief Survey of Networks for Semantic Segmentation 8
2 Transpose Convolutions 18
3 Understanding the Relationship Between Kernel Size, Padding,
and Output Size for Transpose Convolutions 34
4 Using the stride Parameter of nn.ConvTranspose2d for
Upsampling 43
5 Including Padding in the nn.ConvTranspose2d Constructor 47
6 The Building Blocks for mUnet 51
7 The mUnet Network for Semantic Segmentation 56
8 The PurdueShapes5MultiObject Dataset for Semantic
Segmentation 61
9 Training the mUnet 70
10 Testing mUnet on Unseen Data 72
Purdue University 7
A Brief Survey of Networks for Semantic Segmentation
Outline
Purdue University 9
A Brief Survey of Networks for Semantic Segmentation
Generation 2 Networks
The second generation networks are based on the encoder-decoder
principle, with the job of the encoder being to create a low-res rep
that captures the semantic components in the input image and the
job of the decoder being to map those back to the pixel level. The
accuracy of the detail in the back-to-pixel mapping in the docoder is
aided by the skip links from the corresponding levels in the encoder.
Here is the original paper on this encoder-decoder approach:
https://arxiv.org/abs/1505.04597
Purdue University 10
A Brief Survey of Networks for Semantic Segmentation
Generation 3 Networks
The third generation networks for semantic segmentation are based
on the idea of atrous convolution. [See slide 14 for what is meant by
atrous convolution.] The goal here is to detect semantically
meaningful blobs in an input at different scales without increasing the
number of learnable parameters.
Here is the paper that got everybody in the DL community excited
about the possibilities of atrous convolutions:
https://arxiv.org/abs/1606.00915
Purdue University 11
A Brief Survey of Networks for Semantic Segmentation
Generation 4 Networks
These use atrous convolutions in an encode-decoder framework.
Here is the paper that showed how the atrous-convo based learning
could be carried in an encoder-decoder framework:
https://arxiv.org/abs/1706.05587
Purdue University 12
A Brief Survey of Networks for Semantic Segmentation
Purdue University 13
A Brief Survey of Networks for Semantic Segmentation
Purdue University 14
A Brief Survey of Networks for Semantic Segmentation
Since no padding is being used, the output is smaller than the input,
the size of the output being 14 × 14.
As you can see, the feature extracted is the vertical edge in the
middle of the image.
[[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 9. 9. 0. 0. 0. 0. 0. 0.]]
Purdue University 15
A Brief Survey of Networks for Semantic Segmentation
The dilated kernel leads to the following output. Since we are still not
using any padding, the enlarged kernel results in a smaller output. Its
size is now 11 × 11.
[[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]
[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]
[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]
[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]
[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]
[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]
[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]
[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]
[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]
[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]
[0. 0. 0. 9. 9. 9. 9. 0. 0. 0. 0.]]
Purdue University 16
A Brief Survey of Networks for Semantic Segmentation
My example on the last couple of slides does not do full justice to the
idea described above since the nature of grayscale variation in the
input pattern I used should be detectable at all scales (albeit with
different ’signatures’).
Purdue University 17
Transpose Convolutions
Outline
Purdue University 19
Transpose Convolutions
Purdue University 22
Transpose Convolutions
In the second approach shown on the previous slide, you lay down all
the rows of the kernel in the top row of the matrix and then you shift
it by one position to the right with the caveat that the shift needs to
be more than one position when, in the 2D visualization, the kernel
has reached the rightmost positions in each horizontal sweep over the
input.
y = Cx
Purdue University 23
Transpose Convolutions
Figure: Forward data flow through one convo layer in a CNN. The depiction at left is for the case when the stride equals 1
and the one at right for the case when the stride equals 2.
Purdue University 24
Transpose Convolutions
Lossbackprop = CT Loss
Figure: Backpropagating the loss through one convo layer in a CNN. The depiction at left is for the case when the stride
equals 1 and the one at right for the case when the stride equals 2.
Purdue University 26
Transpose Convolutions
Purdue University 27
Transpose Convolutions
In the matrix shown on the previous slide, as before, each row of the
matrix is a shifted version of the row above. While the shift is one
element for the consecutive sweep positions of the kernel vis-a-vis the
input image, the shift becomes 2 positions when the kernel has to
move from one row of the input to the next.
For the stride = 1 case, the first three rows of the matrix C on Slide
27 correspond to the first row of the output, the second three rows to
the second row of the output. And the last three rows to the third
row of the output.
For the vertical jumps dictated by the striding action on the sweep
motions of the kernel, we must also zero out entire vertical partitions
of the matrix that was shown earlier for the stride = 1 case. This is
the reason for why the 4th, the 5th and the 6th rows are all zeroed
out in matrix shown on the next slide for the case of stride = 2.
Purdue University 29
Transpose Convolutions
Purdue University 32
Transpose Convolutions
Purdue University 33
Understanding the Relationship Between Kernel Size, Padding,
and Output Size for Transpose Convolutions
Outline
https://pytorch.org/docs/stable/nn.html#convtranspose2d
Starting with Slide 37, I’ll illustrate the above mentioned relationship
with an example that uses transpose convolution to dilate 4-channel
1 × 1 noise image into a 1-channel 4 × 4 image by using a 4 × 4
kernel. The code for this example is on the next slide.
Purdue University 35
Understanding the Relationship Between Kernel Size, Padding,
and Output Size for Transpose Convolutions
Purdue University 36
Understanding the Relationship Between Kernel Size, Padding,
and Output Size for Transpose Convolutions
trans_conop = nn.ConvTranspose2d(4, 2, 4, 1, 0)
For the last three arguments, if the two values in a tuple are identical,
we can supply the argument as a scalar.
The constructor call shown above means that a (4, 4) kernel will be
used for the transpose convolution with our 4-channel 1 × 1 input
image.
Purdue University 38
Understanding the Relationship Between Kernel Size, Padding,
and Output Size for Transpose Convolutions
Shown below are the values for N, which is the number of padding
zeros for “both” sides of the input image:
kernel_size = (4,4)
dilation = 1
_______________________________________
padding | N
---------------------------------------
0 | 3
1 | 2
2 | 1
---------------------------------------
Purdue University 39
Understanding the Relationship Between Kernel Size, Padding,
and Output Size for Transpose Convolutions
-----------------------------------------------------------
padding = 2
N = 1
padded version of the 1x1 input_image:
0 0 0
0 1 0
0 0 0
padded input_image size: 3x3
output_image size: ERROR
Purdue University 40
Understanding the Relationship Between Kernel Size, Padding,
and Output Size for Transpose Convolutions
Based on what is shown on the previous slide, the reader should now
be able to make sense of the output shown in the commented out
portion after line (I) on Slide 36. That shows a 2-channel 4 × 4
output image obtained with the transpose convolution operator
invoked with the statement in line (H) on the same slide.
Purdue University 41
Understanding the Relationship Between Kernel Size, Padding,
and Output Size for Transpose Convolutions
Purdue University 42
Using the stride Parameter of nn.ConvTranspose2d for
Upsampling
Outline
where I have also assumed that you are using the default choices for
padding (= 0), for output padding (= 0), and for dilation (= 1).
Purdue University 44
Using the stride Parameter of nn.ConvTranspose2d for
Upsampling
In line (A), we set the batch size to 1. With its first argument set to
1, the call in line (B) generates a random-number vector of size 32
that is then shaped into a 2-channel 4 × 4 image. We want to feed
this image into a nn.ConvTranspose2d operator and check the
output image for its size.
import torch
import torch.nn as nn
torch.manual_seed(0)
batch_size = 1 ## (A)
input = torch.randn((batch_size, 32)).view(-1, 2, 4, 4) ## (B)
print("\n\n\nThe 32-dimensional noise vector shaped as an 2-channel 4x4 image:\n\n")
print(input) ## (C)
# tensor([[[[ 0.7298, -1.8453, -0.0250, 1.3694],
# [ 2.6570, 0.9851, -0.2596, 0.1183],
# [-1.1428, 0.0376, 2.6963, 1.2358],
# [ 0.5428, 0.5255, 0.1922, -0.7722]],
#
# [[ 1.6268, 0.1723, -1.6115, -0.4794],
# [-0.1434, -0.3173, 0.9671, -0.9911],
# [ 0.5436, 0.0788, 0.8629, -0.0195],
# [ 0.9910, -0.7777, 0.3140, 0.2133]]]])
#
print(input.shape) ## (1, 2, 4, 4) ## (D)
print(x) ## (G)
Purdue University 46
Including Padding in the nn.ConvTranspose2d Constructor
Outline
import torch
import torch.nn as nn
torch.manual_seed(0)
batch_size = 1 ## (A)
input = torch.randn((batch_size, 32)).view(-1, 2, 4, 4) ## (B)
print("\n\n\nThe 32-dimensional noise vector shaped as an 2-channel 4x4 image:\n\n")
print(input) ## (C)
# tensor([[[[ 0.5769, -0.1692, 1.1887, -0.1575],
# [-0.0455, 0.6485, 0.5239, 0.2180],
# [ 1.9029, 0.3904, 0.0331, -1.0234],
# [ 0.7335, 1.1177, 2.1494, -0.9088]],
#
# [[-0.1434, -0.1947, 1.4903, -0.7005],
# [ 0.1806, 1.3615, 2.0372, 0.6430],
# [ 1.6953, 2.0655, 0.2578, -0.5650],
# [ 0.9278, 0.4826, -0.8298, 1.2678]]]])
x = trans_conop(input) ## (F)
print(x) ## (G)
Purdue University 50
The Building Blocks for mUnet
Outline
The building blocks of mUnet are the two network classes SkipBlockDN
and SkipBlockUP, the former for the encoder in which the size of the
image becomes progressively smaller as the data abstraction level
becomes higher and the latter for the decoder whose job is to map
the abstractions produced by the encoder back to the pixel level.
Purdue University 52
The Building Blocks for mUnet
On the other hand, the skip connections that you will see in mUnet are
shortcuts between the corresponding levels of abstractions in the
encoder and the decoder. Those skip connections are critical to the
operation of the encoder-decoder combo if the goal is to see
pixel-level semantic segmentation results.
Purdue University 53
The Building Blocks for mUnet
Outline
Purdue University 57
The mUnet Network for Semantic Segmentation
Purdue University 60
The PurdueShapes5MultiObject Dataset for Semantic
Segmentation
Outline
Figure: Left: A dataset image; Middle: Its multi-valued mask; Right: BBoxes for the objects
Figure: Left: A dataset image; Middle: Its multi-valued mask; Right: BBoxes for the objects
Purdue University 64
The PurdueShapes5MultiObject Dataset for Semantic
Segmentation
Figure: Left: A dataset image; Middle: Its multi-valued mask; Right: BBoxes for the objects
Figure: Left: A dataset image; Middle: Its multi-valued mask; Right: BBoxes for the objects
Purdue University 65
The PurdueShapes5MultiObject Dataset for Semantic
Segmentation
Figure: Left: A dataset image; Middle: Its multi-valued mask; Right: BBoxes for the objects
Figure: Left: A dataset image; Middle: Its multi-valued mask; Right: BBoxes for the objects
Purdue University 66
The PurdueShapes5MultiObject Dataset for Semantic
Segmentation
Figure: Left: A dataset image; Middle: Its multi-valued mask; Right: BBoxes for the objects
Figure: Left: A dataset image; Middle: Its multi-valued mask; Right: BBoxes for the objects
Purdue University 67
The PurdueShapes5MultiObject Dataset for Semantic
Segmentation
You will find the two smaller datasets, with just 20 images each,
useful for debugging your code. You would want your own training
and the evaluation scripts to run without problems on these two
datasets before you let them loose on the larger datasets.
Purdue University 68
The PurdueShapes5MultiObject Dataset for Semantic
Segmentation
Purdue University 69
Training the mUnet
Outline
What’s interesting is that if you print out the MSE loss for several
iterations when the training has just started, you will see loss numbers
much smaller than what are shown above. However, it is easy to
verify that those represent states of the network in which it has
overfitted to the training data. A classic example of overfitting is low
training error but large error on unseen data.
Purdue University 71
Testing mUnet on Unseen Data
Outline
Shown in the next three slides are the some typical results on this
unseen test dataset. You can see these results for yourself by
executing the script semantic segmentation.py in the Examples
directory of the distribution.
The top row in each display is the combined result from all the five
output channels. The bounding boxes from the dataset are simply
overlaid into this composite output. The second row shows the input
to the network. The semantic segmentations are shown in the lower
five rows.
Purdue University 73
Testing mUnet on Unseen Data
(a) Lower five rows show semantic segmentation. Row 3: rect;(b) Lower five rows show semantic segmentation. Row 3:
Row 4: triangle; Row 5: disk; Row 6: oval; Row 7: star rect; Row 4: triangle; Row 5: disk; Row 6: oval; Row 7: star
Purdue University 75
Testing mUnet on Unseen Data
Purdue University 76