Mod 5
Mod 5
“beak” detector
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.
“upper-left
beak” detector
“middle beak”
detector
Deep learning is about feature/representation learning.
Compared with traditional hand crafted features, neural network learned representation
are much powerful in computer vision tasks.
As these feature learning process are automatically accomplished by fed with large
volume of training data, it has better generalize capability.
Deep learning models are formed by multiple layers. In the context of artificial neural
networks the multi layer perceptron (MLP) with more than 2 hidden layers is already a Deep
Model.
As a rule of thumb deeper models have the potential to perform better than shallow models.
The problem is that the more deep you go the more data you will need to avoid over-fitting.
Convolutional Neural Network
You can imagine how computationally intensive things would get once the
images reach dimensions, say 8K (7680×4320). The role of the ConvNet is
to reduce the images into a form which is easier to process, without losing
features which are critical for getting a good prediction. This is important
when we are to design an architecture which is not only good at learning
features but also is scalable to massive datasets.
Beak detector
A filter
kernel
Output some times referred
as feature map
A convolutional layer
Every image can be represented as a matrix of
pixel values. An image from a standard digital
camera will have three channels — red, green
and blue. You can imagine those as three 2d-
matrices stacked over each other (one for each
color), each having pixel values in the range 0 to
255.
Applying a convolution to an image is like running a filter of a certain dimension and sliding it
on top of the image. That operation is translated into an element-wise multiplication between
the two matrices and finally an addition of the multiplication outputs. The final integer of this
computation forms a single element of the output matrix.
Convolution These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1
…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
-1 1 -1
Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
11 -1-1 -1-1 -1-1 11 -1-1
Image size = 6x6x3 1 -1 -1 -1 1 -1
Filter size = 3x3x3 -1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter 2
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
1 0
6 x 6 image
3 0
14
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected
…
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
1 0
6 x 6 image
3: 0
14:
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters
…
M-N+1
32-5+1 =28
NxN filter
If images are composed of three channels (R-red, G-green, B-blue). Therefore the
input is a volume, a stack of three matrices, which forms a depth identified by the
number of channels.
If we apply only one filter the result would be:
where the cube filter of 27 parameters now slides on top of the cube of the input image.
Some common convolution filters
d – depth, k – number of filters, P- padding, S- stride
(N-F)/S+1=(6-3)/1+1=4
Filter: F x F x d
a Z=W*a+b
g g(z)
b
One layer of Convolutional Neural Network
1
27
The final step that takes us to a convolutional neural layer is to add the bias and a non-linear
function.
One layer of Convolutional Neural Network
The result of a convolution of 6×6×3 with two 3×3×3 is a volume of dimension 4×4×2
We’ve gone from a 6×6×3 dimensional a[0] through one layer of a neural network to
a 4×4×2 dimensional a[1]. So, 6×6×3 has gone to 4×4×2 and that’s one layer of a convolutional
net. In this example we had two filters involved which is why we end up with 4×4×2 output. If we
had 10 filters instead of 2 then would have we wound obtain a 4×4×10 dimensional output volume.
That is we’d be taking 10 of these maps instead of 2 of them, and stacking them up to form
a 4×4×10 output volume, and that’s how a[1] would be obtained.
One layer of Convolutional Neural Network
We can use these ten filters to detect features: vertical edges, horizontal
edges, maybe other features anywhere even in the very large image with just
a very small number of parameters. This is really one property of
convolutional neural nets that makes them less prone to overfitting. So, once
we learn ten feature detectors that work, we could apply this even to very
large images and the number of parameters also remains fixed and relatively
small as 280 in this example.
Number of parameters in one layer
An example of a ConvNet
If a layer l is a convolutional layer, we’re going to denote the filter size with f [l]. So,
f×f and this superscript [l] signifies that this is a filter size f×f filter in layer l.
Then, we use p[l] to denote the amount of padding, and the amount of padding can also be
specified just by saying that we want a valid convolution, which means no padding, or
a same convolution which means we choose a padding so that the output image size has
the same height and width as the input image consequently. We’re going to use s[l] to
denote the stride.
output size
(n+2p−f)/s+1.
(N+2p-f)/s+1
An example of a ConvNet
Let’s now say we have another convolutional layer, and we use 5×5 filters. So, in our
notation f[2] at the next layer of network is equal to 5(f[2]=5), and let’s say we use the stride
of 2, s[2] =2), and no padding (p[2]=0) and 20 filters.
Then the output of this will be another volume, this time, it will be 17×17×20. Notice that because we’re
now using a stride of 2, so (s[2]=2) the dimension has shrunk much faster and 37×37 has gone down in
size by slightly more than 2, to 17×17. Because we’re using 20 filters, the number of channels is
now 20, so this activation a[2] would be that dimension.
One last convolutional layer. Let’s say that we use a 5×5 filter again and, again a stride of 2. Eventually
we end up with a 7×7. Finally, if we use 40 filters and no padding we end up with 7×7×40.
An example of a ConvNet
Now our 39×39×3 input image is processed and 7×7×40 features are computed for
this image. Finally, if we take the 7×7×40(=1960), and we flatten this volume or unroll
it into 1960 units. By unrolling them into a very long vector we can feed into
a softmax or into a logistic regression in order to make a prediction for the final output.
As we go deeper in the neural network, typically we start off with larger images 39×39,
and then the height and width will stay the same for a while and gradually trend down as
we go deeper in the neural network. That is, the size has gone from 39 to 37 to 17 to 7 ,
whereas the number of channels generally increases (from 3 to 10 to 20 to 40). We can
see this general trend in a lot of other convolutional neural networks as well.
Pooling Layer and Fully Connected layer
In a typical ConvNet there are usually three types of layers: one is the
convolutional layer and often we’ll denote that as a ConvNet. One is called
•Max pooling
•Average pooling
Pooling layers
Max Pooling
Suppose we have a 4×4 input image and we want to apply a type of pooling,
our 4×4 input and break it into different regions. We’ll cover the four regions
as shown in figure below. Then, in the output, which is 2×2, each of the
outputs will be the max from the corresponding shaded region.
Maxpooling with a 2×2 filter and a stride of 2
The intuition behid the Maxpooling
If we think of this 4×4 region as some set of features (the activations in some layer of the
neural network), then a large number means that this particular feature is maybe detected. So,
the upper left-hand quadrant has this particular feature, maybe a vertical area or maybe an eye
of an animal. Clearly that feature exists in the upper left-hand quadrant, (maybe that’s a cat
eye detector), whereas this feature doesn’t really exist in the upper right hand quadrant. So,
the features, detected anywhere in one of these quadrants, max operation has preserved in
the output of max pooling. What the max operation does is really safe. If this feature is
detected anywhere in this filter then keep a high number. However, if this feature is not
detected, then this feature most likely doesn’t exist in the corresponding quadrant.
We can say that there are two main reasons that people use Maxpooling:
1.It’s been found in a lot of experiments to work well.
2.It has no parameters to learn. There’s actually nothing for the gradient descent to learn.
Once we’ve fixed f and s, it’s just a fixed computation and gradient descent doesn’t change
anything.
Average pooling
That is, instead of taking the maxima within each filter, it takes the average. In this example
the average of the numbers in purple is 3.75. Then 1.25, then 4, and finally 2. This is
average pooling with hyper parameters f=2 and s=2. We can choose other hyper
parameters as well. These days Max pooling is used much more often than Average
pooling.
Pooling layer downsamples the volume spatially, independently in each
depth slice of the input volume.
Why Pooling
Subsampling
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
Max Pooling
Helps to make representations approximately invariant
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The activations of an example ConvNet architecture.
Features
extracted DNN class
automatic
ally using
convoluti
Illustration of two convolutional layers, the first with 4 filters 5× 5× 3 that gets as
input an RGB image of size 64× 64× 3, and produces a tensor of feature maps. A
second convolutional layer with 5 filters 3× 3× 4 gets as input the tensor from the
previous layer of size 64× 64× 4 and produces a new 64× 64× 5 tensor of
feature maps. The circle after each filter denotes an activation function, e.g. ReLU.
VGG16 model for classification and detection
The image is passed through a stack of convolutional (conv.) layers, where the
filters were used with a very small receptive field: 3×3 (which is the smallest
size to capture the notion of left/right, up/down, center).
VGG16 model for classification and detection
It’s common that, as we go deeper into the network, the sizes (nh, nw)
decrease, while the number of channels (nc) increases.
The whole CNN
cat dog ……
Convolution
Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times
Max Pooling
Flattened
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
A new image Can
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling
cat dog ……
Convolution
Max Pooling
Max Pooling
1
3 0
-1 1 3
3 1 -1
0 3 Flattened
1 Fully Connected
Feedforward network
3
Revisiting CNN Convolution by 3 x 3 filter
The output from the each filter is stacked together forming the depth
dimension of the convolved image. Suppose we have an input image of
size 32*32*3. And we apply 10 filters of size 5*5*3 with valid padding. The
output would have the dimensions as 28*28*10.
Pooling layer
Sometimes when the images are too large, we would need to reduce the
number of trainable parameters. It is then desired to periodically
introduce pooling layers between subsequent convolution layers. Pooling
is done for the sole purpose of reducing the spatial size of the image.
Pooling is done independently on each depth dimension, therefore the
depth of the image remains unchanged. The most common form of
pooling layer generally applied is the max pooling.
As you can see, the 4*4 convolved output has become 2*2 after the max
pooling operation.
Revisiting CNN
Output dimensions
Three hyperparameter would control the size of output volume.
1. The number of filters – the depth of the output volume will be equal to the number of filter
applied. Remember how we had stacked the output from each filter to form an activation map.
The depth of the activation map will be equal to the number of filters.
2. Stride – When we have a stride of one we move across and down a single pixel. With
higher stride values, we move large number of pixels at a time and hence produce smaller
output volumes.
3. Zero padding – This helps us to preserve the size of the input image. If a single zero
padding is added, a single stride filter movement would retain the size of the original image.
Suppose we have an input image of size 32*32*3, we apply 10 filters of size 3*3*3, with single
stride and no zero padding. The output depth will be equal to the number of filters applied i.e.
10.
W=32, F=3, P=0 and S=1.
After multiple layers of convolution and padding, we would need the output in the
form of a class. The convolution and pooling layers would only be able to extract
features and reduce the number of parameters from the original images.
However, to generate the final output we need to apply a fully connected layer to
generate an output equal to the number of classes we need. It becomes tough to
reach that number with just the convolution layers. Convolution layers generate 3D
activation maps while we just need the output as whether or not an image belongs
to a particular class. The output layer has a loss function like categorical cross-
entropy, to compute the error in prediction. Once the forward pass is complete the
backpropagation begins to update the weight and biases for error and loss
reduction.
Revisiting CNN
Output layer
One Layer of a Convolutional Network
Once we get an output after convolving over the entire image using a filter, we add
a bias term to those outputs and finally apply an activation function to generate
activations. This is one layer of a convolutional network. Recall that the equation
for one forward pass is given by:
In our case, input (6 X 6 X 3) is a[0] and filters (3 X 3 X 3) are the weights w[1]. These
activations from layer 1 act as the input for layer 2, and so on. Clearly, the number of
parameters in case of convolutional neural networks is independent of the size of the
image. It essentially depends on the filter size. Suppose we have 10 filters, each of
shape 3 X 3 X 3. What will be the number of parameters in that layer? Let’s try to solve
this:
•Number of parameters for each filter = 3*3*3 = 27
•There will be a bias term for each filter, so total parameters per filter = 28
•As there are 10 filters, the total parameters for that layer = 28*10 = 280
Example
More Edge Detection
The type of filter that we choose helps to detect the vertical or horizontal
edges. We can use the following filters to detect different edges:
input
Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 … There are
25 3x3
-1 -1 1
-1 1 -1 … Max Pooling
filters.
Input_shape = ( 28 , 28 , 1)
3 -1 3 Max Pooling
-3 1
Only modified the network structure and input
CNN in Keras format (vector -> 3-D array)
Input
1 x 28 x 28
Convolution
How many parameters for
each filter? 9 25 x 26 x 26
Max Pooling
25 x 13 x 13
Convolution
How many parameters 225=
for each filter? 50 x 11 x 11
25x9
Max Pooling
50 x 5 x 5
Only modified the network structure and input
CNN in Keras format (vector -> 3-D array)
Input
1 x 28 x 28
Output Convolution
25 x 26 x 26
Fully connected Max Pooling
feedforward network
25 x 13 x 13
Convolution
50 x 11 x 11
Max Pooling
1250 50 x 5 x 5
Flattened
A spatial separable convolution simply divides a kernel into two, smaller
kernels. The most common case would be to divide a 3x3 kernel into a 3x1
and 1x3 kernel, like so:
Normal Convolution
Depthwise Convolution
Each 5x5x1 kernel iterates 1 channel of the image (note: 1 channel, not all
channels), getting the scalar products of every 25 pixel group, giving out a 8x8x1
image. Stacking these images together creates a 8x8x3 image.
Pointwise Convolution