Machine Learning-Lecture 17(Student)
Machine Learning-Lecture 17(Student)
1
the probability of any given output class.
⚫ How does a convolutional neural network build up this hierarchy?
◼ It combines two specialized types of hidden layers, called
layers and layers
◼ Convolution layers search for instances of in the
image, whereas pooling layers these to select a
subset.
⚫ Convolution Layers (the following content is mostly from Deep Learning with
Python, by F. Chollet)
◼ The fundamental difference between a densely connected layer and a
convolution layer is this: Dense layers learn in their
input feature space, whereas convolution layers learn
see figure 5.1):
◼ In the case of images, patterns found in small 2D windows of the inputs. In
the previous example, these windows were all 3 × 3.
2
convolution layer will learn larger patterns made of the features of the first
layers, and so on. This allows convnets to efficiently learn increasingly
complex and abstract visual concepts (because the visual world is
fundamentally spatially hierarchical).
3
https://towardsdatascience.com/types-of-convolution-kernels-simplified-
f040cb307c37
Kernel vs Filter
Before we dive into it, I just want to make the distinction between the terms ‘kernel’
and ‘filter’ very clear because I have seen a lot of people use them interchangeably.
A kernel is, as described earlier, a matrix of weights which are multiplied with the
input to extract relevant features. The dimensions of the kernel matrix is how the
convolution gets it’s name. For example, in 2D convolutions, the kernel matrix is a
2D matrix.
A filter however is a concatenation of multiple kernels, each kernel assigned to a
particular channel of the input. Filters are always one dimension more than the
kernels. For example, in 2D convolutions, filters are 3D matrices (which is
essentially a concatenation of 2D matrices i.e. the kernels). So for a CNN layer
with kernel dimensions h*w and input channels k, the filter dimensions are
k*h*w.
The following four pictures are from “人工智慧 Artificial Intelligence, by 張志勇等
人”
4
◼ A convolution works by these windows of size 3 × 3 or 5 × 5 over the
3D input feature map, stopping at every possible location, and extracting the 3D
patch of surrounding features.
◼ Note that the output width and height may differ from the input width and
height. They may differ for two reasons:
➢ , which can be countered by padding the input feature map
➢ The use of , which I’ll define in a second
◼ Understanding border effects
➢ Consider a 5 × 5 feature map (25 tiles total). There are only 9 tiles around
which you can center a 3 × 3 window, forming a 3 × 3 grid (see figure 5.5).
Hence, the output feature map will be
➢ It a little: by exactly tiles alongside each dimension, in this case.
➢ You can see this border effect in action in the earlier example: you start
with 28 × 28 inputs, which become after the first convolution layer.
5
◼ Understanding convolution strides
➢ The other factor that can influence output size is the notion of strides.
➢ The distance between two successive windows is a parameter of the
convolution, called its stride, which defaults to 1.
➢ It’s possible to have strided convolutions: convolutions with a stride higher
than 1. In figure 5.7, you can see the patches extracted by a 3 × 3
convolution with over a 5 × 5 input
➢ Using stride 2 means the width and height of the feature map are
by a factor of 2 (in addition to any changes induced by border
effects).
➢ Strided convolutions are rarely used in practice.
➢ To downsample feature maps, instead of strides, we tend to use the
operation.
◼ The max-pooling operation
➢ In the convnet example, you may have noticed that the size of the feature
maps is halved after every MaxPooling2D layer.
➢ For instance, before the first MaxPooling2D layers, the feature map is 26 ×
26, but the max-pooling operation halves it to
➢ That’s the role of max pooling: to aggressively feature maps,
much like strided convolutions.
➢ Max pooling consists of extracting windows from the input feature maps
and outputting the max value of each channel.
➢ A big difference from convolution is that max pooling is usually done with
windows and stride , in order to downsample the feature maps by
a factor of
➢ On the other hand, convolution is typically done with 3 × 3 windows and no
stride (stride 1).
➢ Note that max pooling isn’t the only way you can achieve such
downsampling. As you already know, you can also use strides in the prior
6
convolution layer.
➢ But max pooling tends to work better than these alternative solutions.
➢ It’s more informative to look at the of different features
than at their average presence.