0% found this document useful (0 votes)
2 views

Machine Learning-Lecture 17(Student)

Lecture 17 discusses Convolutional Neural Networks (CNNs), highlighting their success in image classification due to their ability to learn hierarchical features through convolution and pooling layers. CNNs operate on 3D tensors, utilizing filters and kernels to extract relevant features from images while maintaining translation invariance and spatial hierarchy. The lecture also explains the concepts of convolution strides and max pooling, emphasizing their roles in downsampling feature maps and enhancing the efficiency of the learning process.

Uploaded by

hubertkuo418
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Machine Learning-Lecture 17(Student)

Lecture 17 discusses Convolutional Neural Networks (CNNs), highlighting their success in image classification due to their ability to learn hierarchical features through convolution and pooling layers. CNNs operate on 3D tensors, utilizing filters and kernels to extract relevant features from images while maintaining translation invariance and spatial hierarchy. The lecture also explains the concepts of convolution strides and max pooling, emphasizing their roles in downsampling feature maps and enhancing the efficiency of the learning process.

Uploaded by

hubertkuo418
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lecture 17: Deep Learning

10.3 Convolutional Neural Networks


⚫ Neural networks rebounded around 2010 with big successes in
. Around that time, massive databases of labeled images were
being accumulated, with ever-increasing numbers of classes.
⚫ Figure 10.5 shows 75 images drawn from the database. This
database consists of images labeled according to superclasses
(e.g. aquatic mammals), with classes per superclass (beaver, dolphin,
otter, seal, whale). Each image has a resolution of pixels, with
eight-bit numbers per pixel representing . The numbers for
each image are organized in a array called a .
The first two feature map axes are spatial (both are 32-dimensional), and the
third is the axis, representing the three colors. There is a
designated training set of images, and a test set of
⚫ A special family of has evolved for
convolutional neural networks such as these, and has
shown spectacular success on a wide range of problems. CNNs mimic to some
degree how humans classify images, by
anywhere in the image that distinguish each particular object class.
⚫ Figure 10.6 illustrates the idea behind a convolutional neural network on a
cartoon image of a tiger.

⚫ The network first identifies in the input image, such as


small edges, patches of color, and the like. These low-level features are then
combined to form , such as parts of ears, eyes, and so on.
Eventually, the presence or absence of these higher-level features contributes to

1
the probability of any given output class.
⚫ How does a convolutional neural network build up this hierarchy?
◼ It combines two specialized types of hidden layers, called
layers and layers
◼ Convolution layers search for instances of in the
image, whereas pooling layers these to select a
subset.
⚫ Convolution Layers (the following content is mostly from Deep Learning with
Python, by F. Chollet)
◼ The fundamental difference between a densely connected layer and a
convolution layer is this: Dense layers learn in their
input feature space, whereas convolution layers learn
see figure 5.1):
◼ In the case of images, patterns found in small 2D windows of the inputs. In
the previous example, these windows were all 3 × 3.

◼ This key characteristic gives convnets two interesting properties:


1. The patterns they learn are . After learning a certain
pattern in the lower-right corner of a picture, a convnet can recognize it
anywhere: for example, in the upper-left corner. A densely connected
network would have to learn the pattern anew if it appeared at a new
location. This makes convnets data efficient when processing images
(because the visual world is fundamentally translation invariant): they need
fewer training samples to learn representations that have generalization
power.
2. They can learn of patterns (see figure 5.2). A first
convolution layer will learn small local patterns such as edges, a second

2
convolution layer will learn larger patterns made of the features of the first
layers, and so on. This allows convnets to efficiently learn increasingly
complex and abstract visual concepts (because the visual world is
fundamentally spatially hierarchical).

◼ Convolutions operate over 3D tensors, called feature maps, with two


spatial axes (height and width) as well as a depth axis (also called the
channels axis). For an RGB image, the dimension of the depth axis is 3.
◼ In the MNIST example, the first convolution layer takes a feature map of
size (28, 28, 1) and outputs a feature map of size (26, 26, 32): it computes
32 filters over its input.

◼ Convolutions are defined by two key parameters:


➢ Size of the patches extracted from the inputs— These are typically 3 ×
3 or 5 × 5. In the example, they were 3 × 3, which is a common choice.
➢ Depth of the output feature map— The number of filters computed by
the convolution. The example started with a depth of 32 and ended
with a depth of 64.
◼ In Keras Conv2D layers, these parameters are the first arguments passed to
the layer: Conv2D(output_depth, (window_height, window_width)).

3
https://towardsdatascience.com/types-of-convolution-kernels-simplified-
f040cb307c37

Kernel vs Filter
Before we dive into it, I just want to make the distinction between the terms ‘kernel’
and ‘filter’ very clear because I have seen a lot of people use them interchangeably.
A kernel is, as described earlier, a matrix of weights which are multiplied with the
input to extract relevant features. The dimensions of the kernel matrix is how the
convolution gets it’s name. For example, in 2D convolutions, the kernel matrix is a
2D matrix.
A filter however is a concatenation of multiple kernels, each kernel assigned to a
particular channel of the input. Filters are always one dimension more than the
kernels. For example, in 2D convolutions, filters are 3D matrices (which is
essentially a concatenation of 2D matrices i.e. the kernels). So for a CNN layer
with kernel dimensions h*w and input channels k, the filter dimensions are
k*h*w.

The following four pictures are from “人工智慧 Artificial Intelligence, by 張志勇等
人”

4
◼ A convolution works by these windows of size 3 × 3 or 5 × 5 over the
3D input feature map, stopping at every possible location, and extracting the 3D
patch of surrounding features.
◼ Note that the output width and height may differ from the input width and
height. They may differ for two reasons:
➢ , which can be countered by padding the input feature map
➢ The use of , which I’ll define in a second
◼ Understanding border effects
➢ Consider a 5 × 5 feature map (25 tiles total). There are only 9 tiles around
which you can center a 3 × 3 window, forming a 3 × 3 grid (see figure 5.5).
Hence, the output feature map will be
➢ It a little: by exactly tiles alongside each dimension, in this case.
➢ You can see this border effect in action in the earlier example: you start
with 28 × 28 inputs, which become after the first convolution layer.

5
◼ Understanding convolution strides
➢ The other factor that can influence output size is the notion of strides.
➢ The distance between two successive windows is a parameter of the
convolution, called its stride, which defaults to 1.
➢ It’s possible to have strided convolutions: convolutions with a stride higher
than 1. In figure 5.7, you can see the patches extracted by a 3 × 3
convolution with over a 5 × 5 input

➢ Using stride 2 means the width and height of the feature map are
by a factor of 2 (in addition to any changes induced by border
effects).
➢ Strided convolutions are rarely used in practice.
➢ To downsample feature maps, instead of strides, we tend to use the
operation.
◼ The max-pooling operation
➢ In the convnet example, you may have noticed that the size of the feature
maps is halved after every MaxPooling2D layer.
➢ For instance, before the first MaxPooling2D layers, the feature map is 26 ×
26, but the max-pooling operation halves it to
➢ That’s the role of max pooling: to aggressively feature maps,
much like strided convolutions.
➢ Max pooling consists of extracting windows from the input feature maps
and outputting the max value of each channel.
➢ A big difference from convolution is that max pooling is usually done with
windows and stride , in order to downsample the feature maps by
a factor of
➢ On the other hand, convolution is typically done with 3 × 3 windows and no
stride (stride 1).
➢ Note that max pooling isn’t the only way you can achieve such
downsampling. As you already know, you can also use strides in the prior

6
convolution layer.
➢ But max pooling tends to work better than these alternative solutions.
➢ It’s more informative to look at the of different features
than at their average presence.

You might also like