A convolutional neural network
A convolutional neural network
This filter is also called a kernel, or feature detector, and its dimensions can
be, for example, 3x3. To perform convolution, the kernel goes over the input
image, doing matrix multiplication element after element. The result for each
receptive field (the area where convolution takes place) is written down in the
feature map.
We continue sliding the filter until the feature map is complete.
Padding. Padding expands the input matrix by adding fake pixels to the
borders of the matrix. This is done because convolution reduces the size of
the matrix. For example, a 5x5 matrix turns into a 3x3 matrix when a filter
goes over it.
Striding. It often happens that when working with a convolutional layer, you
need to get an output that is smaller than the input. One way to achieve this is
to use a pooling layer. Another way to achieve this is to use striding. The idea
behind stride is to skip some areas when the kernel slides over: for example,
skipping every 2 or 3 pixels. It reduces spatial resolution and makes the
network more computationally efficient.
There are multiple filters in a convolutional layer and each of them generates
a filter map. Therefore, the output of a layer will be a set of filter maps,
stacked on top of each other.
For example, padding and passing a 30x30x3 matrix through 10 filters will
result in a set of 10 30x30x1 matrices. After we stack these maps on top of
each other, we will get a 30x30x10 matrix.
The process can be repeated: CNNs usually have more than one
convolutional layer.
3 layers of CNN
The goal of CNN is to reduce the images so that it would be easier to process
without losing features that are valuable for accurate prediction.
Convolutional layer
We’ve already described how convolution layers work above. They are at the
center of CNNs, enabling them to autonomously recognize features in the
images.
But going through the convolution process generates a large amount of data,
which makes it hard to train the neural network. To compress the data, we
need to go through pooling.
Pooling layer
A pooling layer receives the result from a convolutional layer and compresses
it. The filter of a pooling layer is always smaller than a feature map. Usually, it
takes a 2x2 square (patch) and compresses it into one value.
A 2x2 filter would reduce the number of pixels in each feature map to one
quarter the size. If you had a feature map sized 10×10, the output map would
be 5×5.
Multiple different functions can be used for pooling. These are the most
frequent:
Maximum Pooling. It calculates the maximum value for each patch of the
feature map.
Average pooling. It calculates the average value for each patch on the feature
map.
After using the pooling layer, you get pooled feature maps that are a
summarized version of the features detected in the input. Pooling layer
improves the stability of CNN: if before even slightest fluctuations in pixels
would cause the model to misclassify, now small changes in the location of
the feature in the input detected by the convolutional layer will result in a
pooled feature map with the feature in the same location.
Now we need to flatten the input (turn it into a column vector) and pass it
down to a regular neural network for classification.
Fully-connected layer
There is really little technical analysis to be made of these filters and it would be
of no importance to our tutorial. These are just intuitively formulated matrices.
The point is to see how applying them to an image can alter its features in the
same manner that they are used to detect these features.