CNN Unit
CNN Unit
These are the steps (our Plan of Attack) we'll follow for you to master
Convolutional Neural Networks and consequently Deep Learning:
The second part of this step will involve the Rectified Linear Unit or
ReLU. We will cover ReLU layers and explore how linearity functions in
the context of Convolutional Neural Networks.
Step 2: Pooling
In this part, we'll cover pooling and will get to understand exactly how it
generally works. Our nexus here, however, will be a specific type of
pooling; max pooling. We'll cover various approaches, though, including
mean (or sum) pooling. This part will end with a demonstration made
using a visual interactive tool that will definitely sort the whole concept
out for you.
Step 3: Flattening
In the end, we'll wrap everything up and give a quick recap of the
concept covered in the section. If you feel like it will do you any benefit
(and it probably will), you should check out the extra tutorial in which
Softmax and Cross-Entropy are covered. It's not mandatory for the
course, but you will likely come across these concepts when working
with Convolutional Neural Networks and it will do you a lot of good to be
familiar with them.
Let's begin.
What happens there is that your brain detects the object for the first
time, but because the look was brief, your brain does not get to process
enough of the object's features so as to categorize it correctly.
The previous images were designed so that what you see in them
depends on the line and the angle that your brain decides to begin its
"feature detection" expedition from.
It's worth noting that the four categories that show up on this guess list
are far from being the only categories that the network gets to choose
from.
As a matter of fact, with that very image in the screenshot above, many
humans are uncertain of what the object actually is. As we can see, the
correct answer was "hand glass," not a pair of scissors as the network
categorized it.
And the answer to both questions would be: through the magic of
convolutional neural networks.
Input image
Convolutional Neural Network
Output label (image class)
Black & white images are two-dimensional, whereas colored images are
three-dimensional. The difference this makes is in the value assigned to
each pixel when presented to the neural network. In the case of two-
dimensional black & white images, each pixel is assigned one number
between 0 and 255 to represent its shade.
That means that the red layer is represented with a number between 0
and 255, and so are the blue and the green layers. They are then
presented in an RGB format. For example, a "hot pink" pixel would be
presented to the neural network as (255, 105, 180).
Matters obviously get more complex when we're trying to feed the
network pictures of actual human beings.
As you can see, the grid table on the far right shows all of the pixels
valued at 0's while only the parts where the smiley face appears are
valued at 1. This differs from the 8-bit model we just discussed, again,
for the sake of breaking down the concept.
In the table above, white cells are represented as 0's, and black cells
are represented as 1's, which means that there are no other possible
shades that can appear in this image.
If you look at the arc of 1's that ends in the second row from the bottom
you would be able to recognize the smile.
The steps that go into this process are broken down as follows:
Step 1: Convolution
Step 1b: ReLU Layer
Step 2: Pooling
Step 3: Flattening
Step 4: Full Connection
You will probably find these terms to be too much to digest at the
moment, which is quite normal at this point. As we go through the next
tutorials, you will get to understand what each of them actually means.
Additional Reading
In the meantime, if you want to geek out on some extra material, you
should check out this paper titled "Gradient-Based Learning Applied to
Document Recognition" by Yann LeCun and others.
So, if there is someone who can give you the gist of the subject, it is
definitely this guy.
Now it's time for you to do some reading on the topic so you can get
more familiar with the broad concept of convolutional neural networks,
and in the next tutorial we will begin to break it down into its four basic
steps.
In this tutorial, we are going to learn about convolution, which is the first
step in the process that convolutional neural networks undergo. We'll
learn what convolution is, how it works, what elements are used in it,
and what its different uses are.
Get ready!
What is convolution?
If you don't consider yourself to be quite the math buff, there is no need
to worry since this course is based on a more intuitive approach to the
concept of convolutional neural networks, not a mathematical or a
purely technical one.
Those of you who have practiced any field that entails signal processing
are probably familiar with the convolution function.
If you want to do some extra work on your own to scratch beneath the
surface with regard to the mathematical aspects of convolution, you can
check out this 2017 University professor Jianxin Wu titled "Introduction
to Convolutional Neural Networks."
Let's get into the actual convolution operation in the context of neural
networks. The following example will provide you with a breakdown of
everything you need to know about this process.
Input image
Feature detector
Feature map
As you can see, the input image is the same smiley face image that we
had in the previous tutorial. Again, if you look into the pattern of the 1's
and 0's, you will be able to make out the smiley face in there.
Sometimes a 5x5 or a 7x7 matrix is used as a feature detector, but the
more conventional one, and that is the one that we will be working with,
is a 3x3 matrix. The feature detector is often referred to as a "kernel" or
a "filter," which you might come across as you dig into other material on
the topic.
You place it over the input image beginning from the top-left
corner within the borders you see demarcated above, and then
you count the number of cells in which the feature detector
matches the input image.
The number of matching cells is then inserted in the top-left cell
of the feature map.
You then move the feature detector one cell to the right and do
the same thing. This movement is called a and since we are
moving the feature detector one cell at time, that would be called
a stride of one pixel.
What you will find in this example is that the feature detector's
middle-left cell with the number 1 inside it matches the cell that it
is standing over inside the input image. That's the only matching
cell, and so you write "1" in the next cell in the feature map, and
so on and so forth.
After you have gone through the whole first row, you can then
move it over to the next row and go through the same process.
It's important not to confuse the feature map with the other two
elements. The cells of the feature map can contain any digit, not only
1's and 0's. After going over every pixel in the input image in the
example above, we would end up with these results:
By the way, just like feature detector can also be referred to as a kernel
or a filter, a feature map is also known as an activation map and both
terms are also interchangeable.
What is the point from the convolution operation?
There are several uses that we gain from deriving a feature map. These
are the most important of them: Reducing the size of the input image,
and you should know that the larger your strides (the movements
across pixels), the smaller your feature map.
When dealing with proper images, you will find it necessary to widen
your strides. Here we were dealing with a 7x7 input image after all, but
real images tend to be substantially larger and more complex.
These are the most revealing features, and that is all your brain needs
to see in order to make its conclusion. Even these features are seen
broadly and not down to their minutiae.
If your brain actually had to process every bit of data that enters through
your senses at any given moment, you would first be unable to take any
actions, and soon you would have a mental breakdown. Broad
categorization happens to be more practical.
You can actually use a convolution matrix to adjust an image. Here are
a few examples of filters being applied to images using these matrices.
There is really little technical analysis to be made of these filters and it
would be of no importance to our tutorial. These are just intuitively
formulated matrices. The point is to see how applying them to an image
can alter its features in the same manner that they are used to detect
these features.
What's next?
That's all you need to know for now about the convolution operation. In
our next tutorial, we will go through the next part of the convolution step;
the ReLU layer.
The Rectified Linear Unit, or ReLU, is not a separate component of the convolutional neural networks'
process.
It's a supplementary step to the convolution operation that we covered in the previous tutorial. There
are some instructors and authors who discuss both steps separately, but in our case, we're going to
consider both of them to be components of the first step in our process.
If you're done with the previous section on artificial neural networks, then you should be familiar with
the rectifier function that you see in the image below.
The purpose of applying the rectifier function is to increase the non-linearity in our images.
When you look at any image, you'll find it contains a lot of non-linear features (e.g. the transition
between pixels, the borders, the colors, etc.).
The rectifier serves to break up the linearity even further in order to make up for the linearity that we
might impose an image when we put it through the convolution operation.
To see how that actually plays out, we can look at the following picture and see the changes that
happen to it as it undergoes the convolution operation followed by rectification.
Rectification
What the rectifier function does to an image like this is remove all the black elements from it,
keeping only those carrying a positive value (the grey and white colors).
The essential difference between the non-rectified version of the image and the rectified one is the
progression of colors. If you look closely at the first one, you will find parts where a white streak is
followed by a grey one and then a black one. After we rectify the image, you will find the colors
changing more abruptly.
The gradual change is no longer there. That indicates that the linearity has been disposed of.
You have to bear in mind that the way by which we just examined this example only provides a basic
non-technical understanding of the concept of rectification.
The mathematical concepts behind the process are unnecessary here and would be pretty complex at
this point.
This second example is more advanced. Here we have 6 different images of 6 different cheetahs (or 5,
there is 1 that seems to appear in 2 photos) and they are each posing differently in different settings and
from different angles.
Again, max pooling is concerned with teaching your convolutional neural network to recognize that
despite all of these differences that we mentioned, they are all images of cheetah. In order to do that,
the network needs to acquire a property that is known as "spatial variance."
This property makes the network capable of detecting the object in the image without being confused
by the differences in the image's textures, the distances from where they are shot, their angles, or
otherwise.
In order to reach the pooling step, we need to have finished the convolution step, which means that we
would have a feature map ready.
Types of Pooling
Before getting into the details, you should know that there are several types of pooling. These include
among others the following:
Mean pooling
Max pooling
Sum pooling
For every 4 cells your box stands on, you'll find the maximum numerical value and insert it into the
pooled feature map. In the figure below, for instance, the box currently contains a group of cells where
the maximum value is 4.
If you remember the convolution operation example from the previous tutorial, we were using strides
of one pixel. In this example, we are using 2-pixel strides. That's why we end up with a 3x3 pooled
featured map. Generally, strides of two are most commonly used.
Note that in the third movement along the same row, you will find yourself stuck with one lonely
column.
You would still proceed despite the fact that half of your box will be empty. You still find your
maximum value and put it in the pooled feature map. In the least step, you will face a situation where
the box will contain a single cell. You will take that value to be the maximum value.
Just like in the convolution step, the creation of the pooled feature map also makes us dispose of
unnecessary information or features. In this case, we have lost roughly 75% of the original information
found in the feature map since for each 4 pixels in the feature map we ended up with only the
maximum value and got rid of the other 3. These are the details that are unnecessary and without which
the network can do its job more efficiently.
The reason we extract the maximum value, which is actually the point from the whole pooling step, is
to account for distortions. Let's say we have three cheetah images, and in each image the cheetah's tear
lines are taking a different angle.
The feature after it has been pooled will be detected by the network despite these differences in its
appearance between the three images. Consider the tear line feature to be represented by the 4 in the
feature map above.
Imagine that instead of the four appearing in cell 4x2, it appeared in 3x1. When pooling the feature, we
would still end up with 4 as the maximum value from that group, and thus we would get the same result
in the pooled version.
This process is what provides the convolutional neural network with the "spatial variance" capability.
In addition to that, pooling serves to minimize the size of the images as well as the number of
parameters which, in turn, prevents an issue of "overfitting" from coming up.
Overfitting in a nutshell is when you create an excessively complex model in order to account for the
idiosyncracies we just mentioned.
Again, this is an abstract explanation of the pooling concept without digging into the mathematical and
technical aspects of it.
We can draw an analogy here from the human brain. Our brains, too, conduct a pooling step, since the
input image is received through your eyes, but then it is distilled multiple times until, as much as
possible, only the most relevant information is preserved for you to be able to recognize what you are
looking at.
Additional Reading
If you want to do your own reading on the subject, check out this 2010 paper titled "Evaluation of
Pooling Operations in Convolutional Architectures for Object Recognition" by Dominik Scherer and
others from the University of Bonn.
It's a pretty simple read, only 10 pages long, and will boil down the concept of pooling for you just
perfectly. You can even skip the second part titled "Related Work" if you find it irrelevant to what you
want to understand.
As you see the line-up of images in the middle, the box standing alone in the bottom row represents the
input image, and then the row after that represents the convolution operation, followed by the pooling
phase.
You'll see the term "downsampling" used in the "layer visibility" section on the left. Downsampling is
simply another word for pooling.
If you look at the various versions of the original image that appear in the convolution row, you'll be
able to recognize the filters being used for the convolution operation and the features that the
application is focusing on.
You'll notice that in the pooling row, the images have more or less the same features as their convolved
versions minus some information. You can still recognize it as the same image.
On a side note:
You can ignore the other rows for now since we haven't covered these processes yet. Just
keep in mind that, like pooling was similar in its steps to the convolution operation, these,
too, are just further layers of the same process.
If you hover with your mouse over any image, it will send out a ray that points at the source
of these particular pixels that you're standing on in the version that came before it (the
pooling version will point to the convolution version, and that one would point to the input
image.)
Step 3: Flattening
(For the PPT of this lecture Click Here )
After finishing the previous two steps, we're supposed to have a pooled
feature map by now. As the name of this step implies, we are literally
going to flatten our pooled feature map into a column like in the image
below.
The reason we do this is that we're going to need to insert this data into
an artificial neural network later on.
As you see in the image above, we have multiple pooled feature maps
from the previous step.
What happens after the flattening step is that you end up with a long
vector of input data that you then pass through the artificial neural
network to have it processed further.
To sum up, here is what we have after we're done with each of the
steps that we have covered up until now:
In the next tutorial, we will discuss how this data will be used.
As you see from the image below, we have three layers in the full
connection step:
Input layer
Fully-connected layer
Output layer
At this point, they are already sufficient for a fair degree of accuracy in
recognizing classes. We now want to take it to the next level in terms of
complexity and precision.
We can now look at a more complex example than the one at the
beginning of the tutorial.
Say, for instance, the network predicts the figure in the image to be a
dog by a probability of 80%, yet the image actually turns out to be of a
cat. An error has to be calculated in this case. In the context of artificial
neural networks, we call this calculation a "cost function" or a mean
squared error, but as we deal with convolutional neural networks, it is
more commonly referred to as a "loss function."
Class Recognition
Up until now, we've been discussing examples where the output
consists of a single neuron.
Since this one contains two, there are some differences that show up.
In order to understand how it will play out, we need to check out the
weights placed on each synapse linking to this class so that we can tell
which attributes/features are most relevant to it.
The Application
The next step becomes putting our network's efficacy to the test. Say,
we give it an image of a dog.
The dog and cat classes at the end of the artificial neural network have
absolutely no clue of the image. All they have is what is given to them
by the previous layer through the synapses.
By now, each of these classes has three attributes that it's focusing its
attention on.
As you see in the step below, the dog image was predicted to fall into
the dog class by a probability of 0.95 and other 0.05 was placed on the
cat class.
Think of it this way: This process is a vote among the neurons on which
of the classes the image will be attributed to. The class that gets the
majority of the votes wins. Of course, these votes are as good as their
weights.
Summary
As you see and should probably remember from the previous tutorials,
the process goes as follows:
Throughout this entire process, the network's building blocks, like the
weights and the feature maps, are trained and repeatedly altered in
order for the network to reach the optimal performance that will make it
able to classify images and objects as accurately as possible.
By this point, you have acquired all the knowledge you need for you to
proceed to the more practical applications of the concept of
convolutional neural networks.
Additional Reading
If you're still craving more on the subject, though, you can always do
some extra reading of your own.
This blog post by Adit Deshpande from 2016 titled The 9 Deep Learning
Papers You Need To Know About (Understanding CNN's Part 3) will
brief you up on 9 real-life applications of what you learned in this
section, and you can then go on to study these examples in more depth.
Granted, you'll stumble on quite a few concepts that will be brand new
for you, but with the knowledge you have now, you'll be able to manage
a general overview on these papers.
Also, as you train more, you can always revisit these case studies every
once in a while with your newly-acquired experience.
That being said, learning about the softmax and cross-entropy functions
can give you a tighter grasp of this section's topic.
When looking at the predictions generated by the artificial neural
network in the image below, we should ourselves a question;
The answer is that they actually don't coordinate these results together.
The reason that the results have this coherence is that we introduce to
the network the first element of this tutorial which is the softmax
function.
Softmax
The softmax function goes like this:
Without this function, the dog and cat classes will each have a real-
value probability but the results will not add up to any particular figure.
It's really the only sensible thing to do if you want your convolutional
neural network to be of any use.
Cross-Entropy
You shouldn't let the complexity of its name and the formulas
overwhelm you, though. The cross-entropy function is actually quite
simple as you will see.
If you remember the material from the artificial neural networks section,
we had a function called the "mean squared error function."
We used this function in order to assess the performance of our
network, and by working to minimize this mean squared error we would
practically be optimizing the network.
The mean squared error function can be used with convolutional neural
networks, but an even better option would be applying the cross-entropy
function after you had entered the softmax function. Now it's not called a
cost function anymore, but rather a loss function. The differences
between the two are not that massive anyway. It's mainly a
terminological difference. You might as well consider them the exact
same thing for now.
We will use the loss function in the same manner with which we use the
cost function; we'll try to minimize it as much as possible by optimizing
our network.
Demonstration
That provides a "dog" label, hence the 1 and the 0 on the right. When
applying the cross-entropy function, you need to make sure that the
probabilities on the left go into the "q" slot, and the label values on the
right go in the "p" slot. You need to make sure that the order is correct.
As you can see here, for the first two images, both neural networks
came up with similar end predictions, yet the probabilities were
calculated with much less accuracy by the second neural network.
For the third image which shows that bizarre lion-looking creature, both
neural networks were wrong in their predictions since this animal
actually happens to be a dog, but the first neural network was more
leaning towards the right prediction than the second one.
So, even when they were both wrong, the second neural network was
"more wrong" than the first. What we can do with this data is to try and
make an assessment of both network's performance. To do that, we will
have to use one of the functions we mentioned.
Classification Error
This one is very basic. It just tells how many wrong predictions each
network made. In our example, each network made one wrong
prediction out of three. This function is not all that useful when it comes
to backward propagation.
Mean Squared Error
The way you derive the mean squared error is by calculating the
squared values of the network's errors and then get the average of
these across the table. In our case, NN2 obviously has a much higher
error rate than NN1.
Cross-Entropy
As you can see, the structure of this function's results differ than the
others as we have NN2 with a cross-entropy value of 1.06 and can be
somewhat confusing.
There are several advantages that you get from using the cross-entropy
function that are not exactly intuitive or obvious.
We'll examine here one of the core advantages, and if you want to learn
about the remaining reasons for using cross-entropy, you can do so
from the material you'll find mentioned at the end of this tutorial.
When calculating the mean squared error, you subtract one from the
other, and thus the change will be too trivial to even consider. When
using the cross-entropy function, you take a logarithm before comparing
the two values by dividing one by the other.