Convolution Neural Network
Convolution Neural Network
A Convolutional neural network (CNN) is a class of deep neural networks, most commonly
applied to analyzing visual imagery.CNNs use relatively little pre-processing compared to
other image classification algorithms.
A naïve approach to solving this problem is to save an image of an X and an O and compare
every new image to our exemplars to see which is the better match. What makes this task
tricky is that computers are extremely literal. To a computer, an image looks like a two-
dimensional array of pixels (think giant checkerboard) with a number in each position. In our
example a pixel value of 1 is white, and -1 is black. When comparing two images, if any pixel
values don’t match, then the images don’t match, at least to the computer. Ideally, we would
like to be able to see X’s and O’s even if they’re shifted, shrunken, rotated or deformed. This
is where CNNs come in.
CNNs compare images piece by piece. The pieces that it looks for are called features. By
finding rough feature matches in roughly the same positions in two images, CNNs get a lot
better at seeing similarity than whole-image matching schemes
Each feature is like a mini-image—a small two-dimensional array of values. Features match
common aspects of the images. In the case of X images, features consisting of diagonal lines
and a crossing capture all the important characteristics of most X’s. These features will
probably match up to the arms and center of any image of an X.
When presented with a new image, the CNN doesn’t know exactly where these features will
match so it tries them everywhere, in every possible position. In calculating the match to a
feature across the whole image, we make it a filter. The math we use to do this is called
convolution, from which Convolutional Neural Networks take their name.
The math behind convolution is nothing that would make a sixth-grader uncomfortable. To
calculate the match of a feature to a patch of the image, simply multiply each pixel in the
feature by the value of the corresponding pixel in the image. Then add up the answers and
divide by the total number of pixels in the feature. If both pixels are white (a value of 1) then
1 * 1 = 1. If both are black, then (-1) * (-1) = 1. Either way, every matching pixel results in a
1. Similarly, any mismatch is a -1. If all the pixels in a feature match, then adding them up
and dividing by the total number of pixels gives a 1. Similarly, if none of the pixels in a
feature match the image patch, then the answer is a -1.
To complete our convolution, we repeat this process, lining up the feature with every possible
image patch. We can take the answer from each convolution and make a new two-
dimensional array from it, based on where in the image each patch is located. This map of
matches is also a filtered version of our original image. It’s a map of where in the image the
feature is found. Values close to 1 show strong matches, values close to -1 show strong
matches for the photographic negative of our feature, and values near zero show no match of
any sort.
The next step is to repeat the convolution process in its entirety for each of the other features.
The result is a set of filtered images, one for each of our filters. It’s convenient to think of this
whole collection of convolution operations as a single processing step. In CNNs this is
referred to as a convolution layer, hinting at the fact that it will soon have other layers added
to it.
It’s easy to see how CNNs get their reputation as computation hogs. Although we can sketch
our CNN on the back of a napkin, the number of additions, multiplications and divisions can
add up fast. In math speak, they scale linearly with the number of pixels in the image, with
the number of pixels in each feature and with the number of features. With so many factors,
it’s easy to make this problem many millions of times larger without breaking a sweat. Small
wonder that microchip manufacturers are now making specialized chips in an effort to keep
up with the demands of CNNs
Another power tool that CNNs use is called pooling. Pooling is a way to take large images
and shrink them down while preserving the most important information in them. The math
behind pooling is second-grade level at most. It consists of stepping a small window across
an image and taking the maximum value from the window at each step. In practice, a window
2 or 3 pixels on a side and steps of 2 pixels work well.
After pooling, an image has about a quarter as many pixels as it started with. Because it keeps
the maximum value from each window, it preserves the best fits of each feature within the
window. This means that it doesn’t care so much exactly where the feature fit as long as it fit
somewhere within the window. The result of this is that CNNs can find whether a feature is
in an image without worrying about where it is. This helps solve the problem of computers
being hyper-literal.
A pooling layer is just the operation of performing pooling on an image or a collection of
images. The output will have the same number of images, but they will each have fewer
pixels. This is also helpful in managing the computational load. Taking an 8 megapixel image
down to a 2 megapixel image makes life a lot easier for everything downstream.
A small but important player in this process is the Rectified Linear Unit or ReLU. It’s math is
also very simple—wherever a negative number occurs, swap it out for a 0. This helps the
CNN stay mathematically healthy by keeping learned values from getting stuck near 0 or
blowing up toward infinity. It’s the axle grease of CNNs—not particularly glamorous, but
without it they don’t get very far.
The output of a ReLU layer is the same size as whatever is put into it, just with all the
negative values removed
You’ve probably noticed that the input to each layer (two-dimensional arrays) looks a lot like
the output (two-dimensional arrays). Because of this, we can stack them like Lego bricks.
Raw images get filtered, rectified and pooled to create a set of shrunken, feature-filtered
images. These can be filtered and shrunken again and again. Each time, the features become
larger and more complex, and the images become more compact. This lets lower layers
represent simple aspects of the image, such as edges and bright spots. Higher layers can
represent increasingly sophisticated aspects of the image, such as shapes and patterns. These
tend to be readily recognizable. For instance, in a CNN trained on human faces, the highest
layers represent patterns that are clearly face-like.
CNNs have one more arrow in their quiver. Fully connected layers take the high-level filtered
images and translate them into votes. In our case, we only have to decide between two
categories, X and O. Fully connected layers are the primary building block of traditional
neural networks. Instead of treating inputs as a two-dimensional array, they are treated as a
single list and all treated identically. Every value gets its own vote on whether the current
image is an X or and O. However, the process isn’t entirely democratic. Some values are
much better than others at knowing when the image is an X, and some are particularly good
at knowing when the image is an O. These get larger votes than the others. These votes are
expressed as weights, or connection strengths, between each value and each category.
When a new image is presented to the CNN, it percolates through the lower layers until it
reaches the fully connected layer at the end. Then an election is held. The answer with the
most votes wins and is declared the category of the input.
Fully connected layers, like the rest, can be stacked because their outputs (a list of
votes) look a whole lot like their inputs (a list of values). In practice, several fully
connected layers are often stacked together, with each intermediate layer voting on
phantom “hidden” categories. In effect, each additional layer lets the network learn
ever more sophisticated combinations of features that help it make better decisions.