CNNS, Part 1: An Introduction To Convolutional Neural Networks
CNNS, Part 1: An Introduction To Convolutional Neural Networks
CNNS, Part 1: An Introduction To Convolutional Neural Networks
There’s been a lot of buzz about Convolution Neural Networks (CNNs) in the past few
years, especially because of how they’ve revolutionized the field of Computer Vision. In
this post, we’ll build on a basic background knowledge of neural networks and explore
what CNNs are, understand how they work, and build a real one from scratch
(using only numpy) in Python.
1. Motivation
A classic use case of CNNs is to perform image classification, e.g. looking at an image
of a pet and deciding whether it’s a cat or a dog. It’s a seemingly simple task - why not
just use a normal Neural Network?
Good question.
It’s not like we need that many weights, either. The nice thing about images is that we
know pixels are most useful in the context of their neighbors. Objects in images are
made up of small, localized features, like the circular iris of an eye or the square corner
of a piece of paper. Doesn’t it seem wasteful for every node in the first hidden layer to
look at every pixel?
We’ll see soon how a CNN can help us mitigate these problems.
2. Dataset
In this post, we’ll tackle the “Hello, World!” of Computer Vision: the MNIST handwritten
digit classification problem. It’s simple: given an image, classify it as a digit.
Each image in the MNIST dataset is 28x28 and contains a centered, grayscale digit.
Truth be told, a normal neural network would actually work just fine for this problem. You
could treat each image as a 28 x 28 = 784-dimensional vector, feed that to a 784-dim
input layer, stack a few hidden layers, and finish with an output layer of 10 nodes, 1 for
each digit.
This would only work because the MNIST dataset contains small images that are
centered, so we wouldn’t run into the aforementioned issues of size or shifting. Keep in
mind throughout the course of this post, however, that most real-world image
classification problems aren’t this easy.
3. Convolutions
What are Convolutional Neural Networks?
They’re basically just neural networks that use Convolutional layers, a.k.a. Conv
layers, which are based on the mathematical operation of convolution. Conv layers
consist of a set of filters, which you can think of as just 2d matrices of numbers. Here’s
an example 3x3 filter:
-1 0 1
-2 0 2
-1 0 1
A 3x3 filter
We can use an input image and a filter to produce an output image by convolving the
filter with the input image. This consists of
2. Performing element-wise multiplication between the values in the filter and their
corresponding values in the image.
3. Summing up all the element-wise products. This sum is the output value for the
destination pixel in the output image.
Side Note: We (along with many CNN implementations) are technically actually using
cross-correlation instead of convolution here, but they do almost the same thing. I won’t
go into the difference in this post because it’s not that important, but feel free to look this
up if you’re curious.
That 4-step description was a little abstract, so let’s do an example. Consider this tiny
4x4 grayscale image and this 3x3 filter:
0 50 0 29
-1 0 1
0 80 31 2
-2 0 2
33 90 0 75
-1 0 1
0 9 0 95
The numbers in the image represent pixel intensities, where 0 is black and 255 is white.
We’ll convolve the input image and the filter to produce a 2x2 output image:
? ?
? ?
A 2x2 output image
To start, lets overlay our filter in the top left corner of the image:
0 50 0 29
-1 0 1
0 80 31 2
-2 0 2
33 90 0 75
-1 0 1
0 9 0 95
0 -1 0
50 0 0
0 1 0
0 -2 0
80 0 0
31 2 62
33 -1 -33
90 0 0
0 1 0
62 − 33 = 29
Finally, we place our result in the destination pixel of our output image. Since our filter is
overlayed in the top left corner of the input image, our destination pixel is the top left
pixel of the output image:
0 50 0 29
0 80 31 2 29 ?
33 90 0 75 ? ?
0 9 0 95
-1 0 1
-2 0 2
-1 0 1
The vertical Sobel filter
1 2 1
0 0 0
-1 -2 -1
The horizontal Sobel filter
See what’s happening? Sobel filters are edge-detectors. The vertical Sobel filter
detects vertical edges, and the horizontal Sobel filter detects horizontal edges. The
output images are now easily interpreted: a bright pixel (one that has a high value) in
the output image indicates that there’s a strong edge around there in the original image.
Can you see why an edge-detected image might be more useful than the raw image?
Think back to our MNIST handwritten digit classification problem for a second. A CNN
trained on MNIST might look for the digit 1, for example, by using an edge-detection
filter and checking for two prominent vertical edges near the center of the image. In
general, convolution helps us look for specific localized image features (like
edges) that we can use later in the network.
3.2 Padding
Remember convolving a 4x4 input image with a 3x3 filter earlier to produce a 2x2 output
image? Often times, we’d prefer to have the output image be the same size as the input
image. To do this, we add zeros around the image so we can overlay the filter in more
places. A 3x3 filter requires 1 pixel of padding:
0 0 0 0 0 0
0 0 50 0 29 0
0 0 80 31 2 0
0 33 90 0 75 0
0 0 9 0 95 0
0 0 0 0 0 0
A 4x4 input convolved with a 3x3 filter to produce a 4x4 output using same padding
This is called “same” padding, since the input and output have the same dimensions.
Not using any padding, which is what we’ve been doing and will continue to do for this
post, is sometimes referred to as “valid” padding.
For our MNIST CNN, we’ll use a small conv layer with 8 filters as the initial layer in our
network. This means it’ll turn the 28x28 input image into a 26x26x8 output volume:
28x28 26x26x8
conv
Reminder: The output is 26x26x8 and not 28x28x8 because we’re using valid padding,
which decreases the input’s width and height by 2.
Each of the 8 filters in the conv layer produces a 26x26 output, so stacked together they
make up a 26x26x8 volume. All of this happens because of 3 × 3 (filter size) × 8
(number of filters) = only 72 weights!
conv.py
import numpy as np
class Conv3x3:
# A Convolution layer using 3x3 filters.
The Conv3x3 class takes only one argument: the number of filters. In the constructor,
we store the number of filters and initialize a random filters array using NumPy’s randn()
method.
Note: Diving by 9 during the initialization is more important than you may think. If the
initial values are too large or too small, training the network will be ineffective. To learn
more, read about Xavier Initialization.
conv.py
class Conv3x3:
# ...
return output
iterate_regions() is a helper generator method that yields all valid 3x3 image regions for
us. This will be useful for implementing the backwards portion of this class later on.
The line of code that actually performs the convolutions is highlighted above. Let’s
break it down:
We np.sum() the result of the previous step using axis=(1, 2) , which produces a 1d
array of length num_filters where each element contains the convolution result for
the corresponding filter.
We assign the result to output[i, j] , which contains convolution results for pixel (i, j)
in the output.
The sequence above is performed for each pixel in the output until we obtain our final
output volume! Let’s give our code a test run:
cnn.py
import mnist
from conv import Conv3x3
conv = Conv3x3(8)
output = conv.forward(train_images[0])
print(output.shape) # (26, 26, 8)
Note: in our Conv3x3 implementation, we assume the input is a 2d numpy array for
simplicity, because that’s how our MNIST images are stored. This works for us because
we use it as the first layer in our network, but most CNNs have many more Conv layers.
If we were building a bigger network that needed to use Conv3x3 multiple times, we’d
have to make the input be a 3d numpy array.
4. Pooling
Neighboring pixels in images tend to have similar values, so conv layers will typically
also produce similar values for neighboring pixels in outputs. As a result, much of the
information contained in a conv layer’s output is redundant. For example, if we use
an edge-detecting filter and find a strong edge at a certain location, chances are that
we’ll also find relatively strong edges at locations 1 pixel shifted from the original one.
However, these are all the same edge! We’re not finding anything new.
Pooling layers solve this problem. All they do is reduce the size of the input it’s given by
(you guessed it) pooling values together in the input. The pooling is usually done by a
simple operation like max , min , or average . Here’s an example of a Max Pooling layer
with a pooling size of 2:
Pooling divides the input’s width and height by the pool size. For our MNIST CNN,
we’ll place a Max Pooling layer with a pool size of 2 right after our initial conv layer. The
pooling layer will transform a 26x26x8 input into a 13x13x8 output:
conv maxpool
maxpool.py
import numpy as np
class MaxPool2:
# A Max Pooling layer using a pool size of 2.
for i in range(new_h):
for j in range(new_w ):
im_region = image[(i * 2):(i * 2 + 2), (j * 2):(j * 2 + 2)]
yield im_region, i, j
return output
This class works similarly to the Conv3x3 class we implemented previously. The critical
line is again highlighted: to find the max from a given image region, we use np.amax(),
numpy’s array max method. We set axis=(0, 1) because we only want to maximize over
the first two dimensions, height and width, and not the third, num_filters .
cnn.py
import mnist
from conv import Conv3x3
from maxpool import MaxPool2
conv = Conv3x3(8)
pool = MaxPool2()
output = conv.forward(train_images[0])
output = pool.forward(output)
print(output.shape) # (13, 13, 8)
5. Softmax
To complete our CNN, we need to give it the ability to actually make predictions. We’ll
do that by using the standard final layer for a multiclass classification problem: the
Softmax layer, a standard fully-connected (dense) layer that uses the softmax activation
function.
Reminder: fully-connected layers have every node connected to every output from the
previous layer. We used fully-connected layers in my intro to Neural Networks if you
need a refresher.
Softmax turns arbitrary real values into probabilities. The math behind it is pretty
simple: given some numbers,
2. Sum up all the exponentials (powers of e). This result is the denominator.
n
exi
s(xi ) = ∑j=1 exj
The outputs of the Softmax transform are always in the range [0, 1] and add up to 1.
Hence, they’re probabilities.
Denominator = e−1 + e0 + e3 + e5
= 169.87
ex
x ex Probability ( 169.87 )
-1 0.368 0.002
0 1 0.006
3 20.09 0.118
5 148.41 0.874
5.1 Usage
We’ll use a softmax layer with 10 nodes, one representing each digit, as the final
layer in our CNN. Each node in the layer will be connected to every input. After the
softmax transformation is applied, the digit represented by the node with the highest
probability will be the output of the CNN!
You might have just thought to yourself, why bother transforming the outputs into
probabilities? Won’t the highest output value always have the highest probability? If you
did, you’re absolutely right. We don’t actually need to use softmax to predict a digit
- we could just pick the digit with the highest output from the network!
What softmax really does is help us quantify how sure we are of our prediction,
which is useful when training and evaluating our CNN. More specifically, using softmax
lets us use cross-entropy loss, which takes into account how sure we are of each
prediction. Here’s how we calculate cross-entropy loss:
L = − ln(pc )
where c is the correct class (in our case, the correct digit), pc is the predicted probability
for class c, and ln is the natural log. As always, a lower loss is better. For example, in
the best case, we’d have
pc = 1, L = − ln(1) = 0
We’ll be seeing cross-entropy loss again later on in this post, so keep it in mind!
You know the drill by now - let’s implement a Softmax layer class:
softmax.py
import numpy as np
class Softmax:
# A standard fully-connected layer with softmax activation.
We flatten() the input to make it easier to work with, since we no longer need its
shape.
np.dot() multiplies input and self.weights element-wise and then sums the results.
We’ve now completed the entire forward pass of our CNN! Putting it together:
cnn.py
import mnist
import numpy as np
from conv import Conv3x3
from maxpool import MaxPool2
from softmax import Softmax
loss = 0
num_correct = 0
for i, (im, label) in enumerate(zip(test_images, test_labels)):
# Do a forward pass.
_ , l, acc = forward (im, label)
loss += l
num_correct += acc
Want to try or tinker with this code yourself? Run this CNN in your browser. It’s
also available on Github.
6. Conclusion
That’s the end of this introduction to CNNs! In this post, we
Motivated why CNNs might be more useful for certain problems, like image
classification.
Learned about Conv layers, which convolve filters with images to produce more
useful outputs.
Talked about Pooling layers, which can help prune everything but the most useful
features.
There’s still much more that we haven’t covered yet, such as how to actually train a
CNN. Part 2 of this CNN series does a deep-dive on training a CNN, including
deriving gradients and implementing backprop.
If you’re eager to see a trained CNN in action: this example Keras CNN trained on
MNIST achieves 99.25% accuracy. CNNs are powerful!
SUBSCRIBE
For Beginners