Convolutional Neural Networks in Python Master Data Science and Machine Learning With Modern Deep Learning in Python, Theano,... (The LazyProgrammer)
Convolutional Neural Networks in Python Master Data Science and Machine Learning With Modern Deep Learning in Python, Theano,... (The LazyProgrammer)
Python
Chapter 2: Convolution
Conclusion
Introduction
This is the 3rd part in my Data Science and Machine Learning series
on Deep Learning in Python. At this point, you already know a lot
about neural networks and deep learning, including not just the
basics like backpropagation, but how to improve it using modern
techniques like momentum and adaptive learning rates. You've
already written deep neural networks in Theano and TensorFlow,
and you know how to run code using the GPU.
This book is all about how to use deep learning for computer vision
using convolutional neural networks. These are the state of the art
when it comes to image classification and they beat vanilla deep
networks at tasks like MNIST.
In this course we are going to up the ante and look at the StreetView
House Number (SVHN) dataset - which uses larger color images at
various angles - so things are going to get tougher both
computationally and in terms of the difficulty of the classification task.
But we will show that convolutional neural networks, or CNNs, are
capable of handling the challenge!
All the materials used in this book are FREE. You can download and
install Python, Numpy, Scipy, Theano, and TensorFlow with pip or
easy_install.
y = softmax( relu(X.dot(W1).dot(W2) )
Predict
We know that for neural networks the predict function is also called
the feedforward action, and this is simply the dot product and a
nonlinear function on each layer of the neural network.
e.g. z1 = s(w0x), z2 = s(w1z1), z3 = s(w2z2), y = s(w3z3)
Train
We know that training a neural network simply is the application of
gradient descent, which is the same thing we use for logistic
regression and linear regression when we don’t have a closed-form
solution. We know that linear regression has a closed form solution
but we don’t necessarily have to use it, and that gradient descent is
a more general numerical optimization method.
W ← W - learning_rate * dJ/dW
We know that libraries like Theano and TensorFlow will calculate the
gradient for us, which can get very complicated the more layers
there are. You’ll be thankful for this feature of neural networks when
you see that the output function becomes even more complex when
we incorporate convolution (although the derivation is still do-able
and I would recommend trying for practice).
At this point you should be familiar with how the cost function J is
derived from the likelihood and how we might not calculate J over
the entire training data set but rather in batches to improve training
time.
Data Preprocessing
When we work with images you know that an image is really a 2-D
array of data, and that if we have a color image we have a 3-D array
of data where one extra dimension is for the red, green, and blue
channels.
In the past, we’ve flattened this array into a vector, which is the usual
input into a neural network, so for example a 28 x 28 image
becomes a 784 vector, and a 3 x 32 x 32 image becomes a 3072
dimensional vector.
This book will use the MNIST dataset (handwritten digits) and the
streetview house number (SVHN) dataset.
https://github.com/lazyprogrammer/machine_learning_examples
If you’ve already checked out this repo then simply do a “git pull”
since this code will be on the master branch.
I would highly recommend NOT just running this code but using it as
a backup if yours doesn’t work, and try to follow along with the code
examples by typing them out yourself to build muscle memory.
Note that these are MATLAB binary data files, so we’ll need to use
the Scipy library to load them, which I’m sure you have heard of if
you’re familiar with the Numpy stack.
Chapter 2: Convolution
So what is convolution?
Think of your favorite audio effect (suppose that’s the “echo”). An
echo is simply the same sound bouncing back at you in the future,
but with less volume. We’ll see how we can do that mathematically
later.
All effects can be thought of as filters, like the one I’ve shown here,
and they are often drawn in block diagrams. In machine learning and
statistics these are sometimes called kernels.
--------
--------
I’m representing our audio signal by this triangle. Remember that we
want to do 2 things, we want to hear this audio signal in the future,
which is basically a shift in to the right, and this audio signal should
be lower in amplitude than the original.
You can think of it as we are “sliding” the filter across the signal, by
changing the value of m.
You can see from this formula that this just does both convolutions
independently in each direction. I’ve got some pseudocode here to
demonstrate how you might write this in code, but notice there’s a
problem. If i > n or j > m, we’ll go out of bounds.
y = np.zeros(x.shape)
for n in xrange(x.shape[0]):
for m in xrange(x.shape[1]):
for i in xrange(w.shape[0]):
for j in xrange(w.shape[1]):
y[n,m] += w[i,j]*x[n-i,m-j]
If you just want to see the code that’s already been written, check
out the file
https://github.com/lazyprogrammer/machine_learning_examples/blo
b/master/cnn_class/blur.py from Github.
The idea is the same as we did with the sound echo. We’re going to
take a signal and spread it out.
W = np.zeros((20, 20))
for i in xrange(20):
for j in xrange(20):
import numpy as np
img = mpimg.imread('lena.png')
plt.imshow(img)
plt.show()
# make it B&W
bw = img.mean(axis=2)
plt.imshow(bw, cmap='gray')
plt.show()
W = np.zeros((20, 20))
for i in xrange(20):
for j in xrange(20):
plt.imshow(W, cmap='gray')
plt.show()
# now the convolution
out = convolve2d(bw, W)
plt.imshow(out, cmap='gray')
plt.show()
# what's that weird black stuff on the edges? let's check the size of
output
print out.shape
# after convolution, the output signal is N1 + N2 - 1
# we can also just make the output the same size as the input
plt.imshow(out, cmap='gray')
plt.show()
print out.shape
Edge Detection
Hx = np.array([
[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1],
], dtype=np.float32)
Hy = np.array([
[0, 0, 0],
[1, 2, 1],
], dtype=np.float32)
Now let’s do convolutions on these. So Gx is the convolution
between the image and Hx. Gy is the convolution between the image
and Hy.
You can think of Gx and Gy as sort of like vectors, so we can
calculate the magnitude and direction. So G = sqrt(Gx^2 + Gy^2).
We can see that after applying both operators what we get out is all
the edges detected.
import numpy as np
img = mpimg.imread('lena.png')
# make it B&W
bw = img.mean(axis=2)
# Sobel operator - approximate gradient in X dir
Hx = np.array([
[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1],
], dtype=np.float32)
# Sobel operator - approximate gradient in Y dir
Hy = np.array([
[0, 0, 0],
[1, 2, 1],
], dtype=np.float32)
Gx = convolve2d(bw, Hx)
plt.imshow(Gx, cmap='gray')
plt.show()
Gy = convolve2d(bw, Hy)
plt.imshow(Gy, cmap='gray')
plt.show()
# Gradient magnitude
G = np.sqrt(Gx*Gx + Gy*Gy)
plt.imshow(G, cmap='gray')
plt.show()
The Takeaway
So what is the takeaway from all these examples of convolution?
Now you know that there are SOME filters that help us detect
features - so perhaps, it would be possible to just do a convolution in
the neural network and use gradient descent to find the best filter.
Chapter 3: The Convolutional Neural Network
All of the networks we’ve seen so far have one thing in common: all
the nodes in one layer are connected to all the nodes in the next
layer. This is the “standard” feedforward neural network. With
convolutional neural networks you will see how that changes.
Question to think about: How can we ensure the neural network has
“rotational invariance?” What other kinds of invariances can you
think of?
Downsampling
For images, we just want to know if after we did the convolution, was
a feature present in a certain area of the image. We can do that by
downsampling the image, or in other words, changing its resolution.
The simplest convolutional net is just the kind I showed you in the
introduction to this book. It does not even need to incorporate
downsampling.
Z = conv(X, W1)
Y = softmax(Z.dot(W2))
So in the first layer, you take the image, and keep all the colors and
the original shape, meaning you don’t flatten it. (i.e. it remains (3 x W
x H))
Then you perform convolution on it.
Finally, you flatten these features into a vector and you put it into a
regular, fully connected neural network like the ones we’ve been
talking about.
Schematically it would look like this:
The basic pattern is:
Note that you can have arbitrarily many convolution + pool layers,
and more fully connected layers.
Some networks have only convolution. The design is up to you.
Technicalities
4-D tensor inputs: The dimension of the inputs is a 4-D tensor, and
it’s pretty easy to see why. The image already takes up 3
dimensions, since we have height, width, and color. The 4th
dimension is just the number of samples (i.e. for batch training).
4-D tensor filters / kernels: You might be surprised to learn that the
kernels are ALSO 4-D tensors. Now why is this? Well in the LeNet
model, you have multiple kernels per image and a different set of
kernels for each color channel. The next layer after the convolution is
called a feature map. This feature map is the same size as the
number of kernels. So basically you can think of this as, each kernel
will extract a different feature, and place it onto the feature map.
Example:
Another thing to note is that the shapes of our filters are usually
MUCH smaller than the image itself. What this means is that the
same tiny filter gets applied across the entire image. This is the idea
of weight sharing.
Training a CNN
You too, can be a deep learning researcher. Just try different things.
Be creative. Use backprop. Easy, right?
Remember, in Theano, it’s just:
pooled_out = downsample.max_pool_2d(
input=conv_out,
ds=poolsize,
ignore_border=True
The last step where we call the function dimshuffle() on the bias
does a broadcasting since b is a 1-D vector and after the conv_pool
operation you get a 4-D tensor. You’ll see that TensorFlow has a
function that encapsulates this for us.
The next component is the rearranging of the input. Remember that
MATLAB does things a bit weirdly and puts the index to each sample
in the LAST dimension, but Theano expects it to be in the FIRST
dimension. It also happens to expect the color dimension to come
next. So that is what this code here is doing.
def rearrange(X):
for i in xrange(N):
for j in xrange(3):
out[i, j, :, :] = X[:, :, j, i]
Also, as you know with neural networks we like our data to stay in a
small range, so we divide by the maximum value at the end which is
255.
It’s also good to keep track of the size of each matrix as each
operation is done. You’ll see that with TensorFlow, by default each
library treats the edges of the result of the convolution a little
differently, and the order of each dimension is also different.
W1_shape = (20, 3, 5, 5)
W1 = np.random.randn(W1_shape)
b1_init = np.zeros(W1_shape[0])
W2 = np.random.randn(W2_shape)
b2_init = np.zeros(W2_shape[0])
W3_init = np.random.randn(W2_shape[0]*5*5, M)
b3_init = np.zeros(M)
W4_init = np.random.randn(M, K)
b4_init = np.zeros(K)
Note that the bias is the same size as the number of feature maps.
Also note that this filter is a 4-D tensor, which is different from the
filters we were working with previously, which were 1-D and 2-D
filters.
Since that image was 5x5 and had 50 feature maps, the new
flattened dimension will be 50x5x5.
Now that we have all the initial weights and operations we need, we
can compute the output of the neural network. So we do the
convpool twice, and then notice this flatten() operation before I do
the dot product. That’s because Z2, after convpooling, will still be an
image.
# forward pass
Z3 = relu(Z2.flatten(ndim=2).dot(W3) + b3)
pY = T.nnet.softmax(Z3.dot(W4) + b4)
But if you call flatten() by itself it’ll turn into a 1-D array, which we
don’t want, and luckily Theano provides us with a parameter that
allows us to control how much to flatten the array. ndim=2 means to
flatten all the dimensions after the 2nd dimension.
import numpy as np
import theano
import theano.tensor as T
return np.mean(p != t)
def relu(a):
return a * (a > 0)
def y2indicator(y):
N = len(y)
for i in xrange(N):
ind[i, y[i]] = 1
return ind
pooled_out = downsample.max_pool_2d(
input=conv_out,
ds=poolsize,
ignore_border=True
)
w = np.random.randn(*shape) / np.sqrt(np.prod(shape[1:]) +
shape[0]*np.prod(shape[2:] / np.prod(poolsz)))
return w.astype(np.float32)
def rearrange(X):
N = X.shape[-1]
for i in xrange(N):
for j in xrange(3):
out[i, j, :, :] = X[:, :, j, i]
def main():
test = loadmat('../large_files/test_32x32.mat')
Xtrain = rearrange(train['X'])
Ytrain = train['y'].flatten() - 1
del train
Ytrain_ind = y2indicator(Ytrain)
Xtest = rearrange(test['X'])
Ytest = test['y'].flatten() - 1
del test
Ytest_ind = y2indicator(Ytest)
max_iter = 8
print_period = 10
lr = np.float32(0.00001)
reg = np.float32(0.01)
mu = np.float32(0.99)
N = Xtrain.shape[0]
batch_sz = 500
n_batches = N / batch_sz
M = 500
K = 10
poolsz = (2, 2)
# after downsample 28 / 2 = 14
# after downsample 10 / 2 = 5
W3_init = np.random.randn(W2_shape[0]*5*5, M) /
np.sqrt(W2_shape[0]*5*5 + M)
X = T.tensor4('X', dtype='float32')
Y = T.matrix('T')
W1 = theano.shared(W1_init, 'W1')
b1 = theano.shared(b1_init, 'b1')
W2 = theano.shared(W2_init, 'W2')
b2 = theano.shared(b2_init, 'b2')
W3 = theano.shared(W3_init.astype(np.float32), 'W3')
b3 = theano.shared(b3_init, 'b3')
W4 = theano.shared(W4_init.astype(np.float32), 'W4')
b4 = theano.shared(b4_init, 'b4')
# momentum changes
dW1 = theano.shared(np.zeros(W1_init.shape, dtype=np.float32),
'dW1')
# forward pass
Z3 = relu(Z2.flatten(ndim=2).dot(W3) + b3)
train = theano.function(
inputs=[X, Y],
updates=[
(W1, update_W1),
(b1, update_b1),
(W2, update_W2),
(b2, update_b2),
(W3, update_W3),
(b3, update_b3),
(W4, update_W4),
(b4, update_b4),
(dW1, update_dW1),
(db1, update_db1),
(dW2, update_dW2),
(db2, update_db2),
(dW3, update_dW3),
(db3, update_db3),
(dW4, update_dW4),
(db4, update_db4),
],
# create another function for this because we want it over the whole
dataset
get_prediction = theano.function(
inputs=[X, Y],
outputs=[cost, prediction],
t0 = datetime.now()
LL = []
for i in xrange(max_iter):
for j in xrange(n_batches):
train(Xbatch, Ybatch)
if j % print_period == 0:
LL.append(cost_val)
plt.plot(LL)
plt.show()
if __name__ == '__main__':
main()
Chapter 5: Sample Code in TensorFlow
https://github.com/lazyprogrammer/machine_learning_examples/blo
b/master/cnn_class/cnn_tf.py
conv_out = tf.nn.bias_add(conv_out, b)
return pool_out
In the past we just assumed that we had to drag the filter along every
point of the signal, but in fact we can move with any size step we
want, and that’s what stride is. We’re also going to use the padding
parameter to control the size of the output.
Remember that the bias is a 1-D vector, and we used the dimshuffle
function in Theano to add it to the convolution output. Here we can
just use a function that TensorFlow built called bias_add().
Next we call the max_pool() function. Notice that the ksize parameter
is kind of like the poolsize parameter we had with Theano, but it’s
now 4-D instead of 2-D. We just add ones at the ends. Notice that
this function ALSO takes in a strides parameter, meaning we can
max_pool at EVERY step, but we’ll just use non-overlapping sub-
images like we did previously.
def rearrange(X):
N = X.shape[-1]
for i in xrange(N):
for j in xrange(3):
out[i, :, :, j] = X[:, :, j, i]
This is great and allows for a lot of flexibility, but I hit a snag during
development, which is my RAM started swapping when I did this. If
you haven’t noticed yet the size of the SVHN data is pretty big, about
73k samples.
So one way around this is to make the shapes constant, which you’ll
see later. That means we’ll always have to pass in batch_sz number
of samples each time, which means the total number of samples we
use has to be a multiple of it. In the code I used exact numbers but
you can also calculate it using the data.
W3_init = np.random.randn(W2_shape[-1]*8*8, M) /
np.sqrt(W2_shape[-1]*8*8 + M)
b3_init = np.zeros(M, dtype=np.float32)
For the vanilla ANN portion, also notice that the outputs of the
convolution are now a different size. So now it’s 8 instead of 5.
For the forward pass, the first 2 parts are the same as Theano.
One thing that’s different is TensorFlow objects don’t have a flatten
method, so we have to use reshape.
Z2_shape = Z2.get_shape().as_list()
The last step is to calculate the output just before the softmax.
Remember that with TensorFlow the cost function requires the logits
without softmaxing, so we won’t do the softmax at this point.
The full code is as follows:
import numpy as np
import tensorflow as tf
def y2indicator(y):
N = len(y)
ind = np.zeros((N, 10))
for i in xrange(N):
ind[i, y[i]] = 1
return ind
return np.mean(p != t)
def convpool(X, W, b):
conv_out = tf.nn.bias_add(conv_out, b)
w = np.random.randn(*shape) / np.sqrt(np.prod(shape[:-1]) +
shape[-1]*np.prod(shape[:-2] / np.prod(poolsz)))
return w.astype(np.float32)
def rearrange(X):
N = X.shape[-1]
for i in xrange(N):
for j in xrange(3):
out[i, :, :, j] = X[:, :, j, i]
def main():
Xtrain = rearrange(train['X'])
Ytrain = train['y'].flatten() - 1
print len(Ytrain)
del train
Ytrain_ind = y2indicator(Ytrain)
Xtest = rearrange(test['X'])
Ytest = test['y'].flatten() - 1
del test
Ytest_ind = y2indicator(Ytest)
max_iter = 20
print_period = 10
N = Xtrain.shape[0]
batch_sz = 500
n_batches = N / batch_sz
Xtrain = Xtrain[:73000,]
Ytrain = Ytrain[:73000]
Xtest = Xtest[:26000,]
Ytest = Ytest[:26000]
Ytest_ind = Ytest_ind[:26000,]
# initialize weights
M = 500
K = 10
poolsz = (2, 2)
W1_shape = (5, 5, 3, 20) # (filter_width, filter_height,
num_color_channels, num_feature_maps)
W3_init = np.random.randn(W2_shape[-1]*8*8, M) /
np.sqrt(W2_shape[-1]*8*8 + M)
# using None as the first shape element takes up too much RAM
unfortunately
W1 = tf.Variable(W1_init.astype(np.float32))
b1 = tf.Variable(b1_init.astype(np.float32))
W2 = tf.Variable(W2_init.astype(np.float32))
b2 = tf.Variable(b2_init.astype(np.float32))
W3 = tf.Variable(W3_init.astype(np.float32))
b3 = tf.Variable(b3_init.astype(np.float32))
W4 = tf.Variable(W4_init.astype(np.float32))
b4 = tf.Variable(b4_init.astype(np.float32))
Z2_shape = Z2.get_shape().as_list()
cost = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(Yish,
T))
train_op = tf.train.RMSPropOptimizer(0.0001, decay=0.99,
momentum=0.9).minimize(cost)
predict_op = tf.argmax(Yish, 1)
t0 = datetime.now()
LL = []
init = tf.initialize_all_variables()
session.run(init)
for i in xrange(max_iter):
for j in xrange(n_batches):
if len(Xbatch) == batch_sz:
if j % print_period == 0:
prediction = np.zeros(len(Xtest))
print "Cost / err at iteration i=%d, j=%d: %.3f / %.3f" % (i, j, test_cost,
err)
LL.append(test_cost)
plt.plot(LL)
plt.show()
if __name__ == '__main__':
main()
Conclusion
I really hope you had as much fun reading this book as I did making
it.
A lot of the material in this book is covered in this course, but you get
to see me derive the formulas and write the code live:
https://kdp.amazon.com/amazon-dp-
action/us/bookshelf.marketplacelink/B01CVJ19E8
Are you comfortable with this material, and you want to take your
deep learning skillset to the next level? Then my follow-up Udemy
course on deep learning is for you. Similar to previous book, I take
you through the basics of Theano and TensorFlow - creating
functions, variables, and expressions, and build up neural networks
from scratch. I teach you about ways to accelerate the learning
process, including batch gradient descent, momentum, and adaptive
learning rates. I also show you live how to create a GPU instance on
Amazon AWS EC2, and prove to you that training a neural network
with GPU optimization can be orders of magnitude faster than on
your CPU.
https://www.udemy.com/data-science-deep-learning-in-theano-
tensorflow
In part 4 of my deep learning series, I take you through unsupervised
deep learning methods. We study principal components analysis
(PCA), t-SNE (jointly developed by the godfather of deep learning,
Geoffrey Hinton), deep autoencoders, and restricted Boltzmann
machines (RBMs). I demonstrate how unsupervised pretraining on a
deep network with autoencoders and RBMs can improve supervised
learning performance.
https://www.udemy.com/unsupervised-deep-learning-in-python
Would you like an introduction to the basic building block of neural
networks - logistic regression? In this course I teach the theory of
logistic regression (our computational model of the neuron), and give
you an in-depth look at binary classification, manually creating
features, and gradient descent. You might want to check this course
out if you found the material in this book too challenging.
https://udemy.com/data-science-logistic-regression-in-python
The corresponding book for Deep Learning Prerequisites is:
https://kdp.amazon.com/amazon-dp-
action/us/bookshelf.marketplacelink/B01D7GDRQ2
SQL for Marketers: Dominate data analytics, data science, and big
data
https://www.udemy.com/sql-for-marketers-data-analytics-data-
science-big-data
Finally, I am always giving out coupons and letting you know when
you can get my stuff for free. But you can only do this if you are a
current student of mine! Here are some ways I notify my students
about coupons and free giveaways: