Computer Vision - Ipynb - Colaboratory
Computer Vision - Ipynb - Colaboratory
ipynb - Colaboratory
The goal of our convolutional neural networks will be to classify and detect images or speci c
objects from within the image. We will be using image data as our features and a label for those
images as our label or output.
We already know how neural networks work so we can skip through the basics and move right
into explaining the following concepts.
Image Data
Convolutional Layer
Pooling Layer
CNN Architectures
The major differences we are about to see in these types of neural networks are the layers that
make them up.
Image Data
So far, we have dealt with pretty straight forward data that has 1 or 2 dimensions. Now we are
about to deal with image data that is usually made up of 3 dimensions. These 3 dimensions are
as follows:
image height
image width
color channels
The only item in the list above you may not understand is color channels. The number of color
channels represents the depth of an image and coorelates to the colors used in it. For example,
an image with three channels is likely made up of rgb (red, green, blue) pixels. So, for each pixel
we have three numeric values in the range 0-255 that de ne its color. For an image of color
depth 1 we would likely have a greyscale image with one value de ning each pixel, again in the
range of 0-255.
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 1/17
8/4/2020 Computer Vision.ipynb - Colaboratory
Keep this in mind as we discuss how our network works and the input/output of each layer.
Each convolutional neural network is made up of one or many convolutional layers. These layers
are different than the dense layers we have seen previously. Their goal is to nd patterns from
within images that can be used to classify the image or parts of it. But this may sound familiar
to what our densly connected neural network in the previous section was doing, well that's
becasue it is.
The fundemental difference between a dense layer and a convolutional layer is that dense layers
detect patterns globally while convolutional layers detect patterns locally. When we have a
densly connected layer each node in that layer sees all the data from the previous layer. This
means that this layer is looking at all the information and is only capable of analyzing the data in
a global capacity. Our convolutional layer however will not be densly connected, this means it
can detect local patterns using part of the input data to that layer.
Let's have a look at how a densly connected layer would look at an image vs how a convolutional
layer would.
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 2/17
8/4/2020 Computer Vision.ipynb - Colaboratory
This is our image; the goal of our network will be to determine whether this image is a cat or not.
Dense Layer: A dense layer will consider the ENTIRE image. It will look at all the pixels and use
that information to generate some output.
Convolutional Layer: The convolutional layer will look at speci c parts of the image. In this
example let's say it analyzes the highlighted parts below and detects patterns there.
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 3/17
8/4/2020 Computer Vision.ipynb - Colaboratory
Can you see why this might make these networks more useful?
We'll consider that we have a dense neural network that has learned what an eye looks like from
a sample of dog images.
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 4/17
8/4/2020 Computer Vision.ipynb - Colaboratory
Let's say it's determined that an image is likely to be a dog if an eye is present in the boxed off
locations of the image above.
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 5/17
8/4/2020 Computer Vision.ipynb - Colaboratory
Layer Parameters
A convolutional layer is de ned by two key parameters.
Filters
A lter is a m x n pattern of pixels that we are looking for in an image. The number of lters in a
convolutional layer reprsents how many patterns each layer is looking for and what the depth of
our response map will be. If we are looking for 32 different patterns/ lters than our output
feature map (aka the response map) will have a depth of 32. Each one of the 32 layers of depth
will be a matrix of some size containing values indicating if the lter was present at that location
or not.
Here's a great illustration from the book "Deep Learning with Python" by Francois Chollet (pg
124).
Sample Size
This isn't really the best term to describe this, but each convolutional layer is going to examine n
x m blocks of pixels in each image. Typically, we'll consider 3x3 or 5x5 blocks. In the example
above we use a 3x3 "sample size". This size will be the same as the size of our lter.
Our layers work by sliding these lters of n x m pixels over every possible position in our image
and populating a new feature map/response map indicating whether the lter is present at each
location.
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 6/17
8/4/2020 Computer Vision.ipynb - Colaboratory
Image from "Deep Learning with Python" by Francois Chollet (pg 126).
This means our response map will have a slightly smaller width and height than our original
image. This is ne but sometimes we want our response map to have the same dimensions. We
can accomplish this by using something called padding.
Padding is simply the addition of the appropriate number of rows and/or columns to your input
data such that each pixel can be centered by the lter.
Strides
In the previous sections we assumed that the lters would be slid continously through the image
such that it covered every possible position. This is common but sometimes we introduce the
idea of a stride to our convolutional layer. The stride size reprsents how many rows/cols we will
move the lter each time. These are not used very frequently so we'll move on.
Pooling
You may recall that our convnets are made up of a stack of convolution and pooling layers.
The idea behind a pooling layer is to downsample our feature maps and reduce their
dimensions. They work in a similar way to convolutional layers where they extract windows from
the feature map and return a response map of the max, min or average values of each channel.
Pooling is usually done using windows of size 2x2 and a stride of 2. This will reduce the size of
the feature map by a factor of two and return a response map that is 2x smaller.
Please refer to the video to learn how all of this happens at the lower level!
Creating a Convnet
Now it is time to create our rst convnet! This example is for the purpose of getting familiar with
CNN architectures, we will talk about how to improves its performance later.
This tutorial is based on the following guide from the TensorFlow documentation:
https://www.tensor ow.org/tutorials/images/cnn
Dataset
The problem we will consider here is classifying 10 different everyday objects. The dataset we
will use is built into tensor ow and called the CIFAR Image Dataset. It contains 60,000 32x32
color images with 6000 images of each class.
Airplane
Automobile
Bird
Cat
Deer
Dog
Frog
Horse
Ship
Truck
We'll load the dataset and have a look at some of the images below.
%tensorflow_version 2.x # this line is not required unless you are in a notebook
import tensorflow as tf
plt.imshow(train_images[IMG_INDEX] ,cmap=plt.cm.binary)
plt.xlabel(class_names[train_labels[IMG_INDEX][0]])
plt.show()
CNN Architecture
A common architecture for a CNN is a stack of Conv2D and MaxPooling2D layers followed by a
few denesly connected layers. To idea is that the stack of convolutional and maxPooling layers
extract the features from the image. Then these features are attened and fed to densly
connected layers that determine the class of an image based on the presence of features.
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
Layer 1
The input shape of our data will be 32, 32, 3 and we will process 32 lters of size 3x3 over our
input data. We will also apply the activation function relu to the output of each convolution
operation.
Layer 2
This layer will perform the max pooling operation using 2x2 samples and a stride of 2.
Other Layers
The next set of layers do very similar things but take as input the feature map from the previous
layer. They also increase the frequency of lters from 32 to 64. We can do this as our data
shrinks in spacial dimensions as it passed through the layers, meaning we can afford
(computationally) to add more depth.
After looking at the summary you should notice that the depth of our image increases but the
spacial dimensions reduce drastically.
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 9/17
8/4/2020 Computer Vision.ipynb - Colaboratory
So far, we have just completed the convolutional base. Now we need to take these extracted
features and add a way to classify them. This is why we add the following layers to our model.
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.summary()
We can see that the atten layer changes the shape of our data so that we can feed it to the 64-
node dense layer, follwed by the nal output layer of 10 neurons (one for each class).
Training
Now we will train and compile the model using the recommended hyper paramaters from
tensor ow.
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
You should be getting an accuracy of about 70%. This isn't bad for a simple model like this, but
we'll dive into some better approaches for computer vision below.
Data Augmentation
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 10/17
8/4/2020 Computer Vision.ipynb - Colaboratory
To avoid over tting and create a larger dataset from a smaller one we can use a technique
called data augmentation. This is simply performing random transofrmations on our images so
that our model can generalize better. These transformations can be things like compressions,
rotations, stretches and even color changes.
Fortunately, keras can help us do this. Look at the code below to an example of data
augmentation.
i = 0
plt.show()
Pretrained Models
You would have noticed that the model above takes a few minutes to train in the NoteBook and
only gives an accuaracy of ~70%. This is okay but surely there is a way to improve on this.
In this section we will talk about using a pretrained CNN as apart of our own custom network to
improve the accuracy of our model. We know that CNN's alone (with no dense layers) don't do
anything other than map the presence of features from our input. This means we can use a
pretrained CNN, one trained on millions of images, as the start of our model. This will allow us to
have a very good convolutional base before adding our own dense layered classi er at the end.
In fact, by using this techique we can train a very good classi er for a realtively small dataset (<
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 11/17
8/4/2020 Computer Vision.ipynb - Colaboratory
10,000 images). This is because the convnet already has a very good idea of what features to
look for in an image and can nd them very effectively. So, if we can determine the presence of
features all the rest of the model needs to do is determine which combination of features makes
a speci c image.
Fine Tuning
When we employ the technique de ned above, we will often want to tweak the nal layers in our
convolutional base to work better for our speci c problem. This involves not touching or
retraining the earlier layers in our convolutional base but only adjusting the nal few. We do this
because the rst layers in our base are very good at extracting low level features lile lines and
edges, things that are similar for any kind of image. Where the later layers are better at picking
up very speci c features like shapes or even eyes. If we adjust the nal layers than we can look
for only features relevant to our very speci c problem.
This tutorial is based on the following guide from the TensorFlow documentation:
https://www.tensor ow.org/tutorials/images/transfer_learning
#Imports
import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
keras = tf.keras
Dataset
We will load the cats_vs_dogs dataset from the modoule tensor ow_datatsets.
This dataset contains (image, label) pairs where images have different dimensions and 3 color
channels.
# split the data manually into 80% training, 10% testing, 10% validation
(raw_train, raw_validation, raw_test), metadata = tfds.load(
'cats_vs_dogs',
split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'],
with_info=True,
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 12/17
8/4/2020 Computer Vision.ipynb - Colaboratory
as_supervised=True,
)
Data Preprocessing
Since the sizes of our images are all different, we need to convert them all to the same size. We
can create a function that will do that for us below.
Now we can apply this function to all our images using .map() .
train = raw_train.map(format_example)
validation = raw_validation.map(format_example)
test = raw_test.map(format_example)
BATCH_SIZE = 32
SHUFFLE_BUFFER_SIZE = 1000
train_batches = train.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
validation_batches = validation.batch(BATCH_SIZE)
test batches = test.batch(BATCH SIZE)
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 13/17
8/4/2020 Computer Vision.ipynb - Colaboratory
test_batches test.batch(BATCH_SIZE)
Now if we look at the shape of an original image vs the new image we will see it has been
changed.
We want to use this model but only its convolutional base. So, when we load in the model, we'll
specify that we don't want to load the top (classi cation) layer. We'll tell the model what input
shape to expect and to use the predetermined weights from imagenet (Googles dataset).
base_model.summary()
At this point this base_model will simply output a shape (32, 5, 5, 1280) tensor that is a feature
extraction from our original (1, 160, 160, 3) image. The 32 means that we have 32 layers of
differnt lters/features.
feature_batch = base_model(image)
print(feature_batch.shape)
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 14/17
8/4/2020 Computer Vision.ipynb - Colaboratory
base_model.trainable = False
base_model.summary()
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
Finally, we will add the predicition layer that will be a single dense neuron. We can do this
because we only have two classes to predict for.
prediction_layer = keras.layers.Dense(1)
model = tf.keras.Sequential([
base_model,
global_average_layer,
prediction_layer
])
model.summary()
base_learning_rate = 0.0001
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=base_learning_rate),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
# We can evaluate the model right now to see how it does before training it on our new ima
initial_epochs = 3
validation_steps=20
acc = history.history['accuracy']
print(acc)
model.save("dogs_vs_cats.h5") # we can save the model and reload it at anytime in the fut
new_model = tf.keras.models.load_model('dogs_vs_cats.h5')
Object Detection
If you'd like to learn how you can perform object detection and recognition with tensor ow check
out the guide below.
https://github.com/tensor ow/models/tree/master/research/object_detection
Sources
1. “Convolutional Neural Network (CNN) : TensorFlow Core.” TensorFlow,
www.tensor ow.org/tutorials/images/cnn.
2. “Transfer Learning with a Pretrained ConvNet : TensorFlow Core.” TensorFlow,
www.tensor ow.org/tutorials/images/transfer_learning.
3. Chollet François. Deep Learning with Python. Manning Publications Co., 2018.
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 16/17
8/4/2020 Computer Vision.ipynb - Colaboratory
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 17/17