0% found this document useful (0 votes)
122 views

Computer Vision - Ipynb - Colaboratory

The document provides an overview of deep computer vision and convolutional neural networks (CNNs). It discusses how CNNs can be used for image classification and object detection by learning local patterns from image data through convolutional and pooling layers. The key aspects of CNNs that are explained are: (1) image data structure with dimensions for height, width, and color channels, (2) how convolutional layers use filters to detect patterns in local regions of input feature maps, and (3) how multiple convolutional layers can increase abstraction by building upon each other's outputs.

Uploaded by

zb lai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views

Computer Vision - Ipynb - Colaboratory

The document provides an overview of deep computer vision and convolutional neural networks (CNNs). It discusses how CNNs can be used for image classification and object detection by learning local patterns from image data through convolutional and pooling layers. The key aspects of CNNs that are explained are: (1) image data structure with dimensions for height, width, and color channels, (2) how convolutional layers use filters to detect patterns in local regions of input feature maps, and (3) how multiple convolutional layers can increase abstraction by building upon each other's outputs.

Uploaded by

zb lai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

8/4/2020 Computer Vision.

ipynb - Colaboratory

Deep Computer Vision


In this guide we will learn how to peform image classi cation and object detection/recognition
using deep computer vision with something called a convolutional neural network.

The goal of our convolutional neural networks will be to classify and detect images or speci c
objects from within the image. We will be using image data as our features and a label for those
images as our label or output.

We already know how neural networks work so we can skip through the basics and move right
into explaining the following concepts.

Image Data
Convolutional Layer
Pooling Layer
CNN Architectures

The major differences we are about to see in these types of neural networks are the layers that
make them up.

Image Data
So far, we have dealt with pretty straight forward data that has 1 or 2 dimensions. Now we are
about to deal with image data that is usually made up of 3 dimensions. These 3 dimensions are
as follows:

image height
image width
color channels

The only item in the list above you may not understand is color channels. The number of color
channels represents the depth of an image and coorelates to the colors used in it. For example,
an image with three channels is likely made up of rgb (red, green, blue) pixels. So, for each pixel
we have three numeric values in the range 0-255 that de ne its color. For an image of color
depth 1 we would likely have a greyscale image with one value de ning each pixel, again in the
range of 0-255.

https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 1/17
8/4/2020 Computer Vision.ipynb - Colaboratory

Keep this in mind as we discuss how our network works and the input/output of each layer.

Convolutional Neural Network


Note: I will use the term convnet and convolutional neural network interchangably.

Each convolutional neural network is made up of one or many convolutional layers. These layers
are different than the dense layers we have seen previously. Their goal is to nd patterns from
within images that can be used to classify the image or parts of it. But this may sound familiar
to what our densly connected neural network in the previous section was doing, well that's
becasue it is.

The fundemental difference between a dense layer and a convolutional layer is that dense layers
detect patterns globally while convolutional layers detect patterns locally. When we have a
densly connected layer each node in that layer sees all the data from the previous layer. This
means that this layer is looking at all the information and is only capable of analyzing the data in
a global capacity. Our convolutional layer however will not be densly connected, this means it
can detect local patterns using part of the input data to that layer.

Let's have a look at how a densly connected layer would look at an image vs how a convolutional
layer would.

https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 2/17
8/4/2020 Computer Vision.ipynb - Colaboratory

This is our image; the goal of our network will be to determine whether this image is a cat or not.

Dense Layer: A dense layer will consider the ENTIRE image. It will look at all the pixels and use
that information to generate some output.

Convolutional Layer: The convolutional layer will look at speci c parts of the image. In this
example let's say it analyzes the highlighted parts below and detects patterns there.

https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 3/17
8/4/2020 Computer Vision.ipynb - Colaboratory

Can you see why this might make these networks more useful?

How They Work


A dense neural network learns patterns that are present in one speci c area of an image. This
means if a pattern that the network knows is present in a different area of the image it will have
to learn the pattern again in that new area to be able to detect it.

Let's use an example to better illustrate this.

We'll consider that we have a dense neural network that has learned what an eye looks like from
a sample of dog images.

https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 4/17
8/4/2020 Computer Vision.ipynb - Colaboratory

Let's say it's determined that an image is likely to be a dog if an eye is present in the boxed off
locations of the image above.

Now let's ip the image.

Multiple Convolutional Layers


In our models it is quite common to have more than one convolutional layer. Even the basic
Since ourwe
example densly connected
will use network
in this guide will has only recognized
be made patterns globally
up of 3 convolutional layers.itThese
will look where
layers it
work
thinks thebyeyes
together should complexity
increasing be present. and
Clearly it does not
abstraction nd them
at each there and
subsequent therefore
layer. The rst would
layerlikely
determine this image for
might be responsible is not a dog.
picking upEven though
edges the pattern
and short of thethe
lines, while eyes is present,
second it's just
layer will takeinasa
different location.
input these lines and start forming shapes or polygons. Finally, the last layer might take these
shapes and determine
Since convolutional which
layers combiantions
learn make upfrom
and detect patterns a speci c image.
different areas of the image, they don't
have problems with the example we just illustrated. They know what an eye looks like and by
analyzing different parts of the image can nd where it is present.
Feature Maps
You may see me use the term feature map throughout this tutorial. This term simply stands for a
3D tensor with two spacial axes (width and height) and one depth axis. Our convolutional layers
take feature maps as their input and return a new feature map that reprsents the prescence of
spci c lters from the previous feature map. These are what we call response maps.

https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 5/17
8/4/2020 Computer Vision.ipynb - Colaboratory

Layer Parameters
A convolutional layer is de ned by two key parameters.

Filters
A lter is a m x n pattern of pixels that we are looking for in an image. The number of lters in a
convolutional layer reprsents how many patterns each layer is looking for and what the depth of
our response map will be. If we are looking for 32 different patterns/ lters than our output
feature map (aka the response map) will have a depth of 32. Each one of the 32 layers of depth
will be a matrix of some size containing values indicating if the lter was present at that location
or not.

Here's a great illustration from the book "Deep Learning with Python" by Francois Chollet (pg
124).

Sample Size
This isn't really the best term to describe this, but each convolutional layer is going to examine n
x m blocks of pixels in each image. Typically, we'll consider 3x3 or 5x5 blocks. In the example
above we use a 3x3 "sample size". This size will be the same as the size of our lter.

Our layers work by sliding these lters of n x m pixels over every possible position in our image
and populating a new feature map/response map indicating whether the lter is present at each
location.

Borders and Padding


The more mathematical of you may have realized that if we slide a lter of let's say size 3x3 over
our image well consider less positions for our lter than pixels in our input. Look at the example
below.

https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 6/17
8/4/2020 Computer Vision.ipynb - Colaboratory

Image from "Deep Learning with Python" by Francois Chollet (pg 126).

This means our response map will have a slightly smaller width and height than our original
image. This is ne but sometimes we want our response map to have the same dimensions. We
can accomplish this by using something called padding.

Padding is simply the addition of the appropriate number of rows and/or columns to your input
data such that each pixel can be centered by the lter.

Strides
In the previous sections we assumed that the lters would be slid continously through the image
such that it covered every possible position. This is common but sometimes we introduce the
idea of a stride to our convolutional layer. The stride size reprsents how many rows/cols we will
move the lter each time. These are not used very frequently so we'll move on.

Pooling
You may recall that our convnets are made up of a stack of convolution and pooling layers.

The idea behind a pooling layer is to downsample our feature maps and reduce their
dimensions. They work in a similar way to convolutional layers where they extract windows from
the feature map and return a response map of the max, min or average values of each channel.
Pooling is usually done using windows of size 2x2 and a stride of 2. This will reduce the size of
the feature map by a factor of two and return a response map that is 2x smaller.

A More Detailed Look


https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 7/17
8/4/2020 Computer Vision.ipynb - Colaboratory

Please refer to the video to learn how all of this happens at the lower level!

Creating a Convnet
Now it is time to create our rst convnet! This example is for the purpose of getting familiar with
CNN architectures, we will talk about how to improves its performance later.

This tutorial is based on the following guide from the TensorFlow documentation:
https://www.tensor ow.org/tutorials/images/cnn

Dataset
The problem we will consider here is classifying 10 different everyday objects. The dataset we
will use is built into tensor ow and called the CIFAR Image Dataset. It contains 60,000 32x32
color images with 6000 images of each class.

The labels in this dataset are the following:

Airplane
Automobile
Bird
Cat
Deer
Dog
Frog
Horse
Ship
Truck

We'll load the dataset and have a look at some of the images below.

%tensorflow_version 2.x # this line is not required unless you are in a notebook
import tensorflow as tf

from tensorflow.keras import datasets, layers, models


import matplotlib.pyplot as plt

# LOAD AND SPLIT DATASET


(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1


train_images, test_images = train_images / 255.0, test_images / 255.0

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',


'dog', 'frog', 'horse', 'ship', 'truck']

# Let's look at a one image


https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 8/17
8/4/2020 Computer Vision.ipynb - Colaboratory

IMG_INDEX = 7 # change this to look at other images

plt.imshow(train_images[IMG_INDEX] ,cmap=plt.cm.binary)
plt.xlabel(class_names[train_labels[IMG_INDEX][0]])
plt.show()

CNN Architecture
A common architecture for a CNN is a stack of Conv2D and MaxPooling2D layers followed by a
few denesly connected layers. To idea is that the stack of convolutional and maxPooling layers
extract the features from the image. Then these features are attened and fed to densly
connected layers that determine the class of an image based on the presence of features.

We will start by building the Convolutional Base.

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

Layer 1

The input shape of our data will be 32, 32, 3 and we will process 32 lters of size 3x3 over our
input data. We will also apply the activation function relu to the output of each convolution
operation.

Layer 2

This layer will perform the max pooling operation using 2x2 samples and a stride of 2.

Other Layers

The next set of layers do very similar things but take as input the feature map from the previous
layer. They also increase the frequency of lters from 32 to 64. We can do this as our data
shrinks in spacial dimensions as it passed through the layers, meaning we can afford
(computationally) to add more depth.

model.summary() # let's have a look at our model so far

After looking at the summary you should notice that the depth of our image increases but the
spacial dimensions reduce drastically.

Adding Dense Layers

https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcPw… 9/17
8/4/2020 Computer Vision.ipynb - Colaboratory

So far, we have just completed the convolutional base. Now we need to take these extracted
features and add a way to classify them. This is why we add the following layers to our model.
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.summary()

We can see that the atten layer changes the shape of our data so that we can feed it to the 64-
node dense layer, follwed by the nal output layer of 10 neurons (one for each class).

Training
Now we will train and compile the model using the recommended hyper paramaters from
tensor ow.

Note: This will take much longer than previous models!

model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=4,


validation_data=(test_images, test_labels))

Evaluating the Model


We can determine how well the model performed by looking at it's performance on the test data
set.

test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)


print(test_acc)

You should be getting an accuracy of about 70%. This isn't bad for a simple model like this, but
we'll dive into some better approaches for computer vision below.

Working with Small Datasets


In the situation where you don't have millions of images it is di cult to train a CNN from scratch
that performs very well. This is why we will learn about a few techniques we can use to train
CNN's on small datasets of just a few thousand images.

Data Augmentation
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 10/17
8/4/2020 Computer Vision.ipynb - Colaboratory

To avoid over tting and create a larger dataset from a smaller one we can use a technique
called data augmentation. This is simply performing random transofrmations on our images so
that our model can generalize better. These transformations can be things like compressions,
rotations, stretches and even color changes.

Fortunately, keras can help us do this. Look at the code below to an example of data
augmentation.

from keras.preprocessing import image


from keras.preprocessing.image import ImageDataGenerator

# creates a data generator object that transforms images


datagen = ImageDataGenerator(
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')

# pick an image to transform


test_img = train_images[20]
img = image.img_to_array(test_img) # convert image to numpy arry
img = img.reshape((1,) + img.shape) # reshape image

i = 0

for batch in datagen.flow(img, save_prefix='test', save_format='jpeg'): # this loops runs


plt.figure(i)
plot = plt.imshow(image.img_to_array(batch[0]))
i += 1
if i > 4: # show 4 images
break

plt.show()

Pretrained Models
You would have noticed that the model above takes a few minutes to train in the NoteBook and
only gives an accuaracy of ~70%. This is okay but surely there is a way to improve on this.

In this section we will talk about using a pretrained CNN as apart of our own custom network to
improve the accuracy of our model. We know that CNN's alone (with no dense layers) don't do
anything other than map the presence of features from our input. This means we can use a
pretrained CNN, one trained on millions of images, as the start of our model. This will allow us to
have a very good convolutional base before adding our own dense layered classi er at the end.
In fact, by using this techique we can train a very good classi er for a realtively small dataset (<
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 11/17
8/4/2020 Computer Vision.ipynb - Colaboratory

10,000 images). This is because the convnet already has a very good idea of what features to
look for in an image and can nd them very effectively. So, if we can determine the presence of
features all the rest of the model needs to do is determine which combination of features makes
a speci c image.

Fine Tuning
When we employ the technique de ned above, we will often want to tweak the nal layers in our
convolutional base to work better for our speci c problem. This involves not touching or
retraining the earlier layers in our convolutional base but only adjusting the nal few. We do this
because the rst layers in our base are very good at extracting low level features lile lines and
edges, things that are similar for any kind of image. Where the later layers are better at picking
up very speci c features like shapes or even eyes. If we adjust the nal layers than we can look
for only features relevant to our very speci c problem.

Using a Pretrained Model


In this section we will combine the tecniques we learned above and use a pretrained model and
ne tuning to classify images of dogs and cats using a small dataset.

This tutorial is based on the following guide from the TensorFlow documentation:
https://www.tensor ow.org/tutorials/images/transfer_learning

#Imports
import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
keras = tf.keras

Dataset
We will load the cats_vs_dogs dataset from the modoule tensor ow_datatsets.

This dataset contains (image, label) pairs where images have different dimensions and 3 color
channels.

import tensorflow_datasets as tfds


tfds.disable_progress_bar()

# split the data manually into 80% training, 10% testing, 10% validation
(raw_train, raw_validation, raw_test), metadata = tfds.load(
'cats_vs_dogs',
split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'],
with_info=True,
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 12/17
8/4/2020 Computer Vision.ipynb - Colaboratory
as_supervised=True,
)

get_label_name = metadata.features['label'].int2str # creates a function object that we c

# display 2 images from the dataset


for image, label in raw_train.take(5):
plt.figure()
plt.imshow(image)
plt.title(get_label_name(label))

Data Preprocessing
Since the sizes of our images are all different, we need to convert them all to the same size. We
can create a function that will do that for us below.

IMG_SIZE = 160 # All images will be resized to 160x160

def format_example(image, label):


"""
returns an image that is reshaped to IMG_SIZE
"""
image = tf.cast(image, tf.float32)
image = (image/127.5) - 1
image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
return image, label

Now we can apply this function to all our images using .map() .

train = raw_train.map(format_example)
validation = raw_validation.map(format_example)
test = raw_test.map(format_example)

Let's have a look at our images now.

for image, label in train.take(2):


plt.figure()
plt.imshow(image)
plt.title(get_label_name(label))

Finally we will shu e and batch the images.

BATCH_SIZE = 32
SHUFFLE_BUFFER_SIZE = 1000

train_batches = train.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
validation_batches = validation.batch(BATCH_SIZE)
test batches = test.batch(BATCH SIZE)
https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 13/17
8/4/2020 Computer Vision.ipynb - Colaboratory
test_batches test.batch(BATCH_SIZE)

Now if we look at the shape of an original image vs the new image we will see it has been
changed.

for img, label in raw_train.take(2):


print("Original shape:", img.shape)

for img, label in train.take(2):


print("New shape:", img.shape)

Picking a Pretrained Model


The model we are going to use as the convolutional base for our model is the MobileNet V2
developed at Google. This model is trained on 1.4 million images and has 1000 different
classes.

We want to use this model but only its convolutional base. So, when we load in the model, we'll
specify that we don't want to load the top (classi cation) layer. We'll tell the model what input
shape to expect and to use the predetermined weights from imagenet (Googles dataset).

IMG_SHAPE = (IMG_SIZE, IMG_SIZE, 3)

# Create the base model from the pre-trained model MobileNet V2


base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')

base_model.summary()

At this point this base_model will simply output a shape (32, 5, 5, 1280) tensor that is a feature
extraction from our original (1, 160, 160, 3) image. The 32 means that we have 32 layers of
differnt lters/features.

for image, _ in train_batches.take(1):


pass

feature_batch = base_model(image)
print(feature_batch.shape)

Freezing the Base


The term freezing refers to disabling the training property of a layer. It simply means we won’t
make any changes to the weights of any layers that are frozen during training. This is important
as we don't want to change the convolutional base that already has learned weights.

https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 14/17
8/4/2020 Computer Vision.ipynb - Colaboratory

base_model.trainable = False

base_model.summary()

Adding our Classi er


Now that we have our base layer setup, we can add the classi er. Instead of attening the
feature map of the base layer we will use a global average pooling layer that will average the
entire 5x5 area of each 2D feature map and return to us a single 1280 element vector per lter.

global_average_layer = tf.keras.layers.GlobalAveragePooling2D()

Finally, we will add the predicition layer that will be a single dense neuron. We can do this
because we only have two classes to predict for.

prediction_layer = keras.layers.Dense(1)

Now we will combine these layers together in a model.

model = tf.keras.Sequential([
base_model,
global_average_layer,
prediction_layer
])

model.summary()

Training the Model


Now we will train and compile the model. We will use a very small learning rate to ensure that
the model does not have any major changes made to it.

base_learning_rate = 0.0001
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=base_learning_rate),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])

# We can evaluate the model right now to see how it does before training it on our new ima
initial_epochs = 3
validation_steps=20

loss0,accuracy0 = model.evaluate(validation_batches, steps = validation_steps)

# Now we can train it on our images


https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 15/17
8/4/2020 Computer Vision.ipynb - Colaboratory
# Now we can train it on our images
history = model.fit(train_batches,
epochs=initial_epochs,
validation_data=validation_batches)

acc = history.history['accuracy']
print(acc)

model.save("dogs_vs_cats.h5") # we can save the model and reload it at anytime in the fut
new_model = tf.keras.models.load_model('dogs_vs_cats.h5')

And that's it for this section on computer vision!

Object Detection
If you'd like to learn how you can perform object detection and recognition with tensor ow check
out the guide below.

https://github.com/tensor ow/models/tree/master/research/object_detection

Sources
1. “Convolutional Neural Network (CNN)  :   TensorFlow Core.” TensorFlow,
www.tensor ow.org/tutorials/images/cnn.
2. “Transfer Learning with a Pretrained ConvNet  :   TensorFlow Core.” TensorFlow,
www.tensor ow.org/tutorials/images/transfer_learning.
3. Chollet François. Deep Learning with Python. Manning Publications Co., 2018.

https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 16/17
8/4/2020 Computer Vision.ipynb - Colaboratory

https://colab.research.google.com/drive/1ZZXnCjFEOkp_KdNcNabd14yok0BAIuwS#forceEdit=true&sandboxMode=true&scrollTo=aPqeddhcP… 17/17

You might also like