CNN 1721592934
CNN 1721592934
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten (Flatten) (None, 784) 0
=================================================================
Total params: 669,706
Trainable params: 669,706
Non-trainable params: 0
_________________________________________________________________
Params after the flatten layer = 0, because this layer only flattens the image to a vector for feeding into
the input layer. The weights haven’t been added yet.
Params after layer 1 = (784 nodes in input layer) × (512 in hidden layer 1) + (512 connections to biases)
= 401,920.
Params after layer 2 = (512 nodes in hidden layer 1) × (512 in hidden layer 2) + (512 connections to
biases) = 262,656.
Params after layer 3= (512 nodes in hidden layer 2) × (10 in output layer) + (10 connections to biases)
= 5,130.
Total params in the network = 401,920 + 262,656 + 5,130 = 669,706.
Why Convolutions?
Collecting opencv-python
Downloading opencv_python-4.8.0.74-cp37-abi3-win_amd64.whl (38.1 MB)
---------------------------------------- 38.1/38.1 MB 1.3 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17.0 in d:\anaconda setup\lib\site-packages (fr
om opencv-python) (1.23.5)
Installing collected packages: opencv-python
Successfully installed opencv-python-4.8.0.74
In [9]: import os
# Check the current working directory
print(os.getcwd())
# Check if the file exists
file_path = 'input.jfif'
print(os.path.exists(file_path))
D:\JupyterNotebook\PW_Skills_Data_Science\Deep Learning\CV-Content-20230803T131230Z-001
\CV-Content\01. CNN Foundation
True
In [12]: img
...,
Sample Image
Increase in Parameter Issue
While increase in Parameter Issue is not a big problem for the MNIST dataset because the images are
really small in size (28 × 28), what happens when we try to process larger images?
For example, if we have an image with dimensions 1,000 × 1,000, it will yield 1 million parameters for each
node in the first hidden layer.
So if the first hidden layer has 1,000 neurons, this will yield 1 billion parameters even in such a small
network. You can imagine the computational complexity of optimizing 1 billion parameters after only the
first layer.
Source (https://www.cs.toronto.edu)
Source (https://en.wikipedia.org/wiki/Color_vision)
A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers (often with a
subsampling step) and then followed by one or more fully connected layers as in a standard multilayer
neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input
image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights
followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is
that they are easier to train and have many fewer parameters than fully connected networks with the same
b f hidd it I thi ti l ill di th hit t f CNN d th b k ti
Simple Convolution
Matrix Calculation
Padding Concept
Stride Concept
Feature Accumulation
Feature Aggregation
Convolution Operation
Focusing on Filters
In [17]: import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import cv2
import numpy as np
%matplotlib inline
# Read in the image
image = mpimg.imread('input.jfif')
plt.imshow(image)
As youre thinking from Where I get Matrixes for sharpness or to check outline
, I used below link
Source (https://cs231n.github.io/convolutional-networks/)
Intuition
Let's develop better intuition for how Convolutional Neural Networks (CNN) work. We'll examine how
humans classify images, and then see how CNNs use similar approaches.
Let’s say we wanted to classify the following image of a dog as a Golden Retriever:
As humans, how do we do this?
One thing we do is that we identify certain parts of the dog, such as the nose, the eyes, and the fur. We
essentially break up the image into smaller pieces, recognize the smaller pieces, and then combine those
pieces to get an idea of the overall dog.
In this case, we might break down the image into a combination of the following:
A nose
Two eyes
Golden fur
But let’s take this one step further. How do we determine what exactly a nose is? A Golden Retriever nose
can be seen as an oval with two black holes inside it. Thus, one way of classifying a Retriever’s nose is to to
break it up into smaller pieces and look for black holes (nostrils) and curves that define an oval as shown
below:
Broadly speaking, this is what a CNN learns to do. It learns to recognize basic lines and curves, then
shapes and blobs, and then increasingly complex objects within the image. Finally, the CNN classifies the
image by combining the larger, more complex objects.
With deep learning, we don't actually program the CNN to recognize these specific features. Rather, the
CNN learns on its own to recognize such objects through forward propagation and backpropagation!
It's amazing how well a CNN can learn to classify images, even though we never program the CNN with
i f ti b t ifi f t t l kf
An example of what each layer in a CNN might recognize when classifying a picture of a dog
A CNN might have several layers, and each layer might capture a different level in the hierarchy of objects.
The first layer is the lowest level in the hierarchy, where the CNN generally classifies small parts of the
image into simple shapes like horizontal and vertical lines and simple blobs of colors. The subsequent
layers tend to be higher levels in the hierarchy and generally classify more complex ideas like shapes
(combinations of lines), and eventually full objects like dogs.
Once again, the CNN learns all of this on its own. We don't ever have to tell the CNN to go looking for
lines or curves or noses or fur. The CNN just learns from the training set and discovers which characteristics
of a Golden Retriever are worth looking for.
Filters
Breaking up an Image
The first step for a CNN is to break up the image into smaller pieces. We do this by selecting a width and
height that defines a filter.
The filter looks at small pieces, or patches, of the image. These patches are the same size as the filter.
A CNN uses filters to split an image into smaller patches. The size of these patches matches the filter size.
We then simply slide this filter horizontally or vertically to focus on a different piece of the image.
The amount by which the filter slides is referred to as the 'stride'. The stride is a hyperparameter which the
engineer can tune. Increasing the stride reduces the size of your model by reducing the number of total
patches each layer observes. However, this usually comes with a reduction in accuracy.
Let’s look at an example. In this zoomed in image of the dog, we first start with the patch outlined in red. The
width and height of our filter define the size of this square.
We then move the square over to the right by a given stride (2 in this case) to get another patch.
We move our square to the right by two pixels to create another patch.
What's important here is that we are grouping together adjacent pixels and treating them as a collective.
In a normal, non-convolutional neural network, we would have ignored this adjacency. In a normal network,
we would have connected every pixel in the input image to a neuron in the next layer. In doing so, we would
not have taken advantage of the fact that pixels in an image are close together for a reason and have
special meaning.
By taking advantage of this local structure, our CNN learns to classify local patterns, like shapes and
objects, in an image.
Filter Depth
It's common to have more than one filter. Different filters pick up different qualities of a patch. For example,
one filter might look for a particular color, while another might look for a kind of object of a specific shape.
The amount of filters in a convolutional layer is called the filter depth.
In the above example, a patch is connected to a neuron in the next layer. Source: MIchael Neilsen.
That’s dependent on our filter depth. If we have a depth of k , we connect each patch of pixels to k
neurons in the next layer. This gives us the height of k in the next layer, as shown below. In practice, k is
a hyperparameter we tune, and most CNNs tend to pick the same starting values.
Choosing a filter depth of k connects each path to k neurons in the next layer
But why connect a single patch to multiple neurons in the next layer? Isn’t one neuron good enough?
Multiple neurons can be useful because a patch can have multiple interesting characteristics that we want to
capture.
For example, one patch might include some white teeth, some blonde whiskers, and part of a red tongue. In
that case, we might want a filter depth of at least three - one for each of teeth, whiskers, and tongue.
This patch of the dog has many interesting features we may want to capture. These include the presence of
teeth, the presence of whiskers, and the pink color of the tongue.
Having multiple neurons for a given patch ensures that our CNN can learn to capture whatever
characteristics the CNN learns are important.
Remember that the CNN isn't "programmed" to look for certain characteristics. Rather, it learns on its own
which characteristics to notice.
Parameters
Parameter Sharing
The weights, w, are shared across patches for a given layer in a CNN to detect the cat above regardless of
where in the image it is located.
When we are trying to classify a picture of a cat, we don’t care where in the image a cat is. If it’s in the top
left or the bottom right, it’s still a cat in our eyes. We would like our CNNs to also possess this ability known
as translation invariance. How can we achieve this?
As we saw earlier, the classification of a given patch in an image is determined by the weights and biases
corresponding to that patch.
If we want a cat that’s in the top left patch to be classified in the same way as a cat in the bottom right patch,
we need the weights and biases corresponding to those patches to be the same, so that they are classified
the same way.
This is exactly what we do in CNNs. The weights and biases we learn for a given output layer are shared
across all patches in a given input layer. Note that as we increase the depth of our filter, the number of
weights and biases we have to learn still increases, as the weights aren't shared across the output
channels.
There’s an additional benefit to sharing our parameters. If we did not reuse the same weights across all
patches, we would have to learn new parameters for every single patch and hidden layer neuron pair. This
does not scale well, especially for higher fidelity images. Thus, sharing parameters not only helps us with
translation invariance, but also gives us a smaller, more scalable model.
Padding
Let's say we have a 5x5 grid (as shown above) and a filter of size 3x3 with a stride of 1 . What's the
width and height of the next layer? We see that we can fit at most three patches in each direction, giving us
a dimension of 3x3 in our next layer. As we can see, the width and height of each subsequent layer
decreases in such a scheme.
In an ideal world, we'd be able to maintain the same width and height across layers so that we can continue
to add layers without worrying about the dimensionality shrinking and so that we have consistency. How
might we achieve this? One way is to simple add a border of 0 s to our original 5x5 image. You can see
what this looks like in the below image:
The same grid with 0 padding. Source: Andrej Karpathy.
This would expand our original image to a 7x7 . With this, we now see how our next layer's size is again a
5x5 , keeping our dimensionality consistent.
Visualizing CNNs
Layer 1
Example patterns that cause activations in the first layer of the network. These range from simple diagonal
lines (top left) to green blobs (bottom middle).
The images above are from Matthew Zeiler and Rob Fergus' deep visualization toolbox
(https://www.youtube.com/watch?v=ghEmQSxT6tw), which lets us visualize what each layer in a CNN
focuses on.
Each image in the above grid represents a pattern that causes the neurons in the first layer to activate - in
other words, they are patterns that the first layer recognizes. The top left image shows a -45 degree line,
while the middle top square shows a +45 degree line. These squares are shown below again for reference:
As visualized here, the first layer of the CNN can recognize -45 degree lines.
The first layer of the CNN is also able to recognize +45 degree lines, like the one above.
Let's now see some example images that cause such activations. The below grid of images all activated the
-45 degree line. Notice how they are all selected despite the fact that they have different colors, gradients,
and patterns.
Example patches that activate the -45 degree line detector in the first layer.
So, the first layer of our CNN clearly picks out very simple shapes and patterns like lines and blobs.
Layer 2
A visualization of the second layer in the CNN. Notice how we are picking up more complex ideas like
circles and stripes. The gray grid on the left represents how this layer of the CNN activates (or "what it
sees") based on the corresponding images from the grid on the right.
As you see in the image above, the second layer of the CNN recognizes circles (second row, second
column), stripes (first row, second column), and rectangles (bottom right).
The CNN learns to do this on its own. There is no special instruction for the CNN to focus on more
complex objects in deeper layers. That's just how it normally works out when you feed training data into a
CNN.
Layer 3
A visualization of the third layer in the CNN. The gray grid on the left represents how this layer of the CNN
activates (or "what it sees") based on the corresponding images from the grid on the right.
The third layer picks out complex combinations of features from the second layer. These include things like
grids, and honeycombs (top left), wheels (second row, second column), and even faces (third row, third
column).
Layer 5
A visualization of the fifth and final layer of the CNN. The gray grid on the left represents how this layer of
the CNN activates (or "what it sees") based on the corresponding images from the grid on the right.
We'll skip layer 4, which continues this progression, and jump right to the fifth and final layer of this CNN.
The last layer picks out the highest order ideas that we care about for classification, like dog faces, bird
faces, and bicycles.
Max Pooling
Convolutions
Stride is an array of 4 elements; the first element in the stride array indicates the stride for batch and last
element indicates stride for features. It's good practice to remove the batches or features you want to skip
from the data set rather than use stride to skip them. You can always set the first and last element to 1 in
stride in order to use all batches and features.
Max Pooling
The above is an example of max pooling with a 2x2 filter and stride of 2. The left square is the input and the
right square is the output. The four 2x2 colors in input represents each time the filter was applied to create
the max on the right side. For example, [[1, 1], [5, 6]] becomes 6 and [[3, 2], [1, 2]]
becomes 3 .
Integer-valued labels:
[5 0 4 1 9 2 1 3 1 4]
One-hot labels:
[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]
6. Reshape data to fit our CNN (and input_shape)
There are some additional, optional arguments that you might like to tune:
strides - The stride of the convolution. If you don't specify anything, strides is set to 1.
padding - One of 'valid' or 'same'. If you don't specify anything, padding is set to 'valid'.
activation - Typically 'relu'. If you don't specify anything, no activation is applied. You are strongly
encouraged to add a ReLU activation function to every convolutional layer in your networks.
** Things to remember **
Always add a ReLU activation function to the Conv2D layers in your CNN. With the exception of the
final layer in the network, Dense layers should also have a ReLU activation function.
When constructing a network for classification, the final layer in the network should be a Dense layer
with a softmax activation function. The number of nodes in the final layer should equal the total number
of classes in the dataset.
In [28]: from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, Globa
# build the model object
model = Sequential()
# CONV_1: add CONV layer with RELU activation and depth = 32 kernels
model.add(Conv2D(32, kernel_size=(3, 3), padding='same',activation='relu',input_shape=(2
# POOL_1: downsample the image to choose the best features
model.add(MaxPooling2D(pool_size=(2, 2)))
# CONV_2: here we increase the depth to 64
model.add(Conv2D(64, (3, 3),padding='same', activation='relu'))
# POOL_2: more downsampling
model.add(MaxPooling2D(pool_size=(2, 2)))
# flatten since too many dimensions, we only want a classification output
model.add(Flatten())
# FC_1: fully connected to get all relevant data
model.add(Dense(64, activation='relu'))
# FC_2: output a softmax to squash the matrix into output probabilities for the 10 class
model.add(Dense(10, activation='softmax'))
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 28, 28, 32) 320
=================================================================
Total params: 220,234
Trainable params: 220,234
Non-trainable params: 0
_________________________________________________________________
Things to notice:
The network begins with a sequence of two convolutional layers, followed by max pooling layers.
The final layer has one entry for each object class in the dataset, and has a softmax activation function,
so that it returns probabilities.
The Conv2D depth increases from the input layer of 1 to 32 to 64.
We also want to decrease the height and width - This is where maxpooling comes in. Notice that the
image dimensions decrease from 28 to 14 after the pooling layer.
You can see that every output shape has None in place of the batch-size. This is so as to facilitate
changing of batch size at runtime.
Finally, we add one or more fully connected layers to determine what object is contained in the image.
For instance, if wheels were found in the last max pooling layer, this FC layer will transform that
information to predict that a car is present in the image with higher probability. If there were eyes, legs
In [31]: # load the weights that yielded the best validation accuracy
model.load_weights('model.weights.best.hdf5')
Here we will train a CNN to classify images from the CIFAR-10 dataset.
** Tip: ** When using Gradient Descent, you should ensure that all features have a similar scale to speed up
training or else it will take much longer to converge.
Integer-valued labels:
[[[1. 0.]
[0. 1.]
[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]]
[[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[0. 1.]
[1 0 ]
5. Define the Model Architecture
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_2 (Conv2D) (None, 32, 32, 16) 448
=================================================================
Total params: 541,094
Trainable params: 541,094
Non-trainable params: 0
_________________________________________________________________
6. Compile the Model
Epoch 1/5
In [50]: model.load_weights('model.weights.best.hdf5')
In [51]: model
In [46]: # plot a random sample of test images, their predicted labels, and ground truth
fig = plt.figure(figsize=(20, 8))
for i, idx in enumerate(np.random.choice(x_test.shape[0], size=32, replace=False)):
ax = fig.add_subplot(4, 8, i + 1, xticks=[], yticks=[])
ax.imshow(np.squeeze(x_test[idx]))
pred_idx = np.argmax(y_hat[idx])
true_idx = np.argmax(y_test[idx])
ax.set_title("{} ({})".format(cifar10_labels[pred_idx], cifar10_labels[true_idx]),
color=("blue" if pred_idx == true_idx else "red"))
In [53]: # plot a random sample of test images, their predicted labels, and ground truth
fig = plt.figure(figsize=(20, 8))
correct_predictions = 0 # Initialize a counter for correct predictions
for i, idx in enumerate(np.random.choice(x_test.shape[0], size=32, replace=False)):
ax = fig.add_subplot(4, 8, i + 1, xticks=[], yticks=[])
ax.imshow(np.squeeze(x_test[idx]))
pred_idx = np.argmax(y_hat[idx])
true_idx = np.argmax(y_test[idx])
ax.set_title("{} ({})".format(cifar10_labels[pred_idx], cifar10_labels[true_idx]),
color=("blue" if pred_idx == true_idx else "red"))
if pred_idx == true_idx:
correct_predictions += 1
accuracy_ratio = correct_predictions / 32 # Calculate the accuracy ratio
plt.suptitle("Correctly identified: {:.2f}%".format(accuracy_ratio * 100))
plt.show()
In [ ]: