0% found this document useful (0 votes)
9 views96 pages

Mod 5

Convolutional Neural Networks (ConvNets) are designed to efficiently process large images by reducing their dimensionality while preserving critical features. They consist of three main types of layers: convolutional, pooling, and fully connected layers, with convolutional layers utilizing filters to extract features from images. Deep learning models, particularly those with multiple layers, can learn powerful representations automatically from large datasets, improving generalization capabilities in computer vision tasks.

Uploaded by

Pragyamita Basu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views96 pages

Mod 5

Convolutional Neural Networks (ConvNets) are designed to efficiently process large images by reducing their dimensionality while preserving critical features. They consist of three main types of layers: convolutional, pooling, and fully connected layers, with convolutional layers utilizing filters to extract features from images. Deep learning models, particularly those with multiple layers, can learn powerful representations automatically from large datasets, improving generalization capabilities in computer vision tasks.

Uploaded by

Pragyamita Basu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Convolutional Neural Network

Researchers in computer vision area have been experimenting many


neural-network architectures and algorithms, which have influenced other
fields as well.
In computer vision, images are the training data of a network, and the input
features are the pixels of an image. These features can get really big. For
example, when dealing with a 1 megapixel image, the total number of
features in that picture is 3 million (=1,000 x 1,000 x 3 color channels).
Then imagine passing this through a neural network with just 1,000 hidden
units, and we end up with some weights of 3 billion parameters!
These numbers are too big to be managed, but, luckily, we have the perfect
solution: Convolutional neural networks (ConvNets).
Smaller Network: CNN
• We know it is good to learn a small model.
• From this fully connected model, do we really need all the
edges?
• Can some of these be shared?
Consider learning an image:

• Some patterns are much smaller than the whole image

Can represent a small region with fewer parameters

“beak” detector
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector
Deep learning is about feature/representation learning.

Compared with traditional hand crafted features, neural network learned representation
are much powerful in computer vision tasks.
As these feature learning process are automatically accomplished by fed with large
volume of training data, it has better generalize capability.
Deep learning models are formed by multiple layers. In the context of artificial neural
networks the multi layer perceptron (MLP) with more than 2 hidden layers is already a Deep
Model.
As a rule of thumb deeper models have the potential to perform better than shallow models.
The problem is that the more deep you go the more data you will need to avoid over-fitting.
Convolutional Neural Network
You can imagine how computationally intensive things would get once the
images reach dimensions, say 8K (7680×4320). The role of the ConvNet is
to reduce the images into a form which is easier to process, without losing
features which are critical for getting a good prediction. This is important
when we are to design an architecture which is not only good at learning
features but also is scalable to massive datasets.

There are 3 types of layers in a convolutional network:


Convolution (CONV)
Pooling (POOL)
Fully connected (FC)
A convolutional layer
A “convolution” is one of the building blocks of the Convolutional network.
The primary purpose of a “convolution” in the case of a ConvNet is to
extract features from the input image. A convolutional layer has a number
of filters that does convolutional operation.

Beak detector

A filter

kernel
Output some times referred
as feature map
A convolutional layer
Every image can be represented as a matrix of
pixel values. An image from a standard digital
camera will have three channels — red, green
and blue. You can imagine those as three 2d-
matrices stacked over each other (one for each
color), each having pixel values in the range 0 to
255.

Applying a convolution to an image is like running a filter of a certain dimension and sliding it
on top of the image. That operation is translated into an element-wise multiplication between
the two matrices and finally an addition of the multiplication outputs. The final integer of this
computation forms a single element of the output matrix.
Convolution These are the network
parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1



6 x 6 image
Each filter detects a
small pattern (3 x 3).
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1
-1 1 -1
Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1

6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
11 -1-1 -1-1 -1-1 11 -1-1
Image size = 6x6x3 1 -1 -1 -1 1 -1
Filter size = 3x3x3 -1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter 2
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected



0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
1 0
6 x 6 image
3 0
14
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected

1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
1 0
6 x 6 image
3: 0
14:
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters

M-N+1
32-5+1 =28
NxN filter
If images are composed of three channels (R-red, G-green, B-blue). Therefore the
input is a volume, a stack of three matrices, which forms a depth identified by the
number of channels.
If we apply only one filter the result would be:

where the cube filter of 27 parameters now slides on top of the cube of the input image.
Some common convolution filters
d – depth, k – number of filters, P- padding, S- stride

(N-F)/S+1=(6-3)/1+1=4

Input: N x N x d Output: [(N + 2P-F) / S + 1] x [[(N + 2P-F) / S + 1] x k

Filter: F x F x d
a Z=W*a+b
g g(z)
b
One layer of Convolutional Neural Network
1
27

The final step that takes us to a convolutional neural layer is to add the bias and a non-linear
function.
One layer of Convolutional Neural Network

The result of a convolution of 6×6×3 with two 3×3×3 is a volume of dimension 4×4×2

We’ve gone from a 6×6×3 dimensional a[0] through one layer of a neural network to
a 4×4×2 dimensional a[1]. So, 6×6×3 has gone to 4×4×2 and that’s one layer of a convolutional
net. In this example we had two filters involved which is why we end up with 4×4×2 output. If we
had 10 filters instead of 2 then would have we wound obtain a 4×4×10 dimensional output volume.
That is we’d be taking 10 of these maps instead of 2 of them, and stacking them up to form
a 4×4×10 output volume, and that’s how a[1] would be obtained.
One layer of Convolutional Neural Network

In neural networks one step of a forward propagation step


as: Z[1]=W[1]×a[0]+b[1], where a[0]=x.
Then we applied the non-linearity to get a[1]=g(z[1]). The same idea we will apply in a layer of
the Convolutional Neural Network.
This is how we go from a[0] to a[1]. So, the convolution is really:
1.apply the linear operation
2.add the biases and
3.apply ReLU
Number of parameters in one layer
Let’s suppose we have 10 filters that are 3×3×3 in one layer of a neural
network. So, how many parameters does this layer have?

In each filter, there is a 3×3×3 volume so each filter has 27 parameters to be


learned. Then, we added the bias, parameter b, so this gives us 28 parameters.
Previously we had two filters, but now if we imagine that we actually have ten of
these filters, then we have 28×10 so that would be 280 parameters. Nice point
about this is that no matter how big the input images are the number of
parameters will remain fixed. The input image could
be 1000×1000 or 5000×5000, but the number of parameters we have
remains 280.
Number of parameters in one layer

We can use these ten filters to detect features: vertical edges, horizontal
edges, maybe other features anywhere even in the very large image with just
a very small number of parameters. This is really one property of
convolutional neural nets that makes them less prone to overfitting. So, once
we learn ten feature detectors that work, we could apply this even to very
large images and the number of parameters also remains fixed and relatively
small as 280 in this example.
Number of parameters in one layer
An example of a ConvNet
If a layer l is a convolutional layer, we’re going to denote the filter size with f [l]. So,
f×f and this superscript [l] signifies that this is a filter size f×f filter in layer l.
Then, we use p[l] to denote the amount of padding, and the amount of padding can also be
specified just by saying that we want a valid convolution, which means no padding, or
a same convolution which means we choose a padding so that the output image size has
the same height and width as the input image consequently. We’re going to use s[l] to
denote the stride.

output size
(n+2p−f)/s+1.
(N+2p-f)/s+1
An example of a ConvNet
Let’s now say we have another convolutional layer, and we use 5×5 filters. So, in our
notation f[2] at the next layer of network is equal to 5(f[2]=5), and let’s say we use the stride
of 2, s[2] =2), and no padding (p[2]=0) and 20 filters.

Then the output of this will be another volume, this time, it will be 17×17×20. Notice that because we’re
now using a stride of 2, so (s[2]=2) the dimension has shrunk much faster and 37×37 has gone down in
size by slightly more than 2, to 17×17. Because we’re using 20 filters, the number of channels is
now 20, so this activation a[2] would be that dimension.
One last convolutional layer. Let’s say that we use a 5×5 filter again and, again a stride of 2. Eventually
we end up with a 7×7. Finally, if we use 40 filters and no padding we end up with 7×7×40.
An example of a ConvNet

Now our 39×39×3 input image is processed and 7×7×40 features are computed for
this image. Finally, if we take the 7×7×40(=1960), and we flatten this volume or unroll
it into 1960 units. By unrolling them into a very long vector we can feed into
a softmax or into a logistic regression in order to make a prediction for the final output.
As we go deeper in the neural network, typically we start off with larger images 39×39,
and then the height and width will stay the same for a while and gradually trend down as
we go deeper in the neural network. That is, the size has gone from 39 to 37 to 17 to 7 ,
whereas the number of channels generally increases (from 3 to 10 to 20 to 40). We can
see this general trend in a lot of other convolutional neural networks as well.
Pooling Layer and Fully Connected layer

In a typical ConvNet there are usually three types of layers: one is the
convolutional layer and often we’ll denote that as a ConvNet. One is called

a Poolinglayer, from now on called Pool, and then the last is

a Fullyconnectedlayer, called FC.


Although it’s possible to design a pretty good neural network using just
convolutional layers, most neural network architectures will also have a few

pooling layers and a few fully connected layers.


Pooling layers

Apart from convolutional layers, ConvNets often use pooling layers to


reduce the image size. Hence, this layer speeds up the computation and
this also makes some of the features they detect a bit more robust. Let’s
go through an example of pooling, and then we’ll talk about why we might
want to apply them.
There are two types of pooling:

•Max pooling
•Average pooling
Pooling layers
Max Pooling
Suppose we have a 4×4 input image and we want to apply a type of pooling,

called Maxpooling. The output of this particular implementation

of Maxpooling will be a 2×2 output. The procedure is quite simple: we take

our 4×4 input and break it into different regions. We’ll cover the four regions

as shown in figure below. Then, in the output, which is 2×2, each of the
outputs will be the max from the corresponding shaded region.
Maxpooling with a 2×2 filter and a stride of 2
The intuition behid the Maxpooling

If we think of this 4×4 region as some set of features (the activations in some layer of the
neural network), then a large number means that this particular feature is maybe detected. So,
the upper left-hand quadrant has this particular feature, maybe a vertical area or maybe an eye
of an animal. Clearly that feature exists in the upper left-hand quadrant, (maybe that’s a cat
eye detector), whereas this feature doesn’t really exist in the upper right hand quadrant. So,
the features, detected anywhere in one of these quadrants, max operation has preserved in
the output of max pooling. What the max operation does is really safe. If this feature is
detected anywhere in this filter then keep a high number. However, if this feature is not
detected, then this feature most likely doesn’t exist in the corresponding quadrant.
We can say that there are two main reasons that people use Maxpooling:
1.It’s been found in a lot of experiments to work well.
2.It has no parameters to learn. There’s actually nothing for the gradient descent to learn.
Once we’ve fixed f and s, it’s just a fixed computation and gradient descent doesn’t change
anything.
Average pooling
That is, instead of taking the maxima within each filter, it takes the average. In this example
the average of the numbers in purple is 3.75. Then 1.25, then 4, and finally 2. This is
average pooling with hyper parameters f=2 and s=2. We can choose other hyper
parameters as well. These days Max pooling is used much more often than Average
pooling.
Pooling layer downsamples the volume spatially, independently in each
depth slice of the input volume.
Why Pooling

• Subsampling pixels will not change the object


bird
bird

Subsampling

We can subsample the pixels to make image


smaller fewer parameters to characterize the image
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Max Pooling
Helps to make representations approximately invariant

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The activations of an example ConvNet architecture.

Features
extracted DNN class
automatic
ally using
convoluti
Illustration of two convolutional layers, the first with 4 filters 5× 5× 3 that gets as
input an RGB image of size 64× 64× 3, and produces a tensor of feature maps. A
second convolutional layer with 5 filters 3× 3× 4 gets as input the tensor from the
previous layer of size 64× 64× 4 and produces a new 64× 64× 5 tensor of
feature maps. The circle after each filter denotes an activation function, e.g. ReLU.
VGG16 model for classification and detection

The image is passed through a stack of convolutional (conv.) layers, where the
filters were used with a very small receptive field: 3×3 (which is the smallest
size to capture the notion of left/right, up/down, center).
VGG16 model for classification and detection

Visualization of feature map from block 1 & 2


Visualization of feature map from block 3, 4 & 5
A CNN compresses a fully connected network in two ways:
• Reducing number of connections
• Shared weights on the edges
• Max pooling further reduces the complexity

Adding a Fully-Connected layer is a (usually) cheap way of learning non-linear


combinations of the high-level features as represented by the output of the convolutional
layer. The Fully-Connected layer is learning a possibly non-linear function in that space.
Fully Connected Layer
A fully connected layer acts like a “standard” single neural network layer, where
you have a weight matrix W and bias b.
We can see its application in the following example of a Convolutional Neural
Network. This network is inspired by the LeNet-5 network:

It’s common that, as we go deeper into the network, the sizes (nh, nw)
decrease, while the number of channels (nc) increases.
The whole CNN

cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
The whole CNN

3 0
-1 1 Convolution

3 1
0 3
Max Pooling
A new image Can
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling

is the number of filters


The whole CNN

cat dog ……
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattened A new image


3
Flattening
0

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
Revisiting CNN Convolution by 3 x 3 filter

The 6*6 image is now converted into a 4*4 image.

a stride of 2 would look like

The size of image keeps on reducing as we increase the stride value.


Padding the input image with zeros across it solves this problem for
us. We can also add more than one layer of zeros around the image in
case of higher stride values
Revisiting CNN

Multiple filters and activation map

The output from the each filter is stacked together forming the depth
dimension of the convolved image. Suppose we have an input image of
size 32*32*3. And we apply 10 filters of size 5*5*3 with valid padding. The
output would have the dimensions as 28*28*10.

activation map is the output of the convolution layer.


Revisiting CNN

Pooling layer
Sometimes when the images are too large, we would need to reduce the
number of trainable parameters. It is then desired to periodically
introduce pooling layers between subsequent convolution layers. Pooling
is done for the sole purpose of reducing the spatial size of the image.
Pooling is done independently on each depth dimension, therefore the
depth of the image remains unchanged. The most common form of
pooling layer generally applied is the max pooling.

As you can see, the 4*4 convolved output has become 2*2 after the max
pooling operation.
Revisiting CNN
Output dimensions
Three hyperparameter would control the size of output volume.
1. The number of filters – the depth of the output volume will be equal to the number of filter
applied. Remember how we had stacked the output from each filter to form an activation map.
The depth of the activation map will be equal to the number of filters.
2. Stride – When we have a stride of one we move across and down a single pixel. With
higher stride values, we move large number of pixels at a time and hence produce smaller
output volumes.
3. Zero padding – This helps us to preserve the size of the input image. If a single zero
padding is added, a single stride filter movement would retain the size of the original image.

The spatial size of the output image can be calculated as


( [W-F+2P]/S)+1.
W is the input volume size,
F is the size of the filter,
P is the number of padding applied and S is the number of strides.

Suppose we have an input image of size 32*32*3, we apply 10 filters of size 3*3*3, with single
stride and no zero padding. The output depth will be equal to the number of filters applied i.e.
10.
W=32, F=3, P=0 and S=1.

The size of the output volume will be ([32-3+0]/1)+1 = 30.


Therefore the output volume will be 30*30*10.
Revisiting CNN
Output layer

After multiple layers of convolution and padding, we would need the output in the
form of a class. The convolution and pooling layers would only be able to extract
features and reduce the number of parameters from the original images.
However, to generate the final output we need to apply a fully connected layer to
generate an output equal to the number of classes we need. It becomes tough to
reach that number with just the convolution layers. Convolution layers generate 3D
activation maps while we just need the output as whether or not an image belongs
to a particular class. The output layer has a loss function like categorical cross-
entropy, to compute the error in prediction. Once the forward pass is complete the
backpropagation begins to update the weight and biases for error and loss
reduction.
Revisiting CNN
Output layer
One Layer of a Convolutional Network
Once we get an output after convolving over the entire image using a filter, we add
a bias term to those outputs and finally apply an activation function to generate
activations. This is one layer of a convolutional network. Recall that the equation
for one forward pass is given by:

In our case, input (6 X 6 X 3) is a[0] and filters (3 X 3 X 3) are the weights w[1]. These
activations from layer 1 act as the input for layer 2, and so on. Clearly, the number of
parameters in case of convolutional neural networks is independent of the size of the
image. It essentially depends on the filter size. Suppose we have 10 filters, each of
shape 3 X 3 X 3. What will be the number of parameters in that layer? Let’s try to solve
this:
•Number of parameters for each filter = 3*3*3 = 27
•There will be a bias term for each filter, so total parameters per filter = 28
•As there are 10 filters, the total parameters for that layer = 28*10 = 280
Example
More Edge Detection
The type of filter that we choose helps to detect the vertical or horizontal
edges. We can use the following filters to detect different edges:

Some of the commonly used filters are:


The Sobel filter puts a little bit more weight on the central pixels. Instead of using these filters, we can
create our own as well and treat them as a parameter which the model will learn using
backpropagation.
More Edge Detection
Only modified the network structure and input
CNN in Keras format (vector -> 3-D tensor)

input

Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 … There are
25 3x3
-1 -1 1
-1 1 -1 … Max Pooling
filters.
Input_shape = ( 28 , 28 , 1)

28 x 28 pixels 1: black/white, 3: RGB Convolution

3 -1 3 Max Pooling

-3 1
Only modified the network structure and input
CNN in Keras format (vector -> 3-D array)

Input
1 x 28 x 28

Convolution
How many parameters for
each filter? 9 25 x 26 x 26

Max Pooling
25 x 13 x 13

Convolution
How many parameters 225=
for each filter? 50 x 11 x 11
25x9
Max Pooling
50 x 5 x 5
Only modified the network structure and input
CNN in Keras format (vector -> 3-D array)

Input
1 x 28 x 28

Output Convolution

25 x 26 x 26
Fully connected Max Pooling
feedforward network
25 x 13 x 13

Convolution
50 x 11 x 11

Max Pooling
1250 50 x 5 x 5
Flattened
A spatial separable convolution simply divides a kernel into two, smaller
kernels. The most common case would be to divide a 3x3 kernel into a 3x1
and 1x3 kernel, like so:

Now, instead of doing one convolution with 9 multiplications, we do two


convolutions with 3 multiplications each (6 in total) to achieve the same effect.
With less multiplications, computational complexity goes down, and the
network is able to run faster.
Unlike spatial separable convolutions, depthwise separable convolutions work
with kernels that cannot be “factored” into two smaller kernels. Hence, it is more
commonly used. Similar to the spatial separable convolution, a depthwise
separable convolution splits a kernel into 2 separate kernels that do two
convolutions: the depthwise convolution and the pointwise convolution.

Normal Convolution
Depthwise Convolution

Each 5x5x1 kernel iterates 1 channel of the image (note: 1 channel, not all
channels), getting the scalar products of every 25 pixel group, giving out a 8x8x1
image. Stacking these images together creates a 8x8x3 image.
Pointwise Convolution

The pointwise convolution is so named because it uses a 1x1 kernel, or a kernel


that iterates through every single point. This kernel has a depth of however many
channels the input image has; in our case, 3. Therefore, we iterate a 1x1x3
kernel through our 8x8x3 image, to get a 8x8x1 image.
Let’s calculate the number of multiplications the computer has to do in the
original convolution. There are 256 5x5x3 kernels that move 8x8 times. That’s
256x3x5x5x8x8=1,228,800 multiplications.
In the depthwise convolution, we have 3 5x5x1 kernels that move 8x8 times.
That’s 3x5x5x8x8 = 4,800 multiplications. In the pointwise convolution, we have
256 1x1x3 kernels that move 8x8 times. That’s 256x1x1x3x8x8=49,152
multiplications. Adding them up together, that’s 53,952 multiplications.
52,952 is a lot less than 1,228,800. With less computations, the network is able
to process more in a shorter amount of time.
ResNet Building block

In terms of architecture, if any layer ends up damaging the performance of the


model in a plain network, it gets skipped due to the presence of the skip-
connections.
MobileNet architecture

You might also like