0% found this document useful (0 votes)
10 views15 pages

Unit 2 CNN

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views15 pages

Unit 2 CNN

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit 2

CNN Architecture
• Convolutional neural networks are biologically inspired
networks that are used in computer vision for image
classification and object detection.
• In the convolutional neural network architecture, each
layer of the network is 3-dimensional, which has a spatial
extent and a depth corresponding to the number of
features.
• The notion of depth of a single layer in a convolutional
neural network is distinct from the notion of depth in
terms of the number of layers.
• In the input layer, these features correspond to the color
channels like RGB (i.e., red, green, blue), and in the
hidden channels these features represent hidden feature
maps that encode various types of shapes in the image.
• If the input is in grayscale (like LeNet-5), then the input
layer will have a depth of 1, but later layers will still be 3-
dimensional.
• The architecture contains two types of layers, referred to
as the convolution and subsampling layers, respectively.
• For the convolution layers, a convolution operation is
defined, in which a filter is used to map the activations
from one layer to the next.
• A convolution operation uses a 3-dimensional filter of
weights with the same depth as the current layer but with
a smaller spatial extent.
• The dot product between all the weights in the filter and
any choice of spatial region (of the same size as the filter)
in a layer defines the value of the hidden state in the next
layer
• The operation between the filter and the spatial regions in
a layer is performed at every possible position in order to
define the nextlayer
Alexnet
• AlexNet was the winner of the 2012 ILSVRC competition.
• It has 8 layers with learnable parameters.
• The input to the Model is RGB images.
• It has 5 convolution layers with a combination of max-
pooling layers.
• Then it has 3 fully connected layers.
• The activation function used in all layers is Relu.
• It used two Dropout layers.
• The activation function used in the output layer is
Softmax.
• The total number of parameters in this architecture is 62.3
million.
• AlexNet starts with 224 × 224 × 3 images and uses 96
filters of size 11 × 11 × 3 in the first layer.
• A stride of 4 is used. This results in a first layer of size 55
× 55 × 96.
• After the first layer has been computed, a max-pooling
layer is used.
• The ReLU activation function was applied after each
convolutional layer, which was followed by response
normalization and max-pooling.
• The second convolutional layer uses the response-
normalized and pooled output of the first convolutional
layer and filters it with 256 filters of size 5 × 5 × 96.
• The sizes of the filters of the third, fourth, and fifth
convolutional layers are 3 × 3 × 256(with 384 filters), 3
× 3 × 384 (with 384 filters), and 3 × 3 × 384 (with 256
filters).
• All max-pooling layers used 3 × 3 filters at stride 2.
• The fully connected layers have 4096 neurons.
• The fully connected layers have 4096 neurons. The final
set of 4096 activationscan be treated as a 4096-
dimensional representation of the image.
• The final layer of AlexNet uses a 1000-way softmax in
order to perform the classification.
VGG
• VGG stands for Visual Geometry Group; it is a
standard deep Convolutional Neural Network
(CNN) architecture with multiple layers.
• The “deep” refers to the number of layers with
VGG-16 or VGG-19 consisting of 16 and 19
convolutional layers.
• The VGG16 model achieves almost 92.7% top-5 test accuracy
in ImageNet. ImageNet is a dataset consisting of more than 14
million images belonging to nearly 1000 classes.
• Moreover, it was one of the most popular models submitted to
ILSVRC-2014.
• It replaces the large kernel-sized filters with several 3×3
kernel-sized filters one after the other, thereby making
significant improvements over AlexNet.
• The VGG16 model was trained using Nvidia Titan Black
GPUs for multiple weeks.
• The VGG-16 consists of 13 convolutional layers and
three fully connected layers.

• Input: The VGGNet takes in an image input size of


224×224. For the ImageNet competition, the creators of
the model cropped out the center 224×224 patch in each
image to keep the input size of the image consistent.
• Convolutional Layers: VGG’s convolutional layers
leverage a minimal receptive field, i.e., 3×3, the smallest
possible size that still captures up/down and left/right.
Moreover, there are also 1×1 convolution filters acting as
a linear transformation of the input. This is followed by a
ReLU unit, which is a huge innovation from AlexNet that
reduces training time. ReLU stands for rectified linear
unit activation function; it is a piecewise linear function
that will output the input if positive; otherwise, the output
is zero. The convolution stride is fixed at 1 pixel to keep
the spatial resolution preserved after convolution (stride is
the number of pixel shifts over the input matrix).
• Hidden Layers: All the hidden layers in the
VGG network use ReLU.
• Fully-Connected Layers: The VGGNet has three
fully connected layers. Out of the three layers,
the first two have 4096 channels each, and the
third has 1000 channels, 1 for each class.

You might also like