Unit 2 CNN
Unit 2 CNN
CNN Architecture
• Convolutional neural networks are biologically inspired
networks that are used in computer vision for image
classification and object detection.
• In the convolutional neural network architecture, each
layer of the network is 3-dimensional, which has a spatial
extent and a depth corresponding to the number of
features.
• The notion of depth of a single layer in a convolutional
neural network is distinct from the notion of depth in
terms of the number of layers.
• In the input layer, these features correspond to the color
channels like RGB (i.e., red, green, blue), and in the
hidden channels these features represent hidden feature
maps that encode various types of shapes in the image.
• If the input is in grayscale (like LeNet-5), then the input
layer will have a depth of 1, but later layers will still be 3-
dimensional.
• The architecture contains two types of layers, referred to
as the convolution and subsampling layers, respectively.
• For the convolution layers, a convolution operation is
defined, in which a filter is used to map the activations
from one layer to the next.
• A convolution operation uses a 3-dimensional filter of
weights with the same depth as the current layer but with
a smaller spatial extent.
• The dot product between all the weights in the filter and
any choice of spatial region (of the same size as the filter)
in a layer defines the value of the hidden state in the next
layer
• The operation between the filter and the spatial regions in
a layer is performed at every possible position in order to
define the nextlayer
Alexnet
• AlexNet was the winner of the 2012 ILSVRC competition.
• It has 8 layers with learnable parameters.
• The input to the Model is RGB images.
• It has 5 convolution layers with a combination of max-
pooling layers.
• Then it has 3 fully connected layers.
• The activation function used in all layers is Relu.
• It used two Dropout layers.
• The activation function used in the output layer is
Softmax.
• The total number of parameters in this architecture is 62.3
million.
• AlexNet starts with 224 × 224 × 3 images and uses 96
filters of size 11 × 11 × 3 in the first layer.
• A stride of 4 is used. This results in a first layer of size 55
× 55 × 96.
• After the first layer has been computed, a max-pooling
layer is used.
• The ReLU activation function was applied after each
convolutional layer, which was followed by response
normalization and max-pooling.
• The second convolutional layer uses the response-
normalized and pooled output of the first convolutional
layer and filters it with 256 filters of size 5 × 5 × 96.
• The sizes of the filters of the third, fourth, and fifth
convolutional layers are 3 × 3 × 256(with 384 filters), 3
× 3 × 384 (with 384 filters), and 3 × 3 × 384 (with 256
filters).
• All max-pooling layers used 3 × 3 filters at stride 2.
• The fully connected layers have 4096 neurons.
• The fully connected layers have 4096 neurons. The final
set of 4096 activationscan be treated as a 4096-
dimensional representation of the image.
• The final layer of AlexNet uses a 1000-way softmax in
order to perform the classification.
VGG
• VGG stands for Visual Geometry Group; it is a
standard deep Convolutional Neural Network
(CNN) architecture with multiple layers.
• The “deep” refers to the number of layers with
VGG-16 or VGG-19 consisting of 16 and 19
convolutional layers.
• The VGG16 model achieves almost 92.7% top-5 test accuracy
in ImageNet. ImageNet is a dataset consisting of more than 14
million images belonging to nearly 1000 classes.
• Moreover, it was one of the most popular models submitted to
ILSVRC-2014.
• It replaces the large kernel-sized filters with several 3×3
kernel-sized filters one after the other, thereby making
significant improvements over AlexNet.
• The VGG16 model was trained using Nvidia Titan Black
GPUs for multiple weeks.
• The VGG-16 consists of 13 convolutional layers and
three fully connected layers.