Lecture-CNN
Lecture-CNN
1 0 −1 …
1 0 −1 …
.=
𝐾 1 0 −1
• For a CNN, the members of the kernels are repeatedly used, which
greatly reduce the parameters that need to be estimated
• If applying the filter multiple times, the image could reduce to none.
• To avoid this, we can apply a very simple technique called zero padding (padding for
short).
• The general formula is
$%&'()
• 𝑂= +1
*
• W is the width of the input;
• K is the kernel
• P is padding
• S is the size of the stride
Impact of Padding
With padding
Standard CNN Architectures
• CNN typically run many
convolutions in parallel
• Images are often multi-channel
• Example includes RGB images
2D convolution Illustration
Convolution in RGB images
• RGB images contains 3 channels representing the intensity for red,
green and blue channels
•
ImageNet Competition
• 14 million images in over 20,000 categories
• Cat, dog, balloon, strawberry
• over 1 million images have boxes around the object of interest
• Annotated by croudsourcing
• Led by Fei-Fei Li
ImageNet Competition
ImageNet Competition
• Classification Error of winners over the years
AlexNet Architecture
Keras implementation • # 5th Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(3,3),
strides=(1,1), padding=’valid’))
• # 1st Convolutional Layer model.add(Activation(‘relu’))
model.add(Conv2D(filters=96, # Max Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2),
input_shape=(224,224,3), kernel_size=(11,11), padding=’valid’))
strides=(4,4), padding=’valid’)) • # Passing it to a Fully Connected layer
model.add(Activation(‘relu’)) model.add(Flatten())
# 1st Fully Connected Layer
# Max Pooling model.add(Dense(4096, input_shape=(224*224*3,)))
model.add(MaxPooling2D(pool_size=(2,2), model.add(Activation(‘relu’))
strides=(2,2), padding=’valid’)) # Add Dropout to prevent overfitting
model.add(Dropout(0.4))
• # 2nd Convolutional Layer • # 2nd Fully Connected Layer
model.add(Conv2D(filters=256, kernel_size=(11,11), model.add(Dense(4096))
model.add(Activation(‘relu’))
strides=(1,1), padding=’valid’)) # Add Dropout
model.add(Activation(‘relu’)) model.add(Dropout(0.4))
# Max Pooling • # 3rd Fully Connected Layer
model.add(MaxPooling2D(pool_size=(2,2), model.add(Dense(1000))
model.add(Activation(‘relu’))
strides=(2,2), padding=’valid’)) # Add Dropout
model.add(Dropout(0.4))
• # 3rd Convolutional Layer • # Output Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), model.add(Dense(17))
strides=(1,1), padding=’valid’)) model.add(Activation(‘softmax’))
model.add(Activation(‘relu’)) • model.summary()
•𝑘=0
•𝛼=1
•𝛽=1
•𝑛=2
Intra-Channel LRN Example
• The intra-channel normalization works as follows
+
+ =
𝑎(,*
𝑏(,* 9
$ $
-34 5,(6 -34 8,*6 2
2 2 +
𝑘 +𝛼∑ $ ∑ $ 𝑎#,7
#,-./ 0,(1 2 7,-./ 0,*1 2
• 𝑘 = 0, 𝛼 = 1, 𝛽 = 1, 𝑊 = 𝐻 = 8, 𝑛 = 2
+ is the value at position 𝑥, 𝑦 in channel 𝑘
• 𝑎(,*
• Should be distinguished from 𝛼
Batch Normalization
• Batch normalization avoids internal covariate shift
Batch Normalization (BN) Algorithm
• For values 𝑥 in the mini-batch, we scale and shift them to make them
have the same mean the standard deviations
• The input in a batch is 𝑥: , … , 𝑥;
• We calculate the mean and variance for the batch, which we denote
as 𝜇< and 𝜎<2
• We calculate the Z-score for the batch:
𝑧# = 𝜎<1: 𝑥# − 𝜇<
• The normalized output is given by
𝑦# = 𝛾𝑍# + 𝛽
ZFNet
• ZFNet is winner of 2013
• Heavily based upon AlexNet
VGGNet
VGGNet
• M ~ maxpooling
• LRN local response
normalization
• Different columns represent
different versions of VGGNet
• A, C represent smaller networks
• VGG16 and 19 are columns D,E
VGGNet Properties
• VGGNet while not winning the contest, proposes useful properties
that were widely used later on
• One uniqueness is that VGGNet uses small filters at great depth
• Could save parameters and potentially retrieve interesting features.
• Greater depth means more ReLU activation and more non-linearity
• VGGNet is very slow to train
• A workaround is to train a smaller network first, and then use the output from
smaller network as input for further training
GoogLeNet
• GoogLeNet introduced a novel concept, the inception network
• Image features could come at different resolutions
• Using filters of different sizes could help extract features of different
granularity
GoogLeNet Implementation - Bottleneck
• For filters of different sizes, we first apply 1 x 1 convolution bottleneck
in order to collapse the channels and reduce the dimension of the
data
• Recall: inception V1
GoogLeNet – Output Layer
• Another interesting feature for GoogLeNet lies in the output layer
• Instead of using a fully connected layer in the end, GoogLeNet uses
averge pooling, which reduces the number of parameters and
improves the performance (by .7%).
GoogLeNet Dimensions
GoogLeNet
• Overall architecture
ResNet
• ResNet is the first deep learning model that attains human level
accuracy
• It mostly benefits from fitting a much deeper networks
• It is generally believed that training deeper networks should ALWAYS
be helpful for solving complex problems
• As shallow networks are special cases for deep networks with identity
activation function
• Yet fitting deep network is often not easy
• Convergence can take very long;
• There may be problems of exploding and vanishing gradient problems
ResNet Key Algorithm
• One key idea that motivate the ResNet is by copying between layers,
we manage to allow portions of the data be fitted with shallower
networks than other parts of the data
ResNet Implementation
Other Image Processing Applications
• In addition to classification, there are often more complex tasks that
are based upon similar models
• Object localization
• Object detection
• Segmentation
• Video processing
Object Localization
• Question: Can we draw a box over the object that we deem present in
the image?
• It seems natural to see that we can answer object localization based
upon classification e.g. AlexNet
• For the last couple of fully connected layers that are used for
classification, we instead try to train a regression type of model
• The box surrounding the object is defined by 4 numbers
𝑥, 𝑦, 𝑤, ℎ
• 𝑥, 𝑦 are the top left coordinates for the pixel. W, h are the width and
height of the box
Object Detection
• Often there is an unknown number of objects in the image
• If we somehow know a region that contains some object, we can run
a standard CNN model to classify it
• So this boils down to find a regional proposal:
Region-based CNN
• One intuitive idea is to scan different regions of the image at different
resolutions (but limit the number of regions under 2000)
• Basic algorithm is the following:
• Run Selective Search to generate probable objects.
• Feed these patches to CNN, followed by SVM to predict the class of each
patch.
• Optimize patches by training bounding box regression separately.
Spatial Pyramid Pooling
• Similar regions can be identified in the CNN output as in the original
image
• Design pooling based upon the feature map
Spatial Pyramid Pooling
YOLO (You Only
Look Once)