Convolutional Networks
Convolutional Networks
By
Prof K.Khadar Nawas
11X14 Pixel
Features
• Features
– Image features provides rich information on the
image content
– edges and interest points
– correspond to local regions in the image
– recognition, matching, reconstruction
Tradition features VS CNN
Few List used in computer Vision
– Harris Corner Detection
– Scale-Invariant Feature Transform (SIFT)
– Speeded-Up Robust Features (SURF)
• CNN
– Extract strong complex features
– learn the task specific features and are much
more efficient
Filters
• Weight matrix applied to extract local region
features from image
• Many filters could used to extract more
features
• Typical image filter
Image checking
Alex Krizhevsky et al 2012 ImageNet Classification with Deep Convolutional Neural Networks
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Details/Retrospectives:
- first use of ReLU
- used Norm layers (not common anymore)
- data augmentation
- dropout 0.5
- batch size 128
- SGD Momentum 0.9
- Learning rate 1e-2,
- L2 weight decay 5e-4
ImageNet Dataset
• 15 million labeled high-resolution images,
belonging to roughly 22,000 categories
• The images were collected from the web
• in 2010, ImageNet Large-Scale Visual
Recognition Challenge (ILSVRC)
• ILSVRC uses a subset of ImageNet with roughly
1000 images in each of 1000 categories
• ImageNet consists of variable-resolution images
• roughly 1.2 million training images
• 50,000 validation images
• 150,000 testing images
Transfer Learning
ResNet –Residual Network
(34,50,101,152)
Deep Residual Learning for Image Recognition
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
• Very very deep network
• 152 layers
• won the 1st place on the ILSVRC 2015 classification task.
Stacking CNN deep
F(x) := H(x) – x
Fitting Residual:
H(x) =F(x)+x
• H(x) is underlying mapping
• F(x)+x can be realized by feedforward neural
networks with “shortcut connections” known
as identity mapping
– shortcut allows the gradient to be directly
backpropagated to earlier layers
• It not creates any extra parameters and
computation
• Training achieved by backprogation with SGD
Full ResNet architecture
Resnet Architecture
• residual blocks Stacking
• each residual block with 3x3 conv layers
• By doubling the number of filters and
downsample spatially using stride 2
• At beginning additional conv layer
• NO FC layers at the end
34 Plain -Residual
Training Parameters
• Batch Normalization after every CONV layer
• Xavier/2 initialization(instead random weights)
• SGD
• Learning rate: 0.1,0.01
• Mini-batch size 256
• No dropout used
Batch Normalization
• Layer used to normalize the output of the
previous layer
• Type of regularization to avoid overfitting
Building Resnet34- Identity block
def identity_block(x, filter):
x_skip = x
x = tf.keras.layers.Add()([x, x_skip])
x = tf.keras.layers.Activation('relu')(x)
return x
Putting together
https://www.analyticsvidhya.com/blog/2021/08/how-to-code-your-resnet-from-scratch-in-tensorflow/
# Define size of sub-blocks and initial filter size
block_layers = [3, 4, 6, 3]
filter_size = 64
# Step 3 Add the Resnet Blocks
for i in range(4):
if i == 0:
for j in range(block_layers[i]):
x = identity_block(x, filter_size)
else:
# One Residual/Convolutional Block followed by Identity blocks
# The filter size will go on increasing by a factor of 2
filter_size = filter_size*2
x = convolutional_block(x, filter_size)
for j in range(block_layers[i] - 1):
x = identity_block(x, filter_size)
# Step 4 End Dense Network
x = tf.keras.layers.AveragePooling2D((2,2), padding = 'same')(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(512, activation = 'relu')(x)
x = tf.keras.layers.Dense(classes, activation = 'softmax')(x)
model = tf.keras.models.Model(inputs = x_input, outputs = x, name = "ResNet34")
return model
VGGNet
[Simonyan and Zisserman, 2014]
• Used Small Filters 3x3 conv
• Deep in layers(Alexnet 8 layers)
• Similar training procedure as
Alex net
• Simple
• Top-5 error rate of 7.3% on ImageNet
• 16 layer CNN
• 138 M parameters
• Trained on 4 Nvidia Titan Black GPUs
for two to three weeks
Use of 3X3 filter :why3X3 filter
• Used multiple times = greater receptive fields
• Stack of three 3x3 conv (stride 1) layers has
same effective receptive field (efr) as one 7x7
conv layer
• efr :concept is that not all pixels in the
receptive field contribute equally to the
output unit’s response
concepts in deep CNNs is the receptive field, or field of view,
a unit in convolutional networks only depends on a region of the input.
This region in the input is the receptive field for that unit
VGG16
VGG16 Keras Code
model = Sequential()
model.add(Conv2D(input_shape=(224,224,3),filters=64,kernel_size=(3,3),padding="same",
activation="relu"))
model.add(Conv2D(filters=64,kernel_size=(3,3),padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
Transfer Learning
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input
flatten_layer = layers.Flatten()
dense_layer_1 = layers.Dense(50, activation='relu')
dense_layer_2 = layers.Dense(20, activation='relu') model.compile(
prediction_layer = layers.Dense(5, activation='softmax') optimizer='adam',
loss='categorical_crossentropy',
model = models.Sequential([ metrics=['accuracy'],
base_model, )
flatten_layer,
dense_layer_1, Then
dense_layer_2, Model.fit()
prediction_layer ])
GoogleNet
• 22 layers
• Inception module
• 5 million parameters(12x less than AlexNet)
Inception module
• design a good local network topology (network within a network) and
then stack these modules on top of each other
• Used 9 Inception modules in the whole architecture
Naïve Inception module
Naïve Inception
Conv Ops:
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x256
[5x5 conv, 96] 28x28x96x5x5x256
Total: 854M
1x1 convolutions
Inception module with dimension reduction
Inception module - Feature Map
Concatenation
Inception module with dimension reduction
Conv Ops:
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x64
[5x5 conv, 96] 28x28x96x5x5x64
[1x1 conv, 64] 28x28x64x1x1x256
Total: 358M ops
Full GoogLeNet
Adopted from Fei-Fei Li & Justin Johnson & Serena Yeung Lecture Slide
Variants of CNN
• Segmentation
– Semantic
– Instance
• Localization
• Object Detection
• Image Captioning
Semantic Segmentation
• dividing an image into regions mapping to different semantic
classes
• semantic segmentation understands regions in images at pixel
level
• Labeling each pixel with class label
• Approach
– Sliding windows to extract regions & CNN
– Whole image & CNN
Pixel wise labeling
https://www.jeremyjordan.me/
Patch extracting(Vague idea)
• Computation is costly
• Not Reusing of features in overlapping patches
• Labeling center pixel in patch
Convolution on High Res Image
• More convolutional layers
• Stacking of each convolution
• Pixel level prediction
Convolution on High Res Image
• More convolutional layers
• Stacking of each convolution
• Pixel level prediction
• Huge computation on original image(high res)
Reducing network
(Down Sampling and Up Sampling)
https://arxiv.org/pdf/1311.2524.pdf
• a CNN is used to perform forward propagation
on each region proposal to extract its features
• features of each region proposal are used for
predicting the class and bounding box
• 4 step process
• Step 1: Perform selective search to extract
multiple high-quality region proposals on the
input image(2000 regions)
• proposed regions are usually selected at
multiple scales with different shapes and sizes
• Each region proposal will be labeled with a
class and a ground-truth bounding box
• Step 2:Choose a pretrained CNN and truncate
it before the output layer
• Resize each region proposal to the input size
required by the network, and
• output the extracted features for the region
proposal through forward propagation.
• Step 3:Take the extracted features and labeled
class of each region proposal as an example.
• Train multiple support vector machines to
classify objects, where each support vector
machine individually determines whether the
example contains a specific class.
• Step 4:Take the extracted features and labeled
bounding box of each region proposal as an
example. Train a linear regression model to
predict the ground-truth bounding box
R-CNN
Drawbacks
• Inference (detection) is slow,selecting
thousands of region proposals from a single
input image, takes a lot of disk space
• his requires thousands of CNN forward
propagations to perform object detection
• massive computing load makes it infeasible to
widely use R-CNNs in real-world applications
Fast R-CNN [Girshick, 2015].
• Speed of R-CNN lies in the independent CNN
forward propagation for each region proposal,
without sharing computation.
• major improvements of the fast R-CNN from
the R-CNN is that the CNN forward
propagation is only performed on the entire
image
• Step 1: feature extraction in the entire image,
rather than individual region proposals
• finding region proposals (of different shapes)
mark regions of interest (of different shapes)
on the CNN output
• Step 1: feature extraction in the entire image,
rather than individual region proposals
• finding region proposals (of different shapes)
mark regions of interest (of different shapes)
on the CNN output
• R-CNN introduces region of interest (RoI)
pooling layer
• Step 3: Using a fully-connected layer,
transform the concatenated features
• Step 4: Predict the class and bounding box for
each of the region proposals
• The class prediction uses softmax regression.
• In the pooling layer, we indirectly control the
output shape by specifying sizes of the pooling
window, padding, and stride.
Faster R-CNN
• reduces region proposals without loss of
accuracy
• Compared with the fast R-CNN, the faster R-
CNN only changes the region proposal method
from selective search to a region proposal
network
Region proposal network
• works in the following steps
• Step 1:Use a 3×3 convolutional layer with
padding of 1 to transform the CNN output to a
new output with c channels
• each unit along the spatial dimensions of the
CNN-extracted feature maps gets a new
feature vector of length c.
• Step 2: Centered on each pixel of the feature
maps, generate multiple anchor boxes of
different scales and aspect ratios and label
them
• Step 3: Using the length-c feature vector at
the center of each anchor box, predict the
binary class (background or objects) and
bounding box for this anchor box
• Step 4: Consider those predicted bounding
boxes whose predicted classes are objects.
• Remove overlapped results using non-
maximum suppression. The remaining
predicted bounding boxes for objects are the
region proposals required by the region of
interest pooling layer
• For each Region of Interest (RoI) predicts segmentation mask using a small
FCN
• Input
– list P of prediction BBoxes of the
form (x1,y1,x2,y2,c), where (x1,y1) and (x2,y2) are
the ends of the BBox and c is the predicted
confidence score of the model.Use overlap
threshold IoU thresh_iou
• Output:
– return a list keep of filtered prediction BBoxes
Algorithm
https://www.omnicalculator.com/math/bilinear-interpolation
• where Mp is the missing
pixel, SpL,SpR,SpT,SpB are the left, right, top,
and bottom source pixels, and DL,DR,DT,
and DB are the corresponding distances from
the missing pixel.
YOLO
you only look once
[Joseph Redmon et al]
https://arxiv.org/pdf/1506.02640.pdf
• CNN, consists of a total of 24 conv layers and
followed by 2 fully connected layers.
• The First 20 conv layers followed by an average
pooling layer and a fully connected layer is pre-
trained on the ImageNet dataset
• The pretraining for classification is performed on
the dataset with the image resolution of 224 x
224×3
• The layers comprise 3×3 conv layers and1x1
reduction layers
• For object detection the last 4 conv layers followed by 2
fully connected layers are added to train the network.
• the resolution of the dataset is increased to 448 x 448
• Then the final layer predicts the class probabilities and
bounding boxes
• All the other convolutional layers use leaky ReLU
activation
• the final layer uses a linear activation
• It outputs a 7 x 7 x 30 tensor containing predictions
which the model makes on the input image
where
S=grid size
B=Class Labels
C=Confidence Score
C= object probability * IoU of the predicted box with the ground truth
Loss Function
• sum-squared loss
• The sum-squared error compute localization and
classification errors equally
• many grid cells in an image don’t contain any
objects, the sum-squared error tries to make the
confidence score of these cells to zero
– Problem:loss will dominate the gradients and it doesn’t
let the model converge
• introduction the parameters λcoord and λnoobj to
overcome the problem.
https://hackerstreak.com/yolo-made-simple-interpreting-the-you-only-look-once-paper/
Loss
Localization Error
Bi-Cycle
Car
Dog
Then we combine the box and class predictions.
The resulting bounding box prediction consists of the x and y
coordinates of the box’s centre, sqrt(width), sqrt(height) and an
object probability score
Finally do threshold detections and NMS
DenseNet
• the Dense Convolutional Network (DenseNet),
which connects each layer to every other layer in a
feed-forward fashion
• Whereas traditional convolutional networks with L
layers have L connections—one between each layer
and its subsequent layer
• DenseNet network has L(L+1)/2 direct connections
• Problem:gradient passes through many layers, it can
vanish and “wash out” by the time it reaches the
end
• For each layer, the feature-maps of all preceding layers are
used as inputs, and its own feature-maps are used as
inputs into all subsequent layers
• in contrast to ResNets, it never combine features through
summation before they are passed into a layer
• combine features by concatenating
• CIFAR-10, CIFAR-100, SVHN, and ImageNet
• Benefits:
– alleviate the vanishing-gradient problem
– strengthen feature propagation
– encourage feature reuse
https://github.com/liuzhuang13/DenseNet
A dense block with 5 layers
Implementation Details
• Composite function(composition layer)
– batch normalization (BN)followed by a rectified linear unit
(ReLU) and a 3 × 3 convolution (Conv)
• Pooling layers
– To facilitate down-sampling , divide the network into multiple
densely connected dense blocks
• Growth rate
– HL produces k featuremaps, it follows that the L th layer has
k0 + k ×(L−1) input feature-maps, where k0 is the number of
channels in the input layer
– hyperparameter k known as the growth rate of the network
DenseNet-BC
• Bottleneck layers
– To reduce the number of input feature-maps, a
1×1 convolution introduced as bottleneck layer
before each 3×3 convolution
• Compression factor θ
– 0>θ<1
– If a dense block contains m feature-map
– Generates [θm] output feature-maps
– Θ=0.5
A deep DenseNet with three dense blocks
DenseNet architectures for ImageNet. The growth rate for all the networks is
k = 32. Note that each “conv” layer shown in the table corresponds the
sequence BN-ReLU-Conv
The top-1 and top-5 error rates on the
ImageNet validation se
Error rates (%) on CIFAR and SVHN datasets. k denotes network’s
“+” indicates standard data augmentationgrowth rate.
Results that surpass all competing methods are bold and the overall best results are
blue
Generative adversarial networks (GANs)
Ian Goodfellow
• GANs have been used for
– image generation
– image processing
– image synthesis from captions
– image editing
– visual domain adaptation
– data generation for visual recognition
Sample Generation
The GAN framework
• Generator : creates samples that are intended
to come from the same distribution as the
training data
• Discriminator: examines samples to determine
whether they are real or fake
• The discriminator learns using traditional
supervised learning techniques, dividing inputs
into two classes (real or fake).
• The generator is trained to fool the discriminator
is a game theory concept that determines the optimal solution in a non-cooperative game in
which each player lacks any
Cost function
Cost function
Discriminator Perspective
Generator Perspective
• Z is some random noise (Gaussian/Uniform)
• Z can be thought as the latent representation of the
image.
Training Discriminator
• Freeze the generator weights,compare the real samples with generated samples
• Train the discriminator
• minimize the loss of the discriminator
Training Generator
https://realpython.com/generative-adversarial-networks/
https://realpython.com/generative-adversarial-networks/
Training
https://realpython.com/generative-adversarial-networks/
https://realpython.com/generative-adversarial-networks/
output
https://realpython.com/generative-adversarial-networks/
• h(t): Hidden target state
c(t): Source side context vector
y(t): Current target word
h_bar(t): Attentional hidden state
a(t): Alignment vector