0% found this document useful (0 votes)
14 views211 pages

Convolutional Networks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views211 pages

Convolutional Networks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 211

CONVOLUTIONAL NETWORKS

By
Prof K.Khadar Nawas

Inspired from MIT: Introduction to Deep Leaning by Alexnader Amini


2D image(GrayScale)

11X14 Pixel
Features
• Features
– Image features provides rich information on the
image content
– edges and interest points
– correspond to local regions in the image
– recognition, matching, reconstruction
Tradition features VS CNN
Few List used in computer Vision
– Harris Corner Detection
– Scale-Invariant Feature Transform (SIFT)
– Speeded-Up Robust Features (SURF)
• CNN
– Extract strong complex features
– learn the task specific features and are much
more efficient
Filters
• Weight matrix applied to extract local region
features from image
• Many filters could used to extract more
features
• Typical image filter
Image checking

Given Image of X Check it is X?


How?
By extracting Features
Feature of X in image
Using Filters
filter1
filter2
filter3
Convolution operation
Convolution operation
How it Convolutes

• the convolution of a 5x5 image and a 3x3 filter

• slide the 3x3 filter over the input image, element-


wise multiply, and add the outputs
Convolution Operation
Sliding filter to extract local feature

Image Courtesy: https://towardsdatascience.com/convolutional-neural-network-in-natural-


language-processing-96d67f91275c
Image filtering
CNN
Convolutional Neural Networks
• Classification
• Segmentation
• Object Detection
• Image Captioning
CNN
• Building Blocks
1. Convolutional Layers.
2. Pooling Layers.
3. Fully-Connected Layers.
Convolutional Neural Networks

LeNet,LeCun et al. 1998


Interest Drift
• AlexNet,2012 Image Net
Convolution Layer
• 32x32x3 image
Activation Map after convolution
Zero Padding
• valid padding-padding is not used, convolution
normally reduces the spatial output
• full-padding-full padding increases the spatial
ouput
Increase more filters say 6
Activation Function
Decked ConV layers
ExampleVGG16

Visualization of VGG-16 by Lane McIntosh. VGG-16 architecture from [Simonyan and


Zisserman 2014].
Pooling Layers
• Down sampling
• Smaller outputs
Max pooling
Fully Connected Layer (FC layer)
• Use of Bias
– The use of the bias simply increases the number
of parameters in each filter by 1
– Like all other parameters, the bias is learned
during backpropagation
– Bias as a weight of a connection whose input is
always set to +1.
• The ReLU Layer
– ReLU does not change the dimensions of a layer
– ReLU activation function has replaced the other
activation functions in convolutional neural
network
– earlier years, saturating activation functions like
sigmoid and tanh were used
Overfitting

• Models trains(performs) good when training


• Performs poor when exposed to new
data(test) data
• Noise in data can cause overfitting

(left) under-tting, (middle) correct t, and (right) over-tting


https://www.researchgate.net/publication/332412613
Regularization
• larger number of parameters causes overfitting
• to constrain the model to use fewer non-zero
parameters
• lower the complexity of a neural network
model during training
• reducing overfitting in the neural network is
weight decay or also known as weight
regularization
Familiar techniques
• L1,L2 ,Early stopping and dropouts
• Cost function = Loss (say, binary cross entropy)
+ Regularization term
L1 and L2
• added cost is set by the hyperparameter Lambda, 𝜆, and is
often of small proportion
• Lambda will be tested on the values 0.01 and 0.005
• loss function, 𝑓(𝑦, 𝑥, 𝑤)
• L2 penalizes the square value of the weight nears to zero
• L1 tends to drive some weights to exactly zero
CNN Architectures
• Case Studies
– AlexNet
– ResNet
AlexNet- Krizhevsky Alex et al. 2012

Alex Krizhevsky et al 2012 ImageNet Classification with Deep Convolutional Neural Networks
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Details/Retrospectives:
- first use of ReLU
- used Norm layers (not common anymore)
- data augmentation
- dropout 0.5
- batch size 128
- SGD Momentum 0.9
- Learning rate 1e-2,
- L2 weight decay 5e-4
ImageNet Dataset
• 15 million labeled high-resolution images,
belonging to roughly 22,000 categories
• The images were collected from the web
• in 2010, ImageNet Large-Scale Visual
Recognition Challenge (ILSVRC)
• ILSVRC uses a subset of ImageNet with roughly
1000 images in each of 1000 categories
• ImageNet consists of variable-resolution images
• roughly 1.2 million training images
• 50,000 validation images
• 150,000 testing images
Transfer Learning
ResNet –Residual Network
(34,50,101,152)
Deep Residual Learning for Image Recognition
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
• Very very deep network
• 152 layers
• won the 1st place on the ILSVRC 2015 classification task.
Stacking CNN deep

• Deep network should perform better but not


performing as good as shallow network
• The problem is due to not optimizing the learning
• The authors introduced deep residual learning
framework
• hypothesize that it is easier to optimize the
residual mapping than to optimize the original
(plain network)
• Proposed Hypothesis performed better
• use layers to fit a residual mapping rather than
fitting to underline mapping
Residual Block

F(x) := H(x) – x
Fitting Residual:
H(x) =F(x)+x
• H(x) is underlying mapping
• F(x)+x can be realized by feedforward neural
networks with “shortcut connections” known
as identity mapping
– shortcut allows the gradient to be directly
backpropagated to earlier layers
• It not creates any extra parameters and
computation
• Training achieved by backprogation with SGD
Full ResNet architecture
Resnet Architecture
• residual blocks Stacking
• each residual block with 3x3 conv layers
• By doubling the number of filters and
downsample spatially using stride 2
• At beginning additional conv layer
• NO FC layers at the end
34 Plain -Residual
Training Parameters
• Batch Normalization after every CONV layer
• Xavier/2 initialization(instead random weights)
• SGD
• Learning rate: 0.1,0.01
• Mini-batch size 256
• No dropout used
Batch Normalization
• Layer used to normalize the output of the
previous layer
• Type of regularization to avoid overfitting
Building Resnet34- Identity block
def identity_block(x, filter):
x_skip = x

x = tf.keras.layers.Conv2D(filter, (3,3), padding = 'same')(x)


x = tf.keras.layers.BatchNormalization(axis=3)(x)
x = tf.keras.layers.Activation('relu')(x)

x = tf.keras.layers.Conv2D(filter, (3,3), padding = 'same')(x)


x = tf.keras.layers.BatchNormalization(axis=3)(x)

x = tf.keras.layers.Add()([x, x_skip])
x = tf.keras.layers.Activation('relu')(x)
return x
Putting together

def ResNet34(shape = (32, 32, 3), classes = 10):


# Step 1 (Setup Input Layer)
x_input = tf.keras.layers.Input(shape)
x = tf.keras.layers.ZeroPadding2D((3, 3))(x_input)
# Step 2 (Initial Conv layer along with maxPool)
x = tf.keras.layers.Conv2D(64, kernel_size=7, strides=2, padding='same')(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Activation('relu')(x)
x = tf.keras.layers.MaxPool2D(pool_size=3, strides=2, padding='same')(x)

https://www.analyticsvidhya.com/blog/2021/08/how-to-code-your-resnet-from-scratch-in-tensorflow/
# Define size of sub-blocks and initial filter size
block_layers = [3, 4, 6, 3]
filter_size = 64
# Step 3 Add the Resnet Blocks
for i in range(4):
if i == 0:
for j in range(block_layers[i]):
x = identity_block(x, filter_size)
else:
# One Residual/Convolutional Block followed by Identity blocks
# The filter size will go on increasing by a factor of 2
filter_size = filter_size*2
x = convolutional_block(x, filter_size)
for j in range(block_layers[i] - 1):
x = identity_block(x, filter_size)
# Step 4 End Dense Network
x = tf.keras.layers.AveragePooling2D((2,2), padding = 'same')(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(512, activation = 'relu')(x)
x = tf.keras.layers.Dense(classes, activation = 'softmax')(x)
model = tf.keras.models.Model(inputs = x_input, outputs = x, name = "ResNet34")
return model
VGGNet
[Simonyan and Zisserman, 2014]
• Used Small Filters 3x3 conv
• Deep in layers(Alexnet 8 layers)
• Similar training procedure as
Alex net
• Simple
• Top-5 error rate of 7.3% on ImageNet
• 16 layer CNN
• 138 M parameters
• Trained on 4 Nvidia Titan Black GPUs
for two to three weeks
Use of 3X3 filter :why3X3 filter
• Used multiple times = greater receptive fields
• Stack of three 3x3 conv (stride 1) layers has
same effective receptive field (efr) as one 7x7
conv layer
• efr :concept is that not all pixels in the
receptive field contribute equally to the
output unit’s response
concepts in deep CNNs is the receptive field, or field of view,
a unit in convolutional networks only depends on a region of the input.
This region in the input is the receptive field for that unit
VGG16
VGG16 Keras Code
model = Sequential()
model.add(Conv2D(input_shape=(224,224,3),filters=64,kernel_size=(3,3),padding="same",
activation="relu"))
model.add(Conv2D(filters=64,kernel_size=(3,3),padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))
Transfer Learning
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input

## Loading VGG16 model


base_model = VGG16(weights="imagenet", include_top=False, input_shape=(224,224,3
base_model.trainable = False

from tensorflow.keras import layers, models

flatten_layer = layers.Flatten()
dense_layer_1 = layers.Dense(50, activation='relu')
dense_layer_2 = layers.Dense(20, activation='relu') model.compile(
prediction_layer = layers.Dense(5, activation='softmax') optimizer='adam',
loss='categorical_crossentropy',
model = models.Sequential([ metrics=['accuracy'],
base_model, )
flatten_layer,
dense_layer_1, Then
dense_layer_2, Model.fit()
prediction_layer ])
GoogleNet
• 22 layers
• Inception module
• 5 million parameters(12x less than AlexNet)
Inception module
• design a good local network topology (network within a network) and
then stack these modules on top of each other
• Used 9 Inception modules in the whole architecture
Naïve Inception module
Naïve Inception

Conv Ops:
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x256
[5x5 conv, 96] 28x28x96x5x5x256
Total: 854M
1x1 convolutions
Inception module with dimension reduction
Inception module - Feature Map
Concatenation
Inception module with dimension reduction

Conv Ops:
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x64
[5x5 conv, 96] 28x28x96x5x5x64
[1x1 conv, 64] 28x28x64x1x1x256
Total: 358M ops
Full GoogLeNet
Adopted from Fei-Fei Li & Justin Johnson & Serena Yeung Lecture Slide
Variants of CNN
• Segmentation
– Semantic
– Instance
• Localization
• Object Detection
• Image Captioning
Semantic Segmentation
• dividing an image into regions mapping to different semantic
classes
• semantic segmentation understands regions in images at pixel
level
• Labeling each pixel with class label
• Approach
– Sliding windows to extract regions & CNN
– Whole image & CNN
Pixel wise labeling

https://www.jeremyjordan.me/
Patch extracting(Vague idea)

• Computation is costly
• Not Reusing of features in overlapping patches
• Labeling center pixel in patch
Convolution on High Res Image
• More convolutional layers
• Stacking of each convolution
• Pixel level prediction
Convolution on High Res Image
• More convolutional layers
• Stacking of each convolution
• Pixel level prediction
• Huge computation on original image(high res)
Reducing network
(Down Sampling and Up Sampling)

• Down sampling by Pooling and Striding


• MaxPool ing layers with Strides
Reducing network
(Down Sampling and Up Sampling)

• Down sampling by Pooling and Striding


• MaxPool ing layers with Strides
• Upsampling (UnPooling)-
Sample Upsampling
Max-Unpooling
Learnable Upsampling
Transpose Convolution
• Consider 3 x 3 convolution, stride 1 pad 1
Learnable Upsampling
Transpose Convolution
• Consider 3 x 3 convolution, stride 1 pad 1
Transpose Convolution
Dataset
• important semantic segmentation dataset is
Pascal VOC2012
• number of visual object classes in realistic
scenes with twenty object classes
• supervised learning learning problem in that a
training set of labelled images is provided
Person: person
Animal: bird, cat, cow, dog, horse, sheep
Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
Localization & Object Detection
• multiple objects in the image of interest
• Finding object categories and also their
specific positions in the image
• Such task called as object detection (or object
recognition).
• use of a bounding box to describe the spatial
location of an object
• bounding box is rectangular, which is
determined by the x and y coordinates of the
upper-left corner of the rectangle and the
such coordinates of the lower-right corner.
Region-based CNNs (R-CNNs)
[Girshick et al., 2014]
• process of finding and classifying objects in an image is called
object detection.
• Use selective search to come up with regional proposal
• First object detection method using CNN

https://arxiv.org/pdf/1311.2524.pdf
• a CNN is used to perform forward propagation
on each region proposal to extract its features
• features of each region proposal are used for
predicting the class and bounding box
• 4 step process
• Step 1: Perform selective search to extract
multiple high-quality region proposals on the
input image(2000 regions)
• proposed regions are usually selected at
multiple scales with different shapes and sizes
• Each region proposal will be labeled with a
class and a ground-truth bounding box
• Step 2:Choose a pretrained CNN and truncate
it before the output layer
• Resize each region proposal to the input size
required by the network, and
• output the extracted features for the region
proposal through forward propagation.
• Step 3:Take the extracted features and labeled
class of each region proposal as an example.
• Train multiple support vector machines to
classify objects, where each support vector
machine individually determines whether the
example contains a specific class.
• Step 4:Take the extracted features and labeled
bounding box of each region proposal as an
example. Train a linear regression model to
predict the ground-truth bounding box
R-CNN
Drawbacks
• Inference (detection) is slow,selecting
thousands of region proposals from a single
input image, takes a lot of disk space
• his requires thousands of CNN forward
propagations to perform object detection
• massive computing load makes it infeasible to
widely use R-CNNs in real-world applications
Fast R-CNN [Girshick, 2015].
• Speed of R-CNN lies in the independent CNN
forward propagation for each region proposal,
without sharing computation.
• major improvements of the fast R-CNN from
the R-CNN is that the CNN forward
propagation is only performed on the entire
image
• Step 1: feature extraction in the entire image,
rather than individual region proposals
• finding region proposals (of different shapes)
mark regions of interest (of different shapes)
on the CNN output
• Step 1: feature extraction in the entire image,
rather than individual region proposals
• finding region proposals (of different shapes)
mark regions of interest (of different shapes)
on the CNN output
• R-CNN introduces region of interest (RoI)
pooling layer
• Step 3: Using a fully-connected layer,
transform the concatenated features
• Step 4: Predict the class and bounding box for
each of the region proposals
• The class prediction uses softmax regression.
• In the pooling layer, we indirectly control the
output shape by specifying sizes of the pooling
window, padding, and stride.
Faster R-CNN
• reduces region proposals without loss of
accuracy
• Compared with the fast R-CNN, the faster R-
CNN only changes the region proposal method
from selective search to a region proposal
network
Region proposal network
• works in the following steps
• Step 1:Use a 3×3 convolutional layer with
padding of 1 to transform the CNN output to a
new output with c channels
• each unit along the spatial dimensions of the
CNN-extracted feature maps gets a new
feature vector of length c.
• Step 2: Centered on each pixel of the feature
maps, generate multiple anchor boxes of
different scales and aspect ratios and label
them
• Step 3: Using the length-c feature vector at
the center of each anchor box, predict the
binary class (background or objects) and
bounding box for this anchor box
• Step 4: Consider those predicted bounding
boxes whose predicted classes are objects.
• Remove overlapped results using non-
maximum suppression. The remaining
predicted bounding boxes for objects are the
region proposals required by the region of
interest pooling layer

Non-maximum supression(NMS) is often used along with edge detection


algorithms.
Faster R-CNN
Faster R-CNN
NMS
Bounding Box Regression

• Given a predicted bounding box


coordinate p=(px,py,pw,ph) (center coordinate, width, height)
• ground truth box coordinates g=(gx,gy,gw,gh)
• the regressor is configured to learn scale-invariant
transformation between two centers and log-scale
transformation between widths and heights.
Semantic vs Instance Segmentation
Mask R-CNN
• Instance segmentation

• Builds on top of Faster R-CNN by adding a parallel branch

• For each Region of Interest (RoI) predicts segmentation mask using a small
FCN

• Changes RoI pooling in Faster R-CNN to a quantization-free layer called RoI


Align

• Generate a binary mask for each class independently: decouples


segmentation and classification

• per pixel, if pixel-level positions of object are also labeled on images,


the mask R-CNN can effectively leverage such detailed labels to further
improve the accuracy of object detection
Faster R-CNN
FCN
R-CNN Model
Basic Architecture
Faster vs Mask RCNN
RoIAlign Motivation
• Quantization causes Miss alignment
RoI Align
RoI Align
• Removes this quantization which is causes this
misalignment
Result
NMS Algorithm
Non Maximum Suppression

• Input
– list P of prediction BBoxes of the
form (x1,y1,x2,y2,c), where (x1,y1) and (x2,y2) are
the ends of the BBox and c is the predicted
confidence score of the model.Use overlap
threshold IoU thresh_iou
• Output:
– return a list keep of filtered prediction BBoxes
Algorithm

• Step 1 : Select the prediction S with highest


confidence score and remove it from P and add it to
the final prediction list keep

• Step 2 : Now compare this prediction S with all the


predictions present in P. Calculate the IoU of this
prediction S with every other predictions in P. If the
IoU is greater than the threshold thresh_iou for any
prediction T present in P, remove
prediction T from P.
• Step 3 : If there are still predictions left in P,
then go to Step 1 again, else return the
list keep containing the filtered predictions.
IoU in mathematical terms can be represented by the following expression,
Intersection Over Union(IoU) = (Target ∩ Prediction) / (Target U Prediction)

i.E IOU(Box1, Box2) = Intersection_Size(Box1, Box2) / Union_Size(Box1, Box2)


Bilinear interpolation
• Bilinear interpolation replaces each missing
pixel with a weighted average of the nearest
pixels on the boundary of the 4-neighboring
Boundary pixel
Example Bilinear Interploation

https://www.omnicalculator.com/math/bilinear-interpolation
• where Mp is the missing
pixel, SpL,SpR,SpT,SpB are the left, right, top,
and bottom source pixels, and DL,DR,DT,
and DB are the corresponding distances from
the missing pixel.
YOLO
you only look once
[Joseph Redmon et al]

• object detection algorithm which uses


convolutional neural network (CNN) to detect
and identify objects
• makes predictions of bounding boxes and class
probabilities all at once
• YOLO has speed and accuracy
• predicts with the help of a single fully
connected layer
• YOLO is done as a regression problem
Architecture

https://arxiv.org/pdf/1506.02640.pdf
• CNN, consists of a total of 24 conv layers and
followed by 2 fully connected layers.
• The First 20 conv layers followed by an average
pooling layer and a fully connected layer is pre-
trained on the ImageNet dataset
• The pretraining for classification is performed on
the dataset with the image resolution of 224 x
224×3
• The layers comprise 3×3 conv layers and1x1
reduction layers
• For object detection the last 4 conv layers followed by 2
fully connected layers are added to train the network.
• the resolution of the dataset is increased to 448 x 448
• Then the final layer predicts the class probabilities and
bounding boxes
• All the other convolutional layers use leaky ReLU
activation
• the final layer uses a linear activation
• It outputs a 7 x 7 x 30 tensor containing predictions
which the model makes on the input image
where
S=grid size
B=Class Labels
C=Confidence Score

C= object probability * IoU of the predicted box with the ground truth
Loss Function
• sum-squared loss
• The sum-squared error compute localization and
classification errors equally
• many grid cells in an image don’t contain any
objects, the sum-squared error tries to make the
confidence score of these cells to zero
– Problem:loss will dominate the gradients and it doesn’t
let the model converge
• introduction the parameters λcoord and λnoobj to
overcome the problem.
https://hackerstreak.com/yolo-made-simple-interpreting-the-you-only-look-once-paper/
Loss
Localization Error

• The authors used a value of 5 for λcoord


Classification Error

• The λnoobj was assigned a value of 0.5 by the authors.


• Ci, predicted confidence score and the ground truth for each
bounding box in each cell
• Pi probility
Versions
How
Dividing the image into a grid of S x S cells to square images on
7X7 Grid
Each cell produces class and bounding prediction for objects if
their centre falls inside that particular cell
Each cell also predicts a class probability

Bi-Cycle

Car

Dog
Then we combine the box and class predictions.
The resulting bounding box prediction consists of the x and y
coordinates of the box’s centre, sqrt(width), sqrt(height) and an
object probability score
Finally do threshold detections and NMS
DenseNet
• the Dense Convolutional Network (DenseNet),
which connects each layer to every other layer in a
feed-forward fashion
• Whereas traditional convolutional networks with L
layers have L connections—one between each layer
and its subsequent layer
• DenseNet network has L(L+1)/2 direct connections
• Problem:gradient passes through many layers, it can
vanish and “wash out” by the time it reaches the
end
• For each layer, the feature-maps of all preceding layers are
used as inputs, and its own feature-maps are used as
inputs into all subsequent layers
• in contrast to ResNets, it never combine features through
summation before they are passed into a layer
• combine features by concatenating
• CIFAR-10, CIFAR-100, SVHN, and ImageNet
• Benefits:
– alleviate the vanishing-gradient problem
– strengthen feature propagation
– encourage feature reuse
https://github.com/liuzhuang13/DenseNet
A dense block with 5 layers
Implementation Details
• Composite function(composition layer)
– batch normalization (BN)followed by a rectified linear unit
(ReLU)  and a 3 × 3 convolution (Conv)
• Pooling layers
– To facilitate down-sampling , divide the network into multiple
densely connected dense blocks
• Growth rate
– HL produces k featuremaps, it follows that the L th layer has
k0 + k ×(L−1) input feature-maps, where k0 is the number of
channels in the input layer
– hyperparameter k known as the growth rate of the network
DenseNet-BC
• Bottleneck layers
– To reduce the number of input feature-maps, a
1×1 convolution introduced as bottleneck layer
before each 3×3 convolution
• Compression factor θ
– 0>θ<1
– If a dense block contains m feature-map
– Generates [θm] output feature-maps
– Θ=0.5
A deep DenseNet with three dense blocks
DenseNet architectures for ImageNet. The growth rate for all the networks is
k = 32. Note that each “conv” layer shown in the table corresponds the
sequence BN-ReLU-Conv
The top-1 and top-5 error rates on the
ImageNet validation se
Error rates (%) on CIFAR and SVHN datasets. k denotes network’s
“+” indicates standard data augmentationgrowth rate.
Results that surpass all competing methods are bold and the overall best results are
blue
Generative adversarial networks (GANs)
Ian Goodfellow
• GANs have been used for
– image generation
– image processing
– image synthesis from captions
– image editing
– visual domain adaptation
– data generation for visual recognition
Sample Generation
The GAN framework
• Generator : creates samples that are intended
to come from the same distribution as the
training data
• Discriminator: examines samples to determine
whether they are real or fake
• The discriminator learns using traditional
supervised learning techniques, dividing inputs
into two classes (real or fake).
• The generator is trained to fool the discriminator

• discriminator is a function D that takes x as input


and uses θ (D) as parameters
• generator is defined by a function G that takes z as
input and uses θ (G) as parameters
Architecture
The generator
• is simply a differentiable function
• a deep neural network is used to represent
• G(z) yields a sample of x drawn from pmodel
The Descriminator
• the discriminator, represented by the function D
1. The goal of the discriminator is to output the probability
that its input is real rather than fake, under the
assumption that half of the inputs it is ever shown are
real and half are fake
– The goal of the discriminator is for D(x) to be near 1
2. The discriminator then receives input G(z), a fake sample
created by the generator
– the discriminator strives to make D(G(z)) approach 0
– while the generative strives to make the same quantity
approach 1
Optimization
Nash equilibrium

is a game theory concept that determines the optimal solution in a non-cooperative game in
which each player lacks any
Cost function
Cost function
Discriminator Perspective
Generator Perspective
• Z is some random noise (Gaussian/Uniform)
• Z can be thought as the latent representation of the
image.
Training Discriminator

• Freeze the generator weights,compare the real samples with generated samples
• Train the discriminator
• minimize the loss of the discriminator
Training Generator

• Freeze discriminator weights


• Back propagate the error through discriminator to update the generator weights
• Train the generator to generate data that "fools" the discriminator.
• maximize the loss of the discriminator when given generated data
Training Problems
• Non-Convergence
• Mode-Collapse (not a complete learning of
samples)
GAN-Toy example
• To understand how GAN training works,
consider a toy example with a dataset
composed of two-dimensional samples (x₁, x₂),
with x₁ in the interval from 0 to 2π and x₂ =
sin(x₁), as illustrated in the following figure:

https://realpython.com/generative-adversarial-networks/
https://realpython.com/generative-adversarial-networks/
Training

https://realpython.com/generative-adversarial-networks/
https://realpython.com/generative-adversarial-networks/
output

https://realpython.com/generative-adversarial-networks/
• h(t): Hidden target state
c(t): Source side context vector
y(t): Current target word
h_bar(t): Attentional hidden state
a(t): Alignment vector

You might also like