6-DeepVisualLearning L6
6-DeepVisualLearning L6
Jayanta Mukhopadhyay
Dept. of Computer Science and Engg.
Deep learning
◼ Learning using a “deep” neural network
Classical ANN:
Only a few
Input Output hidden layers.
Bus
Deep architecture: Why so
late in application?
◼ Concepts introduced in 80’s.
◼ Basic principles remain the same.
◼ Two major reasons.
◼ Availability of large scale annotated data.
◼ Penetration of internet and smart phones.
◼ Wide spread of social networking.
◼ Online shopping, etc.
◼ Advancement of computing power.
◼ High throughput GPU computing.
Classical Image Classification
hand-crafted
feature
Classifier output Tiger?
extractor Algorithm
Cat?
• Edges • Bayesian Lion?
• SIFT/SURF key Point • LDA
• HOG Regional Features • SVM
• Motion Features, etc. • KNN
Classification Challenges
◼ Very tedious and costly to develop hand-crafted
features to handle various challenges.
*Feature visualization of convolutional net trained on ImageNet (Zeiler and Fergus, 2013)
Supervised Learning
Predicted output
(image label) Input data Model weights
Model (Image pixels)
structure
Training True
input output
Supervised Learning
classifier is.
◼ Data loss: Model predictions should match
training data
◼ Softmax Loss (Multinomial Logistic Regression):
𝑠
𝑒 𝑦𝑖
𝐿𝑖 = −log( 𝑠𝑗 )
σ𝑗 𝑒
𝐿𝑖 = −log(𝑃 𝑌 = 𝑦𝑖 𝑋 = 𝑥𝑖 )
Cross-entropy Loss
Binary indicator (1 if o
belongs to c, else 0). True Prob. of o
belonging to c
More general:
Regularization Loss
i
• Gradient descent
• Back propagation algorithm
Gradient Descent
How to update weights?
Initialize w randomly
While true:
Compute gradient at current point
◼ Forward pass:
◼ Run graph “forward” to compute loss
◼ Backward pass:
◼ Run graph “backward” to compute gradients
with respect to loss
◼ Efficient to compute gradients for big,
complex models.
Learning filters for feature
extraction
◼ Correlation with a mask or kernel
w1 w2 w3
w4 wc w5
w6 w7 w8
g ( x, y) = w1 f ( x − 1, y + 1) + w2 f ( x, y + 1) + w3 f ( x + 1, y + 1) + w4 f ( x − 1, y) +
wc f ( x, y) + w5 f ( x + 1, y) + w6 f ( x − 1, y − 1) + w7 f ( x, y + 1) + w8 f ( x + 1, y + 1)
Convolution in neural
architecture
◼ Output of a neuron:
weighted sum of
inputs
◼ weights defined by a
kernel Σ
W
◼ Sparse connectivity
◼ Shared weights for
every node
◼ Sufficient to describe
the model by a kernel
Convolution Layer
Filters always extend the full
32x32x3 image depth of the input volume
5x5x3 filter (kernel)
32
Convolve the filter
with the image
i.e. “slide over the
32 image spatially,
3 computing dot
products”.
Convolution Layer
32x32x3 image x
5x5x3 filter
32
1 number:
the result of taking a dot product
between the filter and a small
5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot
3 product + bias)
Locality!
28
32 28
3 Translation Invariance! 1
object appearance is
independent of location
Weight sharing!
Convolution Layer activation
32x32x3 image map
5x5x3 filter
32
28
32 28
3 1
32
28
Convolution Layer
32 28
6 # CONV
3 6
For example, if we had 6 5x5x3 filters, we’ll get 6 separate activation maps:
weights
◼ equivalently, each unit is applied to all locations
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
1 1 2 4
Maxpool with 2x2
5 6 7 8
filter and stride 2 6 8
3 2 1 0 3 4
1 2 3 4
Six 5*5 filters, 2*2 average Six 5*5 filters, 2*2 average
Stride 1 pooling, Stride 1 pooling,
Stride 2 Stride 2
◼ Till it overfits!
◼ Fewer parameters:
2 2 2 2
◼ 3*(3 C ) vs. (7 C ) for C channels
Input
28x28x256
Vanishing gradient problem
◼ Gradient becomes zero (vanishes) at deeper
layers!
◼ Learn residual mapping! Used in ResNet
No. of layers:
X 34/50/101/152
Layer #1
.
.
Identity
F(X) Layer #n
+ X
H(X)=F(X)+X
Batch Normalization
◼ Normalizes input activation map to a layer
by considering its distribution over a batch
of training samples.
◼ Each dimension of the input feature map CONV
individually normalized
◼ To make Gaussian activation maps.
◼ Advantages BN
◼ Improves gradient flow through the network.
◼ Allows higher learning rates.
Non-
◼ Reduces the strong dependence on initialization. linearity
◼ Acts as a form of regularization.
◼ Usually inserted after FC / CONV layers,
and before non-linearity.
Drop out
◼ Randomly dropping out nodes of network (at
hidden / visible layers) during training.
◼ Temporarily removing it from the network, along with
all its incoming and outgoing connections.
◼ To regulate overfitting, more effective for smaller
dataset.
◼ Simulates learning sparse representation in hidden
layers.
◼ Implementation
◼ Retain output of a node with a probability p.
◼ Typically within [0.5,1] at hidden layers and [0.8,1] in visible
layers.
Learning weights with drop out
◼ Weights become larger due to drop out.
◼ Needs to be scaled at the end training.
◼ A simple heuristic.
◼ Outgoing weights of a unit retained with
probability p during training, multiplied by p at test
time.
◼ Scaling may be carried out during training
time at each weight update.
◼ No need to rescale weight for the test network.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, Dropout: A Simple Way to Prevent
Neural Networks from Overfitting ,JMLR, 15(Jun):1929−1958, 2014.
Depthwise Separable
Convolutions
❑ Suppose, we have DF × DF × M input feature map,
DF × DF × N output feature map and Dk × Dk spatial
sized conventional convolution filters.
𝐷𝐹×𝐷𝐹×M 𝐷𝐹×𝐷𝐹×𝑁
❑What is the computational cost for such a
convolution operation?
—Dk · Dk · M · DF · DF · N
❑ What is the number of parameters?
—Dk · Dk · M · N Courtesy: Ankita Chatterjee
Depthwise Separable
Convolutions
❑ Now, think of M filters which are D K × D K (not D K × D K × M )
and think each M of these filters are operated separately on
M channels of input of spatial size DF × DF
❑ Number of parameters? DK · DK · M
Used in MobileNet-V1
Alternate layers of
Conv and D-S Conv
The architecture for baseline network EfficientNet-B0 is simple and clean, making it easier to scale and generalize.
Image taken from: EfficientNet Paper
EfficientNet-B0:
the baseline network
developed by AutoML MNAS,
Efficient-B1 to B7:
obtained by scaling up the
baseline network.
In particular, EfficientNet-B7
achieves new state-of-the-art
84.4% top-1 / 97.1% top-5
accuracy, while being 8.4x
smaller than the best existing
CNN.
Courtesy: Ankita Chatterjee
Image taken from: EfficientNet Paper
EfficientNet performance on
other baseline architectures
*Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014
Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014
Transfer Learning Lower learning rate when fine-tuning;
1/10 of original LR good starting Point.
1. Train
on Data 2. Train on 3.Bigger With bigger
Set #1 smaller data set dataset dataset, train
FC –3 FC –3 Reinitialize FC –3 more layers
this layer Train
FC-2 FC-2 FC-2
and train these
FC-1 FC-1 FC-1 layers
Predicts
(i) Coordinates of
bounding boxes
of objects.
(ii) Class probs.
Two stage processing: Localization
and recognition
*Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
Two stage processing: Object
Recognition and Localization
◼Down-Sampling
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Upsampled convolution
2x2 5 0 6 0
5 6 Upsample 0 0 0 0
7 8 7 0 8 0
0 0 0 0
2 1.1 2.4 .6
Input
1.2 1.3 1.4 .7 Filter .05 .1 .05
.28 1.5 3.2 .8 .1 .4 .1
.7 .75 .8 .4 .05 .1 .05
Fully Convolutional Network
(FCN)
◼ No FC layer, only CNNs.
◼ Dense prediction
◼ Downsampled output upsampled followed by
1x1 convolution for providing softmax score at
every pixel.
◼ No. of channel at the output layer=No. of
classes.
◼ Cross-entropy loss function.
Semantic Segmentation
◼ The upsampling of learned low resolution semantic
feature maps is done using upconvolutions which are
initialized w
◼ ith billinear interpolation filters.
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Fusion of prediction
◼ Combines prediction of higher layer with lower layer
(summing them).
◼ Performs at 3 stages.
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Segnet: Encoder-Decoder
Network
Encoder
◼Takes an input image and generates a high-dimensional
feature vector
◼Aggregate features at multiple levels
Decoder
◼Takes a high-dimensional feature vector and generates a
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, Vijay Badrinarayanan, et al. PAMI, 2017
U-Net
◼ Inspiration from wavelet analysis and
synthesis of signals and images. x’(n)
g(n) g’(n)
x(n) xd1(n)
h(n) h’(n)
xa1(n) xd2(n)
g(n) g’(n)
h(n) h’(n)
xa2(n)
U-Net architecture
◼ At the final layer a 1x1 convolution used to map each
feature vector to the desired number of classes.
Conv Concat
D/S U/S
Pool Conv
Encoder Decoder
U-Net: Convolutional Networks for Biomedical Image Segmentationm, Olaf Ronneberger, Philipp
Fischer, Thomas Brox, https://arxiv.org/abs/1505.04597
Results Ground truth
(Manual)
Yellow border
Color parts:
Segmented results
Convolutional Encoder
Decoder
Self supervised
Typical architecture Learning