0% found this document useful (0 votes)
16 views48 pages

Lecture 3

The document provides an overview of Convolutional Neural Networks (CNNs) and their application in deep learning, particularly for computer vision tasks. It discusses key concepts such as the vanishing gradient problem, CNN architecture, various layers (convolution, activation, pooling, dropout, fully connected, and softmax), and techniques like transfer learning and data augmentation. Additionally, it highlights challenges faced by CNNs and lists notable CNN architectures used in practice.

Uploaded by

Abdelrhman Adel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views48 pages

Lecture 3

The document provides an overview of Convolutional Neural Networks (CNNs) and their application in deep learning, particularly for computer vision tasks. It discusses key concepts such as the vanishing gradient problem, CNN architecture, various layers (convolution, activation, pooling, dropout, fully connected, and softmax), and techniques like transfer learning and data augmentation. Additionally, it highlights challenges faced by CNNs and lists notable CNN architectures used in practice.

Uploaded by

Abdelrhman Adel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Lecture 3: Convolutional Neural Networks (CNN)

CS460: Deep Learning


Deep Learning (DL)

“Think of them as deep neural


networks”

 MultilayerPerceptron (MLP) is
considered a relatively shallower
model with one or two hidden layers
 DL models have much more layers
than MLPs
 The bottleneck for mounting more
layers has been the “vanishing
The vanishing gradient
problem
 a difficulty found in MLPs with the sigmoid activation
function with gradient-descent learning methods and
backpropagation
 In such methods, each of the neural network's
weights receives an update proportional to the
partial derivative of the error function with respect to
the current weight in each iteration of training.
 The problem is that in some cases, the gradient will
be very small due to the “chain rule”, effectively
preventing the weight from changing its value at the
“front” layers.
 In the worst case, this may completely stop the
neural network from further learning.
DL and Data Science

 Scalability of neural networks -


results get better with more data
and larger models, that in turn
require more computation to train.
DL and Feature
Engineering
 Automated Feature Learning -
ability to perform automatic feature
extraction from raw data.
 Hierarchical Feature Learning -
ability to provide different levels of
abstractions of the data.
CNN for vision
NN and Vision

 MNIST example
NN and Vision
 For computer vision, why can’t we just
flatten the image and feed it through
traditional NN such as MLPs ?
 Images are high-dimensional vectors. It
would take a huge amount of parameters to
characterize the network.
 (# of parameters = 784*15 + 15 for MNIST
ex.)
 CNNs are proposed to reduce the number of
parameters and adapt the network
architecture specifically to vision tasks.
Traditional Recognition
Approach
CNN Layers
Convolution Neural Networks (CNN)

 is
a class of deep, feed-forward
artificial neural networks that are
applied to analyzing visual imagery.
Convolution Layer
Convolution Layer
Convolution Layer
Convolution Layer
Convolution Layer

 The# of output feature maps are


usually larger than the # of input
feature maps.
Convolution Layer
Related terms
 Filter : A mask/window that holds the
learned weights that are convolved with the
image. Its size specifies the patch or
receptive field of the image.
 Feature Map: is the output of one filter applied to
the previous layer.
 Stride: is the distance (number of rows and
columns) that filter is moved across the input from
the previous location.
 Padding: is to invent mock inputs for the
receptive field for the filter to read, incase the filter
is attempting to read off the edge of the input
feature map
Spatial
Dimensions

 7x7 input (spatially) assume


3x3 filter => 5x5 output
 7x7 input (spatially) assume
3x3 filter applied with stride
2 => 3x3 output!
 7x7 input (spatially) assume
3x3 filter applied with stride
3? doesn’t fit! cannot apply
3x3 filter on 7x7 input with
stride 3.
Spatial
Dimensions

 Output size: (N - F) /
stride + 1
 e.g. N = 7, F = 3:
stride 1 => (7 - 3)/1 + 1
=5
stride 2 => (7 - 3)/2 + 1
=3
stride 3 => (7 - 3)/3 + 1
= 2.33
Padding

 input 7x7 3x3 filter, applied with


stride 1 pad with 1 pixel border
=> what is the output?
 7x7 output!
 in general, common to see CONV
layers with stride 1, filters of size
FxF, and zero-padding with
 (F-1)/2. (will preserve size
spatially)
F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Weight Sharing
 Is the concept by which the CNN achieves
translation invariance.
 Based on the assumption: That if one feature is
useful to compute at some spatial position
(x,y), then it should also be useful to compute
at a different position (x2,y2).
 Is to constrain the neurons in each depth slice
to use the same weights and bias across the
whole image.
 However, it is possible to relax the parameter
sharing scheme, and instead simply call the
layer a Locally-Connected Layer.
Weight Sharing

 In practice, the weight update is


performed concurrently through
parallelization algorithms and special
hardware called the Graphical
Processing Unit (GPU)
 GPUs : are hundreds of simpler
cores, thousands of hardware
threads that are applied to image
regions at the same time.
Number of parameters

 Input volume: 32x32x3 10 5x5 filters


with stride 1, pad 2
 Number of parameters in this layer?
each filter has 5*5*3 + 1 = 76
params (+1 for bias) => 76*10 =
760
Hierarchy of Convolution
Layers
Activation Layer
 After each conv layer, it is conventional to apply a
nonlinear function.
 In the past, nonlinear functions like tanh and sigmoid were
used, but researchers found out that ReLU layers work
far better because the network is able to train a lot faster
(because of the computational efficiency) without making
a significant difference to the accuracy. It also helps to
alleviate the vanishing gradient problem.
 Generalization would not be possible with a linear
mapping as in that case a high level of
abstraction/generalization would not be possible. Hence,
to map a class of images into a manifold of feature vector,
we need activation, without it, it would be really difficult to
generalize as pictures in a class can have to much intra-
class variations.
Activation Layer

 Relu (REctified Linear Unit)


Pooling Layer
 It down-samples the previous layer’s feature map.
 Pooling layers follow a sequence of one or more
convolutional .
 It may be considered as a technique to compress or
generalize feature representations and generally reduce
the overfitting of the training data by the model.
 They too have a receptive field, often much smaller than
the convolutional layer. Also, the stride or number of
inputs that the receptive field is moved for each
activation is often equal to the size of the receptive field
to avoid any overlap.
 Pooling layers are often very simple, taking the average
or the maximum of the input value in order to the new
feature map.
Pooling Layer
Dropout Layer
 Probabilistically dropping out or ignoring nodes in
the network is a simple and effective regularization
method.
 It offers a very computationally cheap and
remarkably effective regularization method to
reduce overfitting and improve generalization error
in deep neural networks of all kinds.
 Dropout has the effect of making the training
process noisy, forcing nodes within a layer to
probabilistically take on more or less responsibility
for the inputs.
 It encourages the network to actually learn a sparse
representation.
Dropout Layer
Fully Connected Layer
 is the normal flat feed-forward neural network layer.
 is preceded by a flatten procedure.
 Contains neurons that connect to the entire input
volume, as in ordinary Neural Networks
 Spatial information is lost at this phase
 These layers may have a non-linear activation
function or a softmax activation in order to output
probabilities of class predictions.
 Fully connected layers are used at the end of the
network after feature extraction and consolidation has
been performed by the convolutional and pooling
layers.
 They are used to create final non-linear combinations
of features and for making predictions by the network.
Soft-max Layer
Soft-max Layer
 A Softmax function is a type of squashing function,
that limit the output of the function into the range 0 to
1.
 This allows the output to be interpreted directly as a
probability. Similarly, softmax functions are multi-class
sigmoids, meaning they are used in determining
probability of multiple classes at once.
 Since the outputs of a softmax function can be
interpreted as a probability, a softmax layer is typically
the final layer used in neural network functions.
 It is important to note that a softmax layer must have
the same number of nodes as the output later.
 It allows for the calculation of the error.
Transfer Learning

 is a technique which reuses the


finished Deep Learning model in
another more specific task.
 A pretrained CNN is used to process
data of different dataset than the
one it was trained on.
 The learned parameters are used as
they are.
 Sometimes, some further training to
fine tune the CNN is used. Also,
Data Augmentation

 Artificially making the dataset larger


 By using a collection of simple image
transformations on the already
included images yielding new ones,
such as: grayscales, horizontal flips,
vertical flips, random crops, color
jitters, translations, rotation.
Challenges to CNNs

A black-box : operates in the


paradigm of non-explainable AI, With
the exception of visualization of
output structures at intermediate
levels
 The application of CNNs in
unsupervised settings is still lagging
behind
 Limitations to context reasoning
 Not invariant to some non-affine
Famous CNNs Listing
 LeNet. The first successful application of Convolutional Networks were developed by Yann
LeCun in 1990’s.
 AlexNet. The first work that popularized Convolutional Networks in Computer Vision. The
AlexNet was submitted to the ImageNet ILSVRC challenge in 2012 and significantly
outperformed the second runner-up (top 5 error of 16% compared to runner-up with 26%
error). The Network had a very similar architecture to LeNet, but was deeper, bigger, and
featured Convolutional Layers stacked on top of each .
 ZF Net. The ILSVRC 2013. It was an improvement on AlexNet by tweaking the architecture
hyperparameters, in particular by expanding the size of the middle convolutional layers and
making the stride and filter size on the first layer smaller.
 GoogLeNet. The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al.
from Google. Its main contribution was the development of an Inception Module that
dramatically reduced the number of parameters in the network (4M, compared to AlexNet
with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers
at the top of the ConvNet, eliminating a large amount of parameters that do not seem to
matter much.
 VGGNet. The runner-up in ILSVRC. Its main contribution was in showing that the depth of
the network is a critical component for good performance. Their final best network contains
16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that
only performs 3x3 convolutions and 2x2 pooling from the beginning to the end.
 ResNet. Residual Network was the winner of ILSVRC 2015. It features special skip
connections and a heavy use of batch normalization. The architecture is also missing fully
connected layers at the end of the network.
CNN for Semantic
Segmentation
What is semantic
segmentation?
A technique to provide fine-grained
pixel-wise labelling of the image at
hand.
 Used in scene understanding
 Traditionally,
 Image segmentation
 Region-level classification
 Recent approaches trying to directly
adopt deep architectures designed
for category prediction to pixel-level
CNN for semantic
segmentation
A general semantic segmentation architecture
can be broadly thought of as
n encoder network followed by
a decoder network:
 The encoder is usually is a pre-trained
classification network like VGG/ResNet followed
by a decoder network.
 The task of the decoder is to semantically
project the discriminative features (lower
resolution) learnt by the encoder onto the pixel
space (higher resolution) to get a dense
classification.
R-CNN
DeConNet
DeConvNet
DeConvNet - Unpooling
 Pooling in convolution network is designed to filter
noisy activations in a lower layer by abstracting
activations in a receptive field with a single
representative value.
 Unpooling layers in deconvolution network, which
perform the reverse operation of pooling and
reconstruct the original size of activations .
 To implement the unpooling operation, the algorithm
records the locations of maximum activations selected
during pooling operation in variables, which are
employed to place each activation back to its original
pooled location.
 This unpooling strategy is particularly useful to
reconstruct the structure of input object.
DeConvNet -
DeConcolution
 The output of an unpooling layer is an enlarged,
yet sparse activation map.
 The deconvolution layers densify the sparse
activations obtained by unpooling through
convolution-like operations with multiple learned
filters.
 However, contrary to convolutional layers, which
connect multiple input activations within a filter
window to a single activation, deconvolutional
layers associate a single input activation with
multiple outputs.
 The output of the deconvolutional layer is an
enlarged and dense activation map.
Fully Convolutional Network-Based
Semantic Segmentation (FCN)

 learns a mapping from pixels to pixels,


without extracting the region
proposals.
 The FCN network pipeline is an
extension of the classical CNN.
 Contrary to the classical CNN, FCNs do
not have fully-connected layers only
have convolutional and pooling layers
which give them the ability to make
predictions on arbitrary-sized inputs.
FCN
FCN
 One issue in this specific FCN is that by
propagating through several alternated
convolutional and pooling layers, the
resolution of the output feature maps is down
sampled. Therefore, the direct predictions of
FCN are typically in low resolution, resulting
in relatively fuzzy object boundaries.
 A variety of more advanced FCN-based
approaches have been proposed to address
this issue, including SegNet, DeepLab-CRF,
and Dilated Convolutions.

You might also like