Intro DL 02
Intro DL 02
1 CONVOLUTION OPERATION
3 ARCHITECTURE
Convolutional Layers
Pooling Layers
Activation Layers
Case Study: VGG Network
Visualizing what CNNs Learn
1
Section 1
Convolution Operation
Convolution operation
Discrete version:
∞
∑
(f ⊛ g)[n] = f[m]g[n − m]
m=−∞
2
Example
3
Properties
f⊛g=g⊛f
(f ⊛ g) ⊛ h = f ⊛ (g ⊛ h)
f ⊛ (g + h) = f ⊛ g + f ⊛ h
4
Section 2
5
CNNs: overview
However, CNNs make the explicit assumption that inputs are images.
6
Section 3
Architecture
CNN Architecture
Remark: in the following we’ll refer to the word depth to indicate the number of
channels of an activation volume. This has nothing to do with the depth of the
whole network, which usually refers to the total number of layers in the network.
7
CNN Architecture
An ”real-world” CNN is made up by a whole bunch of layers stacked one on the
top of the other.
Number of filters N
Kernel size K, the spatial size of the filters convolved
Filter stride S, factor by which to downscale
The presence and amount of spatial padding P on the input volume may be
considered an additional hyperparameter. In practice padding is usually
performed to avoid headaches caused by convolutions ”eating the borders”.
10
Visualizing Convolution 2D
11
Visualizing Convolution 2D
12
Convolution
For now, our networks computes a nonlinear function of the inputs. If we work
with images, it has to learn the spacial structure of the data by itself, which takes
a long time.
13
Convolution
−1 −2 −1
0 0 0
1 2 1
−1 0 1
−2 0 2
−1 0 1
Looking closer, neurons in a CNN perform the very same operation of the
neurons we already know from DNN.
∑
wi xi + b
i
15
Parameter Compatibility
If I is the length of the input volume size, F – the length of the filter,
Pstart , Pend – the amounts of zero padding, S – the stride,
then the output size O of the feature map along that dimension is given by:
I − F + Pstart + Pend
O= +1
S 16
source: https://stanford.edu/~shervine/teaching/cs-230
Convolutional Layers: Parameter Sharing
Assumption: if a feature is useful to compute at some spatial location (x, y), then
it should be useful to compute also at different locations (xi , yi ). Thus, we
constrain the neurons in each depth slice to use the same weights and bias.
If all neurons in a single depth slice are using the same weight vector, then the
forward pass of the convolutional layer can in each depth slice be computed as a
convolution of the neuron’s weights with the input volume (hence the name).
This is why it is common to refer to each set of weights as a filter (or a kernel),
that is convolved with the input.
17
Convolutional Layers: Parameter Sharing
Example of weights learned by [?]. Each of the 96 filters shown here is of size [11x11x3],
and each one is shared by the 55*55 neurons in one depth slice. Notice that the
parameter sharing assumption is relatively reasonable: If detecting a horizontal edge is
important at some location in the image, it should intuitively be useful at some other
location as well due to the translationally-invariant structure of images.
18
Convolution
20
Convolution
Image from http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf
20
Convolution
Image from http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf
20
Convolution
Image from http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf
20
Convolution
Image from http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf
20
Convolutional Layers: Number of Learnable Parameters
tot_learnable = N ∗ K ∗ K ∗ C1 + N
Explanation: there are N filters which convolve on input volume. The neural connection
is local on width and height, but extends for the full depth of input volume, so there are
K ∗ K ∗ C1 parameters for each filter. Furthermore, each filter has an additive learnable
bias.
21
Pooling Layers: overview
Pooling layers spatially subsample the input volume.
Each depth slice of the input is processed independently.
Two hyperparameters:
22
Pooling Layers: types
23
Pooling Layers: why
Most common configuration: pool size K = 2x2, stride S = 2. In this setting 75%
of input volume activations are discarded.
24
Pooling Layers: why not
There’s a lot of research on getting rid of pooling layers while mantaining the
benefits (e.g. [?, ?]). We’ll see if future architecture will still feature pooling
layers.
25
Activation Layers
26
Activation Layers
ReLu wins
ReLu was found to greatly accelerate the convergence of SGD compared to
sigmoid/tanh functions [?]. Furthermore, ReLu can be implemented by a simple
threshold, w.r.t. other activations which require complex operations.
Why using non-linear activations at all?
Composition of linear functions is a linear function. Without nonlinearities,
neural networks would reduce to 1 layer logistic regression.
27
Computing Output Volume Size
H2 = (H1 − K + 2P)/S + 1
W2 = (W1 − K + 2P)/S + 1
C2 = N
28
Computing Output Volume Size
Pooling layer: given an input volume of size H1 x W1 x C1 , the output of a
pooling layer with pool size K and pool stride S is a volume with new shape H2 x
W2 x C2 , where:
H2 = (H1 − K)/S + 1
W2 = (W1 − K)/S + 1
C2 = C1
H2 = H1
W2 = W1
C2 = C1
29
ADVANCED CNN ARCHITECTURES
More complex CNN architectures have recently been demonstrated to perform
better than the traditional conv -> relu -> pool stack architecture.
These architectures usually feature different graph topologies and much more
intricate connectivity structures. 30
Convolutional neural network
31
Convolutional neural network
32
VGG
VGG [?] indicates a deep convolutional network for image recognition developed
and trained in 2014 by the Oxford Vision Geometry Group.
Performance of the network is (was) great. In 2014 VGG team secured the
first and the second places in the localization and classification challenge
on ImageNet;
Pre-trained weights were released in Caffe [?] and converted by the deep
learning community in a variety of other frameworks;
Architectural choices by the authors led to a very neat network model,
successively taken as guideline for a number of future works.
33
VGG16 Architecture
Input fixed size 224x224 RGB images. For training, images are
pre-processed subtracting the mean RGB value of the training set.
Spatial pooling is carried out by five max pooling layers performed over
2x2 pixel window, with stride 2.
Fully connected layers feature 4096 neurons each followed by ReLu. The
very last fully connected layer is composed of 1000 neurons (as many as
ImageNet classes) and is followed by softmax activation.
34
VGG16 Computational Footprint
Each image takes approx. 93MB of memory for forward pass. As a rule of
thumb, backward pass consumes roughly the double of the resources.
Most of memory usage is due to the first layers in the network.
35
The Myth of Interpretability
Convolutional neural networks have often been criticized for their lack of
interpretability[?]. The main objection is to deal with big and complex black
boxes, that give correct results even if in which we have no cue of what’s
happening inside.
36
The Myth of Interpretability
On the other side, linear models and decision trees are often presented as
example of ”champions” of interpretability. The debate whether a logistic
regression would be more or less interpretable than a deep network is complex
out the scope of this lecture.
37
Visualizing Activations
Visualizing activations of the network during the forward pass is straightforward
and can be useful to detect dead filters (i.e. activations that are zero whatever
the input).
Activations on the 1st conv layer (left), and the 5th conv layer (right) of a trained
AlexNet looking at a picture of a cat. Every box shows an activation map corresponding
to some filter. Notice that the activations are sparse and mostly local. 38
Inspecting Weights
Visualizing the learned weights is another common strategy to get an insight of
what the network looks for in the images. The most interpretable weights are the
ones learned by first convolutional layer, which operates directly on the image
pixels.
39
Partially Occluding the Images
40
t-SNE Embedding
CNNs can be interpreted as gradually transforming the images into a
representation in which the classes are separable by a linear classifier. We can
get a rough idea about the topology of this space by embedding images into two
dimensions so that their low-dimensional representation has approximately
equal distances than their high-dimensional representation. Here, a t-SNE
embedding of a set of images.
41