0% found this document useful (0 votes)
22 views

(W F + 2P) /S + 1: Use of Zero-Padding

The document discusses convolutional neural networks and convolutional layers. It defines key concepts like receptive field size, stride, padding, and how they determine the size of the output volume. It also explains parameter sharing, where neurons in the same depth slice use the same weights and biases, dramatically reducing the number of parameters. This allows feature detectors to be shared across spatial positions in the input.

Uploaded by

olia.92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

(W F + 2P) /S + 1: Use of Zero-Padding

The document discusses convolutional neural networks and convolutional layers. It defines key concepts like receptive field size, stride, padding, and how they determine the size of the output volume. It also explains parameter sharing, where neurons in the same depth slice use the same weights and biases, dramatically reducing the number of parameters. This allows feature detectors to be shared across spatial positions in the input.

Uploaded by

olia.92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

correct formula for calculating how many neurons “fit” is given by (W − F + 2P )/S + 1.

For
example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With
stride 2 we would get a 3x3 output. Lets also see one more graphical example:

Illustration of spatial arrangement. In this example there is only one spatial dimension (x-axis), one neuron
with a receptive field size of F = 3, the input size is W = 5, and there is zero padding of P = 1. Left: The neuron
strided across the input in stride of S = 1, giving output of size (5 - 3 + 2)/1+1 = 5. Right: The neuron uses
stride of S = 2, giving output of size (5 - 3 + 2)/2+1 = 3. Notice that stride S = 3 could not be used since it
wouldn't fit neatly across the volume. In terms of the equation, this can be determined since (5 - 3 + 2) = 4 is
not divisible by 3.
The neuron weights are in this example [1,0,-1] (shown on very right), and its bias is zero. These weights are
shared across all yellow neurons (see parameter sharing below).

Use of zero-padding. In the example above on left, note that the input dimension was 5 and the
output dimension was equal: also 5. This worked out so because our receptive fields were 3 and
we used zero padding of 1. If there was no zero-padding used, then the output volume would
have had spatial dimension of only 3, because that it is how many neurons would have “fit” across
the original input. In general, setting zero padding to be P = (F − 1)/2 when the stride is
S = 1 ensures that the input volume and output volume will have the same size spatially. It is

very common to use zero-padding in this way and we will discuss the full reasons when we talk
more about ConvNet architectures.

Constraints on strides. Note again that the spatial arrangement hyperparameters have mutual
constraints. For example, when the input has size W = 10 , no zero-padding is used P = 0, and
the filter size is F = 3 , then it would be impossible to use stride S = 2 , since
(W − F + 2P )/S + 1 = (10 − 3 + 0)/2 + 1 = 4.5 , i.e. not an integer, indicating that the

neurons don’t “fit” neatly and symmetrically across the input. Therefore, this setting of the
hyperparameters is considered to be invalid, and a ConvNet library could throw an exception or
zero pad the rest to make it fit, or crop the input to make it fit, or something. As we will see in the
ConvNet architectures section, sizing the ConvNets appropriately so that all the dimensions “work
out” can be a real headache, which the use of zero-padding and some design guidelines will
significantly alleviate.
Real-world example. The Krizhevsky et al. architecture that won the ImageNet challenge in 2012
accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with
receptive field size F = 11 , stride S = 4 and no zero padding P = 0. Since (227 - 11)/4 + 1 =
55, and since the Conv layer had a depth of K = 96 , the Conv layer output volume had size
[55x55x96]. Each of the 55*55*96 neurons in this volume was connected to a region of size
[11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to
the same [11x11x3] region of the input, but of course with different weights. As a fun aside, if you
read the actual paper it claims that the input images were 224x224, which is surely incorrect
because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the
history of ConvNets and little is known about what happened. My own best guess is that Alex
used zero-padding of 3 extra pixels that he does not mention in the paper.

Parameter Sharing. Parameter sharing scheme is used in Convolutional Layers to control the
number of parameters. Using the real-world example above, we see that there are 55*55*96 =
290,400 neurons in the first Conv Layer, and each has 11*11*3 = 363 weights and 1 bias.
Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the
ConvNet alone. Clearly, this number is very high.

It turns out that we can dramatically reduce the number of parameters by making one reasonable
assumption: That if one feature is useful to compute at some spatial position (x,y), then it should
also be useful to compute at a different position (x2,y2). In other words, denoting a single 2-
dimensional slice of depth as a depth slice (e.g. a volume of size [55x55x96] has 96 depth slices,
each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same
weights and bias. With this parameter sharing scheme, the first Conv Layer in our example would
now have only 96 unique set of weights (one for each depth slice), for a total of 96*11*11*3 =
34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55*55 neurons in
each depth slice will now be using the same parameters. In practice during backpropagation,
every neuron in the volume will compute the gradient for its weights, but these gradients will be
added up across each depth slice and only update a single set of weights per slice.

Notice that if all neurons in a single depth slice are using the same weight vector, then the
forward pass of the CONV layer can in each depth slice be computed as a convolution of the
neuron’s weights with the input volume (Hence the name: Convolutional Layer). This is why it is
common to refer to the sets of weights as a filter (or a kernel
kernel), that is convolved with the input.
Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each
one is shared by the 55*55 neurons in one depth slice. Notice that the parameter sharing assumption is
relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should
intuitively be useful at some other location as well due to the translationally-invariant structure of images.
There is therefore no need to relearn to detect a horizontal edge at every one of the 55*55 distinct locations
in the Conv layer output volume.

Note that sometimes the parameter sharing assumption may not make sense. This is especially
the case when the input images to a ConvNet have some specific centered structure, where we
should expect, for example, that completely different features should be learned on one side of
the image than another. One practical example is when the input are faces that have been
centered in the image. You might expect that different eye-specific or hair-specific features could
(and should) be learned in different spatial locations. In that case it is common to relax the
parameter sharing scheme, and instead simply call the layer a Locally-Connected Layer.
Layer

Numpy examples. To make the discussion above more concrete, lets express the same ideas but
in code and with a specific example. Suppose that the input volume is a numpy array X . Then:

A depth column (or a fibre) at position (x,y) would be the activations X[x,y,:] .
A depth slice, or equivalently an activation map at depth d would be the activations
X[:,:,d] .

Conv Layer Example. Suppose that the input volume X has shape X.shape: (11,11,4) .
Suppose further that we use no zero padding (P = 0), that the filter size is F = 5 , and that the
stride is S = 2 . The output volume would therefore have spatial size (11-5)/2+1 = 4, giving a
volume with width and height of 4. The activation map in the output volume (call it V ), would
then look as follows (only some of the elements are computed in this example):

V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0


V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0

You might also like