Convolutional Layer: Web-Based Demo
Convolutional Layer: Web-Based Demo
The activations of an example ConvNet architecture. The initial volume stores the raw image pixels (left) and
the last volume stores the class scores (right). Each volume of activations along the processing path is
shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The
last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and
print the labels of each one. The full web-based demo is shown in the header of our website. The
architecture shown here is a tiny VGG Net, which we will discuss later.
We now describe the individual layers and the details of their hyperparameters and their
connectivities.
Convolutional Layer
The Conv layer is the core building block of a Convolutional Network that does most of the
computational heavy lifting.
Overview and intuition without brain stuff. Lets first discuss what the CONV layer computes
without brain/neuron analogies. The CONV layer’s parameters consist of a set of learnable filters.
Every filter is small spatially (along width and height), but extends through the full depth of the
input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e.
5 pixels width and height, and 3 because images have depth 3, the color channels). During the
forward pass, we slide (more precisely, convolve) each filter across the width and height of the
input volume and compute dot products between the entries of the filter and the input at any
position. As we slide the filter over the width and height of the input volume we will produce a 2-
dimensional activation map that gives the responses of that filter at every spatial position.
Intuitively, the network will learn filters that activate when they see some type of visual feature
such as an edge of some orientation or a blotch of some color on the first layer, or eventually
entire honeycomb or wheel-like patterns on higher layers of the network. Now, we will have an
entire set of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2-
dimensional activation map. We will stack these activation maps along the depth dimension and
produce the output volume.
Local Connectivity. When dealing with high-dimensional inputs such as images, as we saw above
it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect
each neuron to only a local region of the input volume. The spatial extent of this connectivity is a
hyperparameter called the receptive field of the neuron (equivalently this is the filter size). The
extent of the connectivity along the depth axis is always equal to the depth of the input volume. It
is important to emphasize again this asymmetry in how we treat the spatial dimensions (width
and height) and the depth dimension: The connections are local in space (along width and
height), but always full along the entire depth of the input volume.
Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10
image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have
weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias
parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is
the depth of the input volume.
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field
size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections
to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along
the input depth (20).
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of neurons in
the first Convolutional layer. Each neuron in the convolutional layer is connected only to a local region in the
input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this
example) along the depth, all looking at the same region in the input - see discussion of depth columns in
text below. Right: The neurons from the Neural Network chapter remain unchanged: They still compute a dot
product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to
be local spatially.
Spatial arrangement.
arrangement We have explained the connectivity of each neuron in the Conv Layer to the
input volume, but we haven’t yet discussed how many neurons there are in the output volume or
how they are arranged. Three hyperparameters control the size of the output volume: the depth,
stride and zero-padding
zero-padding. We discuss these next:
1. First, the depth of the output volume is a hyperparameter: it corresponds to the number of
filters we would like to use, each learning to look for something different in the input. For
example, if the first Convolutional Layer takes as input the raw image, then different
neurons along the depth dimension may activate in presence of various oriented edges, or
blobs of color. We will refer to a set of neurons that are all looking at the same region of the
input as a depth column (some people also prefer the term fibre).
2. Second, we must specify the stride with which we slide the filter. When the stride is 1 then
we move the filters one pixel at a time. When the stride is 2 (or uncommonly 3 or more,
though this is rare in practice) then the filters jump 2 pixels at a time as we slide them
around. This will produce smaller output volumes spatially.
3. As we will soon see, sometimes it will be convenient to pad the input volume with zeros
around the border. The size of this zero-padding is a hyperparameter. The nice feature of
zero padding is that it will allow us to control the spatial size of the output volumes (most
commonly as we’ll see soon we will use it to exactly preserve the spatial size of the input
volume so the input and output width and height are the same).
We can compute the spatial size of the output volume as a function of the input volume size (W ),
the receptive field size of the Conv Layer neurons (F ), the stride with which they are applied (S ),
and the amount of zero padding used (P ) on the border. You can convince yourself that the