0% found this document useful (0 votes)
27 views

Convolutional Layer: Web-Based Demo

This document discusses convolutional neural network (CNN) architectures and their components. It describes how CNNs are composed of layers that transform the input image volume into an output volume holding class scores. The main layer types are convolutional (CONV), fully-connected (FC), rectified linear unit (RELU), and pooling (POOL) layers. Each layer may have parameters and hyperparameters. The core building block is the convolutional layer, which applies learnable filters to local regions of the input volume to produce feature maps in the output volume. Key hyperparameters that determine the output size are the filter size, stride, depth, and zero padding.

Uploaded by

olia.92
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Convolutional Layer: Web-Based Demo

This document discusses convolutional neural network (CNN) architectures and their components. It describes how CNNs are composed of layers that transform the input image volume into an output volume holding class scores. The main layer types are convolutional (CONV), fully-connected (FC), rectified linear unit (RELU), and pooling (POOL) layers. Each layer may have parameters and hyperparameters. The core building block is the convolutional layer, which applies learnable filters to local regions of the input volume to produce feature maps in the output volume. Key hyperparameters that determine the output size are the filter size, stride, depth, and zero padding.

Uploaded by

olia.92
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

A ConvNet architecture is in the simplest case a list of Layers that transform the image

volume into an output volume (e.g. holding the class scores)


There are a few distinct types of Layers (e.g. CONV/FC/RELU/POOL are by far the most
popular)
Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a
differentiable function
Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don’t)
Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do,
RELU doesn’t)

The activations of an example ConvNet architecture. The initial volume stores the raw image pixels (left) and
the last volume stores the class scores (right). Each volume of activations along the processing path is
shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The
last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and
print the labels of each one. The full web-based demo is shown in the header of our website. The
architecture shown here is a tiny VGG Net, which we will discuss later.

We now describe the individual layers and the details of their hyperparameters and their
connectivities.

Convolutional Layer

The Conv layer is the core building block of a Convolutional Network that does most of the
computational heavy lifting.
Overview and intuition without brain stuff. Lets first discuss what the CONV layer computes
without brain/neuron analogies. The CONV layer’s parameters consist of a set of learnable filters.
Every filter is small spatially (along width and height), but extends through the full depth of the
input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e.
5 pixels width and height, and 3 because images have depth 3, the color channels). During the
forward pass, we slide (more precisely, convolve) each filter across the width and height of the
input volume and compute dot products between the entries of the filter and the input at any
position. As we slide the filter over the width and height of the input volume we will produce a 2-
dimensional activation map that gives the responses of that filter at every spatial position.
Intuitively, the network will learn filters that activate when they see some type of visual feature
such as an edge of some orientation or a blotch of some color on the first layer, or eventually
entire honeycomb or wheel-like patterns on higher layers of the network. Now, we will have an
entire set of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2-
dimensional activation map. We will stack these activation maps along the depth dimension and
produce the output volume.

The brain view.


view If you’re a fan of the brain/neuron analogies, every entry in the 3D output volume
can also be interpreted as an output of a neuron that looks at only a small region in the input and
shares parameters with all neurons to the left and right spatially (since these numbers all result
from applying the same filter). We now discuss the details of the neuron connectivities, their
arrangement in space, and their parameter sharing scheme.

Local Connectivity. When dealing with high-dimensional inputs such as images, as we saw above
it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect
each neuron to only a local region of the input volume. The spatial extent of this connectivity is a
hyperparameter called the receptive field of the neuron (equivalently this is the filter size). The
extent of the connectivity along the depth axis is always equal to the depth of the input volume. It
is important to emphasize again this asymmetry in how we treat the spatial dimensions (width
and height) and the depth dimension: The connections are local in space (along width and
height), but always full along the entire depth of the input volume.

Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10
image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have
weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias
parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is
the depth of the input volume.

Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field
size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections
to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along
the input depth (20).
Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of neurons in
the first Convolutional layer. Each neuron in the convolutional layer is connected only to a local region in the
input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this
example) along the depth, all looking at the same region in the input - see discussion of depth columns in
text below. Right: The neurons from the Neural Network chapter remain unchanged: They still compute a dot
product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to
be local spatially.

Spatial arrangement.
arrangement We have explained the connectivity of each neuron in the Conv Layer to the
input volume, but we haven’t yet discussed how many neurons there are in the output volume or
how they are arranged. Three hyperparameters control the size of the output volume: the depth,
stride and zero-padding
zero-padding. We discuss these next:

1. First, the depth of the output volume is a hyperparameter: it corresponds to the number of
filters we would like to use, each learning to look for something different in the input. For
example, if the first Convolutional Layer takes as input the raw image, then different
neurons along the depth dimension may activate in presence of various oriented edges, or
blobs of color. We will refer to a set of neurons that are all looking at the same region of the
input as a depth column (some people also prefer the term fibre).
2. Second, we must specify the stride with which we slide the filter. When the stride is 1 then
we move the filters one pixel at a time. When the stride is 2 (or uncommonly 3 or more,
though this is rare in practice) then the filters jump 2 pixels at a time as we slide them
around. This will produce smaller output volumes spatially.
3. As we will soon see, sometimes it will be convenient to pad the input volume with zeros
around the border. The size of this zero-padding is a hyperparameter. The nice feature of
zero padding is that it will allow us to control the spatial size of the output volumes (most
commonly as we’ll see soon we will use it to exactly preserve the spatial size of the input
volume so the input and output width and height are the same).

We can compute the spatial size of the output volume as a function of the input volume size (W ),
the receptive field size of the Conv Layer neurons (F ), the stride with which they are applied (S ),
and the amount of zero padding used (P ) on the border. You can convince yourself that the

You might also like