0% found this document useful (0 votes)
154 views31 pages

Unit 2a

The document discusses convolutional neural networks (CNNs). CNNs are a type of neural network that uses convolution operations instead of general matrix multiplication. Convolution involves applying a filter or kernel over input data to produce a feature map output. This helps CNNs learn local features and reduces parameters. CNNs leverage sparse interactions through small kernels, parameter sharing by using the same kernel across all parts of the input, and equivariance where moving features in the input moves them the same way in the output.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views31 pages

Unit 2a

The document discusses convolutional neural networks (CNNs). CNNs are a type of neural network that uses convolution operations instead of general matrix multiplication. Convolution involves applying a filter or kernel over input data to produce a feature map output. This helps CNNs learn local features and reduces parameters. CNNs leverage sparse interactions through small kernels, parameter sharing by using the same kernel across all parts of the input, and equivariance where moving features in the input moves them the same way in the output.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT II CONVOLUTIONAL NEURAL NETWORKS

Convolution Operation - Sparse Interactions - Parameter Sharing - Equivariance


- Pooling - Convolution Variants: Strided - Tiled - Transposed and dilated
convolutions; CNN Learning: Nonlinearity Functions - Loss Functions –
Regularization - Optimizers - Gradient Computation.
CONVOLUTIONAL NEURAL NETWORKS
• Convolutional networks (LeCun, 1989), also known as convolutional neural
networks or CNNs, are a specialized kind of neural network for processing data
that has a known, grid-like topology.
• The name “convolutional neural network” indicates that the network employs a
mathematical operation called convolution.
• Convolution is a specialized kind of linear operation.
• Convolutional networks are simply neural networks that use convolution in place
of general matrix multiplication in at least one of their layers.
The Convolution Operation
• Convolution is an operation on two functions of a real valued argument.
• Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single
output x(t), the position of the spaceship at time t. Both x and t are real-valued, i.e., we can get a different
reading from the laser sensor at any instant in time.
• Now suppose that our laser sensor is somewhat noisy.
• To obtain a less noisy estimate of the spaceship’s position, we would like to average together several
measurements.
• Of course, more recent measurements are more relevant, so we will want this to be a weighted average that
gives more weight to recent measurements.
• We can do this with a weighting function w(a), where a is the age of a measurement.
• If we apply such a weighted average operation at every moment, we obtain a new function providing s a
smoothed estimate of the position of the spaceship:

• This operation is called convolution. The convolution operation is typically denoted with an asterisk:

• w needs to be a valid probability density function, or the output is not a weighted average. Also, w needs to
be 0 for all negative arguments.
• In general, convolution is defined for any functions for which the above integral is defined, and may
be used for other purposes besides taking weighted averages.
• In convolutional network terminology, the first argument (in this example, the function x) to the
convolution is often referred to as the input and the second argument (in this example, the function w)
as the kernel. The output is sometimes referred to as the feature map.
• In our example, the idea of a laser sensor that can provide measurements at every instant in time is not
realistic. Usually, when we work with data on a computer, time will be discretized, and our sensor will
provide data at regular intervals.
• In our example, it might be more realistic to assume that our laser provides a measurement once per
second. The time index t can then take on only integer values. If we now assume that x and w are
defined only on integer t, we can define the discrete convolution:

• Finally, we often use convolutions over more than one axis at a time. For example, if we use a two-
dimensional image I as our input, we probably also want to use a two-dimensional kernel K:
• The commutative property of convolution arises because we have flipped the kernel relative to the
input, in the sense that as m increases, the index into the input increases, but the index into the
kernel decreases. The only reason to flip the kernel is to obtain the commutative property.
• Instead, many neural network libraries implement a related function called the cross-correlation,
which is the same as convolution but without flipping the kernel:

• Discrete convolution can be viewed as multiplication by a matrix. However, the matrix has several
entries constrained to be equal to other entries. For example, for univariate discrete convolution,
each row of the matrix is constrained to be equal to the row above shifted by one element. This is
known as a Toeplitz matrix. In two dimensions, a doubly block circulant matrix corresponds to
convolution.
• In addition to these constraints that several elements be equal to each other, convolution usually
corresponds to a very sparse matrix (a matrix whose entries are mostly equal to zero).
• This is because the kernel is usually much smaller than the input image.
Motivation
• Convolution leverages three important ideas that can help improve a machine learning system: sparse interactions,
parameter sharing and equivariant representations. Moreover, convolution provides a means for working with
inputs of variable size.
Sparse interactions
• Neural network layers use matrix multiplication by a matrix of parameters with a separate parameter describing the
interaction between each input unit and each output unit. This means every output unit interacts with every input unit.
Convolutional networks, however, typically have sparse interactions (also referred to as sparse connectivity or
sparse weights). This is accomplished by making the kernel smaller than the input.
• For example, when processing an image, the input image might have thousands or millions of pixels, but we can
detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. This means
that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its
statistical efficiency. It also means that computing the output requires fewer operations. These improvements in
efficiency are usually quite large.
• If there are m inputs and n outputs, then matrix multiplication requires m×n parameters and the algorithms used in
practice have O(m × n) runtime (per example). If we limit the number of connections each output may have to k, then
the sparsely connected approach requires only k × n parameters and O(k × n) runtime. For many practical
applications, it is possible to obtain good performance on the machine learning task while keeping k several orders of
magnitude smaller than m. For graphical demonstrations of sparse connectivity, see figure 9.2 and figure 9.3. In a
deep convolutional network, units in the deeper layers may indirectly interact with a larger portion of the input, as
shown in figure 9.4.
• This allows the network to efficiently describe complicated interactions between many variables by constructing such
interactions from simple building blocks that each describe only sparse interactions.
Parameter sharing
• Parameter sharing refers to using the same parameter for more than one function in a model.
• In a traditional neural net, each element of the weight matrix is used exactly once when computing
the output of a layer. It is multiplied by one element of the input and then never revisited.
• As a synonym for parameter sharing, one can say that a network has tied weights, because the value
of the weight applied to one input is tied to the value of a weight applied elsewhere.
• In a convolutional neural net, each member of the kernel is used at every position of the input
(except perhaps some of the boundary pixels, depending on the design decisions regarding the
boundary).
• The parameter sharing used by the convolution operation means that rather than learning a separate
set of parameters for every location, we learn only one set. This does not affect the runtime of
forward propagation—it is still O(k × n)—but it does further reduce the storage requirements of the
model to k parameters.
• Recall that k is usually several orders of magnitude less than m. Since m and n are usually roughly
the same size, k is practically insignificant compared to m×n.
• Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the
memory requirements and statistical efficiency. For a graphical depiction of how parameter sharing
works, see figure 9.5.
Equivariance
• In the case of convolution, the particular form of parameter sharing causes the layer to have a property called
equivariance to translation. To say a function is equivariant means that if the input changes, the output changes in
the same way.
• Specifically, a function f (x) is equivariant to a function g if f(g(x)) = g(f (x)).
• In the case of convolution, if we let g be any function that translates the input, i.e., shifts it, then the convolution
function is equivariant to g.
• For example, let I be a function giving image brightness at integer coordinates. Let g be a function mapping one
image function to another image function, such that I’= g(I ) is the image function with I’(x, y) = I(x − 1, y). This
shifts every pixel of I one unit to the right.
• If we apply this transformation to I, then apply convolution, the result will be the same as if we applied convolution
to I’, then applied the transformation g to the output. When processing time series data, this means that convolution
produces a sort of timeline that shows when different features appear in the input.
• If we move an event later in time in the input, the exact same representation of it will appear in the output, just later
in time.
• Similarly with images, convolution creates a 2-D map of where certain features appear in the input. If we move the
object in the input, its representation will move the same amount in the output. This is useful for when we know that
some function of a small number of neighboring pixels is useful when applied to multiple input locations.
• For example, when processing images, it is useful to detect edges in the first layer of a convolutional network. The
same edges appear more or less everywhere in the image, so it is practical to share parameters across the entire
image.
• In some cases, we may not wish to share parameters across the entire image.
• For example, if we are processing images that are cropped to be centered on an individual’s face, we probably want
to extract different features at different locations—the part of the network processing the top of the face needs to
look for eyebrows, while the part of the network processing the bottom of the face needs to look for a chin.
Pooling
• A typical layer of a convolutional network consists of three stages (see figure 9.7).
• In the first stage, the layer performs several convolutions in parallel to produce a set of linear activations. In the second
stage, each linear activation is run through a nonlinear activation function, such as the rectified linear activation function.
• This stage is sometimes called the detector stage. In the third stage, we use a pooling function to modify the output of the
layer further.
• A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs.
• For example, the max pooling operation reports the maximum output within a rectangular neighborhood. Other popular
pooling functions include the average of a rectangular neighborhood, the L2 norm of a rectangular neighborhood, or a
weighted average based on the distance from the central pixel.
• In all cases, pooling helps to make the representation become approximately invariant to small translations of the input.
• Invariance to local translation can be a very useful property if we care more about whether some feature is present than
exactly where it is.
• For example, when determining whether an image contains a face, we need not know the location of the eyes with pixel-
perfect accuracy, we just need to know that there is an eye on the left side of the face and an eye on the right side of the
face.
• In other contexts, it is more important to preserve the location of a feature. For example, if we want to find a corner defined
by two edges meeting at a specific orientation, we need to preserve the location of the edges well enough to test whether
they meet.
• The use of pooling can be viewed as adding an infinitely strong prior that the function the layer learns must be invariant to
small translations. When this assumption is correct, it can greatly improve the statistical efficiency of the network.
Convolution Variants : Strided - Tiled
• Assume we have a 4-D kernel tensor K with element Ki,j,k,l giving the connection strength between a unit in channel i of the
output and a unit in channel j of the input, with an offset of k rows and l columns between the output unit and the input unit.
• Assume our input consists of observed data V with element Vi,j,k giving the value of the input unit within channel i at row j and
column k .
• Assume our output consists of Z with the same format as V. If Z is produced by convolving K across V without flipping K, then

• where the summation over l, m and n is over all values for which the tensor indexing operations inside the summation is valid. In
linear algebra notation, we index into arrays using a 1 for the first entry. This necessitates the −1 in the above formula.
• Programming languages such as C and Python index starting from 0, rendering the above expression even simpler.
• We may want to skip over some positions of the kernel in order to reduce the computational cost (at the expense of not extracting
our features as finely). We can think of this as downsampling the output of the full convolution function.
• If we want to sample only every s pixels in each direction in the output, then we can define a downsampled convolution function
c such that

• We refer to s as the stride of this downsampled convolution. It is also possible to define a separate stride for each direction of
motion. See figure 9.12 for an illustration.
• One essential feature of any convolutional network implementation is the ability to implicitly zero-pad the input V in
order to make it wider.
• Without this feature, the width of the representation shrinks by one pixel less than the kernel width at each layer.
• Zero padding the input allows us to control the kernel width and the size of the output independently.
• Without zero padding, we are forced to choose between shrinking the spatial extent of the network rapidly and using small
kernels—both scenarios that significantly limit the expressive power of the network. See figure 9.13 for an example.
• Three special cases of the zero-padding setting are worth mentioning. One is the extreme case in which no zero-padding is
used whatsoever, and the convolution kernel is only allowed to visit positions where the entire kernel is contained entirely
within the image. In MATLAB terminology, this is called valid convolution.
• Another special case of the zero-padding setting is when just enough zero-padding is added to keep the size of the output
equal to the size of the input. MATLAB calls this same convolution.
• In this case, the network can contain as many convolutional layers as the available hardware can support, since the
operation of convolution does not modify the architectural possibilities available to the next layer.
• However, the input pixels near the border influence fewer output pixels than the input pixels near the center. This can
make the border pixels somewhat underrepresented in the model. This motivates the other extreme case, which MATLAB
refers to as full convolution, in which enough zeroes are added for every pixel to be visited k times in each direction,
resulting in an output image of width m + k − 1. In this case, the output pixels near the border are a function of fewer
pixels than the output pixels near the center.
• This can make it difficult to learn a single kernel that performs well at all positions in the convolutional feature map.
Usually the optimal amount of zero padding (in terms of test set classification accuracy) lies somewhere between “valid”
and “same” convolution.
• In some cases, we do not actually want to use convolution, but rather locally connected layers (LeCun, 1986, 1989). In
this case, the adjacency matrix in the graph of our MLP is the same, but every connection has its own weight, specified
by a 6-D tensor W.
• The indices into W are respectively: i, the output channel, j, the output row, k, the output column, l, the input channel,
m, the row offset within the input, and n, the column offset within the input. The linear part of a locally connected layer
is then given by

• This is sometimes also called unshared convolution, because it is a similar operation to discrete convolution with a
small kernel, but without sharing parameters across locations. Figure 9.14 compares local connections, convolution, and
full connections.
• Tiled convolution (Gregor and LeCun, 2010a; Le et al., 2010) offers a compromise between a convolutional layer and
a locally connected layer. Rather than learning a separate set of weights at every spatial location, we learn a set of
kernels that we rotate through as we move through space. This means that immediately neighboring locations will have
different filters, like in a locally connected layer, but the memory requirements for storing the parameters will increase
only by a factor of the size of this set of kernels, rather than the size of the entire output feature map. See figure 9.16 for
a comparison of locally connected layers, tiled convolution, and standard convolution.
• To define tiled convolution algebraically, let k be a 6-D tensor, where two of the dimensions correspond to different
locations in the output map. Rather than having a separate index for each location in the output map, output locations
cycle through a set of t different choices of kernel stack in each direction. If t is equal to the output width, this is the
same as a locally connected layer.

• where % is the modulo operation, with t%t = 0, (t + 1)%t = 1, etc. It is straightforward to generalize this equation to use
a different tiling range for each dimension.
• Both locally connected layers and tiled convolutional layers have an interesting interaction with max-pooling: the
detector units of these layers are driven by different filters.
Transposed convolutions
• Some sources use the name deconvolution, which is inappropriate because it’s not a deconvolution. To make things
worse deconvolutions do exists, but they’re not common in the field of deep learning. An actual deconvolution reverts
the process of a convolution. Imagine inputting an image into a single convolutional layer. Now take the output, throw
it into a black box and out comes your original image again. This black box does a deconvolution. It is the mathematical
inverse of what a convolutional layer does.
• A transposed convolution is somewhat similar because it produces the same spatial resolution a hypothetical
deconvolutional layer would. However, the actual mathematical operation that’s being performed on the values is
different. A transposed convolutional layer carries out a regular convolution but reverts its spatial transformation.

• An image of 5x5 is fed into a convolutional layer. The stride is set to 2, the padding is deactivated and the kernel is 3x3.
This results in a 2x2 image.
• f we wanted to reverse this process, we’d need the inverse mathematical operation so that 9 values are generated from
each pixel we input. Afterward, we traverse the output image with a stride of 2. This would be a deconvolution.
• A transposed convolution does not do that. The only thing in common is it guarantees that the output will be a 5x5
image as well, while still performing a normal convolution operation. To achieve this, we need to perform some fancy
padding on the input.
• As you can imagine now, this step will not reverse the process from above. At least not concerning the numeric values.
• It merely reconstructs the spatial resolution from before and performs a convolution. This may not be the mathematical
inverse, but for Encoder-Decoder architectures, it’s still very helpful. This way we can combine the upscaling of an
image with a convolution, instead of doing two separate processes.
Dilated convolutions
• Dilated convolutions introduce another parameter to convolutional layers called the dilation rate.
This defines a spacing between the values in a kernel. A 3x3 kernel with a dilation rate of 2 will
have the same field of view as a 5x5 kernel, while only using 9 parameters. Imagine taking a 5x5
kernel and deleting every second column and row.
• This delivers a wider field of view at the same computational cost. Dilated convolutions are
particularly popular in the field of real-time segmentation. Use them if you need a wide field of
view and cannot afford multiple convolutions or larger kernels.

You might also like