Module 4 Merged
Module 4 Merged
DEEP LEARNING
Module-3 PART -I
1
SYLLABUS
2
TRACE KTU
The Convolution Operation
Suppose we are tracking the location of a spaceship with a laser
sensor. Our laser sensor provides a single output x(t), the
position of the spaceship at time t. Both x and t are real-valued,
i.e., we can get a different reading from the laser sensor at any
instant in time.
To obtain a less noisy estimate of the spaceship’s position, we
would like to average together several measurements.
4 more recent measurements are more relevant
We will want this to be a weighted average that
gives more weight to recent measurements a
weighting function w(a), where a is the age of a
measurement
TRACE KTU
If we apply such a weighted average operation
at every moment, we obtain a new
function providing s a smoothed estimate of the
position of the spaceship
This operation is called convolution. The convolution
5 operation is typically denoted with an asterisk:
s(t) = (x w)(t)
w needs to be a valid probability density function, or the
output is not a weighted average.
convolution is define for any functions for which the above
integral is defined, and may be used for other purposes
TRACE KTU
besides taking weighted averages.
In convolutional network terminology, the first argument (in
this example, the function x) to the convolution is often
referred to as the input and the second argument (in this
example, the function w) as the kernel.
The output is sometimes referred to as the feature map.
6 when we work with data on a computer, time will be
discretized, and our sensor will provide data at
regular intervals.
it might be more realistic to assume that our laser
provides a measurement once per second. The time
index t can then take on only integer values.
TRACE KTU
If we now assume that x and w are defined only on
integer t,
We can define the discrete convolution:
In machine learning applications, the input is usually a
7 multidimensional array of data and the kernel is usually a
multidimensional array of parameters that are adapted by the
learning algorithm.
multidimensional arrays as tensors
Because each element of the input and kernel must be
explicitly stored separately, we usually assume that these
TRACE KTU
functions are zero everywhere but the finite set of points for
which we store the values.
can implement the infinite summation as a summation
over a finite number of array elements.
use convolutions over more than one axis at a time.
example, if we use a two-dimensional image I as our input, we
8 probably also want to use a two-dimensional kernel K
TRACE KTU
The commutative property of convolution arises because
we have flipped the kernel relative to the input, in the
sense that as m increases, the index into the input
increases, but the index into the kernel decreases.
The only reason to flip the kernel is to obtain the commutative
property.
9
TRACE KTU
Motivation
10
Convolution leverages three important ideas that can help
improve a machine learning system: sparse interactions,
parameter sharing and equivariant representations.
convolution provides a means for working with inputs of variable
size.
TRACE KTU
Traditional neural network layers use matrix multiplication by a
matrix of parameters with a separate parameter describing
the interaction between each input unit and each output unit
This means every output unit interacts with every input unit.
Convolutional networks, however, typically have sparse
interactions (also referred to as sparse connectivity or sparse
weights).
This is accomplished by making the kernel smaller than the input
When s is formed by
11 convolution with a kernel of
width 3, only three outputs
are affected by x
TRACE KTU
Figure :: The receptive field of the units in the deeper layers of a
convolutional network is larger than the receptive field of the units in the
shallow layers. This effect increases if the network includes architectural
features like strided convolution or pooling. This means that even
though direct connections in a convolutional net are very sparse, units in
the deeper layers can be indirectly connected to all or most of the input
image.
Parameter sharing refers to using the same parameter for more
14 than one function in a model.
In a traditional neural net, each element of the weight matrix
is used exactly once when computing the output of a layer.
It is multiplied by one element of the input and then never
revisited
As a synonym for parameter sharing, one can say that a network
TRACE KTU
has tied weights, because the value of the weight applied to one
input is tied to the value of a weight applied elsewhere.
In a convolutional neural net, each member of the kernel is
used at every position of the input.
The parameter sharing used by the convolution operation means
that rather than learning a separate set of parameters for
every location, we learn only one set
The black arrows
15 indicate uses of the
central element of a 3-
element kernel in a
convolutional model.
Due to parameter
sharing, this single
parameter is used at all
TRACE KTU
This shifts every pixel of I one unit to the right.
Finally, some kinds of data cannot be processed by
neural networks defined by matrix multiplication with
a fixed-shape matrix.
Convolution enables processing of some of these kinds
of data.
POOLING
18
typical layer of a convolutional network consists of three stages
The The convolutional
convolutional net is viewed as a
net is viewed larger number of
as a small simple layers;
number of every step of
relatively processing is
complex regarded as a
layers, with
each layer
having many
TRACE KTU layer in its own
right. This means
that not every
“stages.” “layer” has
there is a one- parameters.
to-one
mapping
between kernel
tensors and
network
layers.
In the first stage, the layer performs several convolutions in
19
parallel to produce a set of linear activations.
In the second stage, each linear activation is run through a
nonlinear activation function, such as the rectified linear
activation function.
This stage is sometimes called the detector stage. In the third
stage, we use a pooling function to modify the output of the layer
further
TRACE KTU
A pooling function replaces the output of the net at a certain
location with a summary statistic of the nearby outputs.
For example, the max pooling operation reports the maximum
output within a rectangular Neighborhood , popular pooling
functions include the average of a rectangular neighborhood
20 In all cases, pooling helps to make the representation become
approximately invariant to small translations of the input.
Invariance to translation means that
if we translate the input by a small amount, the values of most
of the pooled outputs do not change.
Invariance to local translation can be a very useful property
TRACE KTU
if we care more about whether some feature is present than
exactly where it is.
For example, when determining whether an image contains a
face, we need not know the location of the eyes with
pixel-perfect accuracy, we just need to know that there is an eye
on the left side of the face and an eye on the right side of the
face
A view of the middle of the
output of a convolutional layer.
21 The bottom row shows outputs
of the nonlinearity. The top
row shows the outputs of max
pooling, with a stride of one
pixel between pooling regions
and a pooling region width of
three pixels
TRACE KTU
23
TRACE KTU
Figure : Example of learned invariances:-A pooling unit that pools over multiple features that are
learned with separate parameters can learn to be invariant to transformations of the input.
Here we show how a set of three learned filters and a max pooling unit can learn to become
invariant to rotation. All three filters are intended to detect a hand-written 5.Each filter
attempts to match a slightly different orientation of the 5. When a 5 appears in
the input, the corresponding filter will match it and cause a large activation in a detector unit.
The max pooling unit then has a large activation regardless of which detector unit was
activated. Max pooling over spatial positions is naturally invariant to translation;
24
TRACE KTU
memory requirements for storing the parameters
For many tasks, pooling is essential for
handling inputs of varying size.
CONVOLUTION AND POOLING AS AN
25 INFINITELY STRONG PRIOR
prior probability distribution-This is a probability distribution
over the parameters of a model that encodes our beliefs about what
models are reasonable, before we have seen any data.
Priors can be considered weak or strong depending on how
concentrated the probability density in the prior is.
TRACE KTU
A weak prior is a prior distribution with high entropy, such as a
Gaussian distribution with high variance
Such a prior allows the data to move the parameters more or less
freely
A strong prior has very low entropy, such as a Gaussian
distribution with low variance. Such a prior plays a
more active role in determining where the parameters end up.
An infinitely strong prior places zero probability on some
26 parameters and says that these parameter values are
completely forbidden, regardless of how much support the
data gives to those values
This infinitely strong prior says that the but shifted in space.
weights for one hidden unit must be identical to the
weights of its neighbor,
TRACE KTU
The prior also says that the weights must be zero, except for
in the small, spatially contiguous receptive field assigned to
that hidden unit.
Use of convolution as introducing an infinitely strong prior
probability distribution over the parameters of a layer.
Says that the function the layer should learn contains
27 only local interactions and is Equivariant to
translation.
Likewise, the use of pooling is an infinitely strong
prior that each unit should be invariant to small
translations
TRACE KTU
Convolutional net as a fully connected net with an
infinitely strong prior can give us some insights into
how convolutional nets work
Convolution and pooling can cause underfitting
Convolution and pooling are only useful when the
assumptions made by the prior are reasonably
accurate.
28 Convolutional network architectures are designed to
use pooling on some channels but not on other
channels, in order to get both
Highly invariant features and features that will not
underfit when the translation invariance prior is
incorrect
TRACE KTU
Another key insight from this view is that we should
only compare convolutional models to other
convolutional models in benchmarks of statistical
learning performance.
29
TRACE KTU
TRACE KTU
where the summation over l, m and n is over all values for
which the tensor indexing operations inside the summation is
valid
Want to skip over some positions of the kernel in order to
33 reduce the computational cost
Downsampling the output of the full convolution function.
If we want to sample only every s pixels in each direction in the
output, then we can define a downsampled convolution function
c such that
TRACE KTU
We refer to s as the stride of this downsampled convolution
34
TRACE KTU
Figure : Convolution with a stride. In this example, we use a stride of two.(Top)Convolution
with a stride length of two implemented in a single operation. (Bottom) Convolution with a
stride greater than one pixel is mathematically equivalent to convolution with unit stride
followed by downsampling. Obviously, the two-step approach involving downsampling is
computationally wasteful, because it computes many values that are then discarded.
One essential feature of any convolutional network
35
implementation is the ability to implicitly zero-pad the input V
in order to make it wider
Zero padding the input allows us to control the kernel width
and the size of the output independently
Three special cases of the zero-padding setting
the extreme case in which no zero-padding is used
TRACE KTU
whatsoever, and the convolution kernel is only allowed to visit
positions where the entire kernel is contained entirely within the
image.
In this case, all pixels in the output are a function of the same
number of pixels in the input, so the behavior of an output pixel
is somewhat more regular.(In MATLAB terminology, this is
called valid convolution)
36 The size of the output shrinks at each layer.
If the input image has width m and the kernel has width k,
the output will be of width m − k + 1.
Since the shrinkage is greater than 0, it limits the number of
convolutional layers that can be included in the network.
As layers are added, the spatial dimension of the network will
TRACE KTU
Eventually drop to 1 × 1, at which point additional layers
cannot meaningfully be considered convolutional
37
Another special case of the zero-padding setting is when
just enough zero-padding is added to keep the size of the
output equal to the size of the input. (MATLAB calls this same
convolution)
In this case, the network can contain as many convolutional
layers,
TRACE KTU
Since the operation of convolution does not modify the
architectural possibilities available to the next layer.
The input pixels near the border influence fewer output
pixels than the input pixels near the center.
This can make the border pixels somewhat underrepresented
in the model
38
Other extreme case, which MATLAB refers to as full
convolution, zeroes are added for every pixel to be
visited k times in each direction, resulting in an
output image of width M + k − 1
The output pixels near the border are a function of
fewer pixels than the output pixels near the
center.
TRACE KTU
This can make it difficult to learn a single kernel that
performs well at all positions in the convolutional
feature map.
The optimal amount of zero padding (in terms of test
set classification accuracy) lies somewhere between
“valid” and “same” convolution.
In some cases,do not want to use convolution,specified by the 6-
39 D tensor W. The indices into W are respectively: i, the output
channel,j, the output row, k, the output column, l, the input
channel, m, the row offset within the input, and n, the column
offset within the input.
The linear part of a locally connected layer is then given by
TRACE KTU
This is sometimes also called unshared convolution, because it
is a similar operation to discrete convolution with a small kernel,
but without sharing parameters across locations
Locally connected layers are useful when we know that each
feature should be a function of a small part of space, but there is
no reason to think that the same feature should occur across all
of space.
Fig:Comparison of local connections,
40
convolution, and full connections.
(Top)A locally connected layer with a
patch size of two pixels. Each edge is
labeled with a unique letter to show
that each edge is associated with its
own weight parameter.
(Center)A convolutional layer with a
kernel width of two pixels. This model
TRACE KTU
has exactly the same connectivity as
the locally connected layer. (Bottom)A
fully connected layer resembles a
locally connected layer in the sense
that each edge has its own parameter
41 Tiled convolution offers a compromise between a convolutional
layer and a locally connected layer.
Rather than learning a separate set of weights at every spatial
location, we learn a set of kernels that we rotate through as we
move through space
This means that immediately neighboring locations will have
TRACE KTU
different filters, like in a locally connected layer,
But the memory requirements for storing the parameters will
increase only by a factor of the size of this set of kernels, rather
than the size of the entire output feature map.
Fig: A locally connected layer has no sharing at all
42
1
SYLLABUS
2
TRACE KTU
STRUCTURED OUTPUTS
4
TRACE KTU
the network belongs to class i.
This allows the model to label every pixel in an image and draw
precise masks that follow the outlines of individual objects.
Used for classification of a single object in an image, the
greatest reduction in the spatial dimensions of the network
comes from using pooling layers with large stride.
5 One strategy for pixel-wise labeling of images is to produce an
initial guess
of the image labels, then refine this initial guess using the
interactions between neighboring pixels.
Once a prediction for each pixel is made, various methods can
be used to further process these predictions in order to obtain
a segmentation of the image into regions
TRACE KTU
The general idea is to assume that large groups of contiguous
pixels tend to be associated with the same label.
DATA TYPES
6
The data used with a convolutional network usually consists
of several channels, each channel being the observation of a
different quantity at some point in space or time.
One advantage to convolutional networks is that they can also
process inputs with varying spatial extents.
TRACE KTU
These kinds of input simply cannot be represented by
traditional, matrix multiplication-based neural networks
7
TRACE KTU
8
TRACE KTU
EFFICIENT CONVOLUTION ALGORITHMS
9
TRACE KTU
performing point-wise multiplication of the
converting back to the time domain using
transform.
two signals, and
an inverse Fourier
TRACE KTU
× d) runtime and parameter storage space
i) naïve approach,(ii) convolution with separable kernel, (iii)
recursive filtering, and (iv) convolution in the frequency
domain.
Naïve approach
11
TRACE KTU
• The basic (or naïve) approach visits the individual time samples n
in the input signal f .
• In each position, it computes inner product of current sample
neighbourhood andflipped kernel g, where the size of the
neighbourhood is practically equal to the size of
the convolution kernel
Basically, the convolution is a memory-bound problem , i.e. the
ratio between the arithmetic
12 operations and memory accesses is low. The adjacent threads
process the adjacent signal
samples including the common neighbourhood.
Separable convolution
The convolution with these kernels can be simply decomposed
into several lower dimensional convolutions
TRACE KTU
Separable convolution kernel must fullfil the condition that its
matrix has rank equal to one
Let us construct such a kernel. Given one row vector
~u = (u1, u2, u3, . . . , um) and one column vector ~vT = (v1, v2,
v3, . . . , vn). let us convolve them together:
13
TRACE KTU
Recursive filters on various architectures
15
Streaming architectures
The data can be processed in a stream keeping the memory
requirements on a minimum level.
This allows moving the computation to relatively small and
cheap embedded systems.
The recursive filters are thus used in various real-time
TRACE KTU
applications such as edge detection , video filtering , and
optical flow .
Parallel architectures.
As for the parallel architectures, Robelley et al. presented a
mathematical formulation for computing time-invariant
recursive filters on general SIMD DSP architectures.
CONVOLUTION IN THE FREQUENCY DOMAIN.
16
(Referred to as the fast convolution.)
The fast convolution can be even more efficient than the
separable version if the number of kernel samples is large
enough.
TRACE KTU
17
TRACE KTU