0% found this document useful (0 votes)
17 views78 pages

Module 4 Merged

Uploaded by

chn22csc313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views78 pages

Module 4 Merged

Uploaded by

chn22csc313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

CST414

DEEP LEARNING
Module-3 PART -I

1
SYLLABUS
2

 Module-3 (Convolutional Neural Network)


 Convolutional Neural Networks – convolution operation,
motivation, pooling, Convolution and Pooling as an infinitely
strong prior, variants of convolution functions, structured
outputs, data types, efficient convolution algorithms
TRACE KTU
CONVOLUTIONAL NEURAL NETWORK
3  The name “convolutional neural network” indicates that the
network employs a mathematical operation called convolution.
Convolution is a specialized kind of linear operation.
 Convolutional networks are simply neural networks that use
convolution in place of general matrix multiplication in at least one
of their layers.

TRACE KTU
 The Convolution Operation
 Suppose we are tracking the location of a spaceship with a laser
sensor. Our laser sensor provides a single output x(t), the
position of the spaceship at time t. Both x and t are real-valued,
i.e., we can get a different reading from the laser sensor at any
instant in time.
 To obtain a less noisy estimate of the spaceship’s position, we
would like to average together several measurements.
4 more recent measurements are more relevant
We will want this to be a weighted average that
gives more weight to recent measurements a
weighting function w(a), where a is the age of a
measurement

TRACE KTU
If we apply such a weighted average operation
at every moment, we obtain a new
function providing s a smoothed estimate of the
position of the spaceship
 This operation is called convolution. The convolution
5 operation is typically denoted with an asterisk:
 s(t) = (x w)(t)
 w needs to be a valid probability density function, or the
output is not a weighted average.
 convolution is define for any functions for which the above
integral is defined, and may be used for other purposes
TRACE KTU
besides taking weighted averages.
 In convolutional network terminology, the first argument (in
this example, the function x) to the convolution is often
referred to as the input and the second argument (in this
example, the function w) as the kernel.
 The output is sometimes referred to as the feature map.
6 when we work with data on a computer, time will be
discretized, and our sensor will provide data at
regular intervals.
it might be more realistic to assume that our laser
provides a measurement once per second. The time
index t can then take on only integer values.
TRACE KTU
If we now assume that x and w are defined only on
integer t,
We can define the discrete convolution:
 In machine learning applications, the input is usually a
7 multidimensional array of data and the kernel is usually a
multidimensional array of parameters that are adapted by the
learning algorithm.
 multidimensional arrays as tensors
 Because each element of the input and kernel must be
explicitly stored separately, we usually assume that these

TRACE KTU
functions are zero everywhere but the finite set of points for
which we store the values.
 can implement the infinite summation as a summation
over a finite number of array elements.
 use convolutions over more than one axis at a time.
 example, if we use a two-dimensional image I as our input, we
8 probably also want to use a two-dimensional kernel K

TRACE KTU
 The commutative property of convolution arises because
we have flipped the kernel relative to the input, in the
sense that as m increases, the index into the input
increases, but the index into the kernel decreases.
 The only reason to flip the kernel is to obtain the commutative
property.
9

TRACE KTU
Motivation
10
 Convolution leverages three important ideas that can help
improve a machine learning system: sparse interactions,
parameter sharing and equivariant representations.
 convolution provides a means for working with inputs of variable
size.

TRACE KTU
 Traditional neural network layers use matrix multiplication by a
matrix of parameters with a separate parameter describing
the interaction between each input unit and each output unit
 This means every output unit interacts with every input unit.
Convolutional networks, however, typically have sparse
interactions (also referred to as sparse connectivity or sparse
weights).
 This is accomplished by making the kernel smaller than the input
When s is formed by
11 convolution with a kernel of
width 3, only three outputs
are affected by x

TRACE KTU When s is formed by matrix


multiplication, connectivity
is no longer sparse, so all of
the outputs are affected by
x3.

Fig: Sparse connectivity


We highlight one output
12 unit, s3, and also highlight
the input units in x that
affect this unit. These units
are known as the
receptive field of s3.
When s is formed by
convolution with a kernel

TRACE KTU of width 3, only three


inputs affect s3.

When s is formed by matrix


multiplication, connectivity is no
longer sparse, so all of the inputs
Fig:Sparse connectivity affect s3.
13

TRACE KTU
Figure :: The receptive field of the units in the deeper layers of a
convolutional network is larger than the receptive field of the units in the
shallow layers. This effect increases if the network includes architectural
features like strided convolution or pooling. This means that even
though direct connections in a convolutional net are very sparse, units in
the deeper layers can be indirectly connected to all or most of the input
image.
Parameter sharing refers to using the same parameter for more
14 than one function in a model.
 In a traditional neural net, each element of the weight matrix
is used exactly once when computing the output of a layer.
 It is multiplied by one element of the input and then never
revisited
 As a synonym for parameter sharing, one can say that a network
TRACE KTU
has tied weights, because the value of the weight applied to one
input is tied to the value of a weight applied elsewhere.
 In a convolutional neural net, each member of the kernel is
used at every position of the input.
 The parameter sharing used by the convolution operation means
that rather than learning a separate set of parameters for
every location, we learn only one set
The black arrows
15 indicate uses of the
central element of a 3-
element kernel in a
convolutional model.
Due to parameter
sharing, this single
parameter is used at all

TRACE KTU input locations.

The single black arrow


indicates the use of the central
Figure : Parameter sharing element of the weight matrix in
a fully connected model. This
model has no parameter
sharing so the parameter is
used only once.
16

In the case of convolution, the particular form of


parameter sharing causes the layer to have a property
called equivariance to translation.
Equivariant means that if the input changes, the
TRACE KTU
output changes in the same way.
A function f (x) is equivariant to a function g if f(g(x)) =
g(f (x))
In the case of convolution, if we let g be any function
that translates the input, i.e., shifts it, then the
convolution function is equivariant to g.
17 For example, let I be a function giving image
brightness at integer coordinates.
Let g be a function mapping one image function to
another image function, such that I’ = g(I ) is the
image function with I’(x, y) = I(x − 1, y).

TRACE KTU
This shifts every pixel of I one unit to the right.
Finally, some kinds of data cannot be processed by
neural networks defined by matrix multiplication with
a fixed-shape matrix.
Convolution enables processing of some of these kinds
of data.
POOLING
18
 typical layer of a convolutional network consists of three stages
The The convolutional
convolutional net is viewed as a
net is viewed larger number of
as a small simple layers;
number of every step of
relatively processing is
complex regarded as a
layers, with
each layer
having many
TRACE KTU layer in its own
right. This means
that not every
“stages.” “layer” has
there is a one- parameters.
to-one
mapping
between kernel
tensors and
network
layers.
 In the first stage, the layer performs several convolutions in
19
parallel to produce a set of linear activations.
 In the second stage, each linear activation is run through a
nonlinear activation function, such as the rectified linear
activation function.
 This stage is sometimes called the detector stage. In the third
stage, we use a pooling function to modify the output of the layer
further
TRACE KTU
 A pooling function replaces the output of the net at a certain
location with a summary statistic of the nearby outputs.
 For example, the max pooling operation reports the maximum
output within a rectangular Neighborhood , popular pooling
functions include the average of a rectangular neighborhood
20  In all cases, pooling helps to make the representation become
approximately invariant to small translations of the input.
Invariance to translation means that
 if we translate the input by a small amount, the values of most
of the pooled outputs do not change.
 Invariance to local translation can be a very useful property

TRACE KTU
if we care more about whether some feature is present than
exactly where it is.
 For example, when determining whether an image contains a
face, we need not know the location of the eyes with
 pixel-perfect accuracy, we just need to know that there is an eye
on the left side of the face and an eye on the right side of the
face
A view of the middle of the
output of a convolutional layer.
21 The bottom row shows outputs
of the nonlinearity. The top
row shows the outputs of max
pooling, with a stride of one
pixel between pooling regions
and a pooling region width of
three pixels

TRACE KTU A view of the same network, after


the input has been shifted to the
right by one pixel. Every value in
the bottom row has changed, but
only half of the values in the top
row have changed, because the max
pooling
units are only sensitive to the
maximum value in the
Figure :- Max pooling introduces neighborhood, not its exact location
invariance
 The use of pooling can be viewed as adding an infinitely strong
22 prior that the function the layer learns must be invariant to
small translations.
 Pooling over spatial regions produces invariance to translation
 Because pooling summarizes the responses over a whole
neighborhood, it is
 possible to use fewer pooling units than detector units

TRACE KTU
23

TRACE KTU
Figure : Example of learned invariances:-A pooling unit that pools over multiple features that are
learned with separate parameters can learn to be invariant to transformations of the input.
Here we show how a set of three learned filters and a max pooling unit can learn to become
invariant to rotation. All three filters are intended to detect a hand-written 5.Each filter
attempts to match a slightly different orientation of the 5. When a 5 appears in
the input, the corresponding filter will match it and cause a large activation in a detector unit.
The max pooling unit then has a large activation regardless of which detector unit was
activated. Max pooling over spatial positions is naturally invariant to translation;
24

When the number of parameters in the next layer


is a function of its input size
This reduction in the input size can also result in
improved statistical efficiency and reduced

TRACE KTU
memory requirements for storing the parameters
For many tasks, pooling is essential for
handling inputs of varying size.
CONVOLUTION AND POOLING AS AN
25 INFINITELY STRONG PRIOR
 prior probability distribution-This is a probability distribution
over the parameters of a model that encodes our beliefs about what
models are reasonable, before we have seen any data.
 Priors can be considered weak or strong depending on how
concentrated the probability density in the prior is.
TRACE KTU
 A weak prior is a prior distribution with high entropy, such as a
Gaussian distribution with high variance
 Such a prior allows the data to move the parameters more or less
freely
 A strong prior has very low entropy, such as a Gaussian
distribution with low variance. Such a prior plays a
 more active role in determining where the parameters end up.
 An infinitely strong prior places zero probability on some
26 parameters and says that these parameter values are
completely forbidden, regardless of how much support the
data gives to those values
 This infinitely strong prior says that the but shifted in space.
weights for one hidden unit must be identical to the
weights of its neighbor,

TRACE KTU
 The prior also says that the weights must be zero, except for
in the small, spatially contiguous receptive field assigned to
that hidden unit.
 Use of convolution as introducing an infinitely strong prior
probability distribution over the parameters of a layer.
Says that the function the layer should learn contains
27 only local interactions and is Equivariant to
translation.
Likewise, the use of pooling is an infinitely strong
prior that each unit should be invariant to small
translations

TRACE KTU
Convolutional net as a fully connected net with an
infinitely strong prior can give us some insights into
how convolutional nets work
Convolution and pooling can cause underfitting
Convolution and pooling are only useful when the
assumptions made by the prior are reasonably
accurate.
28 Convolutional network architectures are designed to
use pooling on some channels but not on other
channels, in order to get both
Highly invariant features and features that will not
underfit when the translation invariance prior is
incorrect
TRACE KTU
Another key insight from this view is that we should
only compare convolutional models to other
convolutional models in benchmarks of statistical
learning performance.
29

TRACE KTU

Figure : Examples of architectures for classification with convolutional networks


Variants of the Basic Convolution
30 Function
 Convolution with a single kernel can only extract one kind of
feature
 We want each layer of our network to extract many kinds of
features, at many locations.
 In a multilayer convolutional network, the input to the second
TRACE KTU
layer is the output of the first layer, which usually has the
output of many different convolutions at each position
 When working with images, we usually think of the input and
output of the convolution as being 3-D tensors,
 With one index into the different channels and two indices into
the spatial coordinates of each channel.
Because convolutional networks usually use multi-
31
channel convolution, the linear operations they are
based on are not guaranteed to be commutative,
even if kernel-flipping is used. These multi-channel
operations are only commutative if each operation
has the same number of output channels as input
channels.
TRACE KTU
Assume we have a 4-D kernel tensor K with element
Ki,j,k,l giving the connection strength between a unit in
channel i of the output and a unit in channel j of the
input, with an offset of k rows and l columns between
the output unit and the input unit.
 Assume our input consists of observed data V with element
32 Vi,j,k giving the value of the input unit within channel i at row j
and column k .
 Assume our output consists of Z with the same format as V. If
Z is produced by convolving K across V without flipping K,
then

TRACE KTU
 where the summation over l, m and n is over all values for
which the tensor indexing operations inside the summation is
valid
 Want to skip over some positions of the kernel in order to
33 reduce the computational cost
 Downsampling the output of the full convolution function.
 If we want to sample only every s pixels in each direction in the
output, then we can define a downsampled convolution function
c such that

TRACE KTU
 We refer to s as the stride of this downsampled convolution
34

TRACE KTU
Figure : Convolution with a stride. In this example, we use a stride of two.(Top)Convolution
with a stride length of two implemented in a single operation. (Bottom) Convolution with a
stride greater than one pixel is mathematically equivalent to convolution with unit stride
followed by downsampling. Obviously, the two-step approach involving downsampling is
computationally wasteful, because it computes many values that are then discarded.
 One essential feature of any convolutional network
35
implementation is the ability to implicitly zero-pad the input V
in order to make it wider
 Zero padding the input allows us to control the kernel width
and the size of the output independently
 Three special cases of the zero-padding setting
 the extreme case in which no zero-padding is used

TRACE KTU
whatsoever, and the convolution kernel is only allowed to visit
positions where the entire kernel is contained entirely within the
image.
In this case, all pixels in the output are a function of the same
number of pixels in the input, so the behavior of an output pixel
is somewhat more regular.(In MATLAB terminology, this is
called valid convolution)
36  The size of the output shrinks at each layer.
 If the input image has width m and the kernel has width k,
the output will be of width m − k + 1.
 Since the shrinkage is greater than 0, it limits the number of
convolutional layers that can be included in the network.
 As layers are added, the spatial dimension of the network will

TRACE KTU
 Eventually drop to 1 × 1, at which point additional layers
cannot meaningfully be considered convolutional
37
 Another special case of the zero-padding setting is when
just enough zero-padding is added to keep the size of the
output equal to the size of the input. (MATLAB calls this same
convolution)
 In this case, the network can contain as many convolutional
layers,
TRACE KTU
 Since the operation of convolution does not modify the
architectural possibilities available to the next layer.
 The input pixels near the border influence fewer output
pixels than the input pixels near the center.
 This can make the border pixels somewhat underrepresented
in the model
38
Other extreme case, which MATLAB refers to as full
convolution, zeroes are added for every pixel to be
visited k times in each direction, resulting in an
output image of width M + k − 1
The output pixels near the border are a function of
fewer pixels than the output pixels near the
center.
TRACE KTU
This can make it difficult to learn a single kernel that
performs well at all positions in the convolutional
feature map.
The optimal amount of zero padding (in terms of test
set classification accuracy) lies somewhere between
“valid” and “same” convolution.
 In some cases,do not want to use convolution,specified by the 6-
39 D tensor W. The indices into W are respectively: i, the output
channel,j, the output row, k, the output column, l, the input
channel, m, the row offset within the input, and n, the column
offset within the input.
 The linear part of a locally connected layer is then given by

TRACE KTU
 This is sometimes also called unshared convolution, because it
is a similar operation to discrete convolution with a small kernel,
but without sharing parameters across locations
 Locally connected layers are useful when we know that each
feature should be a function of a small part of space, but there is
no reason to think that the same feature should occur across all
of space.
Fig:Comparison of local connections,
40
convolution, and full connections.
(Top)A locally connected layer with a
patch size of two pixels. Each edge is
labeled with a unique letter to show
that each edge is associated with its
own weight parameter.
(Center)A convolutional layer with a
kernel width of two pixels. This model

TRACE KTU
has exactly the same connectivity as
the locally connected layer. (Bottom)A
fully connected layer resembles a
locally connected layer in the sense
that each edge has its own parameter
41  Tiled convolution offers a compromise between a convolutional
layer and a locally connected layer.
 Rather than learning a separate set of weights at every spatial
location, we learn a set of kernels that we rotate through as we
move through space
 This means that immediately neighboring locations will have

TRACE KTU
different filters, like in a locally connected layer,
 But the memory requirements for storing the parameters will
increase only by a factor of the size of this set of kernels, rather
than the size of the entire output feature map.
Fig: A locally connected layer has no sharing at all
42

Fig:-Tiled convolution has a set of t different kernels.


Here we illustrate the case of t = 2. One of these
kernels has edges labeled “a” and “b,” while the other

TRACE KTU has edges labeled “c” and “d.”

fig:-Traditional convolution is equivalent to tiled


convolution with t = 1. There is only one kernel and it is
applied everywhere, as indicated in the diagram by
using the kernel with weights labeled “a” and “b”
everywhere

fig_A comparison of locally connected layers, tiled


convolution, and standard convolution.
 To define tiled convolution algebraically, let k be a 6-D tensor,
43 where two of the dimensions correspond to different
locations in the output map
 Rather than having a separate index for each location in the
output map, output locations cycle through a set of t different
choices of kernel stack in each direction.
 If t is equal to the output width, this is the same as a locally
connected layer.
TRACE KTU
 where % is the modulo operation, with t%t = 0, (t + 1)%t = 1,
etc.
 If these filters learn to detect different transformed versions of
the same underlying features, then the max-pooled units
become invariant to the learned transformation
 To train a convolutional network that incorporates strided
44
convolution of kernel stack K applied to multi-channel image V
with stride s as defined by c(k,v, s)
 To minimize some loss function J (V,K). During forward
propagation, we will need to use c itself to output Z, which
is then propagated through the rest of the network and used to
compute the cost function J .
 During back-propagation, we will receive a tensor G such that
TRACE KTU
 To train the network, we need to compute the derivatives with
respect to the weights in the kernel. To do so, we can use a
function
CST414
DEEP LEARNING
Module-3 PART -II

1
SYLLABUS
2

 Module-3 (Convolutional Neural Network)


 Convolutional Neural Networks – convolution operation, motivation,
pooling, Convolution and Pooling as an infinitely strong prior, variants
of convolution functions, structured outputs, data types, efficient
convolution algorithms
TRACE KTU
3

TRACE KTU
STRUCTURED OUTPUTS
4

 Convolutional networks can be used to output a high-


dimensional, structured Object
 Typically this object is just a tensor, emitted by a standard
convolutional layer. For example, the model might emit a tensor
S, where Si,j,k is the probability that pixel (j, k ) of the input to

TRACE KTU
the network belongs to class i.
 This allows the model to label every pixel in an image and draw
precise masks that follow the outlines of individual objects.
 Used for classification of a single object in an image, the
greatest reduction in the spatial dimensions of the network
comes from using pooling layers with large stride.
5  One strategy for pixel-wise labeling of images is to produce an
initial guess
 of the image labels, then refine this initial guess using the
interactions between neighboring pixels.
 Once a prediction for each pixel is made, various methods can
be used to further process these predictions in order to obtain
a segmentation of the image into regions
TRACE KTU
 The general idea is to assume that large groups of contiguous
pixels tend to be associated with the same label.
DATA TYPES
6
 The data used with a convolutional network usually consists
of several channels, each channel being the observation of a
different quantity at some point in space or time.
 One advantage to convolutional networks is that they can also
process inputs with varying spatial extents.

TRACE KTU
 These kinds of input simply cannot be represented by
traditional, matrix multiplication-based neural networks
7

TRACE KTU
8

TRACE KTU
EFFICIENT CONVOLUTION ALGORITHMS
9

 Modern convolutional network applications often involve networks


containing more than one million units. Powerful
implementations exploiting parallel computation resources.
 Convolution is equivalent to converting both the input and the
kernel to the frequency domain using a Fourier transform,

TRACE KTU
performing point-wise multiplication of the
converting back to the time domain using
transform.
two signals, and
an inverse Fourier

 When a d-dimensional kernel can be expressed as the outer product


of d vectors, one vector per dimension, the kernel is called separable
 When the kernel is separable, naive convolution is inefficient. It
10 is equivalent to compose d one dimensional convolutions with each
of these vectors.
 The kernel also takes fewer parameters to represent as vectors.
 If the kernel is w elements wide in each dimension, then naive
multidimensional convolution requires O (wd ) runtime and
parameter storage space, while separable convolution requires O(w

TRACE KTU
× d) runtime and parameter storage space
 i) naïve approach,(ii) convolution with separable kernel, (iii)
recursive filtering, and (iv) convolution in the frequency
domain.
Naïve approach
11

TRACE KTU
• The basic (or naïve) approach visits the individual time samples n
in the input signal f .
• In each position, it computes inner product of current sample
neighbourhood andflipped kernel g, where the size of the
neighbourhood is practically equal to the size of
the convolution kernel
 Basically, the convolution is a memory-bound problem , i.e. the
ratio between the arithmetic
12 operations and memory accesses is low. The adjacent threads
process the adjacent signal
samples including the common neighbourhood.
Separable convolution
 The convolution with these kernels can be simply decomposed
into several lower dimensional convolutions

TRACE KTU
 Separable convolution kernel must fullfil the condition that its
matrix has rank equal to one
 Let us construct such a kernel. Given one row vector
 ~u = (u1, u2, u3, . . . , um) and one column vector ~vT = (v1, v2,
v3, . . . , vn). let us convolve them together:
13

 It is clear that rank(A) = 1. Here, A is a matrix


representing some separable convolution
TRACE KTU
 kernel while ~u and ~v are the previously referred lower
dimensional (cheaper) convolution kernels.
Recursive filtering
14  The convolution is a process where the inner product, whose size
corresponds to kernel size, is computed again and again in each
individual sample. One of the vectors (kernel), that enter this
operation, is always the same.

TRACE KTU
 Recursive filters on various architectures
15
 Streaming architectures
 The data can be processed in a stream keeping the memory
requirements on a minimum level.
 This allows moving the computation to relatively small and
cheap embedded systems.
 The recursive filters are thus used in various real-time
TRACE KTU
applications such as edge detection , video filtering , and
optical flow .
 Parallel architectures.
 As for the parallel architectures, Robelley et al. presented a
mathematical formulation for computing time-invariant
recursive filters on general SIMD DSP architectures.
CONVOLUTION IN THE FREQUENCY DOMAIN.
16
(Referred to as the fast convolution.)
 The fast convolution can be even more efficient than the
separable version if the number of kernel samples is large
enough.

TRACE KTU
17

TRACE KTU

You might also like