0% found this document useful (0 votes)
17 views

Lecture-CNN

Uploaded by

i001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Lecture-CNN

Uploaded by

i001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Convoluted Neural Networks

Dajiang Liu and Sen Yang


PHS 597
References
• Most of the materials in this lecture is taken from Chapter 9 of
Goodfellow book
• Some introductions of convolution also come from
• https://towardsdatascience.com/a-comprehensive-introduction-to-
different-types-of-convolutions-in-deep-learning-669281e58215
What is Convolution?
• Convolution is a commonly used operation for getting the distribution for
the sum of two random variables
• While being used for other purposes as well
• Consider two independently distributed random variables X and Y, with
cumulative distribution functions 𝐹 𝑥 and 𝐺 𝑦 and density functions
𝑓 𝑥 and 𝑔 𝑦
• If we want to calculate the distribution for 𝑍 = 𝑋 + 𝑌,
Pr 𝑍 ≤ 𝑧 = 𝑃 𝑋 + 𝑌 ≤ 𝑧 = 1Pr 𝑌 ≤ 𝑧 − 𝑥 𝑓 𝑥 𝑑𝑥
• If the density for Z exists, the density function equals
𝑝 𝑧 = 1𝑔 𝑧 − 𝑥 𝑓 𝑥 𝑑𝑥
• The operation above that involves 𝑓 and 𝑔 is called convolution, which is
often denoted by ∗
𝑝 𝑧 =𝑓∗𝑔 𝑧
Convolution

• Convolution can be thought of as an “average” filter


• Consider the convolution
𝑠 𝑡 = $𝑓 𝑡 − 𝑎 𝑔 𝑎 𝑑𝑎
• We can think of 𝑠 𝑡 as an average of 𝑓 𝑡 − 𝑎 , where 𝑎 is considered
a distance to 𝑡
• The values of 𝑓 𝑡 − 𝑎 are weighted by the weights 𝑔 𝑎
• The farther away the point is from t (i.e. |𝑎| bigger), the less weight it carries
• “average” here is a general term. For it to be a real average, we need
𝑔 𝑎 to be probability density function
• I.e. positive and integrate to 1
Convolution
• Convolution is commutative 𝑓 ∗ 𝑔 is the same as 𝑔 ∗ 𝑓
• Yet, we often give different names to f and g in a convolution 𝑠 = 𝑓 ∗ 𝑔
• We call f the input and g the kernel
• In practice we usually deal with discrete functions, so the integral is often
done by summation
• In math, summation is considered an integral over the discrete measure
𝑠 𝑡 = 8𝑓 𝑥 𝑔 𝑡 −𝑥
!
• Also, in practice, all functions have finite support, and are non-zero over a
finite set of point. So the above summation is over a finite number of
different 𝑥’s
Multi-dimensional Convolution
• In image applications, the convolution is done using multi-dimensional array (which we
call tensor)
• For a two dimensional convolution, it is of the form:
𝑆 𝑖, 𝑗 = & & 𝐼 𝑚, 𝑛 𝐾 𝑖 − 𝑚, 𝑗 − 𝑛
! "
• The convolution is commutative, i.e.
𝑆 𝑖, 𝑗 = & & 𝐼 𝑖 − 𝑚, 𝑗 − 𝑛 𝐾 𝑚, 𝑛
! "
• If we flip the kernel, 𝐾∗ −𝑚, −𝑛 = 𝐾 𝑚, 𝑛 , the convolution becomes
𝑆 𝑖, 𝑗 = & & 𝐼 𝑖 + 𝑚, 𝑗 + 𝑛 𝐾 ∗ 𝑚, 𝑛
! "
In this case, we call it the cross correlation between 𝐼 and 𝐾 ∗
• Given this equivalence, we interchangeably use convolution and cross correlation in the
following discussions.
Convolution as Matrix Multiplication
• Convolution can be taken as matrix multiplication.
• Take a simple kernel 𝐾 0 = 0, 𝐾 1 = 1 𝐾 −1 = −1
• So for a given function 𝐼(𝑥), the convolution takes the form
𝑆 𝑡 = ,𝐼 𝑡−𝑥 𝐾 𝑥
!
• We can write the input as a vector and the kernel as a matrix, and resulting convolution as a matrix
multiplication
𝑆- = 𝐾. 𝐼-

1 0 −1 …
1 0 −1 …
.=
𝐾 1 0 −1

. is called Toeplitz matrix (each row is a shift by 1 of the row above)


In algebra 𝐾
**: need to be careful about the boundary values
Example: Convolution as edge detection
• Input: BW image with 320x280
• Output: image with size 319x280
• If using kernel 𝐾 0 = 1, 𝐾 1 = −1 to process the image, it requires
319*280*3 operations
• If using matrix multiplication, it requires 320*180*319*280 floating
point operations
2D Convolution Example
Why convolution?
• Parameter sharing
• Sparse interaction
• Equivariant
• Flexibly handling input of different sizes
Sparse Connectivity
• Previously, in all of our examples, different nodes were linked by the
multiplication of weights and input
• In this case, each node is dependent on all nodes from previous layer (called fully
connected)

• In many real problems, data is organized into structures with sparse


connectivity
• For example: pixels in a picture only are only correlated with nearby pixels. But
farther apart two pixels are, the less dependent they become

• Sparse connectivity also reduce the computational burden


• For fully connected networks, connecting a layer with m nodes to a layer with n
nodes requires 𝑂 𝑚𝑛 parameters
• While for sparsely connected neural networks, each node may only be connected to
k node in the next layer (with k<n). In this case, there are only 𝑂 𝑚𝑘 parameters
needed
Example:

Sparse connectivity Full connectivity


Example: Sparse connectivity in multi-layer
neural networks

Receptive field of 𝑔"


Parameter Sharing
• In fully connected NN, the parameters that link a node with its
receptive field are used only once.

• For a CNN, the members of the kernels are repeatedly used, which
greatly reduce the parameters that need to be estimated

CNN with a kernel with 3 Fully Connected NN


parameters
Equivariant to Translation Property
• Translating the input before taking convolution gives the same result as
taking convolution first and then translation.
• Mathematically,
𝑠 𝑡 = 𝑓 ∗ 𝑔 𝑡 = 1𝑓 𝑡 − 𝑥 𝑔 𝑥 𝑑𝑥
• Translating s by Δ𝑡 we get
𝑠 𝑡 + Δ𝑡 = 1𝑓 𝑡 + Δ𝑡 − 𝑥 𝑔 𝑥 𝑑𝑥 = 1𝑓: 𝑡 − 𝑥 𝑔 𝑥 𝑑𝑥

Where 𝑓: 𝑡 ≔ 𝑓 𝑡 + Δ𝑡 (can be considered as a shifted image)


• Convolution is not necessarily equivariant to other transformations (e.g.,
scaling)
Standard CNN Architecture
• A CNN usually follows:
• Convolution
• Detector (applying a non-linear
activation function to the output
from a convolution)
• Pooling
• Maxpooling: selecting the max value
of a rectangular region.
Pooling
• Pooling helps to make the
convolution stable
• E.g., more invariant (stable)
against the translation
• This is very useful in image
analysis
• E.g., A shift in the image should
not affect the determination of
whether the image is a cat or
dog.
Pooling
• Pooling is a down sampling procedure, in order to make the detector
output invariants to small shifts in the data
• Pool is almost always done using a 2x2 filter
• Within each filter, either to retain
• The average (average pooling)
• The max (max pooling)

• For imaging analysis, many involve edge detection


• Max pool works much better than average pooling
• Edges are where pixel changes rapidly. Average tend to make these changes go away.
Stride
• Stride controls how the filter convolves around the input volume.
• The amount by which the filter shifts is the stride.
• Stride is normally set in a way so that the output volume is an integer
and not a fraction.
• Stride, similar to pooling, is a downsampling technique.
Stride is a down-sampling procedure

Stride 2 is equivalent to stride 1 +


Down-sampling
Padding
• Applying convolutional filters often lead to reductions of the image sizes

• If applying the filter multiple times, the image could reduce to none.
• To avoid this, we can apply a very simple technique called zero padding (padding for
short).
• The general formula is
$%&'()
• 𝑂= +1
*
• W is the width of the input;
• K is the kernel
• P is padding
• S is the size of the stride
Impact of Padding

No padding: each convolution


reduces the dimension by 6. Only 3
layers are possible

With padding
Standard CNN Architectures
• CNN typically run many
convolutions in parallel
• Images are often multi-channel
• Example includes RGB images
2D convolution Illustration
Convolution in RGB images
• RGB images contains 3 channels representing the intensity for red,
green and blue channels

• More generally, data may be present as multiple channels


How 2D-Convolution Works for Multiple
Channels

• For multi-channel data, if we say filters of size k x k, unless otherwise


stated, we mean the filters of the size k x k x Din
How to change dimensions between different
layers with 2D Convolution
• Let us say that the input layer is of height, width and depth of
𝐻#$ , 𝑊#$ , 𝐷#$
• Let us say that the input layer is of height, width and depth of
𝐻%&' , 𝑊%&' , 𝐷%&'
• The idea is to apply 𝐷%&' number of filters and stacking them
3D convolution
• In addition to 2D filters, 3D filters can be used.
1 x 1 Convolution
• While called a 1 x 1 convolution, it is indeed a 3D convolution, with 1
x 1 x D fitler, where D is the depth of the input layer
• Applying one such 1 x 1 x D filter yields a output of W x H x 1 output
• Applying N such filters gives output of size W x H x N
Why 1 x 1 Convolution
• Google inception network used this structure
• Several notable benefits
• Dimension reduction
• Applying the 1 x 1 x D filter can collapse the input to dimension W x H x 1
• Information embedding
• Even if entire depth is collapsed to 1, there can still be considerable amount information
retained
Transposed
convolution
• Recall that convolution is a
linear operation which can
be written as matrix
multiplication
Z1 Transposed
Z2
Z3
Convolution
Z4 Z1 Z2 Z3 Z4
Z5
Z5 Z6 Z7 Z8
Z6
Z9 Z10 Z11 Z12
Z7
Z8 Z13 Z14 Z15 Z16
Z9
Z10
Z11
Z12
Z13
Z14
Z15
Z16
Computation of Convolution – Separable
Kernels
• Separable kernel: if they can be written as outer project of two
vectors


ImageNet Competition
• 14 million images in over 20,000 categories
• Cat, dog, balloon, strawberry
• over 1 million images have boxes around the object of interest
• Annotated by croudsourcing
• Led by Fei-Fei Li
ImageNet Competition
ImageNet Competition
• Classification Error of winners over the years
AlexNet Architecture
Keras implementation • # 5th Convolutional Layer
model.add(Conv2D(filters=256, kernel_size=(3,3),
strides=(1,1), padding=’valid’))
• # 1st Convolutional Layer model.add(Activation(‘relu’))
model.add(Conv2D(filters=96, # Max Pooling
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2),
input_shape=(224,224,3), kernel_size=(11,11), padding=’valid’))
strides=(4,4), padding=’valid’)) • # Passing it to a Fully Connected layer
model.add(Activation(‘relu’)) model.add(Flatten())
# 1st Fully Connected Layer
# Max Pooling model.add(Dense(4096, input_shape=(224*224*3,)))
model.add(MaxPooling2D(pool_size=(2,2), model.add(Activation(‘relu’))
strides=(2,2), padding=’valid’)) # Add Dropout to prevent overfitting
model.add(Dropout(0.4))
• # 2nd Convolutional Layer • # 2nd Fully Connected Layer
model.add(Conv2D(filters=256, kernel_size=(11,11), model.add(Dense(4096))
model.add(Activation(‘relu’))
strides=(1,1), padding=’valid’)) # Add Dropout
model.add(Activation(‘relu’)) model.add(Dropout(0.4))
# Max Pooling • # 3rd Fully Connected Layer
model.add(MaxPooling2D(pool_size=(2,2), model.add(Dense(1000))
model.add(Activation(‘relu’))
strides=(2,2), padding=’valid’)) # Add Dropout
model.add(Dropout(0.4))
• # 3rd Convolutional Layer • # Output Layer
model.add(Conv2D(filters=384, kernel_size=(3,3), model.add(Dense(17))
strides=(1,1), padding=’valid’)) model.add(Activation(‘softmax’))
model.add(Activation(‘relu’)) • model.summary()

• # 4th Convolutional Layer • # Compile the model


model.compile(loss=keras.losses.categorical_crossentropy,
model.add(Conv2D(filters=384, kernel_size=(3,3), optimizer=’adam’, metrics=[“accuracy”])
strides=(1,1), padding=’valid’))
model.add(Activation(‘relu’))
My GPU Machine
Local Response Normalization (LRN)
• LRN was a technique introduced in AlexNet
• The original motivation is to make sure that the output from each
layer is bounded, and the magnitude does not change with the depth
of the network
• Two types of LRN
• Inter channel normalization
4
𝑎1,3
4
𝑏1,3 = @
:
567 <9=,4> ? ;
;
𝑘 +𝛼∑ : 𝑎1,3
567 8,49 ;
Inter-channel vs. Intra-channel LRN
Inter Channel LRN Example

•𝑘=0
•𝛼=1
•𝛽=1
•𝑛=2
Intra-Channel LRN Example
• The intra-channel normalization works as follows
+
+ =
𝑎(,*
𝑏(,* 9
$ $
-34 5,(6 -34 8,*6 2
2 2 +
𝑘 +𝛼∑ $ ∑ $ 𝑎#,7
#,-./ 0,(1 2 7,-./ 0,*1 2
• 𝑘 = 0, 𝛼 = 1, 𝛽 = 1, 𝑊 = 𝐻 = 8, 𝑛 = 2
+ is the value at position 𝑥, 𝑦 in channel 𝑘
• 𝑎(,*
• Should be distinguished from 𝛼
Batch Normalization
• Batch normalization avoids internal covariate shift
Batch Normalization (BN) Algorithm
• For values 𝑥 in the mini-batch, we scale and shift them to make them
have the same mean the standard deviations
• The input in a batch is 𝑥: , … , 𝑥;
• We calculate the mean and variance for the batch, which we denote
as 𝜇< and 𝜎<2
• We calculate the Z-score for the batch:
𝑧# = 𝜎<1: 𝑥# − 𝜇<
• The normalized output is given by
𝑦# = 𝛾𝑍# + 𝛽
ZFNet
• ZFNet is winner of 2013
• Heavily based upon AlexNet
VGGNet
VGGNet
• M ~ maxpooling
• LRN local response
normalization
• Different columns represent
different versions of VGGNet
• A, C represent smaller networks
• VGG16 and 19 are columns D,E
VGGNet Properties
• VGGNet while not winning the contest, proposes useful properties
that were widely used later on
• One uniqueness is that VGGNet uses small filters at great depth
• Could save parameters and potentially retrieve interesting features.
• Greater depth means more ReLU activation and more non-linearity
• VGGNet is very slow to train
• A workaround is to train a smaller network first, and then use the output from
smaller network as input for further training
GoogLeNet
• GoogLeNet introduced a novel concept, the inception network
• Image features could come at different resolutions
• Using filters of different sizes could help extract features of different
granularity
GoogLeNet Implementation - Bottleneck
• For filters of different sizes, we first apply 1 x 1 convolution bottleneck
in order to collapse the channels and reduce the dimension of the
data
• Recall: inception V1
GoogLeNet – Output Layer
• Another interesting feature for GoogLeNet lies in the output layer
• Instead of using a fully connected layer in the end, GoogLeNet uses
averge pooling, which reduces the number of parameters and
improves the performance (by .7%).
GoogLeNet Dimensions
GoogLeNet
• Overall architecture
ResNet
• ResNet is the first deep learning model that attains human level
accuracy
• It mostly benefits from fitting a much deeper networks
• It is generally believed that training deeper networks should ALWAYS
be helpful for solving complex problems
• As shallow networks are special cases for deep networks with identity
activation function
• Yet fitting deep network is often not easy
• Convergence can take very long;
• There may be problems of exploding and vanishing gradient problems
ResNet Key Algorithm
• One key idea that motivate the ResNet is by copying between layers,
we manage to allow portions of the data be fitted with shallower
networks than other parts of the data
ResNet Implementation
Other Image Processing Applications
• In addition to classification, there are often more complex tasks that
are based upon similar models
• Object localization
• Object detection
• Segmentation
• Video processing
Object Localization
• Question: Can we draw a box over the object that we deem present in
the image?
• It seems natural to see that we can answer object localization based
upon classification e.g. AlexNet
• For the last couple of fully connected layers that are used for
classification, we instead try to train a regression type of model
• The box surrounding the object is defined by 4 numbers
𝑥, 𝑦, 𝑤, ℎ
• 𝑥, 𝑦 are the top left coordinates for the pixel. W, h are the width and
height of the box
Object Detection
• Often there is an unknown number of objects in the image
• If we somehow know a region that contains some object, we can run
a standard CNN model to classify it
• So this boils down to find a regional proposal:
Region-based CNN
• One intuitive idea is to scan different regions of the image at different
resolutions (but limit the number of regions under 2000)
• Basic algorithm is the following:
• Run Selective Search to generate probable objects.
• Feed these patches to CNN, followed by SVM to predict the class of each
patch.
• Optimize patches by training bounding box regression separately.
Spatial Pyramid Pooling
• Similar regions can be identified in the CNN output as in the original
image
• Design pooling based upon the feature map
Spatial Pyramid Pooling
YOLO (You Only
Look Once)

You might also like