CNN 2
CNN 2
Filter / Kernel
Producing Feature Maps
Input
-1 -1 Output
Filter
(Goodfellow 2016)
Simple Kernels / Filters
X or X?
• Same convolution: zero pad input so output • Valid-only convolution: output only when
is same size as input dimensions entire kernel contained in input (shrinks output)
• Full convolution: zero pad input so output is produced whenever an output value
contains at least one input value (expands output)
x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME')
FC FC
... 𝑦𝑦�
⋮ ⋮
10
5×5×16
120 84 Reminder:
Output size = (N+2P-F)/stride + 1
This slide is taken from Andrew Ng [LeCun et al., 1998]
LeNet-5
• Only 60K parameters
• As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊 ↓, 𝑁𝑁𝐶𝐶 ↑
• General structure:
conv->pool->conv->pool->FC->FC->output
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
3b. Convolutional Neural
Networks (CNNs)
An image classification CNN
Representation Learning in Deep CNNs
tf.keras.layers.
Conv2D
For a neuron in
hidden layer:
- Take inputs from patch
- Compute weighted
sum
- Apply bias
Convolutional Layers: Local Connectivity
tf.keras.layers.Conv2D
4x4 filter:
1) applying a window of weights
matrix of 2) computing linear combinations
weights wij for neuron (p,q) in hidden layer 3) activating with non-linear function
CNNs: Spatial Arrangement of Output
Volume
depth
Layer Dimensions:
ℎ w d
where h and w are spatial
dimensions d (depth) = number of
height filters
Stride:
Filter step size
Receptive Field:
width Locations in input image
that a node is path
connected to
tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
Introducing Non-Linearity
- Apply after every convolution operation
(i.e., after convolutional layers) Rectified Linear Unit
- ReLU: pixel-by-pixel operation that replaces (ReLU)
all negative values by zero.
- Non-linear operation
tf.keras.layers.ReLU
tf.keras.layers.Max
Pool2D(
pool_size=(2,2),
) strides=2 1) Reduced
dimensionality
2) Spatial invariance
f(x) = max(0, x)
When will we backpropagate through this?
Once it “dies” what happens to it?
Pooling reduces dimensionality by giving up
spatial location
• max pooling reports the maximum output
within a defined neighborhood
• Padding can be SAME or VALID
x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')
91
1. Learn features in input image through convolution
2. Introduce non-linearity through activation function (real-world data is
non-linear!)
3. Reduce dimensionality and preserve spatial invariance with pooling
CNNs for Classification: Class Probabilities
def generate_model():
model = tf.keras.Sequential([
# first convolutional layer
tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
])
return model
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
4a. Real-world feature invariance is
hard
How can computers recognize objects?
How can computers recognize objects?
Challenge:
• Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc.
• How can we overcome this challenge?
Answer:
• Learn a ton of features (millions) from the bottom up
• Learn the convolutional filters, rather than pre-computing them
Feature invariance to perturbation is hard
Detect
features
to
classify
Li/Johnson/Yeung C231n