0% found this document useful (0 votes)
43 views47 pages

CNN 2

Convolutional neural networks (CNNs) apply learned filters via convolution to images to extract visual features at different levels of abstraction, from low-level edges to mid-level object parts to high-level objects and scenes. CNNs share parameters across their convolutional filters to learn features directly from data in a hierarchical fashion. Modern CNN architectures have millions of parameters and dozens of layers, applying techniques like residual connections to enable very deep networks for complex tasks like image classification.

Uploaded by

kirti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views47 pages

CNN 2

Convolutional neural networks (CNNs) apply learned filters via convolution to images to extract visual features at different levels of abstraction, from low-level edges to mid-level object parts to high-level objects and scenes. CNNs share parameters across their convolutional filters to learn features directly from data in a hierarchical fashion. Modern CNN architectures have millions of parameters and dozens of layers, applying techniques like residual connections to enable very deep networks for complex tasks like image classification.

Uploaded by

kirti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Today: Convolutional Neural Networks (CNNs)

1. Scene understanding and object recognition for machines (and humans)


– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
2a. Spatial structure
for image recognition
Using Spatial Structure

Input: 2D Idea: connect


image. patches of input to
Array of pixel neurons in hidden
values layer.
Neuron connected
to region of input.
Only “sees”these
values.
Using Spatial Structure

Connect patch in input layer to a single neuron in subsequent layer.


Use a sliding window to define connections.
How can we weight the patch to detect particular features?
Feature Extraction with Convolution
- Filter of size 4x4 : 16 different weights
- Apply this same filter to 4x4 patches in input
- Shift by 2 pixels for next patch

This “patchy” operation is convolution

1) Apply a set of weights – a filter – to extract local features

2) Use multiple filters to extract different features

3) Spatially share parameters of each filter


Fully Connected Neural Network

Input: Fully Connected:


• 2D image • Each neuron in
• Vector of pixel hidden layer
values connected to all
neurons in input
layer
• No spatial information
• Many, many
parameters

Key idea: Use spatial structure in input to inform architecture


of the network
High Level Feature Detection

Let’s identify key features in each image category

Nose, Eyes,Mouth Wheels, License Plate, Door,Windows,Steps


Headlights
Fully Connected Neural Network
2b. Convolutions and filters
Convolution operation is element wise
multiply and add

Filter / Kernel
Producing Feature Maps

Original Sharpen Edge Detect “Strong” Edge


Detect
A simple pattern: Edges
How can we detect edges with a kernel?

Input

-1 -1 Output
Filter

(Goodfellow 2016)
Simple Kernels / Filters
X or X?

Image is represented as matrix of pixel values… and computers are literal!


We want to be able to classify an X as an X even if it’s shifted, shrunk, rotated, deformed.

Rohrer How do CNNs work?


There are three approaches to edge cases in
convolution
Zero Padding Controls Output Size
(Goodfellow 2016)

• Same convolution: zero pad input so output • Valid-only convolution: output only when
is same size as input dimensions entire kernel contained in input (shrinks output)
• Full convolution: zero pad input so output is produced whenever an output value
contains at least one input value (expands output)

x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME')

• TF convolution operator takes stride and zero fill option as parameters


• Stride is distance between kernel applications in each dimension
• Padding can be SAME or VALID
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
3a. Learning Visual Features
de novo
Key idea:
learn hierarchy of features
directly from the data
(rather than hand-engineering them)

Low level features Mid level features High level features

Edges, dark spots Eyes, ears,nose Facial structure

Lee+ ICML 2009


Key idea: re-use parameters
Convolution shares parameters
Example 3x3 convolution on a 5x5 image
Feature Extraction with Convolution

1) Apply a set of weights – a filter – to extract local features


2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction

[LeCun et al., 1998]


LeNet-5
conv avg pool conv avg pool
...
5×5 f=2 5×5 f=2
s=1 s=2 s=1 s=2
32×32×1 28×28×6 14×14×6 10×10×16

FC FC
... 𝑦𝑦�
⋮ ⋮
10
5×5×16
120 84 Reminder:
Output size = (N+2P-F)/stride + 1
This slide is taken from Andrew Ng [LeCun et al., 1998]
LeNet-5
• Only 60K parameters
• As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊 ↓, 𝑁𝑁𝐶𝐶 ↑
• General structure:
conv->pool->conv->pool->FC->FC->output

• Different filters look at different channels


• Sigmoid and Tanh nonlinearity

[LeCun et al., 1998]


Backpropagation of convolution

Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
3b. Convolutional Neural
Networks (CNNs)
An image classification CNN
Representation Learning in Deep CNNs

Low level features Mid level features High level features

Edges, dark spots Eyes, ears,nose Facial structure


Conv Layer 1 Conv Layer 2 Conv Layer 3

Lee+ ICML 2009


CNNs for Classification

1. Convolution:Apply filters to generate feature maps.


2. Non-linearity: Often ReLU.
3. Pooling: Downsampling operation on each feature map.
tf.keras.layers.Conv2
Train model with image data. D
Learn weights of filters in convolutional layers. tf.keras.activations.
*
tf.keras.layers.MaxPool2
D
Example – Six convolutional layers
Convolutional Layers: Local Connectivity

tf.keras.layers.
Conv2D

For a neuron in
hidden layer:
- Take inputs from patch
- Compute weighted
sum
- Apply bias
Convolutional Layers: Local Connectivity

tf.keras.layers.Conv2D

For a neuron in hidden layer:


• Take inputs from patch
• Compute weighted sum
• Apply bias

4x4 filter:
1) applying a window of weights
matrix of 2) computing linear combinations
weights wij for neuron (p,q) in hidden layer 3) activating with non-linear function
CNNs: Spatial Arrangement of Output
Volume
depth
Layer Dimensions:
ℎ  w d
where h and w are spatial
dimensions d (depth) = number of
height filters

Stride:
Filter step size

Receptive Field:
width Locations in input image
that a node is path
connected to
tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
Introducing Non-Linearity
- Apply after every convolution operation
(i.e., after convolutional layers) Rectified Linear Unit
- ReLU: pixel-by-pixel operation that replaces (ReLU)
all negative values by zero.
- Non-linear operation

tf.keras.layers.ReLU

Karn Intuitive CNNs


Pooling

tf.keras.layers.Max
Pool2D(
pool_size=(2,2),
) strides=2 1) Reduced
dimensionality
2) Spatial invariance

Max Pooling, average pooling


The REctified Linear Unit (RELU) is a common
non-linear detector stage after convolution

x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')


x = tf.nn.bias_add(x, b)
x= tf.nn.relu(x)

f(x) = max(0, x)
When will we backpropagate through this?
Once it “dies” what happens to it?
Pooling reduces dimensionality by giving up
spatial location
• max pooling reports the maximum output
within a defined neighborhood
• Padding can be SAME or VALID
x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

Output Input Pooling Batch H W Input channel


Neighborhood
[batch, height, width, channels]
Dilated Convolution
CNNs for Classification: Feature Learning

91
1. Learn features in input image through convolution
2. Introduce non-linearity through activation function (real-world data is
non-linear!)
3. Reduce dimensionality and preserve spatial invariance with pooling
CNNs for Classification: Class Probabilities

- CONV and POOL layers output high-level features of input


- Fully connected layer uses these features for classifying input image
- Express output as probability of image belonging to a particular class
Putting it all together
import tensorflow as tf

def generate_model():
model = tf.keras.Sequential([
# first convolutional layer
tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),

# second convolutional layer


tf.keras.layers.Conv2D(64, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),

# fully connected classifier


tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1024, activation='relu’),
tf.keras.layers.Dense(10, activation=‘softmax’)
# 10 outputs

])
return model
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
4a. Real-world feature invariance is
hard
How can computers recognize objects?
How can computers recognize objects?

Challenge:
• Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc.
• How can we overcome this challenge?
Answer:
• Learn a ton of features (millions) from the bottom up
• Learn the convolutional filters, rather than pre-computing them
Feature invariance to perturbation is hard

Detect
features
to
classify

Li/Johnson/Yeung C231n

You might also like