0% found this document useful (0 votes)
24 views69 pages

(Fall 2024) Images and Convolutions

Uploaded by

David Earnest
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views69 pages

(Fall 2024) Images and Convolutions

Uploaded by

David Earnest
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Images and Convolutions

By: ML@B Edu Team


● What is Computer Vision

Outline ●

Representing Images
Problems with MLP
● Convolution Mechanics
● More Convolutions!
What is Computer Vision?
A field of computer science
focused on processing,
analyzing, and understanding
visual data
Brief History of CV

● 1959
○ David Hubel and Torsten Wiesel started
experimenting on the visual cortex of cats
○ Discovered that our visual cortex processes
images by analyzing simple structures such
as edges first
Evolution of CV
● Object Detection prior to 2012:

● Then came deep learning…


○ No more feature extraction by hand!
○ Use a large neural network to learn the important features
○ Deep Learning paved the way for a massive acceleration in the progress of computer vision
The Deep Learning Approach
● Convolutional Neural Networks (CNNs)
○ Deep Learning algorithm used for analyzing images
○ Invented by Yann LeCun (LeNet - 5) in 1990s
● Why the sudden explosion post 2012?
○ AlexNet (16.4% Error Rate!!)
■ Nvidia GPUs
■ More powerful computers
■ More access to data
Images as Data
How do we represent
images digitally?
Images as matrices?
Grayscale
● Pixel values range in gray levels
from 0 (black) to 255 (white)
● Each pixel has 256 values — takes
up 8 bits
Color Images/Channels

Terminology: Each “layer” is commonly


referred to as a channel. An RGB image
has 3 channels.
Moving Past MLPs
Why are standard
dense NNs (MLPs) not
ideal for image
classification?
Classification with NN
● Consider a 200 x 200 x 3 image
○ This is an RGB image with height and width
200 pixels
○ We can represent this by a 200x200x3 =
120,000 element vector
● How many parameters do we need for
an MLP with one fully connected
hidden layer of 10 units?

Fully Connected Layers: y = Wx + b


Classification with NN
● Consider a 200 x 200 x 3 image
○ This is an RGB image with height and width
200 pixels
○ We can represent this by a 200x200x3 =
120,000 element vector
● How many parameters do we need for
an MLP with one fully connected
hidden layer of 10 units?

200 * 200 * 3 * 10 + 10 = 1,200,010


WAY TOO MANY!!
Each pixel is an individual input “feature” of the
network. Why does this not make sense?
How does image classification
work intuitively? How can you
convince someone that an
image contains object x
Features of an Image
Local Regions of an Image
● The basic idea is to operate on local regions of an image rather than individual
pixels or operating on far away pixels
● The dense neural network model template doesn’t lend itself immediately to patch
recognition like this!
Same ideas prevalent in classical CV too!

Canny Edge Detector — a very famous feature extractor developed by Berkeley Prof. John Canny!

Takeaway: you can’t really extract any useful information from by looking at individual pixels at once
Extracting representations from images
Recall that deep learning is the process of extracting
hierarchical representations from an input. What
does this look like for an image?
1. Learn to detect edges, textures and colors from
raw pixels in the first layer
2. Use edges to detect simple shapes and patterns
in intermediate layers
3. Combine shapes and patterns to detect
abstract higher-level features, such as facial
shapes, in higher layers
Other desiderata
● Equivariance to translation: the same set of pixels, when translated, should have
their representations translated too
● Invariance to translation: semantic meaning does not change due to a translation
Solution: CNNs

Downsample the image and only extract what is relevant


Convolutions
High level concept
*don’t worry about implementation just yet :)
Initial Ideas
● Instead of processing an entire image at once, process patches of it instead!
○ Allows the network to pay attention to local regions of an image
● Process each patch with a layer parameterized by the same weights and bias —
also called weight sharing:
○ Ensures that same patches will output the same “representations”
○ Preserves translational equivariance and invariance
Filters + Convolutions

1 0 1
0 1 0
1 0 1
Weight Filter

Terminology!: Also
referred to as a “kernel”
Filters + Convolutions

1 1 1 1 0

1 0 1 0 0 1 1 0 2 3 2
0 0 1 0 1 1 2 3
0 1 0
0 0 1 1 1
1 0 1 3 2 4
1 1 0 1 0
Weight Filter

Terminology!: Also
referred to as a “kernel”
Filters (2D)
How to perform convolutions:
1. Slide filter along width and height by a certain amount (stride).
2. Compute dot products between entries of filter and input at any position.

Note: There is one bias


term per filter applied after
the convolution, just not in
our examples
Filters (2D) Practice
Filters (2D) Practice
Convolutions
Terminology!: This resulting matrix
is called an activation map

1 0 1
0 1 0
1 0 1
Weight Filter
Example What does this do? Any
ideas?

10 10 10 0 0 0

10 10 10 0 0 0 ?
1 0 -1
10 10 10 0 0 0

10 10 10 0 0 0 * 1

1
0

0
-1

-1
=
10 10 10 0 0 0

10 10 10 0 0 0
Example

10 10 10 0 0 0

10 10 10 0 0 0 0
1 0 -1
10 10 10 0 0 0

10 10 10 0 0 0 * 1

1
0

0
-1

-1
=
10 10 10 0 0 0

10 10 10 0 0 0
Example

10 10 10 0 0 0

10 10 10 0 0 0 0 ?
1 0 -1
10 10 10 0 0 0

10 10 10 0 0 0 * 1

1
0

0
-1

-1
=
10 10 10 0 0 0

10 10 10 0 0 0
Example

10 10 10 0 0 0

10 10 10 0 0 0 0 30
1 0 -1
10 10 10 0 0 0

10 10 10 0 0 0 * 1

1
0

0
-1

-1
=
10 10 10 0 0 0

10 10 10 0 0 0
Example

10 10 10 0 0 0

10 10 10 0 0 0 0 30 ?
1 0 -1
10 10 10 0 0 0

10 10 10 0 0 0 * 1

1
0

0
-1

-1
=
10 10 10 0 0 0

10 10 10 0 0 0
Example

10 10 10 0 0 0

10 10 10 0 0 0 0 30 30
1 0 -1
10 10 10 0 0 0

10 10 10 0 0 0 * 1

1
0

0
-1

-1
=
10 10 10 0 0 0

10 10 10 0 0 0
Example

10 10 10 0 0 0

10 10 10 0 0 0 0 30 30 ?
1 0 -1
10 10 10 0 0 0

10 10 10 0 0 0 * 1

1
0

0
-1

-1
=
10 10 10 0 0 0

10 10 10 0 0 0
Example

10 10 10 0 0 0

10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0

10 10 10 0 0 0 * 1

1
0

0
-1

-1
=
10 10 10 0 0 0

10 10 10 0 0 0
What does this filter do?

10 10 10 0 0 0

10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0
* =
0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0

10 10 10 0 0 0
Vertical Edge Detection

10 10 10 0 0 0

10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0
* =
0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0

10 10 10 0 0 0
Vertical Edge Detection

10 10 10 0 0 0

10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0
* =
0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0

10 10 10 0 0 0
Convolutions Concept Check
1) What does a horizontal edge detector look like?

2) What is the output of the same input with a horizontal edge detector?

3) What does this ^ tell us about the output of some convolution based “<insert
shape here> detector”?
Convolutions Concept Check 1 1 1

0 0 0
1) What does a horizontal edge detector look like?
-1 -1 -1
Similar, but rotated 90 degrees
2) What is the output of the same input with a horizontal edge detector?
The output is all zeros
3) What does this ^ tell us about the output of some convolution based “<insert
shape here> detector”?
The output of convolving the kernel at any location is high when the feature the
kernel was designed to detect (or something similar) is present, and low when it isn’t
present. In this case, there were no horizontal lines, so our horizontal line kernel
outputted zero everywhere
Purpose of Convolutions
● Different filters can be used to extract
various features of an image such as
edge detection and blurring
Some Classical Ideas: Edge Detection
● edges and shapes are important!
● John Canny (Berkeley prof) made good edge detector
● discrete gradients
Some Classical Ideas: HOG
● Histogram of Oriented
Gradients
● uses multiple gradient
orientations
● compares histogram (e.g.
SVM)
Where does the “Deep Learning” part come in?
Like dense fully-connected layers, we can just learn these filters!
Filters (3D)
● Steps:
○ Compute the dot product for each channel
(same as 2D)
○ Sum over each channel
● Note: The depth of the filter is always the
same as the depth of the input image

⚠W1 and W2 are distinct 4 x 4 x 3


filters
Convolutional Layers Also referred to as a
“feature map”
Padding
Convolving an image with a filter results in a block with a smaller height and width —
what if we want the height and width as before?
Same vs Valid Padding
● Same padding: padding with 0s (or
possibly some other constant value)
to preserve the spatial dimensions of
the output
● Valid padding: no padding
Stride
The amount of pixels to slide the filter by (both horizontally and vertically):
1. A stride of 1 will shift the filter every pixel
2. A stride of 2 will shift the filter every 2 pixels
Output Dimensions
● ceil[(W − F + 2P) / S] + 1
○ W: Input Dimension
○ F: Kernel / Filter Size
○ P: Padding size
○ S: Strides
○ ceil is the ceiling function
■ ceil[3.5] = ceil[4] = 4
● For the figure on the right:
○ Assume no padding and a stridge of 1
○ W’ = ceil[(6 - 3 + 2 * 0) / 1] + 1 = 4
○ H’ = ceil[(6 - 3 + 2 * 0) / 1] + 1 = 4
Backprop with Convolutions
● Like regular backprop algorithm (see previous lectures)
● Derivative of error w.r.t. a particular weight = sum of derivatives @ each output

Derivative
Derivative of output
Input Values Calculations for
feature map
Defining a Convolutional Layer in PyTorch

https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
Other Operations
Pooling Layers
● Reduces output size
● Applied to each channel independently
● Neighboring features may be similar
○ Doesn’t remove too much information
● Max pooling takes the max
● Average pooling takes the average
Pooling Layers Concept Check
1) In max pooling, what are the partial
derivatives of the top right output with
respect to the 2x2 sub-grid of inputs in the
top right corner?
Pooling Layers Concept Check
1) In max pooling, what are the partial
derivatives of the top right output with
respect to the 2x2 sub-grid of inputs in the
top right corner?

0 0

0 1

The only information that flows into the next


layer is from that 7, so only it will receive any
gradient flow during backprop
Transposed Convolution
● Note: This is just for culture, just look at the high level
● With most convolutions, our output size ended up being
smaller than the input size if we don’t pad
● What if we want to increase the output size?
○ Say you want to do a task like super-resolution where the output of the
model is larger than the input size
● Boils down to dilating the input feature map and running a
convolution on it
Recap
Recap
Intuition
● In an image, local regions provide more information than individual pixels
● Convolution is an operation that can filter for different things like edges by itself,
and you can have multiple filters to create more feature maps
○ Fewer parameters than a linear layer, allows us to extract features in an efficient way
○ Hyperparameters include stride, padding, kernel size
● The convolution operation amounts to a “sliding window”, and is spatially invariant
Tools for your toolkit:
● Conv2D layer
● Pooling layers
● Transposed Convolution
Lecture Attendance

http://tinyurl.com/fa24-dl4cv
Contributors
● Jake Austin
● Aryan Jain
● Val Rotan
● Past ML@B Edu members

You might also like