03 PL, Activation, BackProp, CNN
03 PL, Activation, BackProp, CNN
𝑥4
More: https://en.wikipedia.org/wiki/Perceptron
Activation Functions
Step(x) Sigmoid(x)
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
6
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
7
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑏 = 𝑏𝑐 𝑏𝑑 𝑏𝑏
𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
8
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑏 = 𝑏𝑐 𝑏𝑑 𝑏𝑏
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑔)
9
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤𝑥 𝑇 + 𝑏 𝑇 )
10
Two-layer MLP + Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[2] 𝑥 𝑇 + 𝑏[2] )
11
N-layer MLP + Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )
…
𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑘] )
𝑇 𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] )
12
How to train the parameters?
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )
…
𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑘] )
𝑇 𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] )
13
How to train the parameters?
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]
𝑇 𝑙 = 𝑙𝑜𝑠𝑠(𝑓, 𝑦)
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )
… We can still use SGD
𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑖] )
… We need!
𝜕𝑙 𝜕𝑙
𝑇 𝑇
𝑓= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] ) 𝜕𝑤[𝑘]𝑖𝑗 𝜕𝑏 𝑘 𝑖
14
Backpropagation – repeated application of
chain rule
Two-layer Neural Network – Forward Pass
Two-layer Neural Network – Backward Pass
Basic building blocks of the CNN architecture
• Input layer
• Convolutional layer
• Fully connected layer
• Loss layer
• Convolutional layer
• Convolutional kernel
• Pooling layer
• Non-linearity
Convolution operation
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.
“upper-left
beak” detector
“middle beak”
detector
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
Convolutional Layer (with 4 filters)
weights:
4x1x9x9
Input: 1x224x224 Output: 4x224x224
if zero padding,
and stride = 1
Convolutional Layer (with 4 filters)
weights:
4x1x9x9
Input: 1x224x224 Output: 4x112x112
if zero padding,
but stride = 2
Color image: RGB 3 channels – conv. over
depth
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 1 -1 Filter
-1-1 -1-1 11 -1-1-1 111 -1-1-1 2 …3
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Different types of convolution
Parameters:
Kernel stride
Size
Padding
Along depth
Depthwise separable convolution – step2
122
148
126
134
One to many mapping
The whole CNN
cat dog ……
Convolution
Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times
Max Pooling
Flattened
Pooling
• Down-sample the image – controls the parameters of the CNN model
Why Pooling
• Subsampling pixels will not change the object
bird
bird
Subsampling
Max Pooling
Max Pooling
1
3 0
-1 1 3
3 1 -1
0 3 Flattened
1 Fully Connected
Feedforward network
3
Conv Net Topology
• 5 convolutional layers
• 3 fully connected layers + soft-max
• 650K neurons , 60 Mln weights
Why do we need a deep CNN?
Courtsey: ICRI
Why do we need a deep CNN?
Why do we need a deep CNN?
Why do we need a deep CNN?
Why do we need a deep CNN?
Suggested reading
Summary: Image Features
• The idea of low, mid, and high level features
• Largely replaced by Neural networks
• But there is a direct connection between the feature hierarchy
dog
cat
cat
bear
Supervised Learning vs Unsupervised
Learning
𝑥 → 𝑦 𝑥
cat
dog
bear
dog
bear
dog
cat
cat
bear
Supervised Learning vs Unsupervised
Learning
𝑥 → 𝑦 𝑥
cat
dog
bear
Classification Clustering
dog
bear
dog
cat
cat
bear
Supervised Learning – k-Nearest Neighbors
cat
dog k=3
bear
cat, cat, dog
cat
dog
bear
6
Supervised Learning – k-Nearest Neighbors
cat
dog k=3
bear
cat
bear, dog, dog
cat dog bear
dog
bear
7
Supervised Learning – k-Nearest Neighbors
•How do we choose the right K?
•How do we choose the right features?
•How do we choose the right distance metric?
8
Supervised Learning – k-Nearest Neighbors
•How do we choose the right K?
•How do we choose the right features?
•How do we choose the right distance metric?
Validation
Training Set Testing Set
Set
Training, Validation (Dev), Test Sets
Validation
Training Set Testing Set
Set
Validation
Training Set Testing Set
Set
Only to be used for evaluating the model at the very end of development and any
changes to the model after running it on the test set, could be influenced by what you
saw happened on the test set, which would invalidate any future evaluation.
Unsupervised Learning – k-means clustering
k=3
1. Initially assign
all images to a
random cluster
13
Unsupervised Learning – k-means clustering
k=3
2. Compute the
mean image (in
feature space) for
each cluster
14
Unsupervised Learning – k-means clustering
k=3
3. Reassign images
to clusters
based on similarity to
cluster means
15
Unsupervised Learning – k-means clustering
k=3
4. Keep repeating
this process
until convergence
16
Unsupervised Learning – k-means clustering
k=3
4. Keep repeating
this process
until convergence
17
Unsupervised Learning – k-means clustering
k=3
4. Keep repeating
this process
until convergence
18
Unsupervised Learning – k-means clustering
•How do we choose the right K?
•How do we choose the right features?
•How do we choose the right distance metric?
•How sensitive is this method with respect to the random
assignment of clusters?
19
Supervised Learning - Classification
Training Data Test Data
dog
cat bear
dog bear
20
Supervised Learning - Classification
Training Data Test Data
cat
dog
cat
. .
. .
. .
bear
21
Supervised Learning - Classification
Training Data
𝑥1 = [ ] 𝑦1 = [ cat ]
𝑥2 = [ ] 𝑦2 = [dog ]
𝑥3 = [ ] 𝑦3 = [cat ]
.
.
.
𝑥𝑛 = [ ] 𝑦𝑛 = [ bear ]
22
Supervised Learning - Classification
Training Data targets /
inputs labels / predictions
We need to find a function that
ground truth maps x and y for any of them.
𝑥1 = [𝑥11 𝑥12 𝑥13 𝑥14 ] 𝑦1 = 1 𝑦ො1 = 1
.
.
.
24
Supervised Learning – Linear Softmax
Training Data targets /
inputs labels / predictions
ground truth
𝑥1 = [𝑥11 𝑥12 𝑥13 𝑥14 ] 𝑦1 = [1 0 0] 𝑦ො1 = [0.85 0.10 0.05]
.
.
.
25
Supervised Learning – Linear Softmax
𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
26
How do we find a good w and b?
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 (𝑤, 𝑏) 𝑓𝑑 (𝑤, 𝑏) 𝑓𝑏 (𝑤, 𝑏)]
𝑛 3 𝑛 𝑛
𝐿 𝑤, 𝑏 = −𝑦𝑖,𝑗 log(𝑦ො𝑖,𝑗 ) = −log(𝑦ො𝑖,𝑙𝑎𝑏𝑒𝑙 ) = −log 𝑓𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤, 𝑏)
𝑖=1 𝑗=1 𝑖=1 𝑖=1
27
Idea of entropy and cross-entropy
30
Gradient Descent (GD) (idea)
1. Start with a random value
𝐿 𝑤 of w (e.g. w = 12)
3. Recompute w as:
w=12
𝑤
31
Gradient Descent (GD) (idea)
𝐿 𝑤
2. Compute the gradient
(derivative) of L(w) at point
w = 12. (e.g. dL/dw = 6)
3. Recompute w as:
w=10
𝑤
32
(mini-batch) Stochastic Gradient Descent (SGD)
𝜆 = 0.01
𝑙(𝑤, 𝑏) = −log 𝑓𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤, 𝑏)
Initialize w and b randomly 𝑖∈𝐵
for e = 0, num_epochs do
for b = 0, num_batches do
Compute: 𝑑𝑙(𝑤, 𝑏)/𝑑𝑤 and 𝑑𝑙(𝑤, 𝑏)/𝑑𝑏
Update w: 𝑤 = 𝑤 − 𝜆 𝑑𝑙(𝑤, 𝑏)/𝑑𝑤
Update b: 𝑏 = 𝑏 − 𝜆 𝑑𝑙(𝑤, 𝑏)/𝑑𝑏
Print: 𝑙(𝑤, 𝑏) // Useful to see if this is becoming smaller or not.
end
end
33
Source: Andrew Ng