0% found this document useful (0 votes)
17 views95 pages

03 PL, Activation, BackProp, CNN

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views95 pages

03 PL, Activation, BackProp, CNN

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Perceptron Model

Frank Rosenblatt (1957) - Cornell University


Activation
function
𝑥1
𝑤1
𝑛 𝑤2
1, if ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 > 0 𝑥2
𝑓 𝑥 =ቐ 𝑖=0 ෍
0, otherwise 𝑤3
𝑥3 𝑤4

𝑥4

More: https://en.wikipedia.org/wiki/Perceptron
Activation Functions
Step(x) Sigmoid(x)

Tanh(x) ReLU(x) = max(0, x)


Two-layer Multi-layer Perceptron (MLP)
”hidden" layer

Loss / Criterion
𝑥1 𝑎1

𝑥2 𝑎2
෍ 𝑦ො1 𝑦1
𝑥3 𝑎3

𝑥4 𝑎4


Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑔𝑐 = 𝑤𝑐1 𝑥𝑖1 + 𝑤𝑐2 𝑥𝑖2 + 𝑤𝑐3 𝑥𝑖3 + 𝑤𝑐4 𝑥𝑖4 + 𝑏𝑐


𝑔𝑑 = 𝑤𝑑1 𝑥𝑖1 + 𝑤𝑑2 𝑥𝑖2 + 𝑤𝑑3 𝑥𝑖3 + 𝑤𝑑4 𝑥𝑖4 + 𝑏𝑑
𝑔𝑏 = 𝑤𝑏1 𝑥𝑖1 + 𝑤𝑏2 𝑥𝑖2 + 𝑤𝑏3 𝑥𝑖3 + 𝑤𝑏4 𝑥𝑖4 + 𝑏𝑏

𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )

6
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑤𝑐1 𝑤𝑐2 𝑤𝑐3 𝑤𝑐4


𝑔𝑐 = 𝑤𝑐1 𝑥𝑖1 + 𝑤𝑐2 𝑥𝑖2 + 𝑤𝑐3 𝑥𝑖3 + 𝑤𝑐4 𝑥𝑖4 + 𝑏𝑐
𝑤 = 𝑤𝑑1 𝑤𝑑2 𝑤𝑑3 𝑤𝑑4
𝑔𝑑 = 𝑤𝑑1 𝑥𝑖1 + 𝑤𝑑2 𝑥𝑖2 + 𝑤𝑑3 𝑥𝑖3 + 𝑤𝑑4 𝑥𝑖4 + 𝑏𝑑 𝑤𝑏1 𝑤𝑏2 𝑤𝑏3 𝑤𝑏4
𝑔𝑏 = 𝑤𝑏1 𝑥𝑖1 + 𝑤𝑏2 𝑥𝑖2 + 𝑤𝑏3 𝑥𝑖3 + 𝑤𝑏4 𝑥𝑖4 + 𝑏𝑏
𝑏 = 𝑏𝑐 𝑏𝑑 𝑏𝑏

𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )

7
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑤𝑐1 𝑤𝑐2 𝑤𝑐3 𝑤𝑐4


𝑤 = 𝑤𝑑1 𝑤𝑑2 𝑤𝑑3 𝑤𝑑4
𝑔 = 𝑤𝑥 𝑇 + 𝑏 𝑇 𝑤𝑏1 𝑤𝑏2 𝑤𝑏3 𝑤𝑏4

𝑏 = 𝑏𝑐 𝑏𝑑 𝑏𝑏

𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )

8
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑤𝑐1 𝑤𝑐2 𝑤𝑐3 𝑤𝑐4


𝑤 = 𝑤𝑑1 𝑤𝑑2 𝑤𝑑3 𝑤𝑑4
𝑔 = 𝑤𝑥 𝑇 + 𝑏 𝑇 𝑤𝑏1 𝑤𝑏2 𝑤𝑏3 𝑤𝑏4

𝑏 = 𝑏𝑐 𝑏𝑑 𝑏𝑏

𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑔)

9
Linear Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤𝑥 𝑇 + 𝑏 𝑇 )

10
Two-layer MLP + Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[2] 𝑥 𝑇 + 𝑏[2] )

11
N-layer MLP + Softmax
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )

𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑘] )

𝑇 𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] )
12
How to train the parameters?
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑇
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )

𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑘] )

𝑇 𝑇
𝑓 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] )
13
How to train the parameters?
𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑇 𝑙 = 𝑙𝑜𝑠𝑠(𝑓, 𝑦)
𝑎1 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[1] 𝑥 𝑇 + 𝑏[1] )
𝑇
𝑎2 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[2] 𝑎1𝑇 + 𝑏[2] )
… We can still use SGD
𝑇 𝑇
𝑎𝑘 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤[𝑘] 𝑎𝑘−1 + 𝑏[𝑖] )

… We need!

𝜕𝑙 𝜕𝑙
𝑇 𝑇
𝑓= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑤[𝑛] 𝑎𝑛−1 + 𝑏[𝑛] ) 𝜕𝑤[𝑘]𝑖𝑗 𝜕𝑏 𝑘 𝑖
14
Backpropagation – repeated application of
chain rule
Two-layer Neural Network – Forward Pass
Two-layer Neural Network – Backward Pass
Basic building blocks of the CNN architecture
• Input layer
• Convolutional layer
• Fully connected layer
• Loss layer

• Convolutional layer
• Convolutional kernel
• Pooling layer
• Non-linearity
Convolution operation
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected




0 1 0 0 1 0
0 0 1 0 1 0
x36
Convolutional Layer (with 4 filters)
weights:
4x1x9x9
Input: 1x224x224 Output: 4x224x224

if zero padding,
and stride = 1
Convolutional Layer (with 4 filters)
weights:
4x1x9x9
Input: 1x224x224 Output: 4x112x112

if zero padding,
but stride = 2
Color image: RGB 3 channels – conv. over
depth
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 1 -1 Filter
-1-1 -1-1 11 -1-1-1 111 -1-1-1 2 …3
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Different types of convolution

Parameters:

 Kernel stride

 Size

 Padding

Normal vs dialated Dialation width = 2


Dialated convolution
Spatially Separable convolution
Depthwise separable convolution

Convolving by 256 5x5 kernels over the input volume


Depthwise separable convolution – step1

Along depth
Depthwise separable convolution – step2

Pointwise 1x1 conv


Transpose convolution
Convolution as a matrix multiplication
Many to one mapping – 9 values to 1 value

122

148

126

134
One to many mapping
The whole CNN
cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
Pooling
• Down-sample the image – controls the parameters of the CNN model
Why Pooling
• Subsampling pixels will not change the object
bird
bird

Subsampling

 We can subsample the pixels to make image


smaller
 fewer parameters to characterize the image
Pooling or strided convolution?
Unpool
The whole CNN
cat dog ……
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattened A new image


3
Flattening
0

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
Conv Net Topology
• 5 convolutional layers
• 3 fully connected layers + soft-max
• 650K neurons , 60 Mln weights
Why do we need a deep CNN?

Courtsey: ICRI
Why do we need a deep CNN?
Why do we need a deep CNN?
Why do we need a deep CNN?
Why do we need a deep CNN?
Suggested reading
Summary: Image Features
• The idea of low, mid, and high level features
• Largely replaced by Neural networks
• But there is a direct connection between the feature hierarchy

• Many other features proposed


• LBP: Local Binary Patterns: Useful for recognizing faces.
• Dense SIFT: SIFT features computed on a grid similar to the HOG features.
• etc.
Supervised Learning vs Unsupervised
Learning
𝑥 → 𝑦 𝑥
cat
dog
bear
dog
bear

dog
cat

cat

bear
Supervised Learning vs Unsupervised
Learning
𝑥 → 𝑦 𝑥
cat
dog
bear
dog
bear

dog
cat

cat

bear
Supervised Learning vs Unsupervised
Learning
𝑥 → 𝑦 𝑥
cat
dog
bear

Classification Clustering
dog
bear

dog
cat

cat

bear
Supervised Learning – k-Nearest Neighbors
cat
dog k=3
bear
cat, cat, dog

cat

cat dog bear

dog
bear

6
Supervised Learning – k-Nearest Neighbors
cat
dog k=3
bear

cat
bear, dog, dog
cat dog bear

dog
bear

7
Supervised Learning – k-Nearest Neighbors
•How do we choose the right K?
•How do we choose the right features?
•How do we choose the right distance metric?

8
Supervised Learning – k-Nearest Neighbors
•How do we choose the right K?
•How do we choose the right features?
•How do we choose the right distance metric?

Answer: Just choose the one combination that works best!


BUT not on the test data.

Instead split the training data into a ”Training set” and


a ”Validation set” (also called ”Development set”)
9
Training, Validation (Dev), Test Sets

Validation
Training Set Testing Set
Set
Training, Validation (Dev), Test Sets

Validation
Training Set Testing Set
Set

Used during development


Training, Validation (Dev), Test Sets

Validation
Training Set Testing Set
Set

Only to be used for evaluating the model at the very end of development and any
changes to the model after running it on the test set, could be influenced by what you
saw happened on the test set, which would invalidate any future evaluation.
Unsupervised Learning – k-means clustering

k=3
1. Initially assign
all images to a
random cluster

13
Unsupervised Learning – k-means clustering

k=3
2. Compute the
mean image (in
feature space) for
each cluster
14
Unsupervised Learning – k-means clustering

k=3
3. Reassign images
to clusters
based on similarity to
cluster means
15
Unsupervised Learning – k-means clustering

k=3
4. Keep repeating
this process
until convergence

16
Unsupervised Learning – k-means clustering

k=3
4. Keep repeating
this process
until convergence

17
Unsupervised Learning – k-means clustering

k=3
4. Keep repeating
this process
until convergence

18
Unsupervised Learning – k-means clustering
•How do we choose the right K?
•How do we choose the right features?
•How do we choose the right distance metric?
•How sensitive is this method with respect to the random
assignment of clusters?

19
Supervised Learning - Classification
Training Data Test Data
dog
cat bear

cat dog bear


cat

dog bear

20
Supervised Learning - Classification
Training Data Test Data
cat

dog

cat

. .
. .
. .

bear

21
Supervised Learning - Classification
Training Data

𝑥1 = [ ] 𝑦1 = [ cat ]

𝑥2 = [ ] 𝑦2 = [dog ]

𝑥3 = [ ] 𝑦3 = [cat ]

.
.
.

𝑥𝑛 = [ ] 𝑦𝑛 = [ bear ]

22
Supervised Learning - Classification
Training Data targets /
inputs labels / predictions
We need to find a function that
ground truth maps x and y for any of them.
𝑥1 = [𝑥11 𝑥12 𝑥13 𝑥14 ] 𝑦1 = 1 𝑦ො1 = 1

𝑥2 = [𝑥21 𝑥22 𝑥23 𝑥24 ] 𝑦2 = 2 𝑦ො2 = 2


𝑦ෝ𝑖 = 𝑓(𝑥𝑖 ; 𝜃)

𝑥3 = [𝑥31 𝑥32 𝑥33 𝑥34 ] 𝑦3 = 1 𝑦ො3 = 2 How do we ”learn” the parameters


of this function?
.
.
We choose ones that makes the
. following quantity small:
𝑛

𝑥𝑛 = [𝑥𝑛1 𝑥𝑛2 𝑥𝑛3 𝑥𝑛4 ] 𝑦𝑛 = 3 𝑦ො𝑛 = 1 ෍ 𝐶𝑜𝑠𝑡(𝑦ෝ𝑖 , 𝑦𝑖 )


𝑖=1
23
Supervised Learning – Linear Softmax
Training Data targets /
inputs labels /
ground truth
𝑥1 = [𝑥11 𝑥12 𝑥13 𝑥14 ] 𝑦1 = 1

𝑥2 = [𝑥21 𝑥22 𝑥23 𝑥24 ] 𝑦2 = 2

𝑥3 = [𝑥31 𝑥32 𝑥33 𝑥34 ] 𝑦3 = 1

.
.
.

𝑥𝑛 = [𝑥𝑛1 𝑥𝑛2 𝑥𝑛3 𝑥𝑛4 ] 𝑦𝑛 = 3

24
Supervised Learning – Linear Softmax
Training Data targets /
inputs labels / predictions
ground truth
𝑥1 = [𝑥11 𝑥12 𝑥13 𝑥14 ] 𝑦1 = [1 0 0] 𝑦ො1 = [0.85 0.10 0.05]

𝑥2 = [𝑥21 𝑥22 𝑥23 𝑥24 ] 𝑦2 = [0 1 0] 𝑦ො2 = [0.20 0.70 0.10]

𝑥3 = [𝑥31 𝑥32 𝑥33 𝑥34 ] 𝑦3 = [1 0 0] 𝑦ො3 = [0.40 0.45 0.15]

.
.
.

𝑥𝑛 = [𝑥𝑛1 𝑥𝑛2 𝑥𝑛3 𝑥𝑛4 ] 𝑦𝑛 = [0 0 1] 𝑦ො𝑛 = [0.40 0.25 0.35]

25
Supervised Learning – Linear Softmax

𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 𝑓𝑑 𝑓𝑏 ]

𝑔𝑐 = 𝑤𝑐1 𝑥𝑖1 + 𝑤𝑐2 𝑥𝑖2 + 𝑤𝑐3 𝑥𝑖3 + 𝑤𝑐4 𝑥𝑖4 + 𝑏𝑐


𝑔𝑑 = 𝑤𝑑1 𝑥𝑖1 + 𝑤𝑑2 𝑥𝑖2 + 𝑤𝑑3 𝑥𝑖3 + 𝑤𝑑4 𝑥𝑖4 + 𝑏𝑑
𝑔𝑏 = 𝑤𝑏1 𝑥𝑖1 + 𝑤𝑏2 𝑥𝑖2 + 𝑤𝑏3 𝑥𝑖3 + 𝑤𝑏4 𝑥𝑖4 + 𝑏𝑏

𝑓𝑐 = 𝑒 𝑔𝑐 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑑 = 𝑒 𝑔𝑑 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )
𝑓𝑏 = 𝑒 𝑔𝑏 /(𝑒 𝑔𝑐 +𝑒 𝑔𝑑 + 𝑒 𝑔𝑏 )

26
How do we find a good w and b?

𝑥𝑖 = [𝑥𝑖1 𝑥𝑖2 𝑥𝑖3 𝑥𝑖4 ] 𝑦𝑖 = [1 0 0] 𝑦ො𝑖 = [𝑓𝑐 (𝑤, 𝑏) 𝑓𝑑 (𝑤, 𝑏) 𝑓𝑏 (𝑤, 𝑏)]

We need to find w, and b that minimize the following:

𝑛 3 𝑛 𝑛
𝐿 𝑤, 𝑏 = ෍ ෍ −𝑦𝑖,𝑗 log(𝑦ො𝑖,𝑗 ) = ෍ −log(𝑦ො𝑖,𝑙𝑎𝑏𝑒𝑙 ) = ෍ −log 𝑓𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤, 𝑏)
𝑖=1 𝑗=1 𝑖=1 𝑖=1

27
Idea of entropy and cross-entropy

How many questions are to be


Asked to right guess the color
Of a randomly picked ball?
Another case
Gradient Descent (GD)
expensive
𝜆 = 0.01 𝑛

𝐿(𝑤, 𝑏) = ෍ −log 𝑓𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤, 𝑏)


Initialize w and b randomly
𝑖=1
for e = 0, num_epochs do
Compute: 𝑑𝐿(𝑤, 𝑏)/𝑑𝑤 and 𝑑𝐿(𝑤, 𝑏)/𝑑𝑏

Update w: 𝑤 = 𝑤 − 𝜆 𝑑𝐿(𝑤, 𝑏)/𝑑𝑤


Update b: 𝑏 = 𝑏 − 𝜆 𝑑𝐿(𝑤, 𝑏)/𝑑𝑏

Print: 𝐿(𝑤, 𝑏) // Useful to see if this is becoming smaller or not.


end

30
Gradient Descent (GD) (idea)
1. Start with a random value
𝐿 𝑤 of w (e.g. w = 12)

2. Compute the gradient


(derivative) of L(w) at point
w = 12. (e.g. dL/dw = 6)

3. Recompute w as:

w = w – lambda * (dL / dw)

w=12
𝑤
31
Gradient Descent (GD) (idea)

𝐿 𝑤
2. Compute the gradient
(derivative) of L(w) at point
w = 12. (e.g. dL/dw = 6)

3. Recompute w as:

w = w – lambda * (dL / dw)

w=10
𝑤
32
(mini-batch) Stochastic Gradient Descent (SGD)

𝜆 = 0.01
𝑙(𝑤, 𝑏) = ෍ −log 𝑓𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤, 𝑏)
Initialize w and b randomly 𝑖∈𝐵

for e = 0, num_epochs do
for b = 0, num_batches do
Compute: 𝑑𝑙(𝑤, 𝑏)/𝑑𝑤 and 𝑑𝑙(𝑤, 𝑏)/𝑑𝑏
Update w: 𝑤 = 𝑤 − 𝜆 𝑑𝑙(𝑤, 𝑏)/𝑑𝑤
Update b: 𝑏 = 𝑏 − 𝜆 𝑑𝑙(𝑤, 𝑏)/𝑑𝑏
Print: 𝑙(𝑤, 𝑏) // Useful to see if this is becoming smaller or not.
end
end
33
Source: Andrew Ng

You might also like