009 Neural - Networks Complete
009 Neural - Networks Complete
Introduction
• 25 4 3 1x3 vector
4
• −12 3x1 vector
18
• All vectors are also matrices (as a matrix can be 1D).
• Therefore, the rules that we develop for matrices, also work for vectors
Addition/subtraction
• The two matrices to be added/subtracted must be the same size
https://www.mathsisfun.com/algebra/matrix-introduction.html
Negative, Scaler Multiplication, Transpose
https://www.mathsisfun.com/algebra/matrix-introduction.html
Matrix Multiplication
• For two matrices to be compatible for multiplication, the # columns of the first must
match the # rows of the second i.e., the inner dimensions must match
• E.g., If A is a matrix with dimensions m x n, and B is a matrix with dimensions p x q,
then:
• We can only produce the product Y= AB, if n = p
• The dimensions of the product matrix, Y would be m x q
• Therefore, m x n x n x q → m x q
• The "Dot Product" is where we multiply matching members, then sum up:
• (1, 2, 3) • (7, 9, 11) = 1×7 + 2×9 + 3×11 = 58
https://www.mathsisfun.com/algebra/matrix-introduction.html
Matrix Multiplication
https://www.mathsisfun.com/algebra/matrix-introduction.html
Matrix: Hadamard Product
• The Hadamard product (element-wise product) takes two matrices of the same
dimensions and produces another matrix of the same dimension as the
operands
• Each element 𝑖, 𝑗 is the product of elements 𝑖, 𝑗 of the original two matrices.
• Denoted as ° or .∗
https://en.wikipedia.org/wiki/Hadamard_product_(matrices)
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell,
Video Lectures 35-37,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecture
note20.html
• Deep Learning Specialization, Andrew Ng
https://www.coursera.org/specializations/deep-
learning?utm_source=deeplearningai&utm_medium=institutions&ut
m_campaign=SocialYoutubeDLSC1W1L1
– Video Lectures: C1W3L1 – C1W4L6:
https://www.youtube.com/playlist?list=PLpFsSf5Dm-pd5d3rjNtIXUHT-v7bdaEIe
• A beginner's guide to deriving and implementing backpropagation,
by Pranav Budhwant https://link.medium.com/Zp3zxNWpf6
• Tensorflow playground: https://playground.tensorflow.org/
• Machine Learning Playground: https://ml-playground.com/#
The Artificial “Neuron”
• Remember logistic regression?
ℎ(𝑥) = 𝜎(𝑤 𝑇 𝑥 + 𝑏)
⇔
𝑧 = 𝑤𝑇𝑥 + 𝑏
ℎ(𝑥) = 𝜎(𝑧)
𝒙𝟏
𝒘𝟏
𝒙𝟐 𝒘𝟐
𝒘𝟑 𝒛 = 𝒘𝑻 𝒙 + 𝒃 𝝈(𝒛) 𝒉(𝒙)
𝒙𝟑
𝒘𝒏
…
𝒙𝒏
Learning weights 3
Negative gradient
𝜕
Contribution of 𝑤1 in L: 𝜕𝑤 𝐿
1 Positive gradient
𝜕 2
𝑤1 ≔ 𝑤1 − 𝜶 𝐿
𝜕𝑤1
𝑳
𝑤1 ≔ 𝑤1 − 𝛼 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 : Increases 𝑤1
𝑤1 ≔ 𝑤1 − 𝛼 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 : Decreases 𝑤1 Gradient = 0 at
1 the minimum
1. What is my contribution in this cost?
2. Should I increase or decrease my
value to lower the cost?
0
𝒙𝟏 -0.5 0 0.5 1 1.5 2 2.5
𝒘𝟏 𝒘𝟏
𝒙𝟐 𝒘𝟐
𝒘𝟑 𝒛 = 𝒘𝑻 𝒙 + 𝒃 𝝈(𝒛) 𝒉(𝒙)
A cost function:
𝒙𝟑 𝒎
𝟏 𝒊 𝟐
𝑳(𝒉 𝒙 , 𝒚) = 𝒉 𝒙 −𝒚 𝒊
𝟐𝒎
… 𝒘𝒏 𝒎
𝒊=𝟏
𝟏 𝒊 𝒊 𝒊 𝒊
𝑳(𝒉 𝒙 , 𝒚) = −𝒚 𝒍𝒐𝒈 𝒉 𝒙 − 𝟏−𝒚 𝒍𝒐𝒈 𝟏 − 𝒉 𝒙
𝒎
𝒙𝒏 𝒊
Can we do better than linear decision boundaries?
(Image refs: https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-trick-mercers-
theorem-e1e6848c6c4d, https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f, Weinberger, Lectures 35-37)
ℎ(𝑥) = 𝜎(𝑤 𝑥 + 𝑏)𝑇
• Manipulate 𝑥
𝒙 → 𝝓(𝒙)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
• Kernels: Predefined 𝜙(𝑥).
𝜙 𝑥 = x2
Neural Networks (Weinberger, Lectures 35-37)
ℎ(𝑥) = 𝜎(𝑤 𝑇 𝑥 + 𝑏)
• Manipulate 𝑥
• Neural Networks:
– Learn 𝜙(𝑥)
𝒙 → 𝝓(𝒙)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
• Learn the 𝑤, 𝑏 and the representation of the data 𝜙(𝑥)
• 𝜙 𝑥 = 𝑔(𝑣 𝑇 𝑥 + 𝑐), where 𝑔(𝑧) is again a non-linear function like
the Heaviside step function, sigmoid, hyperbolic tangent (tanh),
Rectified Linear Unit (RELU), Leaky RELU,…
𝒉 𝒙 = 𝝈(𝒘𝑻 (𝒈 𝒗𝑻 𝒙 + 𝒄 ) + 𝒃)
• Why non-linear?
𝒉 𝒙 = 𝒘𝑻 𝒗𝑻 𝒙 + 𝒄 + 𝒃
𝒉 𝒙 = 𝒘𝑻 𝒗𝑻 𝒙 + 𝒘𝑻 𝒄 + 𝒃
𝒉 𝒙 = (𝒘𝑻 𝒗𝑻 )𝒙 + (𝒘𝑻 𝒄 + 𝒃)
• Only a linear function!
Neural Networks (Weinberger, Lectures 35-37)
• With some changes, the NN’s have been around for a while
• Multi-layer perceptron → Artificial Neural Network → Deep Learning
• Major changes
– GPUs – matrix multiplications
– Preference of activation functions
– Stochastic Gradient Descent (SGD) and Mini-batch Gradient
Descent
– Rebranding!
MLP: Who? Perceptron? I am an ANN!
ℎ1 (𝑥)
ℎ (𝑥)
𝜙 𝑥 = 2 , where each ℎ(𝑥) is a linear classifier. ℎ 𝑥 = 𝜎(𝑣 𝑇 𝑥 + 𝑐)
…
ℎ𝑛 (𝑥)
𝒙 𝒉𝟏 𝑥Ԧ = 𝝈 𝒘𝑻 𝑥Ԧ + 𝒃
𝒙𝟏 𝒗𝟏
𝒗𝟐 𝒉𝟏 (𝑥)
Ԧ 𝒘𝟏
𝒗𝟑
𝒖𝟏
𝒖𝟐 𝒉𝟐 (𝑥)
Ԧ 𝒘𝟐 𝑯 𝑥Ԧ = 𝝈 𝒘𝑻 𝝓 𝑥Ԧ + 𝒃
𝒙𝟐 𝒖𝟑
𝒘𝟑 𝑯(𝑥)
Ԧ
… 𝒉𝟑 (𝑥)
Ԧ
𝒌𝟏 𝒌𝟐
𝒌𝟑
… 𝒘𝒏
𝒙𝒌 𝒉𝒏 (𝑥)
Ԧ
Input Hidden Layer Output
Neural Networks (Weinberger, Lectures 35-37)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝒈(𝒗𝑻 𝒙 + 𝒄) + 𝒃)
𝒙𝟏
𝒗𝟏
𝒙𝟐 𝒗𝟐
𝑚
𝒛𝟏 = 𝒗𝑻 𝒙 + 𝒄 𝒂𝟏 = 𝝈(𝒛) 1
𝒗𝟑 𝐿 𝑦,
ො 𝑦 = 𝑙(𝑦,
ො 𝑦)
𝑚
𝒙𝟑 𝒘𝟏 𝑖=1
𝒗𝒏
… 𝒗𝟏
𝒘𝟐 𝒛 = 𝒘𝑻 𝒂 + 𝒃 𝒉 𝒙 = 𝝈(𝒛)
𝒙𝒏 𝑻
𝒛𝟐 = 𝒗 𝒙 + 𝒄 𝒂𝟐 = 𝝈(𝒛)
𝑦ො
𝒗𝒏
… …
Intuition – How NNs work?
𝑯 𝒙 = 𝝈 𝒘𝑻 𝝓 𝒙 + 𝒃
The NN learns 𝜙 𝑥 :
ℎ1 (𝑥)
ℎ (𝑥)
𝜙 𝑥 = 2 , where each ℎ(𝑥) is a linear classifier. ℎ 𝑥 = 𝑣 𝑇 𝑥 + 𝑐
…
ℎ𝑛 (𝑥)
Ԧ 𝒘𝟏
𝒉𝟏 (𝑥)
Becomes smoother
Ԧ 𝒘𝟐
𝒉𝟐 (𝑥) with higher n!
Universal
𝒙 𝒘𝟑 𝑯(𝑥)
Ԧ approximators.
𝒉𝟑 (𝑥)
Ԧ
…
𝒘𝒏
𝒉𝒏 (𝑥)
Ԧ
More Intuition – Regression (Weinberger, Lectures 35-37)
RELU
𝑯 𝒙 = 𝒘𝑻 𝝓(𝒙) + 𝒃
𝑯 𝒙 = 𝒘𝑻 𝒎𝒂𝒙( 𝒗𝑻 𝒙 + 𝒄 , 𝟎) + 𝒃
𝒎 𝑣 𝑇 𝑥 + 𝑐2
𝟏 𝑐1 𝑇
𝑣 𝑥 + 𝑐1
𝟐
𝑳 𝑯 𝒙 ,𝒚 = 𝒉 𝒙 𝒊 − 𝒚𝒊
𝟐𝒎
𝒊=𝟏
𝑐2
Layers (Weinberger, Lectures 35-37)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
𝝓(𝒙) = 𝝈 𝒗𝑻 𝒙 + 𝒄
• To make this more powerful:
– Either make the matrix 𝑣 really big
– Or add layers like a Matryoshka doll
𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
𝝓 𝒙 = 𝝈 𝒗𝑻 𝝓′ 𝒙 + 𝒄
𝝓′(𝒙) = 𝝈 𝒗′𝑻 𝝓′′ 𝒙 + 𝒄′
𝝓′′(𝒙) = 𝝈 𝒗′′𝑻 𝒙 + 𝒄′′
• “Deep” learning
– This was known since the time of Frank Rosenblatt
– These were not used because:
• These used to take a long time to process
• Any function that you can learn with a deep network, you can also learn with a shallow network
• But in practice, a shallow network requires an exponentially wide matrix to be able to compete with
a deep network that needs fewer, smaller matrices to do the same
• A deep network allows to learn simple things (lines, hyperplanes, piece-wise linear functions) with
earlier layers and build with more complex tools (non-linear functions) in the later layers
Learning with Layers (Weinberger, Lectures 35-37)
𝒉 𝒙 = 𝒘𝑻 𝝓 𝒙
𝝓 𝒙 = 𝝈 𝒂(𝒙) , 𝒂 𝒙 = 𝒗𝑻 𝝓′ 𝒙
𝝓′ 𝒙 = 𝝈 𝒂′ 𝒙 , 𝒂′(𝒙)𝒗′𝑻 𝝓′′ 𝒙
𝝓′′ 𝒙 = 𝝈 𝒂′′ 𝒙 , 𝒂′′ 𝒙 = 𝒗′′𝑻 𝒙
• Forward pass and backpropagation
𝒙𝟏
𝒗𝟏
𝒙𝟐 𝒗𝟐
𝑚
𝒗𝟑 𝒛𝟏 = 𝒗𝑻 𝒙 + 𝒄 𝒂𝟏 = 𝝈(𝒛) 1
𝐿 𝑦,
ො 𝑦 = 𝑙(𝑦,
ො 𝑦)
𝒙𝟑 𝒘𝟏 𝑚
𝑖=1
𝒗𝒏
… 𝒗𝟏
𝒘𝟐 𝒛 = 𝒘𝑻 𝒂 + 𝒃 𝒉 𝒙 = 𝝈(𝒛)
𝒙𝒏 𝑻
𝒛𝟐 = 𝒗 𝒙 + 𝒄 𝒂𝟐 = 𝝈(𝒛)
𝑦ො
𝒗𝒏
Neural Networks - SGD (Weinberger, Lectures 35-37)
• Forget convex cost functions!
• Approximate the gradient of the cost function with one data point – in random
order
– Compared to batch gradient descent that averages over the whole dataset
• Clearly a bad approximation – and that’s why we use it
• It’s a bad approximation to find the exact minimum – which we no longer care
about
• Misses narrow (may be deep) holes and converges onto wide holes
– The narrow holes may not be in the same place in the test data
• Does not overfit to data easily
• Is faster as SGD has already taken m steps after one epoch
• The direction of the small steps remains correct on average
Neural Networks - SGD (Weinberger, Lectures 35-37)
• In practice, we use the mini-batch gradient descent
• Initially use a very large learning rate – prevent from falling into the narrow local
minima
• Now you are jumping around in the wide holes
• Lower the learning rate – say by a factor of 10
• And the gradient descent will converge further
• We must remember, that we have billions of local minima
– Millions of wider holes
• In practice reaching a descent minimum allows for a good enough error rate
– a NN with a different local minimum may perform equally well
– So, ensemble methods (that combine classifiers) work well
Up next
• Formalize the notation
– Logistic regression – Vectorized version
– NN – Vectorized version
– Forward pass
• Activation functions
– Sigmoid, tanh, ReLU, leaky ReLU
– Pros and cons
– Gradients
• Forward and Backward passes
– Backpropagation
• Logistic Regression
• NN
Neural Networks
Notation and Forward Pass
𝒁 = 𝒘𝑻 𝑿 + 𝒃
𝑑𝑖𝑚𝑠: 1, 𝑚 = 1, 𝑛𝑥 𝑛𝑥 , 𝑚 + 1, 𝑚
Note: 𝑏 would be implicitly converted to [b b b b. b] by python, NumPy broadcasting operation.
1 2 3 𝑚
𝑍 = [𝑧 𝑧 𝑧 …𝑧 ]
𝑨=𝝈 𝒁
1 2 3 𝑚
𝐴 = [𝑎 𝑎 𝑎 …𝑎 ]
Vectorizing Logistic Regression – The Forward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛)
𝑥2
𝑤2
𝑥𝑛𝑥 A complete forward pass:
𝑤𝑛𝑥
𝒁 = 𝒘𝑻 𝑿 + 𝒃
𝑏 𝑨=𝝈 𝒁
Neural Networks: Terminology
(https://www.codeproject.com/Articles/1261763/ANNT-Feed-forward-fully-connected-neural-networks,
https://laptrinhx.com/deep-learning-using-keras-3760648021/ )
[1]
𝑎1
𝑥1 [2]
𝑎1
[1]
𝑎2
𝑥2 [2] [3]
𝑎2 𝑎1
[1]
𝑎3
𝑥3 [2]
𝑎3
[1]
𝑎4
NN- Forward pass with 1 training example [1]
𝑧1 = 𝑤1
1 𝑇 [0]
𝑎 + 𝑏1
1 Note:
•
1
𝑤1 is the entire
(Deep Learning Specialization, Andrew Ng ) weight matrix of
[1] [1] this neuron. We
[1]
𝑎2 [1]
𝑧3 = 𝑤3
1 𝑇 [0]
𝑎 + 𝑏3
1
𝑥2
[0] [2] [3]
(i.e., 𝑎2 )
𝑎2 𝑎1 [1]
𝑎3 = 𝜎(𝑧3 )
[1]
𝑏𝑙 = 𝑏2𝑙 , 𝑑𝑖𝑚𝑠 = (𝑛 𝑙 , 1)
…
𝑙
𝑏𝑛 𝑙
Forward pass
𝒛𝟏 =𝑾𝟏 𝒂𝟎 +𝒃𝟏
[1] (4,1) = (4,3)(3,1)+(4,1)
𝑎1
𝒂𝟏 =𝝈 𝒛𝟏
𝑥1 [2] (4,1) = (4,1)
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2 𝒛𝟐 =𝑾𝟐 𝒂𝟏 +𝒃𝟐
[2] [3] (3,1) = (3,4)(4,1)+(3,1)
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
𝒂 𝟐 = 𝝈(𝒛 𝟐 )
[1]
𝑎3 (3,1) = (3,1)
𝑥3
[0] [2]
(i.e., 𝑎3 )
𝑎3
𝒛𝟑 =𝑾𝟑 𝒂𝟐 +𝒃𝟑
[1] (1,1) = (1,3)(3,1)+(1,1)
𝑎4
𝒂 𝟑 = 𝝈(𝒛 𝟑 )
(1,1) = (1,1)
NN- Forward pass with m training examples
(Deep Learning Specialization, Andrew Ng )
[1]
𝑿(𝟏) → 𝒚
ෝ 𝟏
=𝒂𝟑 𝟏
𝑎1 𝑿(𝟐) → 𝒚
ෝ 𝟐 =𝒂𝟑 𝟐
𝑥1 [2] …
[0]
𝑎1
(i.e., 𝑎1 )
𝑎2
[1] 𝑿(𝒎) → 𝒚
ෝ 𝒎
=𝒂𝟑 𝒎
𝟏
𝒈 𝒛 =
𝟏 + 𝒆−𝒛
• Range: (0, +1)
• Gradient → 0 for higher values of 𝑧
• Mean around 0.5
• Derivative:
𝑑 1 1
𝑔 𝑧 = 𝑔′ 𝑧 = 1 −
𝑑𝑧 1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
𝑔′ 𝑧 = 𝑔 𝑧 1 − 𝑔 𝑧
𝑖𝑓 𝑎 = 𝑔 𝑧 , 𝑔′ 𝑧 = 𝑎(1 − 𝑎)
• Examples:
▪ 𝑍 = 10, 𝑔 𝑧 ≈ 1, 𝑔′ 𝑧 ≈ 0
▪ 𝑍 = −10, 𝑔 𝑧 ≈ 0, 𝑔′ 𝑧 ≈ 0
1 1 1 1
▪ 𝑍 = 0, 𝑔 𝑧 ≈ , 𝑔′ 𝑧 = 1− =
2 2 2 4
Hyperbolic Tangent (tanh) (Deep Learning Specialization, Andrew Ng )
𝒆𝒛 − 𝒆−𝒛
𝒈 𝒛 = 𝒛
𝒆 + 𝒆−𝒛
• Shifted version of the sigmoid
• Range: (−1, +1)
• Gradient → 0 for higher values of 𝑧
• Mean around 0
• Derivative:
𝑑 2
𝑔 𝑧 = 𝑔′ 𝑧 = 1 − 𝑡𝑎𝑛ℎ 𝑧
𝑑𝑧
𝑔′ 𝑧 = (1 − 𝑔 𝑧 2 )
𝑖𝑓𝑎 = 𝑔 𝑧 , 𝑔′ 𝑧 = 1 − 𝑎2
• Examples:
▪ 𝑍 = 10, 𝑔 𝑧 ≈ 1, 𝑔′ 𝑧 ≈ 0
▪ 𝑍 = −10, 𝑔 𝑧 ≈ −1, 𝑔′ 𝑧 ≈ 0
▪ 𝑍 = 0, 𝑔 𝑧 ≈ 0, 𝑔′ 𝑧 = 1
Rectified Linear Unit (ReLU) (Deep Learning Specialization, Andrew Ng )
𝒈 𝒛 = 𝐦𝐚𝐱(𝟎, 𝒛)
• Range: (0, ∞)
• Solves the problem of diminished gradients for higher values of |𝑧|
• Gradient undefined for 𝑧 = 0, and 0 for 𝑧 < 0
• Derivative:
0 𝑖𝑓 𝑧 < 0
𝑑
𝑔 𝑧 = 𝑔′ 𝑧 = 1 𝑖𝑓 𝑧 > 0
𝑑𝑧
𝑢𝑛𝑑𝑒𝑓 𝑖𝑓 𝑧 = 0
0 𝑖𝑓 𝑧 < 0
𝑔′ 𝑧 ≈
1 𝑖𝑓 𝑧 ≥ 0
Leaky ReLU (Deep Learning Specialization, Andrew Ng )
𝒈 𝒛 = 𝒎𝒂𝒙(𝟎. 𝟎𝟏𝒛, 𝒛)
• Range: (−∞, ∞)
• Solves the problem of diminished gradients for higher values of |𝑧|
• Gradient undefined for 𝑧 = 0
• Derivative:
0.01 𝑖𝑓 𝑧 < 0
𝑑
𝑔 𝑧 = 𝑔′ 𝑧 = 1 𝑖𝑓 𝑧 > 0
𝑑𝑧
𝑢𝑛𝑑𝑒𝑓 𝑖𝑓 𝑧 = 0
0.01 𝑖𝑓 𝑧 < 0
𝑔′ 𝑧 ≈
1 𝑖𝑓 𝑧 ≥ 0
Table of Activation
functions
(Activation Functions : Sigmoid, ReLU, Leaky ReLU and Softmax basics for Neural
Networks and Deep Learning, Himanshu Sharma,
https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-
relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e)
Up next
• Forward and Backward passes
– Backpropagation
• Logistic Regression
• NN
Neural Networks
Training the NN – Backpropagation
[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2
[2] [3]
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
[1]
𝑥3 𝑎3
[0] [2]
(i.e., 𝑎3 )
𝑎3
[1]
𝑎4
Logistic Regression Derivatives w.r.t Cost 𝑳
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒃 𝒂 = 𝝈(𝒛) 𝑳(𝒂, 𝒚)
𝑥2
𝑤2
𝜕𝐿 𝑎, 𝑦 𝜕𝐿 𝑎, 𝑦 𝑳 𝒂, 𝒚
𝑏 𝜕𝑧 = −𝒚𝒍𝒐𝒈 𝒂 − 𝟏 − 𝒚 𝐥𝐨𝐠(𝟏 − 𝒂)
𝜕𝑎
𝜕𝐿 𝜕𝑎 𝑦 1−𝑦
𝑑𝑧 = 𝑑𝑎 = − +
𝜕𝑎 𝜕𝑧 𝑎 1−𝑎
𝜕𝑎
As = 𝑎(1 − 𝑎) for 𝜎(𝑧)
𝜕𝑧
𝑑𝑧 = 𝑎 − 𝑦
𝜕𝐿 𝑎, 𝑦
𝜕𝑤𝑖
𝜕𝐿 𝜕𝑎 𝜕𝑧
𝑑𝑤𝑖 =
𝜕𝑎 𝜕𝑧 𝜕𝑤𝑖
𝑑𝑤𝑖 = 𝑑𝑧 𝑥𝑖
As
𝜕𝑤𝑖
𝜕𝑧
= 𝑥𝑖 Weight update
𝑑𝑤𝑖 = (𝑎 − 𝑦) 𝑥𝑖 𝑤𝑖 ≔ 𝑤𝑖 − 𝛼𝑑𝑤𝑖
and
𝑑𝑏 = 𝑑𝑧
𝑏 ≔ 𝑏 − 𝛼𝑑𝑏
𝜕𝑧
As =1
𝜕𝑏
Vectorizing Logistic Regression – The Backward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛) 𝑳(𝒂, 𝒚)
𝑥2 1 1 1
𝑤2 𝑑𝑧 =𝑎 −𝑦
This is for one training instance. For 𝑚 instances:
𝑥𝑛𝑥
𝑤𝑛𝑥 𝒅𝒁 = 𝑨 − 𝒀
𝑑𝑖𝑚𝑠: 1, 𝑚 = 1, 𝑚 − (1, 𝑚)
𝑏
𝑚
𝑥1 𝑤1 1 𝑖
𝑥2 𝑤2 𝑑𝑏 = 𝑑𝑧 ⇒
𝑥 = … ,𝑤 = … ,𝑏 = 𝑏 𝑚
𝑖=1
𝑥𝑛𝑥 𝑤𝑛𝑥
𝟏
𝒅𝒃 = 𝒔𝒖𝒎 𝒅𝒁 , 𝑑𝑖𝑚𝑠 = (1,1)
•
𝜕𝑎
= 𝑎(1 − 𝑎) for 𝜎 𝑧 𝒎
𝜕𝑧
𝜕𝑧 𝟏
•
𝜕𝑤𝑖
= 𝑥𝑖
𝒅𝑾 = 𝑿𝒅𝒁𝑻 , 𝑑𝑖𝑚𝑠: (𝑛𝑥 , 1) = (𝑛𝑥 , 𝑚)(𝑚, 1)
• 𝑑𝑎 = − +
𝑦 1−𝑦 𝒎
𝑎 1−𝑎
• 𝑑𝑧 = 𝑎 − 𝑦
• 𝑑𝑤𝑖 = 𝑑𝑧 𝑥𝑖 𝑾 ≔ 𝑾 − 𝜶 𝒅𝑾
• 𝑑𝑤𝑖 = (𝑎 − 𝑦) 𝑥𝑖
𝒃 ≔ 𝒃 − 𝜶(𝒅𝒃)
• 𝑑𝑏 = 𝑑𝑧
Vectorizing Logistic Regression – The Backward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛) 𝑳(𝒂, 𝒚)
𝑥2 One complete iteration of the Gradient Descent
𝑤2
Forward pass:
𝑥𝑛𝑥
𝑤𝑛𝑥 𝒁 = 𝑾𝑻 𝑿 + 𝒃
𝑨=𝝈 𝒁
𝑏
Backward pass:
𝒅𝒁 = 𝑨 − 𝒀
𝜕𝑎
• = 𝑎(1 − 𝑎) for 𝜎 𝑧 𝟏
𝜕𝑧 𝒅𝑾 = 𝑿𝒅𝒁𝑻
•
𝜕𝑧
= 𝑥𝑖 𝒎
𝜕𝑤𝑖
𝟏
• 𝑑𝑎 = − +
𝑦 1−𝑦 𝒅𝒃 = 𝒔𝒖𝒎 𝒅𝒁
𝑎 1−𝑎 𝒎
• 𝑑𝑧 = 𝑎 − 𝑦
• 𝑑𝑤𝑖 = 𝑑𝑧 𝑥𝑖 𝑾 ≔ 𝑾 − 𝜶 𝒅𝒘
• 𝑑𝑤𝑖 = (𝑎 − 𝑦) 𝑥𝑖
𝒃 ≔ 𝒃 − 𝜶(𝒅𝒃)
• 𝑑𝑏 = 𝑑𝑧
Now the NN?
[1]
𝑎1
𝑥1
[2]
𝑎1
[0]
(i.e., 𝑎1 )
[1]
𝑎2
𝑥2
[0]
(i.e., 𝑎2 ) [2] [3]
𝑎2 𝑎1 𝑦ො
[1]
𝑥3 𝑎3
[0]
(i.e., 𝑎3 ) [2]
𝑎3
[1]
𝑎4
The forward pass
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2
[2] [3]
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
[1]
𝑥3 𝑎3
[0] [2]
(i.e., 𝑎3 )
𝑎3
[1]
𝑎4
The backward pass
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
The backward pass (Layer 𝑳)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2 [2] [3]
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
[1]
𝑎3
𝑥3
[0] [2]
(i.e., 𝑎3 )
𝑎3
[1]
𝑎4
𝝏𝒂
• 𝝏𝒛
= 𝒂(𝟏 − 𝒂)
for 𝝈 𝒛
𝝏𝒛
• = 𝒙𝒊
𝝏𝒘𝒊
𝝏𝒛
• =𝟏
𝝏𝒃
𝒚 𝟏−𝒚
• 𝒅𝒂 = − 𝒂 + 𝟏−𝒂
• 𝒅𝒛 = 𝒂 − 𝒚
• 𝒅𝒘𝒊 = 𝒅𝒛 𝒙𝒊
• 𝒅𝒘𝒊 = (𝒂 −
𝒚) 𝒙𝒊
• 𝒅𝒃 = 𝒅𝒛
The backward pass (Notes)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
The backward pass (Layer 𝐿 − 1 and beyond)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 )
[1]
𝑎1
𝑎2
𝑥2
[0]
(i.e., 𝑎2 )
[2] [3]
𝑎2 𝑎1 𝑦ො
𝑥3 [1]
[0]
𝑎3
(i.e., 𝑎3 )
[2]
[1]
𝑎3
𝑎4
𝝏𝒂
• 𝝏𝒛
= 𝒂(𝟏 − 𝒂)
for 𝝈 𝒛
𝝏𝒛
• = 𝒙𝒊
𝝏𝒘𝒊
𝒚 𝟏−𝒚
• 𝒅𝒂 = − 𝒂 + 𝟏−𝒂
• 𝒅𝒛 = 𝒂 − 𝒚
Recursive structure that allows us to
𝜕𝐶 𝜕𝐶 • 𝒅𝒘𝒊 = 𝒅𝒛 𝒙𝒊
incorporate that becomes 𝑎 𝐿 − 𝑦 for
𝜕𝑧 𝑙+1 𝜕𝑧 𝐿 • 𝒅𝒘𝒊 = (𝒂 −
𝒚) 𝒙𝒊
• 𝒅𝒃 = 𝒅𝒛
The backward pass (Layer 𝐿 − 1 and beyond)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 )
[1]
𝑎1
𝑎2
𝑥2
[0]
(i.e., 𝑎2 )
[2] [3]
𝑎2 𝑎1 𝑦ො
𝑥3 [1]
[0]
𝑎3
(i.e., 𝑎3 )
[2]
[1]
𝑎3
𝑎4
𝝏𝒂
• 𝝏𝒛
= 𝒂(𝟏 − 𝒂)
for 𝝈 𝒛
𝝏𝒛
• = 𝒙𝒊
𝝏𝒘𝒊
𝒚 𝟏−𝒚
• 𝒅𝒂 = − 𝒂 + 𝟏−𝒂
• 𝒅𝒛 = 𝒂 − 𝒚
• 𝒅𝒘𝒊 = 𝒅𝒛 𝒙𝒊
One Iteration of the forward and Backward passes for a 2-layered NN:
Forward Propagation:
𝑍 1 =𝑊 1 𝑋+𝑏1 (𝑛 1 , 𝑚) = (𝑛 1 , 𝑛 0 )(𝑛 0 , 𝑚) + (𝑛 1 , 𝑚)
𝐴 1 = 𝑔 1 (𝑍 1 ) (𝑛 1 , 𝑚) = (𝑛 1 , 𝑚)
𝑍 2 =𝑊 2 𝐴1 +𝑏2 (𝑛 2 , 𝑚) = (𝑛 2 , 𝑛 1 )(𝑛 1 , 𝑚) + (𝑛 2 , 𝑚)
𝐴 2 = 𝑔 2 (𝑍 2 ) (𝑛 2 , 𝑚) = (𝑛 2 , 𝑚)
Back Propagation:
𝑑𝑍 2 = 𝐴 2 − 𝑌 𝑛 2 ,𝑚 = 𝑛 2 ,𝑚 − 𝑛 2 ,𝑚 , 𝑛 2 =
1 𝑖𝑓 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 ℎ𝑎𝑠 1 𝑛𝑜𝑑𝑒
2 1
𝑑𝑊 = 𝑚 𝑑𝑍 2 𝐴 1 𝑇 (𝑛 2 , 𝑛 1 ) = (𝑛 2 , 𝑚)(𝑚, 𝑛 1 )
1
𝑑𝑏 2 = 𝑚 𝑠𝑢𝑚(𝑑𝑍 2 ) 𝑛 2 , 𝑚 = 𝑛 2 , 𝑚 , 𝑗𝑢𝑠𝑡 𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑒 𝑡ℎ𝑒 𝑠𝑢𝑚 𝑚 𝑡𝑖𝑚𝑒𝑠
𝑑𝑍 1 = 𝑊 2 𝑇 𝑑𝑍 2
∗ 𝑔 1 ′(𝑍 1 ) (𝑛 1 , m) = 𝑛 1 , 𝑛 2 𝑛 2 , 𝑚 ∗ (𝑛 1 , m)
1 1
𝑑𝑊 = 𝑚 𝑑𝑍 1 𝑋 𝑇 (𝑛 1 , 𝑛 0 ) = (𝑛 1 , 𝑚)(𝑚, 𝑛 0 )
1
𝑑𝑏 1 = 𝑠𝑢𝑚(𝑑𝑍 1 ) 𝑛 1 ,𝑚 = 𝑛 1 ,𝑚
𝑚
Update:
𝑊 2 = 𝑊 2 − 𝛼𝑑𝑊 2
𝑏 2 = 𝑏 2 − 𝛼𝑑𝑏 2
𝑊 1 = 𝑊 1 − 𝛼𝑑𝑊 1
𝑏 1 = 𝑏 1 − 𝛼𝑑𝑏 1
The Vanishing Gradients Problem
(https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)
• The problem: As more layers using certain activation functions are added to neural
networks, the gradients of the loss function approaches zero, making the network
hard to train.
• The Reason: Activation functions like the sigmoid and tanh, squish a large input
space into a small input space between 0 and 1.
– A large change in the input of the sigmoid function will cause a small change in the output. Hence,
the derivative becomes small.
– The derivatives are small for large values of the input |x|
The Vanishing Gradients Problem
(https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)
• For shallow networks with only a few layers that use these activations, this
isn’t a big problem.
• However, when more layers are used, it can cause the gradient to be too
small for training to work effectively.
• Gradients of neural networks are found using backpropagation.
– Backpropagation finds the derivatives of the network by moving layer by layer
from the final layer to the initial one.
– By the chain rule, the derivatives of each layer are multiplied down the network
(from the final layer to the initial) to compute the derivatives of the initial
layers.
– When n hidden layers use an activation like the sigmoid function, n small
derivatives are multiplied together. Thus, the gradient decreases exponentially
as we propagate down to the initial layers.
• A small gradient means that the weights and biases of the initial layers
will not be updated effectively with each training session. Since these
initial layers are often crucial to recognizing the core elements of the
input data, it can lead to overall inaccuracy of the whole network.
The Vanishing Gradients Problem
(https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)
http://aghaaliraza.com
Thank you!
61