0% found this document useful (0 votes)
21 views61 pages

009 Neural - Networks Complete

The document provides an overview of neural networks, focusing on the mathematical concepts of scalars, vectors, and matrices, which are foundational for understanding machine learning. It explains matrix operations, including addition, multiplication, and the Hadamard product, as well as the structure and learning mechanisms of artificial neurons and neural networks. The content also highlights the evolution of neural networks, their capabilities as universal approximators, and the significance of non-linear activation functions in enhancing model performance.

Uploaded by

safiurrehman74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views61 pages

009 Neural - Networks Complete

The document provides an overview of neural networks, focusing on the mathematical concepts of scalars, vectors, and matrices, which are foundational for understanding machine learning. It explains matrix operations, including addition, multiplication, and the Hadamard product, as well as the structure and learning mechanisms of artificial neurons and neural networks. The content also highlights the evolution of neural networks, their capabilities as universal approximators, and the significance of non-linear activation functions in enhancing model performance.

Uploaded by

safiurrehman74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Neural Networks

Introduction

Agha Ali Raza

CS535/EE514 – Machine Learning


The Matrix
Scalers, Vectors and Matrices
(https://www.mathsisfun.com/algebra/scalar-vector-matrix.html)

• A scalar is a number e.g., 9, -25, 0.579,…


• A matrix is an array of numbers (one or more rows, one or more columns)
• CS people: Think of multidimensional arrays
𝑎11 𝑎12 𝑎13
𝑎21 𝑎22 𝑎23
• 𝑎 𝑎32 𝑎33 a 4x3 matrix i.e., 4 rows and 3 columns
31
𝑎41 𝑎42 𝑎43
𝑎11 𝑎12 … 𝑎1𝑛
𝑎21 𝑎22 … 𝑎23
• ⋯ ⋮ ⋱ ⋮ m x n matrix i.e., m rows and n columns
𝑎𝑚1 𝑎𝑚2 … 𝑎𝑚𝑛

• A vector is a list of numbers (a row or column)


• CS people: Think of a 1D array

• 25 4 3 1x3 vector
4
• −12 3x1 vector
18
• All vectors are also matrices (as a matrix can be 1D).
• Therefore, the rules that we develop for matrices, also work for vectors
Addition/subtraction
• The two matrices to be added/subtracted must be the same size

https://www.mathsisfun.com/algebra/matrix-introduction.html
Negative, Scaler Multiplication, Transpose

https://www.mathsisfun.com/algebra/matrix-introduction.html
Matrix Multiplication
• For two matrices to be compatible for multiplication, the # columns of the first must
match the # rows of the second i.e., the inner dimensions must match
• E.g., If A is a matrix with dimensions m x n, and B is a matrix with dimensions p x q,
then:
• We can only produce the product Y= AB, if n = p
• The dimensions of the product matrix, Y would be m x q
• Therefore, m x n x n x q → m x q

• The "Dot Product" is where we multiply matching members, then sum up:
• (1, 2, 3) • (7, 9, 11) = 1×7 + 2×9 + 3×11 = 58

https://www.mathsisfun.com/algebra/matrix-introduction.html
Matrix Multiplication

• And by the way, matrix multiplication is not commutative

https://www.mathsisfun.com/algebra/matrix-introduction.html
Matrix: Hadamard Product
• The Hadamard product (element-wise product) takes two matrices of the same
dimensions and produces another matrix of the same dimension as the
operands
• Each element 𝑖, 𝑗 is the product of elements 𝑖, 𝑗 of the original two matrices.
• Denoted as ° or .∗

https://en.wikipedia.org/wiki/Hadamard_product_(matrices)
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell,
Video Lectures 35-37,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecture
note20.html
• Deep Learning Specialization, Andrew Ng
https://www.coursera.org/specializations/deep-
learning?utm_source=deeplearningai&utm_medium=institutions&ut
m_campaign=SocialYoutubeDLSC1W1L1
– Video Lectures: C1W3L1 – C1W4L6:
https://www.youtube.com/playlist?list=PLpFsSf5Dm-pd5d3rjNtIXUHT-v7bdaEIe
• A beginner's guide to deriving and implementing backpropagation,
by Pranav Budhwant https://link.medium.com/Zp3zxNWpf6
• Tensorflow playground: https://playground.tensorflow.org/
• Machine Learning Playground: https://ml-playground.com/#
The Artificial “Neuron”
• Remember logistic regression?
ℎ(𝑥) = 𝜎(𝑤 𝑇 𝑥 + 𝑏)

𝑧 = 𝑤𝑇𝑥 + 𝑏
ℎ(𝑥) = 𝜎(𝑧)
𝒙𝟏
𝒘𝟏

𝒙𝟐 𝒘𝟐
𝒘𝟑 𝒛 = 𝒘𝑻 𝒙 + 𝒃 𝝈(𝒛) 𝒉(𝒙)
𝒙𝟑
𝒘𝒏

𝒙𝒏
Learning weights 3
Negative gradient
𝜕
Contribution of 𝑤1 in L: 𝜕𝑤 𝐿
1 Positive gradient
𝜕 2
𝑤1 ≔ 𝑤1 − 𝜶 𝐿
𝜕𝑤1
𝑳
𝑤1 ≔ 𝑤1 − 𝛼 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 : Increases 𝑤1
𝑤1 ≔ 𝑤1 − 𝛼 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 : Decreases 𝑤1 Gradient = 0 at
1 the minimum
1. What is my contribution in this cost?
2. Should I increase or decrease my
value to lower the cost?
0
𝒙𝟏 -0.5 0 0.5 1 1.5 2 2.5
𝒘𝟏 𝒘𝟏
𝒙𝟐 𝒘𝟐
𝒘𝟑 𝒛 = 𝒘𝑻 𝒙 + 𝒃 𝝈(𝒛) 𝒉(𝒙)
A cost function:
𝒙𝟑 𝒎
𝟏 𝒊 𝟐
𝑳(𝒉 𝒙 , 𝒚) = ෍ 𝒉 𝒙 −𝒚 𝒊
𝟐𝒎
… 𝒘𝒏 𝒎
𝒊=𝟏
𝟏 𝒊 𝒊 𝒊 𝒊
𝑳(𝒉 𝒙 , 𝒚) = ෍ −𝒚 𝒍𝒐𝒈 𝒉 𝒙 − 𝟏−𝒚 𝒍𝒐𝒈 𝟏 − 𝒉 𝒙
𝒎
𝒙𝒏 𝒊
Can we do better than linear decision boundaries?
(Image refs: https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-trick-mercers-
theorem-e1e6848c6c4d, https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f, Weinberger, Lectures 35-37)
ℎ(𝑥) = 𝜎(𝑤 𝑥 + 𝑏)𝑇

• Manipulate 𝑥
𝒙 → 𝝓(𝒙)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
• Kernels: Predefined 𝜙(𝑥).
𝜙 𝑥 = x2
Neural Networks (Weinberger, Lectures 35-37)
ℎ(𝑥) = 𝜎(𝑤 𝑇 𝑥 + 𝑏)
• Manipulate 𝑥
• Neural Networks:
– Learn 𝜙(𝑥)
𝒙 → 𝝓(𝒙)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
• Learn the 𝑤, 𝑏 and the representation of the data 𝜙(𝑥)
• 𝜙 𝑥 = 𝑔(𝑣 𝑇 𝑥 + 𝑐), where 𝑔(𝑧) is again a non-linear function like
the Heaviside step function, sigmoid, hyperbolic tangent (tanh),
Rectified Linear Unit (RELU), Leaky RELU,…
𝒉 𝒙 = 𝝈(𝒘𝑻 (𝒈 𝒗𝑻 𝒙 + 𝒄 ) + 𝒃)
• Why non-linear?
𝒉 𝒙 = 𝒘𝑻 𝒗𝑻 𝒙 + 𝒄 + 𝒃
𝒉 𝒙 = 𝒘𝑻 𝒗𝑻 𝒙 + 𝒘𝑻 𝒄 + 𝒃
𝒉 𝒙 = (𝒘𝑻 𝒗𝑻 )𝒙 + (𝒘𝑻 𝒄 + 𝒃)
• Only a linear function!
Neural Networks (Weinberger, Lectures 35-37)
• With some changes, the NN’s have been around for a while
• Multi-layer perceptron → Artificial Neural Network → Deep Learning
• Major changes
– GPUs – matrix multiplications
– Preference of activation functions
– Stochastic Gradient Descent (SGD) and Mini-batch Gradient
Descent
– Rebranding!
MLP: Who? Perceptron? I am an ANN!

That is Mr. ANN for you!


Neural Networks (Weinberger, Lectures 35-37)
𝑯 𝒙 = 𝝈 𝒘𝑻 𝝓 𝒙 + 𝒃
The NN learns 𝜙 𝑥 :

ℎ1 (𝑥)
ℎ (𝑥)
𝜙 𝑥 = 2 , where each ℎ(𝑥) is a linear classifier. ℎ 𝑥 = 𝜎(𝑣 𝑇 𝑥 + 𝑐)

ℎ𝑛 (𝑥)

𝒙 𝒉𝟏 𝑥Ԧ = 𝝈 𝒘𝑻 𝑥Ԧ + 𝒃

𝒙𝟏 𝒗𝟏
𝒗𝟐 𝒉𝟏 (𝑥)
Ԧ 𝒘𝟏
𝒗𝟑
𝒖𝟏
𝒖𝟐 𝒉𝟐 (𝑥)
Ԧ 𝒘𝟐 𝑯 𝑥Ԧ = 𝝈 𝒘𝑻 𝝓 𝑥Ԧ + 𝒃
𝒙𝟐 𝒖𝟑
𝒘𝟑 𝑯(𝑥)
Ԧ
… 𝒉𝟑 (𝑥)
Ԧ
𝒌𝟏 𝒌𝟐
𝒌𝟑
… 𝒘𝒏
𝒙𝒌 𝒉𝒏 (𝑥)
Ԧ
Input Hidden Layer Output
Neural Networks (Weinberger, Lectures 35-37)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝒈(𝒗𝑻 𝒙 + 𝒄) + 𝒃)

𝒙𝟏
𝒗𝟏

𝒙𝟐 𝒗𝟐
𝑚
𝒛𝟏 = 𝒗𝑻 𝒙 + 𝒄 𝒂𝟏 = 𝝈(𝒛) 1
𝒗𝟑 𝐿 𝑦,
ො 𝑦 = ෍ 𝑙(𝑦,
ො 𝑦)
𝑚
𝒙𝟑 𝒘𝟏 𝑖=1

𝒗𝒏
… 𝒗𝟏
𝒘𝟐 𝒛 = 𝒘𝑻 𝒂 + 𝒃 𝒉 𝒙 = 𝝈(𝒛)
𝒙𝒏 𝑻
𝒛𝟐 = 𝒗 𝒙 + 𝒄 𝒂𝟐 = 𝝈(𝒛)
𝑦ො
𝒗𝒏

… …
Intuition – How NNs work?
𝑯 𝒙 = 𝝈 𝒘𝑻 𝝓 𝒙 + 𝒃
The NN learns 𝜙 𝑥 :

ℎ1 (𝑥)
ℎ (𝑥)
𝜙 𝑥 = 2 , where each ℎ(𝑥) is a linear classifier. ℎ 𝑥 = 𝑣 𝑇 𝑥 + 𝑐

ℎ𝑛 (𝑥)

Ԧ 𝒘𝟏
𝒉𝟏 (𝑥)
Becomes smoother
Ԧ 𝒘𝟐
𝒉𝟐 (𝑥) with higher n!

Universal
𝒙 𝒘𝟑 𝑯(𝑥)
Ԧ approximators.
𝒉𝟑 (𝑥)
Ԧ


𝒘𝒏
𝒉𝒏 (𝑥)
Ԧ
More Intuition – Regression (Weinberger, Lectures 35-37)

RELU
𝑯 𝒙 = 𝒘𝑻 𝝓(𝒙) + 𝒃

𝑯 𝒙 = 𝒘𝑻 𝒎𝒂𝒙( 𝒗𝑻 𝒙 + 𝒄 , 𝟎) + 𝒃
𝒎 𝑣 𝑇 𝑥 + 𝑐2
𝟏 𝑐1 𝑇
𝑣 𝑥 + 𝑐1
𝟐
𝑳 𝑯 𝒙 ,𝒚 = ෍ 𝒉 𝒙 𝒊 − 𝒚𝒊
𝟐𝒎
𝒊=𝟏

𝑐2
Layers (Weinberger, Lectures 35-37)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
𝝓(𝒙) = 𝝈 𝒗𝑻 𝒙 + 𝒄
• To make this more powerful:
– Either make the matrix 𝑣 really big
– Or add layers like a Matryoshka doll

𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
𝝓 𝒙 = 𝝈 𝒗𝑻 𝝓′ 𝒙 + 𝒄
𝝓′(𝒙) = 𝝈 𝒗′𝑻 𝝓′′ 𝒙 + 𝒄′
𝝓′′(𝒙) = 𝝈 𝒗′′𝑻 𝒙 + 𝒄′′
• “Deep” learning
– This was known since the time of Frank Rosenblatt
– These were not used because:
• These used to take a long time to process
• Any function that you can learn with a deep network, you can also learn with a shallow network
• But in practice, a shallow network requires an exponentially wide matrix to be able to compete with
a deep network that needs fewer, smaller matrices to do the same
• A deep network allows to learn simple things (lines, hyperplanes, piece-wise linear functions) with
earlier layers and build with more complex tools (non-linear functions) in the later layers
Learning with Layers (Weinberger, Lectures 35-37)
𝒉 𝒙 = 𝒘𝑻 𝝓 𝒙
𝝓 𝒙 = 𝝈 𝒂(𝒙) , 𝒂 𝒙 = 𝒗𝑻 𝝓′ 𝒙
𝝓′ 𝒙 = 𝝈 𝒂′ 𝒙 , 𝒂′(𝒙)𝒗′𝑻 𝝓′′ 𝒙
𝝓′′ 𝒙 = 𝝈 𝒂′′ 𝒙 , 𝒂′′ 𝒙 = 𝒗′′𝑻 𝒙
• Forward pass and backpropagation

𝒙𝟏
𝒗𝟏

𝒙𝟐 𝒗𝟐
𝑚
𝒗𝟑 𝒛𝟏 = 𝒗𝑻 𝒙 + 𝒄 𝒂𝟏 = 𝝈(𝒛) 1
𝐿 𝑦,
ො 𝑦 = ෍ 𝑙(𝑦,
ො 𝑦)
𝒙𝟑 𝒘𝟏 𝑚
𝑖=1
𝒗𝒏
… 𝒗𝟏
𝒘𝟐 𝒛 = 𝒘𝑻 𝒂 + 𝒃 𝒉 𝒙 = 𝝈(𝒛)
𝒙𝒏 𝑻
𝒛𝟐 = 𝒗 𝒙 + 𝒄 𝒂𝟐 = 𝝈(𝒛)
𝑦ො
𝒗𝒏
Neural Networks - SGD (Weinberger, Lectures 35-37)
• Forget convex cost functions!
• Approximate the gradient of the cost function with one data point – in random
order
– Compared to batch gradient descent that averages over the whole dataset
• Clearly a bad approximation – and that’s why we use it

• It’s a bad approximation to find the exact minimum – which we no longer care
about
• Misses narrow (may be deep) holes and converges onto wide holes
– The narrow holes may not be in the same place in the test data
• Does not overfit to data easily
• Is faster as SGD has already taken m steps after one epoch
• The direction of the small steps remains correct on average
Neural Networks - SGD (Weinberger, Lectures 35-37)
• In practice, we use the mini-batch gradient descent
• Initially use a very large learning rate – prevent from falling into the narrow local
minima
• Now you are jumping around in the wide holes
• Lower the learning rate – say by a factor of 10
• And the gradient descent will converge further
• We must remember, that we have billions of local minima
– Millions of wider holes
• In practice reaching a descent minimum allows for a good enough error rate
– a NN with a different local minimum may perform equally well
– So, ensemble methods (that combine classifiers) work well
Up next
• Formalize the notation
– Logistic regression – Vectorized version
– NN – Vectorized version
– Forward pass
• Activation functions
– Sigmoid, tanh, ReLU, leaky ReLU
– Pros and cons
– Gradients
• Forward and Backward passes
– Backpropagation
• Logistic Regression
• NN
Neural Networks
Notation and Forward Pass

Agha Ali Raza

CS535/EE514 – Machine Learning


Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell,
Video Lectures 35-37,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecture
note20.html
• Deep Learning Specialization, Andrew Ng
https://www.coursera.org/specializations/deep-
learning?utm_source=deeplearningai&utm_medium=institutions&ut
m_campaign=SocialYoutubeDLSC1W1L1
– Video Lectures: C1W3L1 – C1W4L6:
https://www.youtube.com/playlist?list=PLpFsSf5Dm-pd5d3rjNtIXUHT-v7bdaEIe
• A beginner's guide to deriving and implementing backpropagation,
by Pranav Budhwant https://link.medium.com/Zp3zxNWpf6
• Tensorflow playground: https://playground.tensorflow.org/
• Machine Learning Playground: https://ml-playground.com/#
Vectorizing Logistic Regression – The Forward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛)
𝑥2 𝑧 = 𝑤𝑇𝑥 + 𝑏
𝑤2 𝑑𝑖𝑚𝑠: 1, 𝑛𝑥 𝑛𝑥 , 1 = (1,1)
𝑥𝑛𝑥 This is for one training instance:
𝑤𝑛𝑥 𝑧 𝑖
= 𝑤𝑇𝑥 𝑖
+𝑏
𝑖
𝑎 = 𝜎(𝑧 𝑖 )
𝑏
We can do better:
𝑥1 𝑤1
𝑥2 𝑤2 1 2 3 𝑚
𝑥 = … ,𝑤 = … ,𝑏 = 𝑏 𝑥1 𝑥1 𝑥1 𝑥1
1 2 3 𝑚
𝑥𝑛𝑥 𝑤𝑛𝑥 𝑋 = 𝑥2 𝑥2 𝑥2 … 𝑥2 , 𝑑𝑖𝑚𝑠 = (𝑛𝑥 , 𝑚)
… … … …
(1) (2) (3) (𝑚)
𝑥𝑛𝑥 𝑥𝑛𝑥 𝑥𝑛𝑥 𝑥𝑛𝑥

𝒁 = 𝒘𝑻 𝑿 + 𝒃
𝑑𝑖𝑚𝑠: 1, 𝑚 = 1, 𝑛𝑥 𝑛𝑥 , 𝑚 + 1, 𝑚
Note: 𝑏 would be implicitly converted to [b b b b. b] by python, NumPy broadcasting operation.
1 2 3 𝑚
𝑍 = [𝑧 𝑧 𝑧 …𝑧 ]
𝑨=𝝈 𝒁
1 2 3 𝑚
𝐴 = [𝑎 𝑎 𝑎 …𝑎 ]
Vectorizing Logistic Regression – The Forward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛)
𝑥2
𝑤2
𝑥𝑛𝑥 A complete forward pass:
𝑤𝑛𝑥
𝒁 = 𝒘𝑻 𝑿 + 𝒃
𝑏 𝑨=𝝈 𝒁
Neural Networks: Terminology
(https://www.codeproject.com/Articles/1261763/ANNT-Feed-forward-fully-connected-neural-networks,
https://laptrinhx.com/deep-learning-using-keras-3760648021/ )

A 3-layered fully-connected network A 2-layered fully-connected network


Formalizing Notation – NN (Deep Learning Specialization, Andrew Ng )
Consider the following 3-layer NN:
• Number of nodes in layer 𝑙: 𝑛 𝑙
𝑙
• 𝑎𝑖 : 𝑖𝑡ℎ node in layer 𝑙
• 𝑎0 =𝑥
• 𝑎[𝐿] = ℎ 𝑥 = 𝑦ො

[1]
𝑎1
𝑥1 [2]
𝑎1
[1]
𝑎2
𝑥2 [2] [3]
𝑎2 𝑎1
[1]
𝑎3
𝑥3 [2]
𝑎3
[1]
𝑎4
NN- Forward pass with 1 training example [1]
𝑧1 = 𝑤1
1 𝑇 [0]
𝑎 + 𝑏1
1 Note:

1
𝑤1 is the entire
(Deep Learning Specialization, Andrew Ng ) weight matrix of
[1] [1] this neuron. We

[1] 𝑎1 = 𝜎(𝑧1 ) have been using


the term
𝑎1 differently
earlier.
[1] 1 𝑇 [0] 1
𝑧2 = 𝑤2 𝑎 + 𝑏2 • 𝑎0 =𝑥
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1 [1]
𝑎2 = 𝜎(𝑧2 )
[1]

[1]
𝑎2 [1]
𝑧3 = 𝑤3
1 𝑇 [0]
𝑎 + 𝑏3
1
𝑥2
[0] [2] [3]
(i.e., 𝑎2 )
𝑎2 𝑎1 [1]
𝑎3 = 𝜎(𝑧3 )
[1]

[1] [1] 1 𝑇 [0]


𝑥3 𝑎3 𝑧4 = 𝑤4 𝑎 + 𝑏4
1

[0] [2] [1] [1]


(i.e., 𝑎3 )
𝑎3 𝑎4 = 𝜎(𝑧4 )

[1] [1] [1] [1] [1]


𝑎4 𝑤1,1 𝑤1,2 𝑤1,3 𝑤
1,𝑛 𝑙−1 𝑤11 𝑇
[1] [1] [1] [1] 1𝑇
𝑤 𝑙
= 𝑤2,1 𝑤2,2 𝑤2,3 … 𝑤 𝑙−1 = 𝑤2 , 𝑑𝑖𝑚𝑠 = 𝑛 𝑙 , 𝑛 𝑙−1
2,𝑛
… … … … …
[1] [1] [1] [1] 1𝑇
𝑤 𝑙 𝑤 𝑙 𝑤 𝑙 𝑤 𝑙 𝑙−1 𝑤𝑛 𝑙
𝑛 ,1 𝑛 ,2 𝑛 ,3 𝑛 ,𝑛
𝑏1𝑙

𝑏𝑙 = 𝑏2𝑙 , 𝑑𝑖𝑚𝑠 = (𝑛 𝑙 , 1)

𝑙
𝑏𝑛 𝑙

[1] 1 𝑇 [0] 1 [1] [1] 1𝑇 𝑙 1𝑇 𝑙 𝑙 𝑙 𝑙


𝑧1 = 𝑤1 𝑎 + 𝑏1 , 𝑎1 = 𝜎(𝑧1 ) 𝑤1 𝑎1 𝑏1 𝑤1 𝑎 0 + 𝑏1 𝑧1 𝑎1 𝑧1
[1]
𝑧2 = 𝑤21 𝑇 𝑎
[0] [1] [1]
+ 𝑏21 , 𝑎2 = 𝜎(𝑧2 ) 𝑤2
1𝑇 𝑎2 𝑙 1𝑇 𝑙 𝑙 𝑙
𝑏2 = 𝑤2 𝑎 0 + 𝑏2 = 𝑧2 , 𝑎2 = 𝜎 𝑧2
𝑙
… +
[1] [0] [1] [1] … … … … … …
𝑧3 = 𝑤31 𝑇 𝑎 + 𝑏31 , 𝑎3 = 𝜎(𝑧3 ) 1𝑇 𝑎𝑛 0 𝑙 1𝑇 0 𝑙 𝑙 𝑙 𝑙
𝑤𝑛 𝑙 𝑏𝑛 𝑙 𝑤𝑛 𝑙 𝑎 + 𝑏𝑛 𝑙 𝑧𝑛 𝑙 𝑎𝑛 𝑙 𝑧𝑛 𝑙
[1] [0] [1] [1]
𝑧4 = 𝑤41 𝑇 𝑎 + 𝑏41 , 𝑎4 = 𝜎(𝑧4 )
𝒛 𝒍 = 𝑾 𝒍 𝒂 𝒍−𝟏 + 𝒃 𝒍 , 𝒂 𝒍 = 𝝈(𝒛 𝒍 )
NN- Forward pass with 1 training example
(Deep Learning Specialization, Andrew Ng )

Forward pass

𝒛𝟏 =𝑾𝟏 𝒂𝟎 +𝒃𝟏
[1] (4,1) = (4,3)(3,1)+(4,1)
𝑎1
𝒂𝟏 =𝝈 𝒛𝟏
𝑥1 [2] (4,1) = (4,1)
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2 𝒛𝟐 =𝑾𝟐 𝒂𝟏 +𝒃𝟐
[2] [3] (3,1) = (3,4)(4,1)+(3,1)
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
𝒂 𝟐 = 𝝈(𝒛 𝟐 )
[1]
𝑎3 (3,1) = (3,1)
𝑥3
[0] [2]
(i.e., 𝑎3 )
𝑎3
𝒛𝟑 =𝑾𝟑 𝒂𝟐 +𝒃𝟑
[1] (1,1) = (1,3)(3,1)+(1,1)
𝑎4
𝒂 𝟑 = 𝝈(𝒛 𝟑 )
(1,1) = (1,1)
NN- Forward pass with m training examples
(Deep Learning Specialization, Andrew Ng )

[1]
𝑿(𝟏) → 𝒚
ෝ 𝟏
=𝒂𝟑 𝟏
𝑎1 𝑿(𝟐) → 𝒚
ෝ 𝟐 =𝒂𝟑 𝟐

𝑥1 [2] …
[0]
𝑎1
(i.e., 𝑎1 )

𝑎2
[1] 𝑿(𝒎) → 𝒚
ෝ 𝒎
=𝒂𝟑 𝒎

𝑥2 [2] [3] Forward pass with 𝒊𝒕𝒉 instance


[0]
(i.e., 𝑎2 ) 𝑎2 𝑎1 𝑦ො 𝒛 𝟏 (𝒊) = 𝑾 𝟏 (𝒊) 𝒂 𝟎 (𝒊) + 𝒃 𝟏 (𝒊)
[1]
𝑎3 (4,1) = (4,3)(3,1)+(4,1)
𝒂 𝟏 (𝒊) = 𝝈 𝒛 𝟏 (𝒊)
𝑥3 [2]
𝑎3 (4,1) = (4,1)
[0]
(i.e., 𝑎3 ) [1]
𝑎4
𝒛 𝟐 (𝒊) = 𝑾 𝟐 (𝒊) 𝒂 𝟏 (𝒊) + 𝒃 𝟐 (𝒊)
(3,1) = (3,4)(4,1)+(3,1)
𝒂 𝟐 (𝒊) = 𝝈(𝒛 𝟐 (𝒊) )
(3,1) = (3,1)

𝒛 𝟑 (𝒊) = 𝑾 𝟑 (𝒊) 𝒂 𝟐 (𝒊) + 𝒃 𝟑 (𝒊)


(1,1) = (1,3)(3,1)+(1,1)
𝒂 𝟑 (𝒊) = 𝝈(𝒛 𝟑 (𝒊) )
(1,1) = (1,1)
NN- Forward pass with m training examples
(Deep Learning Specialization, Andrew Ng )
Using:
[1]
𝑎1 𝒙𝟏
𝟏
𝒙𝟏
𝟐
𝒙𝟏
𝟑 𝒎
𝒙𝟏
𝟏 𝟐 𝟑 𝒎
𝑥1 [2] 𝑿 = 𝑨 𝟎 = 𝒙𝟐 𝒙𝟐 𝒙𝟐 … 𝒙𝟐 , 𝒅𝒊𝒎𝒔 = (𝒏 𝟎 , 𝒎)
[0]
(i.e., 𝑎1 )
[1]
𝑎1 … … … …
(𝟏) (𝟐) (𝟑) (𝒎)
𝑎2 𝒙 𝟎 𝒙 𝟎 𝒙 𝟎
𝒏 𝒏
𝒙 𝟎
𝒏 𝒏
𝑥2 [2] [3]
[0]
(i.e., 𝑎2 ) 𝑎2 𝑎1 𝑦ො
[1]
𝑎3
𝑥3 [2]
[0]
(i.e., 𝑎3 ) [1]
𝑎3 Forward pass with 𝒎 instances
𝑎4
𝒁𝟏 =𝑾𝟏 𝑨𝟎 +𝒃𝟏
(4,m) = (4,3)(3,m)+(4,m)
𝑙 𝑙 (1) 𝑙 (2) 𝑙 (3) 𝑙 (𝑚)
𝑍 = [𝑧 𝑧 𝑧 … 𝑧 ] 𝑨𝟏 =𝝈 𝒁𝟏
𝑙 (1) 𝑙 (2) 𝑙 (𝑚)
𝑧1 𝑧1 𝑧1 (4,m) = (4,m)
𝑙 (1) 𝑙 (2) 𝑙 (𝑚)
𝑍 𝑙 𝑧2 𝑧2 … 𝑧2
… … … 𝒁𝟐 =𝑾𝟐 𝑨𝟏 +𝒃𝟐
𝑙 (1) 𝑙 (2) 𝑙 (𝑚) (3,m) = (3,4)(4,m)+(3,m)
𝑧 𝑧 𝑧
𝑛𝑙 𝑛𝑙 𝑛𝑙
𝑨 𝟐 = 𝝈(𝒁 𝟐 )
𝐴 𝑙 = [𝑎 𝑙 (1) 𝑎 𝑙 (2) 𝑎 𝑙 (3) … 𝑎 𝑙 (𝑚) ] (3,m) = (3,m)
𝑙 (1) 𝑙 (2) 𝑙 (𝑚)
𝑎1 𝑎1 𝑎1
𝒁𝟑 =𝑾𝟑 𝑨𝟐 +𝒃𝟑
𝑙 (1) 𝑙 (2) 𝑙 (𝑚)
𝐴 𝑙 𝑎2 𝑎2 … 𝑎2 (1,m) = (1,3)(3,m)+(1,m)
… … … 𝑨 𝟑 = 𝝈(𝒁 𝟑 )
𝑙 (1) 𝑙 (2) 𝑙 (𝑚)
𝑎 𝑙 𝑎 𝑙 𝑎 𝑙 (1,m) = (1,m)
𝑛 𝑛 𝑛
Up next
• Activation functions
– Sigmoid, tanh, ReLU, leaky ReLU
– Pros and cons
– Gradients
• Forward and Backward passes
– Backpropagation
• Logistic Regression
• NN
Neural Networks
Activation Functions

Agha Ali Raza

CS535/EE514 – Machine Learning


Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell,
Video Lectures 35-37,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecture
note20.html
• Deep Learning Specialization, Andrew Ng
https://www.coursera.org/specializations/deep-
learning?utm_source=deeplearningai&utm_medium=institutions&ut
m_campaign=SocialYoutubeDLSC1W1L1
– Video Lectures: C1W3L1 – C1W4L6:
https://www.youtube.com/playlist?list=PLpFsSf5Dm-pd5d3rjNtIXUHT-v7bdaEIe
• A beginner's guide to deriving and implementing backpropagation,
by Pranav Budhwant https://link.medium.com/Zp3zxNWpf6
• Tensorflow playground: https://playground.tensorflow.org/
• Machine Learning Playground: https://ml-playground.com/#
Sigmoid (𝝈) (Deep Learning Specialization, Andrew Ng )

𝟏
𝒈 𝒛 =
𝟏 + 𝒆−𝒛
• Range: (0, +1)
• Gradient → 0 for higher values of 𝑧
• Mean around 0.5
• Derivative:

𝑑 1 1
𝑔 𝑧 = 𝑔′ 𝑧 = 1 −
𝑑𝑧 1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
𝑔′ 𝑧 = 𝑔 𝑧 1 − 𝑔 𝑧

𝑖𝑓 𝑎 = 𝑔 𝑧 , 𝑔′ 𝑧 = 𝑎(1 − 𝑎)
• Examples:
▪ 𝑍 = 10, 𝑔 𝑧 ≈ 1, 𝑔′ 𝑧 ≈ 0
▪ 𝑍 = −10, 𝑔 𝑧 ≈ 0, 𝑔′ 𝑧 ≈ 0
1 1 1 1
▪ 𝑍 = 0, 𝑔 𝑧 ≈ , 𝑔′ 𝑧 = 1− =
2 2 2 4
Hyperbolic Tangent (tanh) (Deep Learning Specialization, Andrew Ng )

𝒆𝒛 − 𝒆−𝒛
𝒈 𝒛 = 𝒛
𝒆 + 𝒆−𝒛
• Shifted version of the sigmoid
• Range: (−1, +1)
• Gradient → 0 for higher values of 𝑧
• Mean around 0
• Derivative:
𝑑 2
𝑔 𝑧 = 𝑔′ 𝑧 = 1 − 𝑡𝑎𝑛ℎ 𝑧
𝑑𝑧
𝑔′ 𝑧 = (1 − 𝑔 𝑧 2 )

𝑖𝑓𝑎 = 𝑔 𝑧 , 𝑔′ 𝑧 = 1 − 𝑎2
• Examples:
▪ 𝑍 = 10, 𝑔 𝑧 ≈ 1, 𝑔′ 𝑧 ≈ 0
▪ 𝑍 = −10, 𝑔 𝑧 ≈ −1, 𝑔′ 𝑧 ≈ 0
▪ 𝑍 = 0, 𝑔 𝑧 ≈ 0, 𝑔′ 𝑧 = 1
Rectified Linear Unit (ReLU) (Deep Learning Specialization, Andrew Ng )

𝒈 𝒛 = 𝐦𝐚𝐱(𝟎, 𝒛)
• Range: (0, ∞)
• Solves the problem of diminished gradients for higher values of |𝑧|
• Gradient undefined for 𝑧 = 0, and 0 for 𝑧 < 0
• Derivative:
0 𝑖𝑓 𝑧 < 0
𝑑
𝑔 𝑧 = 𝑔′ 𝑧 = 1 𝑖𝑓 𝑧 > 0
𝑑𝑧
𝑢𝑛𝑑𝑒𝑓 𝑖𝑓 𝑧 = 0
0 𝑖𝑓 𝑧 < 0
𝑔′ 𝑧 ≈
1 𝑖𝑓 𝑧 ≥ 0
Leaky ReLU (Deep Learning Specialization, Andrew Ng )

𝒈 𝒛 = 𝒎𝒂𝒙(𝟎. 𝟎𝟏𝒛, 𝒛)
• Range: (−∞, ∞)
• Solves the problem of diminished gradients for higher values of |𝑧|
• Gradient undefined for 𝑧 = 0
• Derivative:
0.01 𝑖𝑓 𝑧 < 0
𝑑
𝑔 𝑧 = 𝑔′ 𝑧 = 1 𝑖𝑓 𝑧 > 0
𝑑𝑧
𝑢𝑛𝑑𝑒𝑓 𝑖𝑓 𝑧 = 0
0.01 𝑖𝑓 𝑧 < 0
𝑔′ 𝑧 ≈
1 𝑖𝑓 𝑧 ≥ 0
Table of Activation
functions
(Activation Functions : Sigmoid, ReLU, Leaky ReLU and Softmax basics for Neural
Networks and Deep Learning, Himanshu Sharma,
https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-
relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e)
Up next
• Forward and Backward passes
– Backpropagation
• Logistic Regression
• NN
Neural Networks
Training the NN – Backpropagation

Agha Ali Raza

CS535/EE514 – Machine Learning


Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell,
Video Lectures 35-37,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecture
note20.html
• Deep Learning Specialization, Andrew Ng
https://www.coursera.org/specializations/deep-
learning?utm_source=deeplearningai&utm_medium=institutions&ut
m_campaign=SocialYoutubeDLSC1W1L1
– Video Lectures: C1W3L1 – C1W4L6:
https://www.youtube.com/playlist?list=PLpFsSf5Dm-pd5d3rjNtIXUHT-v7bdaEIe
• A beginner's guide to deriving and implementing backpropagation,
by Pranav Budhwant https://link.medium.com/Zp3zxNWpf6
• Tensorflow playground: https://playground.tensorflow.org/
• Machine Learning Playground: https://ml-playground.com/#
How to train your NN?
Consider the following 2-layer NN:
• Number of nodes in layer 𝑙: 𝑛 𝑙
𝑙
• 𝑎𝑖 : 𝑖𝑡ℎ node in layer 𝑙
• 𝑎0 =𝑋
• 𝑎[𝐿] = ℎ 𝑥 = 𝑦ො

[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2
[2] [3]
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
[1]
𝑥3 𝑎3
[0] [2]
(i.e., 𝑎3 )
𝑎3
[1]
𝑎4
Logistic Regression Derivatives w.r.t Cost 𝑳
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒃 𝒂 = 𝝈(𝒛) 𝑳(𝒂, 𝒚)
𝑥2
𝑤2
𝜕𝐿 𝑎, 𝑦 𝜕𝐿 𝑎, 𝑦 𝑳 𝒂, 𝒚
𝑏 𝜕𝑧 = −𝒚𝒍𝒐𝒈 𝒂 − 𝟏 − 𝒚 𝐥𝐨𝐠(𝟏 − 𝒂)
𝜕𝑎
𝜕𝐿 𝜕𝑎 𝑦 1−𝑦
𝑑𝑧 = 𝑑𝑎 = − +
𝜕𝑎 𝜕𝑧 𝑎 1−𝑎
𝜕𝑎
As = 𝑎(1 − 𝑎) for 𝜎(𝑧)
𝜕𝑧
𝑑𝑧 = 𝑎 − 𝑦

𝜕𝐿 𝑎, 𝑦
𝜕𝑤𝑖
𝜕𝐿 𝜕𝑎 𝜕𝑧
𝑑𝑤𝑖 =
𝜕𝑎 𝜕𝑧 𝜕𝑤𝑖
𝑑𝑤𝑖 = 𝑑𝑧 𝑥𝑖
As
𝜕𝑤𝑖
𝜕𝑧
= 𝑥𝑖 Weight update
𝑑𝑤𝑖 = (𝑎 − 𝑦) 𝑥𝑖 𝑤𝑖 ≔ 𝑤𝑖 − 𝛼𝑑𝑤𝑖
and
𝑑𝑏 = 𝑑𝑧
𝑏 ≔ 𝑏 − 𝛼𝑑𝑏
𝜕𝑧
As =1
𝜕𝑏
Vectorizing Logistic Regression – The Backward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛) 𝑳(𝒂, 𝒚)
𝑥2 1 1 1
𝑤2 𝑑𝑧 =𝑎 −𝑦
This is for one training instance. For 𝑚 instances:
𝑥𝑛𝑥
𝑤𝑛𝑥 𝒅𝒁 = 𝑨 − 𝒀
𝑑𝑖𝑚𝑠: 1, 𝑚 = 1, 𝑚 − (1, 𝑚)
𝑏
𝑚
𝑥1 𝑤1 1 𝑖
𝑥2 𝑤2 𝑑𝑏 = ෍ 𝑑𝑧 ⇒
𝑥 = … ,𝑤 = … ,𝑏 = 𝑏 𝑚
𝑖=1
𝑥𝑛𝑥 𝑤𝑛𝑥
𝟏
𝒅𝒃 = 𝒔𝒖𝒎 𝒅𝒁 , 𝑑𝑖𝑚𝑠 = (1,1)

𝜕𝑎
= 𝑎(1 − 𝑎) for 𝜎 𝑧 𝒎
𝜕𝑧
𝜕𝑧 𝟏

𝜕𝑤𝑖
= 𝑥𝑖
𝒅𝑾 = 𝑿𝒅𝒁𝑻 , 𝑑𝑖𝑚𝑠: (𝑛𝑥 , 1) = (𝑛𝑥 , 𝑚)(𝑚, 1)
• 𝑑𝑎 = − +
𝑦 1−𝑦 𝒎
𝑎 1−𝑎
• 𝑑𝑧 = 𝑎 − 𝑦
• 𝑑𝑤𝑖 = 𝑑𝑧 𝑥𝑖 𝑾 ≔ 𝑾 − 𝜶 𝒅𝑾
• 𝑑𝑤𝑖 = (𝑎 − 𝑦) 𝑥𝑖
𝒃 ≔ 𝒃 − 𝜶(𝒅𝒃)
• 𝑑𝑏 = 𝑑𝑧
Vectorizing Logistic Regression – The Backward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛) 𝑳(𝒂, 𝒚)
𝑥2 One complete iteration of the Gradient Descent
𝑤2
Forward pass:
𝑥𝑛𝑥
𝑤𝑛𝑥 𝒁 = 𝑾𝑻 𝑿 + 𝒃
𝑨=𝝈 𝒁
𝑏
Backward pass:
𝒅𝒁 = 𝑨 − 𝒀

𝜕𝑎
• = 𝑎(1 − 𝑎) for 𝜎 𝑧 𝟏
𝜕𝑧 𝒅𝑾 = 𝑿𝒅𝒁𝑻

𝜕𝑧
= 𝑥𝑖 𝒎
𝜕𝑤𝑖
𝟏
• 𝑑𝑎 = − +
𝑦 1−𝑦 𝒅𝒃 = 𝒔𝒖𝒎 𝒅𝒁
𝑎 1−𝑎 𝒎
• 𝑑𝑧 = 𝑎 − 𝑦
• 𝑑𝑤𝑖 = 𝑑𝑧 𝑥𝑖 𝑾 ≔ 𝑾 − 𝜶 𝒅𝒘
• 𝑑𝑤𝑖 = (𝑎 − 𝑦) 𝑥𝑖
𝒃 ≔ 𝒃 − 𝜶(𝒅𝒃)
• 𝑑𝑏 = 𝑑𝑧
Now the NN?

[1]
𝑎1
𝑥1
[2]
𝑎1
[0]
(i.e., 𝑎1 )

[1]
𝑎2
𝑥2
[0]
(i.e., 𝑎2 ) [2] [3]
𝑎2 𝑎1 𝑦ො
[1]
𝑥3 𝑎3
[0]
(i.e., 𝑎3 ) [2]
𝑎3
[1]
𝑎4
The forward pass
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)

[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2
[2] [3]
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
[1]
𝑥3 𝑎3
[0] [2]
(i.e., 𝑎3 )
𝑎3
[1]
𝑎4
The backward pass
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
The backward pass (Layer 𝑳)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)

[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2 [2] [3]
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
[1]
𝑎3
𝑥3
[0] [2]
(i.e., 𝑎3 )
𝑎3
[1]
𝑎4
𝝏𝒂
• 𝝏𝒛
= 𝒂(𝟏 − 𝒂)
for 𝝈 𝒛
𝝏𝒛
• = 𝒙𝒊
𝝏𝒘𝒊
𝝏𝒛
• =𝟏
𝝏𝒃
𝒚 𝟏−𝒚
• 𝒅𝒂 = − 𝒂 + 𝟏−𝒂

• 𝒅𝒛 = 𝒂 − 𝒚
• 𝒅𝒘𝒊 = 𝒅𝒛 𝒙𝒊
• 𝒅𝒘𝒊 = (𝒂 −
𝒚) 𝒙𝒊
• 𝒅𝒃 = 𝒅𝒛
The backward pass (Notes)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
The backward pass (Layer 𝐿 − 1 and beyond)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)

[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 )
[1]
𝑎1
𝑎2
𝑥2
[0]
(i.e., 𝑎2 )
[2] [3]
𝑎2 𝑎1 𝑦ො
𝑥3 [1]
[0]
𝑎3
(i.e., 𝑎3 )

[2]
[1]
𝑎3
𝑎4

𝝏𝒂
• 𝝏𝒛
= 𝒂(𝟏 − 𝒂)
for 𝝈 𝒛
𝝏𝒛
• = 𝒙𝒊
𝝏𝒘𝒊
𝒚 𝟏−𝒚
• 𝒅𝒂 = − 𝒂 + 𝟏−𝒂

• 𝒅𝒛 = 𝒂 − 𝒚
Recursive structure that allows us to
𝜕𝐶 𝜕𝐶 • 𝒅𝒘𝒊 = 𝒅𝒛 𝒙𝒊
incorporate that becomes 𝑎 𝐿 − 𝑦 for
𝜕𝑧 𝑙+1 𝜕𝑧 𝐿 • 𝒅𝒘𝒊 = (𝒂 −
𝒚) 𝒙𝒊
• 𝒅𝒃 = 𝒅𝒛
The backward pass (Layer 𝐿 − 1 and beyond)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)

[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 )
[1]
𝑎1
𝑎2
𝑥2
[0]
(i.e., 𝑎2 )
[2] [3]
𝑎2 𝑎1 𝑦ො
𝑥3 [1]
[0]
𝑎3
(i.e., 𝑎3 )

[2]
[1]
𝑎3
𝑎4

𝝏𝒂
• 𝝏𝒛
= 𝒂(𝟏 − 𝒂)
for 𝝈 𝒛
𝝏𝒛
• = 𝒙𝒊
𝝏𝒘𝒊
𝒚 𝟏−𝒚
• 𝒅𝒂 = − 𝒂 + 𝟏−𝒂

• 𝒅𝒛 = 𝒂 − 𝒚
• 𝒅𝒘𝒊 = 𝒅𝒛 𝒙𝒊

Recursive structure that allows us to • 𝒅𝒘𝒊 = (𝒂 −


𝒚) 𝒙𝒊
𝜕𝐶 𝜕𝐶
incorporate that becomes 𝑎 𝐿 − 𝑦 for • 𝒅𝒃 = 𝒅𝒛
𝜕𝑧 𝑙+1 𝜕𝑧 𝐿
The backward pass (Notes)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
The Forward and Backward Pass
(Deep Learning Specialization, Andrew Ng )

One Iteration of the forward and Backward passes for a 2-layered NN:
Forward Propagation:
𝑍 1 =𝑊 1 𝑋+𝑏1 (𝑛 1 , 𝑚) = (𝑛 1 , 𝑛 0 )(𝑛 0 , 𝑚) + (𝑛 1 , 𝑚)
𝐴 1 = 𝑔 1 (𝑍 1 ) (𝑛 1 , 𝑚) = (𝑛 1 , 𝑚)
𝑍 2 =𝑊 2 𝐴1 +𝑏2 (𝑛 2 , 𝑚) = (𝑛 2 , 𝑛 1 )(𝑛 1 , 𝑚) + (𝑛 2 , 𝑚)
𝐴 2 = 𝑔 2 (𝑍 2 ) (𝑛 2 , 𝑚) = (𝑛 2 , 𝑚)
Back Propagation:
𝑑𝑍 2 = 𝐴 2 − 𝑌 𝑛 2 ,𝑚 = 𝑛 2 ,𝑚 − 𝑛 2 ,𝑚 , 𝑛 2 =
1 𝑖𝑓 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 ℎ𝑎𝑠 1 𝑛𝑜𝑑𝑒
2 1
𝑑𝑊 = 𝑚 𝑑𝑍 2 𝐴 1 𝑇 (𝑛 2 , 𝑛 1 ) = (𝑛 2 , 𝑚)(𝑚, 𝑛 1 )
1
𝑑𝑏 2 = 𝑚 𝑠𝑢𝑚(𝑑𝑍 2 ) 𝑛 2 , 𝑚 = 𝑛 2 , 𝑚 , 𝑗𝑢𝑠𝑡 𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑒 𝑡ℎ𝑒 𝑠𝑢𝑚 𝑚 𝑡𝑖𝑚𝑒𝑠

𝑑𝑍 1 = 𝑊 2 𝑇 𝑑𝑍 2
∗ 𝑔 1 ′(𝑍 1 ) (𝑛 1 , m) = 𝑛 1 , 𝑛 2 𝑛 2 , 𝑚 ∗ (𝑛 1 , m)
1 1
𝑑𝑊 = 𝑚 𝑑𝑍 1 𝑋 𝑇 (𝑛 1 , 𝑛 0 ) = (𝑛 1 , 𝑚)(𝑚, 𝑛 0 )
1
𝑑𝑏 1 = 𝑠𝑢𝑚(𝑑𝑍 1 ) 𝑛 1 ,𝑚 = 𝑛 1 ,𝑚
𝑚
Update:
𝑊 2 = 𝑊 2 − 𝛼𝑑𝑊 2
𝑏 2 = 𝑏 2 − 𝛼𝑑𝑏 2

𝑊 1 = 𝑊 1 − 𝛼𝑑𝑊 1

𝑏 1 = 𝑏 1 − 𝛼𝑑𝑏 1
The Vanishing Gradients Problem
(https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)
• The problem: As more layers using certain activation functions are added to neural
networks, the gradients of the loss function approaches zero, making the network
hard to train.
• The Reason: Activation functions like the sigmoid and tanh, squish a large input
space into a small input space between 0 and 1.
– A large change in the input of the sigmoid function will cause a small change in the output. Hence,
the derivative becomes small.
– The derivatives are small for large values of the input |x|
The Vanishing Gradients Problem
(https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)
• For shallow networks with only a few layers that use these activations, this
isn’t a big problem.
• However, when more layers are used, it can cause the gradient to be too
small for training to work effectively.
• Gradients of neural networks are found using backpropagation.
– Backpropagation finds the derivatives of the network by moving layer by layer
from the final layer to the initial one.
– By the chain rule, the derivatives of each layer are multiplied down the network
(from the final layer to the initial) to compute the derivatives of the initial
layers.
– When n hidden layers use an activation like the sigmoid function, n small
derivatives are multiplied together. Thus, the gradient decreases exponentially
as we propagate down to the initial layers.
• A small gradient means that the weights and biases of the initial layers
will not be updated effectively with each training session. Since these
initial layers are often crucial to recognizing the core elements of the
input data, it can lead to overall inaccuracy of the whole network.
The Vanishing Gradients Problem
(https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)

• The simplest solution is to use other activation functions, such as


ReLU, which do not cause small derivatives.
• Another solution is batch normalization i.e. normalize the input so
|x| doesn’t reach the outer edges of the sigmoid function.
– Normalize the input so that most of it falls in the green region, where the
derivative isn’t too small.
For more details please visit

http://aghaaliraza.com

Thank you!
61

You might also like