0% found this document useful (0 votes)

21 views61 pages

009 Neural - Networks Complete

The document provides an overview of neural networks, focusing on the mathematical concepts of scalars, vectors, and matrices, which are foundational for understanding machine learning. It explains matrix operations, including addition, multiplication, and the Hadamard product, as well as the structure and learning mechanisms of artificial neurons and neural networks. The content also highlights the evolution of neural networks, their capabilities as universal approximators, and the significance of non-linear activation functions in enhancing model performance.

Uploaded by

safiurrehman74

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views61 pages

009 Neural - Networks Complete

Uploaded by

safiurrehman74

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Neural Networks

Introduction

Agha Ali Raza

CS535/EE514 – Machine Learning

The Matrix
Scalers, Vectors and Matrices
(https://www.mathsisfun.com/algebra/scalar-vector-matrix.html)

• A scalar is a number e.g., 9, -25, 0.579,…

• A matrix is an array of numbers (one or more rows, one or more columns)
• CS people: Think of multidimensional arrays
𝑎11 𝑎12 𝑎13
𝑎21 𝑎22 𝑎23
• 𝑎 𝑎32 𝑎33 a 4x3 matrix i.e., 4 rows and 3 columns
31
𝑎41 𝑎42 𝑎43
𝑎11 𝑎12 … 𝑎1𝑛
𝑎21 𝑎22 … 𝑎23
• ⋯ ⋮ ⋱ ⋮ m x n matrix i.e., m rows and n columns
𝑎𝑚1 𝑎𝑚2 … 𝑎𝑚𝑛

• A vector is a list of numbers (a row or column)

• CS people: Think of a 1D array

• 25 4 3 1x3 vector
4
• −12 3x1 vector
18
• All vectors are also matrices (as a matrix can be 1D).
• Therefore, the rules that we develop for matrices, also work for vectors
Addition/subtraction
• The two matrices to be added/subtracted must be the same size

https://www.mathsisfun.com/algebra/matrix-introduction.html
Negative, Scaler Multiplication, Transpose

https://www.mathsisfun.com/algebra/matrix-introduction.html
Matrix Multiplication
• For two matrices to be compatible for multiplication, the # columns of the first must
match the # rows of the second i.e., the inner dimensions must match
• E.g., If A is a matrix with dimensions m x n, and B is a matrix with dimensions p x q,
then:
• We can only produce the product Y= AB, if n = p
• The dimensions of the product matrix, Y would be m x q
• Therefore, m x n x n x q → m x q

• The "Dot Product" is where we multiply matching members, then sum up:
• (1, 2, 3) • (7, 9, 11) = 1×7 + 2×9 + 3×11 = 58

https://www.mathsisfun.com/algebra/matrix-introduction.html
Matrix Multiplication

• And by the way, matrix multiplication is not commutative

https://www.mathsisfun.com/algebra/matrix-introduction.html
Matrix: Hadamard Product
• The Hadamard product (element-wise product) takes two matrices of the same
dimensions and produces another matrix of the same dimension as the
operands
• Each element 𝑖, 𝑗 is the product of elements 𝑖, 𝑗 of the original two matrices.
• Denoted as ° or .∗

https://en.wikipedia.org/wiki/Hadamard_product_(matrices)
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell,
Video Lectures 35-37,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecture
note20.html
• Deep Learning Specialization, Andrew Ng
https://www.coursera.org/specializations/deep-
learning?utm_source=deeplearningai&utm_medium=institutions&ut
m_campaign=SocialYoutubeDLSC1W1L1
– Video Lectures: C1W3L1 – C1W4L6:
https://www.youtube.com/playlist?list=PLpFsSf5Dm-pd5d3rjNtIXUHT-v7bdaEIe
• A beginner's guide to deriving and implementing backpropagation,
by Pranav Budhwant https://link.medium.com/Zp3zxNWpf6
• Tensorflow playground: https://playground.tensorflow.org/
• Machine Learning Playground: https://ml-playground.com/#
The Artificial “Neuron”
• Remember logistic regression?
ℎ(𝑥) = 𝜎(𝑤 𝑇 𝑥 + 𝑏)
⇔
𝑧 = 𝑤𝑇𝑥 + 𝑏
ℎ(𝑥) = 𝜎(𝑧)
𝒙𝟏
𝒘𝟏

𝒙𝟐 𝒘𝟐
𝒘𝟑 𝒛 = 𝒘𝑻 𝒙 + 𝒃 𝝈(𝒛) 𝒉(𝒙)
𝒙𝟑
𝒘𝒏
…

𝒙𝒏
Learning weights 3
Negative gradient
𝜕
Contribution of 𝑤1 in L: 𝜕𝑤 𝐿
1 Positive gradient
𝜕 2
𝑤1 ≔ 𝑤1 − 𝜶 𝐿
𝜕𝑤1
𝑳
𝑤1 ≔ 𝑤1 − 𝛼 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 : Increases 𝑤1
𝑤1 ≔ 𝑤1 − 𝛼 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 : Decreases 𝑤1 Gradient = 0 at
1 the minimum
1. What is my contribution in this cost?
2. Should I increase or decrease my
value to lower the cost?
0
𝒙𝟏 -0.5 0 0.5 1 1.5 2 2.5
𝒘𝟏 𝒘𝟏
𝒙𝟐 𝒘𝟐
𝒘𝟑 𝒛 = 𝒘𝑻 𝒙 + 𝒃 𝝈(𝒛) 𝒉(𝒙)
A cost function:
𝒙𝟑 𝒎
𝟏 𝒊 𝟐
𝑳(𝒉 𝒙 , 𝒚) = ෍ 𝒉 𝒙 −𝒚 𝒊
𝟐𝒎
… 𝒘𝒏 𝒎
𝒊=𝟏
𝟏 𝒊 𝒊 𝒊 𝒊
𝑳(𝒉 𝒙 , 𝒚) = ෍ −𝒚 𝒍𝒐𝒈 𝒉 𝒙 − 𝟏−𝒚 𝒍𝒐𝒈 𝟏 − 𝒉 𝒙
𝒎
𝒙𝒏 𝒊
Can we do better than linear decision boundaries?
(Image refs: https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-trick-mercers-
theorem-e1e6848c6c4d, https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f, Weinberger, Lectures 35-37)
ℎ(𝑥) = 𝜎(𝑤 𝑥 + 𝑏)𝑇

• Manipulate 𝑥
𝒙 → 𝝓(𝒙)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
• Kernels: Predefined 𝜙(𝑥).
𝜙 𝑥 = x2
Neural Networks (Weinberger, Lectures 35-37)
ℎ(𝑥) = 𝜎(𝑤 𝑇 𝑥 + 𝑏)
• Manipulate 𝑥
• Neural Networks:
– Learn 𝜙(𝑥)
𝒙 → 𝝓(𝒙)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
• Learn the 𝑤, 𝑏 and the representation of the data 𝜙(𝑥)
• 𝜙 𝑥 = 𝑔(𝑣 𝑇 𝑥 + 𝑐), where 𝑔(𝑧) is again a non-linear function like
the Heaviside step function, sigmoid, hyperbolic tangent (tanh),
Rectified Linear Unit (RELU), Leaky RELU,…
𝒉 𝒙 = 𝝈(𝒘𝑻 (𝒈 𝒗𝑻 𝒙 + 𝒄 ) + 𝒃)
• Why non-linear?
𝒉 𝒙 = 𝒘𝑻 𝒗𝑻 𝒙 + 𝒄 + 𝒃
𝒉 𝒙 = 𝒘𝑻 𝒗𝑻 𝒙 + 𝒘𝑻 𝒄 + 𝒃
𝒉 𝒙 = (𝒘𝑻 𝒗𝑻 )𝒙 + (𝒘𝑻 𝒄 + 𝒃)
• Only a linear function!
Neural Networks (Weinberger, Lectures 35-37)
• With some changes, the NN’s have been around for a while
• Multi-layer perceptron → Artificial Neural Network → Deep Learning
• Major changes
– GPUs – matrix multiplications
– Preference of activation functions
– Stochastic Gradient Descent (SGD) and Mini-batch Gradient
Descent
– Rebranding!
MLP: Who? Perceptron? I am an ANN!

That is Mr. ANN for you!

Neural Networks (Weinberger, Lectures 35-37)
𝑯 𝒙 = 𝝈 𝒘𝑻 𝝓 𝒙 + 𝒃
The NN learns 𝜙 𝑥 :

ℎ1 (𝑥)
ℎ (𝑥)
𝜙 𝑥 = 2 , where each ℎ(𝑥) is a linear classifier. ℎ 𝑥 = 𝜎(𝑣 𝑇 𝑥 + 𝑐)
…
ℎ𝑛 (𝑥)

𝒙 𝒉𝟏 𝑥Ԧ = 𝝈 𝒘𝑻 𝑥Ԧ + 𝒃

𝒙𝟏 𝒗𝟏
𝒗𝟐 𝒉𝟏 (𝑥)
Ԧ 𝒘𝟏
𝒗𝟑
𝒖𝟏
𝒖𝟐 𝒉𝟐 (𝑥)
Ԧ 𝒘𝟐 𝑯 𝑥Ԧ = 𝝈 𝒘𝑻 𝝓 𝑥Ԧ + 𝒃
𝒙𝟐 𝒖𝟑
𝒘𝟑 𝑯(𝑥)
Ԧ
… 𝒉𝟑 (𝑥)
Ԧ
𝒌𝟏 𝒌𝟐
𝒌𝟑
… 𝒘𝒏
𝒙𝒌 𝒉𝒏 (𝑥)
Ԧ
Input Hidden Layer Output
Neural Networks (Weinberger, Lectures 35-37)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝒈(𝒗𝑻 𝒙 + 𝒄) + 𝒃)

𝒙𝟏
𝒗𝟏

𝒙𝟐 𝒗𝟐
𝑚
𝒛𝟏 = 𝒗𝑻 𝒙 + 𝒄 𝒂𝟏 = 𝝈(𝒛) 1
𝒗𝟑 𝐿 𝑦,
ො 𝑦 = ෍ 𝑙(𝑦,
ො 𝑦)
𝑚
𝒙𝟑 𝒘𝟏 𝑖=1

𝒗𝒏
… 𝒗𝟏
𝒘𝟐 𝒛 = 𝒘𝑻 𝒂 + 𝒃 𝒉 𝒙 = 𝝈(𝒛)
𝒙𝒏 𝑻
𝒛𝟐 = 𝒗 𝒙 + 𝒄 𝒂𝟐 = 𝝈(𝒛)
𝑦ො
𝒗𝒏

… …
Intuition – How NNs work?
𝑯 𝒙 = 𝝈 𝒘𝑻 𝝓 𝒙 + 𝒃
The NN learns 𝜙 𝑥 :

ℎ1 (𝑥)
ℎ (𝑥)
𝜙 𝑥 = 2 , where each ℎ(𝑥) is a linear classifier. ℎ 𝑥 = 𝑣 𝑇 𝑥 + 𝑐
…
ℎ𝑛 (𝑥)

Ԧ 𝒘𝟏
𝒉𝟏 (𝑥)
Becomes smoother
Ԧ 𝒘𝟐
𝒉𝟐 (𝑥) with higher n!

Universal
𝒙 𝒘𝟑 𝑯(𝑥)
Ԧ approximators.
𝒉𝟑 (𝑥)
Ԧ

…
𝒘𝒏
𝒉𝒏 (𝑥)
Ԧ
More Intuition – Regression (Weinberger, Lectures 35-37)

RELU
𝑯 𝒙 = 𝒘𝑻 𝝓(𝒙) + 𝒃

𝑯 𝒙 = 𝒘𝑻 𝒎𝒂𝒙( 𝒗𝑻 𝒙 + 𝒄 , 𝟎) + 𝒃
𝒎 𝑣 𝑇 𝑥 + 𝑐2
𝟏 𝑐1 𝑇
𝑣 𝑥 + 𝑐1
𝟐
𝑳 𝑯 𝒙 ,𝒚 = ෍ 𝒉 𝒙 𝒊 − 𝒚𝒊
𝟐𝒎
𝒊=𝟏

𝑐2
Layers (Weinberger, Lectures 35-37)
𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
𝝓(𝒙) = 𝝈 𝒗𝑻 𝒙 + 𝒄
• To make this more powerful:
– Either make the matrix 𝑣 really big
– Or add layers like a Matryoshka doll

𝒉 𝒙 = 𝝈(𝒘𝑻 𝝓 𝒙 + 𝒃)
𝝓 𝒙 = 𝝈 𝒗𝑻 𝝓′ 𝒙 + 𝒄
𝝓′(𝒙) = 𝝈 𝒗′𝑻 𝝓′′ 𝒙 + 𝒄′
𝝓′′(𝒙) = 𝝈 𝒗′′𝑻 𝒙 + 𝒄′′
• “Deep” learning
– This was known since the time of Frank Rosenblatt
– These were not used because:
• These used to take a long time to process
• Any function that you can learn with a deep network, you can also learn with a shallow network
• But in practice, a shallow network requires an exponentially wide matrix to be able to compete with
a deep network that needs fewer, smaller matrices to do the same
• A deep network allows to learn simple things (lines, hyperplanes, piece-wise linear functions) with
earlier layers and build with more complex tools (non-linear functions) in the later layers
Learning with Layers (Weinberger, Lectures 35-37)
𝒉 𝒙 = 𝒘𝑻 𝝓 𝒙
𝝓 𝒙 = 𝝈 𝒂(𝒙) , 𝒂 𝒙 = 𝒗𝑻 𝝓′ 𝒙
𝝓′ 𝒙 = 𝝈 𝒂′ 𝒙 , 𝒂′(𝒙)𝒗′𝑻 𝝓′′ 𝒙
𝝓′′ 𝒙 = 𝝈 𝒂′′ 𝒙 , 𝒂′′ 𝒙 = 𝒗′′𝑻 𝒙
• Forward pass and backpropagation

𝒙𝟏
𝒗𝟏

𝒙𝟐 𝒗𝟐
𝑚
𝒗𝟑 𝒛𝟏 = 𝒗𝑻 𝒙 + 𝒄 𝒂𝟏 = 𝝈(𝒛) 1
𝐿 𝑦,
ො 𝑦 = ෍ 𝑙(𝑦,
ො 𝑦)
𝒙𝟑 𝒘𝟏 𝑚
𝑖=1
𝒗𝒏
… 𝒗𝟏
𝒘𝟐 𝒛 = 𝒘𝑻 𝒂 + 𝒃 𝒉 𝒙 = 𝝈(𝒛)
𝒙𝒏 𝑻
𝒛𝟐 = 𝒗 𝒙 + 𝒄 𝒂𝟐 = 𝝈(𝒛)
𝑦ො
𝒗𝒏
Neural Networks - SGD (Weinberger, Lectures 35-37)
• Forget convex cost functions!
• Approximate the gradient of the cost function with one data point – in random
order
– Compared to batch gradient descent that averages over the whole dataset
• Clearly a bad approximation – and that’s why we use it

• It’s a bad approximation to find the exact minimum – which we no longer care
about
• Misses narrow (may be deep) holes and converges onto wide holes
– The narrow holes may not be in the same place in the test data
• Does not overfit to data easily
• Is faster as SGD has already taken m steps after one epoch
• The direction of the small steps remains correct on average
Neural Networks - SGD (Weinberger, Lectures 35-37)
• In practice, we use the mini-batch gradient descent
• Initially use a very large learning rate – prevent from falling into the narrow local
minima
• Now you are jumping around in the wide holes
• Lower the learning rate – say by a factor of 10
• And the gradient descent will converge further
• We must remember, that we have billions of local minima
– Millions of wider holes
• In practice reaching a descent minimum allows for a good enough error rate
– a NN with a different local minimum may perform equally well
– So, ensemble methods (that combine classifiers) work well
Up next
• Formalize the notation
– Logistic regression – Vectorized version
– NN – Vectorized version
– Forward pass
• Activation functions
– Sigmoid, tanh, ReLU, leaky ReLU
– Pros and cons
– Gradients
• Forward and Backward passes
– Backpropagation
• Logistic Regression
• NN
Neural Networks
Notation and Forward Pass

Agha Ali Raza

CS535/EE514 – Machine Learning

Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell,
Video Lectures 35-37,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecture
note20.html
• Deep Learning Specialization, Andrew Ng
https://www.coursera.org/specializations/deep-
learning?utm_source=deeplearningai&utm_medium=institutions&ut
m_campaign=SocialYoutubeDLSC1W1L1
– Video Lectures: C1W3L1 – C1W4L6:
https://www.youtube.com/playlist?list=PLpFsSf5Dm-pd5d3rjNtIXUHT-v7bdaEIe
• A beginner's guide to deriving and implementing backpropagation,
by Pranav Budhwant https://link.medium.com/Zp3zxNWpf6
• Tensorflow playground: https://playground.tensorflow.org/
• Machine Learning Playground: https://ml-playground.com/#
Vectorizing Logistic Regression – The Forward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛)
𝑥2 𝑧 = 𝑤𝑇𝑥 + 𝑏
𝑤2 𝑑𝑖𝑚𝑠: 1, 𝑛𝑥 𝑛𝑥 , 1 = (1,1)
𝑥𝑛𝑥 This is for one training instance:
𝑤𝑛𝑥 𝑧 𝑖
= 𝑤𝑇𝑥 𝑖
+𝑏
𝑖
𝑎 = 𝜎(𝑧 𝑖 )
𝑏
We can do better:
𝑥1 𝑤1
𝑥2 𝑤2 1 2 3 𝑚
𝑥 = … ,𝑤 = … ,𝑏 = 𝑏 𝑥1 𝑥1 𝑥1 𝑥1
1 2 3 𝑚
𝑥𝑛𝑥 𝑤𝑛𝑥 𝑋 = 𝑥2 𝑥2 𝑥2 … 𝑥2 , 𝑑𝑖𝑚𝑠 = (𝑛𝑥 , 𝑚)
… … … …
(1) (2) (3) (𝑚)
𝑥𝑛𝑥 𝑥𝑛𝑥 𝑥𝑛𝑥 𝑥𝑛𝑥

𝒁 = 𝒘𝑻 𝑿 + 𝒃
𝑑𝑖𝑚𝑠: 1, 𝑚 = 1, 𝑛𝑥 𝑛𝑥 , 𝑚 + 1, 𝑚
Note: 𝑏 would be implicitly converted to [b b b b. b] by python, NumPy broadcasting operation.
1 2 3 𝑚
𝑍 = [𝑧 𝑧 𝑧 …𝑧 ]
𝑨=𝝈 𝒁
1 2 3 𝑚
𝐴 = [𝑎 𝑎 𝑎 …𝑎 ]
Vectorizing Logistic Regression – The Forward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛)
𝑥2
𝑤2
𝑥𝑛𝑥 A complete forward pass:
𝑤𝑛𝑥
𝒁 = 𝒘𝑻 𝑿 + 𝒃
𝑏 𝑨=𝝈 𝒁
Neural Networks: Terminology
(https://www.codeproject.com/Articles/1261763/ANNT-Feed-forward-fully-connected-neural-networks,
https://laptrinhx.com/deep-learning-using-keras-3760648021/ )

A 3-layered fully-connected network A 2-layered fully-connected network

Formalizing Notation – NN (Deep Learning Specialization, Andrew Ng )
Consider the following 3-layer NN:
• Number of nodes in layer 𝑙: 𝑛 𝑙
𝑙
• 𝑎𝑖 : 𝑖𝑡ℎ node in layer 𝑙
• 𝑎0 =𝑥
• 𝑎[𝐿] = ℎ 𝑥 = 𝑦ො

[1]
𝑎1
𝑥1 [2]
𝑎1
[1]
𝑎2
𝑥2 [2] [3]
𝑎2 𝑎1
[1]
𝑎3
𝑥3 [2]
𝑎3
[1]
𝑎4
NN- Forward pass with 1 training example [1]
𝑧1 = 𝑤1
1 𝑇 [0]
𝑎 + 𝑏1
1 Note:
•
1
𝑤1 is the entire
(Deep Learning Specialization, Andrew Ng ) weight matrix of
[1] [1] this neuron. We

[1] 𝑎1 = 𝜎(𝑧1 ) have been using

the term
𝑎1 differently
earlier.
[1] 1 𝑇 [0] 1
𝑧2 = 𝑤2 𝑎 + 𝑏2 • 𝑎0 =𝑥
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1 [1]
𝑎2 = 𝜎(𝑧2 )
[1]

[1]
𝑎2 [1]
𝑧3 = 𝑤3
1 𝑇 [0]
𝑎 + 𝑏3
1
𝑥2
[0] [2] [3]
(i.e., 𝑎2 )
𝑎2 𝑎1 [1]
𝑎3 = 𝜎(𝑧3 )
[1]

[1] [1] 1 𝑇 [0]

𝑥3 𝑎3 𝑧4 = 𝑤4 𝑎 + 𝑏4
1

[0] [2] [1] [1]

(i.e., 𝑎3 )
𝑎3 𝑎4 = 𝜎(𝑧4 )

[1] [1] [1] [1] [1]

𝑎4 𝑤1,1 𝑤1,2 𝑤1,3 𝑤
1,𝑛 𝑙−1 𝑤11 𝑇
[1] [1] [1] [1] 1𝑇
𝑤 𝑙
= 𝑤2,1 𝑤2,2 𝑤2,3 … 𝑤 𝑙−1 = 𝑤2 , 𝑑𝑖𝑚𝑠 = 𝑛 𝑙 , 𝑛 𝑙−1
2,𝑛
… … … … …
[1] [1] [1] [1] 1𝑇
𝑤 𝑙 𝑤 𝑙 𝑤 𝑙 𝑤 𝑙 𝑙−1 𝑤𝑛 𝑙
𝑛 ,1 𝑛 ,2 𝑛 ,3 𝑛 ,𝑛
𝑏1𝑙

𝑏𝑙 = 𝑏2𝑙 , 𝑑𝑖𝑚𝑠 = (𝑛 𝑙 , 1)
…
𝑙
𝑏𝑛 𝑙

[1] 1 𝑇 [0] 1 [1] [1] 1𝑇 𝑙 1𝑇 𝑙 𝑙 𝑙 𝑙

𝑧1 = 𝑤1 𝑎 + 𝑏1 , 𝑎1 = 𝜎(𝑧1 ) 𝑤1 𝑎1 𝑏1 𝑤1 𝑎 0 + 𝑏1 𝑧1 𝑎1 𝑧1
[1]
𝑧2 = 𝑤21 𝑇 𝑎
[0] [1] [1]
+ 𝑏21 , 𝑎2 = 𝜎(𝑧2 ) 𝑤2
1𝑇 𝑎2 𝑙 1𝑇 𝑙 𝑙 𝑙
𝑏2 = 𝑤2 𝑎 0 + 𝑏2 = 𝑧2 , 𝑎2 = 𝜎 𝑧2
𝑙
… +
[1] [0] [1] [1] … … … … … …
𝑧3 = 𝑤31 𝑇 𝑎 + 𝑏31 , 𝑎3 = 𝜎(𝑧3 ) 1𝑇 𝑎𝑛 0 𝑙 1𝑇 0 𝑙 𝑙 𝑙 𝑙
𝑤𝑛 𝑙 𝑏𝑛 𝑙 𝑤𝑛 𝑙 𝑎 + 𝑏𝑛 𝑙 𝑧𝑛 𝑙 𝑎𝑛 𝑙 𝑧𝑛 𝑙
[1] [0] [1] [1]
𝑧4 = 𝑤41 𝑇 𝑎 + 𝑏41 , 𝑎4 = 𝜎(𝑧4 )
𝒛 𝒍 = 𝑾 𝒍 𝒂 𝒍−𝟏 + 𝒃 𝒍 , 𝒂 𝒍 = 𝝈(𝒛 𝒍 )
NN- Forward pass with 1 training example
(Deep Learning Specialization, Andrew Ng )

Forward pass

𝒛𝟏 =𝑾𝟏 𝒂𝟎 +𝒃𝟏
[1] (4,1) = (4,3)(3,1)+(4,1)
𝑎1
𝒂𝟏 =𝝈 𝒛𝟏
𝑥1 [2] (4,1) = (4,1)
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2 𝒛𝟐 =𝑾𝟐 𝒂𝟏 +𝒃𝟐
[2] [3] (3,1) = (3,4)(4,1)+(3,1)
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
𝒂 𝟐 = 𝝈(𝒛 𝟐 )
[1]
𝑎3 (3,1) = (3,1)
𝑥3
[0] [2]
(i.e., 𝑎3 )
𝑎3
𝒛𝟑 =𝑾𝟑 𝒂𝟐 +𝒃𝟑
[1] (1,1) = (1,3)(3,1)+(1,1)
𝑎4
𝒂 𝟑 = 𝝈(𝒛 𝟑 )
(1,1) = (1,1)
NN- Forward pass with m training examples
(Deep Learning Specialization, Andrew Ng )

[1]
𝑿(𝟏) → 𝒚
ෝ 𝟏
=𝒂𝟑 𝟏
𝑎1 𝑿(𝟐) → 𝒚
ෝ 𝟐 =𝒂𝟑 𝟐

𝑥1 [2] …
[0]
𝑎1
(i.e., 𝑎1 )

𝑎2
[1] 𝑿(𝒎) → 𝒚
ෝ 𝒎
=𝒂𝟑 𝒎

𝑥2 [2] [3] Forward pass with 𝒊𝒕𝒉 instance

[0]
(i.e., 𝑎2 ) 𝑎2 𝑎1 𝑦ො 𝒛 𝟏 (𝒊) = 𝑾 𝟏 (𝒊) 𝒂 𝟎 (𝒊) + 𝒃 𝟏 (𝒊)
[1]
𝑎3 (4,1) = (4,3)(3,1)+(4,1)
𝒂 𝟏 (𝒊) = 𝝈 𝒛 𝟏 (𝒊)
𝑥3 [2]
𝑎3 (4,1) = (4,1)
[0]
(i.e., 𝑎3 ) [1]
𝑎4
𝒛 𝟐 (𝒊) = 𝑾 𝟐 (𝒊) 𝒂 𝟏 (𝒊) + 𝒃 𝟐 (𝒊)
(3,1) = (3,4)(4,1)+(3,1)
𝒂 𝟐 (𝒊) = 𝝈(𝒛 𝟐 (𝒊) )
(3,1) = (3,1)

𝒛 𝟑 (𝒊) = 𝑾 𝟑 (𝒊) 𝒂 𝟐 (𝒊) + 𝒃 𝟑 (𝒊)

(1,1) = (1,3)(3,1)+(1,1)
𝒂 𝟑 (𝒊) = 𝝈(𝒛 𝟑 (𝒊) )
(1,1) = (1,1)
NN- Forward pass with m training examples
(Deep Learning Specialization, Andrew Ng )
Using:
[1]
𝑎1 𝒙𝟏
𝟏
𝒙𝟏
𝟐
𝒙𝟏
𝟑 𝒎
𝒙𝟏
𝟏 𝟐 𝟑 𝒎
𝑥1 [2] 𝑿 = 𝑨 𝟎 = 𝒙𝟐 𝒙𝟐 𝒙𝟐 … 𝒙𝟐 , 𝒅𝒊𝒎𝒔 = (𝒏 𝟎 , 𝒎)
[0]
(i.e., 𝑎1 )
[1]
𝑎1 … … … …
(𝟏) (𝟐) (𝟑) (𝒎)
𝑎2 𝒙 𝟎 𝒙 𝟎 𝒙 𝟎
𝒏 𝒏
𝒙 𝟎
𝒏 𝒏
𝑥2 [2] [3]
[0]
(i.e., 𝑎2 ) 𝑎2 𝑎1 𝑦ො
[1]
𝑎3
𝑥3 [2]
[0]
(i.e., 𝑎3 ) [1]
𝑎3 Forward pass with 𝒎 instances
𝑎4
𝒁𝟏 =𝑾𝟏 𝑨𝟎 +𝒃𝟏
(4,m) = (4,3)(3,m)+(4,m)
𝑙 𝑙 (1) 𝑙 (2) 𝑙 (3) 𝑙 (𝑚)
𝑍 = [𝑧 𝑧 𝑧 … 𝑧 ] 𝑨𝟏 =𝝈 𝒁𝟏
𝑙 (1) 𝑙 (2) 𝑙 (𝑚)
𝑧1 𝑧1 𝑧1 (4,m) = (4,m)
𝑙 (1) 𝑙 (2) 𝑙 (𝑚)
𝑍 𝑙 𝑧2 𝑧2 … 𝑧2
… … … 𝒁𝟐 =𝑾𝟐 𝑨𝟏 +𝒃𝟐
𝑙 (1) 𝑙 (2) 𝑙 (𝑚) (3,m) = (3,4)(4,m)+(3,m)
𝑧 𝑧 𝑧
𝑛𝑙 𝑛𝑙 𝑛𝑙
𝑨 𝟐 = 𝝈(𝒁 𝟐 )
𝐴 𝑙 = [𝑎 𝑙 (1) 𝑎 𝑙 (2) 𝑎 𝑙 (3) … 𝑎 𝑙 (𝑚) ] (3,m) = (3,m)
𝑙 (1) 𝑙 (2) 𝑙 (𝑚)
𝑎1 𝑎1 𝑎1
𝒁𝟑 =𝑾𝟑 𝑨𝟐 +𝒃𝟑
𝑙 (1) 𝑙 (2) 𝑙 (𝑚)
𝐴 𝑙 𝑎2 𝑎2 … 𝑎2 (1,m) = (1,3)(3,m)+(1,m)
… … … 𝑨 𝟑 = 𝝈(𝒁 𝟑 )
𝑙 (1) 𝑙 (2) 𝑙 (𝑚)
𝑎 𝑙 𝑎 𝑙 𝑎 𝑙 (1,m) = (1,m)
𝑛 𝑛 𝑛
Up next
• Activation functions
– Sigmoid, tanh, ReLU, leaky ReLU
– Pros and cons
– Gradients
• Forward and Backward passes
– Backpropagation
• Logistic Regression
• NN
Neural Networks
Activation Functions

Agha Ali Raza

CS535/EE514 – Machine Learning

𝟏
𝒈 𝒛 =
𝟏 + 𝒆−𝒛
• Range: (0, +1)
• Gradient → 0 for higher values of 𝑧
• Mean around 0.5
• Derivative:

𝑑 1 1
𝑔 𝑧 = 𝑔′ 𝑧 = 1 −
𝑑𝑧 1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
𝑔′ 𝑧 = 𝑔 𝑧 1 − 𝑔 𝑧

𝑖𝑓 𝑎 = 𝑔 𝑧 , 𝑔′ 𝑧 = 𝑎(1 − 𝑎)
• Examples:
▪ 𝑍 = 10, 𝑔 𝑧 ≈ 1, 𝑔′ 𝑧 ≈ 0
▪ 𝑍 = −10, 𝑔 𝑧 ≈ 0, 𝑔′ 𝑧 ≈ 0
1 1 1 1
▪ 𝑍 = 0, 𝑔 𝑧 ≈ , 𝑔′ 𝑧 = 1− =
2 2 2 4
Hyperbolic Tangent (tanh) (Deep Learning Specialization, Andrew Ng )

𝒆𝒛 − 𝒆−𝒛
𝒈 𝒛 = 𝒛
𝒆 + 𝒆−𝒛
• Shifted version of the sigmoid
• Range: (−1, +1)
• Gradient → 0 for higher values of 𝑧
• Mean around 0
• Derivative:
𝑑 2
𝑔 𝑧 = 𝑔′ 𝑧 = 1 − 𝑡𝑎𝑛ℎ 𝑧
𝑑𝑧
𝑔′ 𝑧 = (1 − 𝑔 𝑧 2 )

𝑖𝑓𝑎 = 𝑔 𝑧 , 𝑔′ 𝑧 = 1 − 𝑎2
• Examples:
▪ 𝑍 = 10, 𝑔 𝑧 ≈ 1, 𝑔′ 𝑧 ≈ 0
▪ 𝑍 = −10, 𝑔 𝑧 ≈ −1, 𝑔′ 𝑧 ≈ 0
▪ 𝑍 = 0, 𝑔 𝑧 ≈ 0, 𝑔′ 𝑧 = 1
Rectified Linear Unit (ReLU) (Deep Learning Specialization, Andrew Ng )

𝒈 𝒛 = 𝐦𝐚𝐱(𝟎, 𝒛)
• Range: (0, ∞)
• Solves the problem of diminished gradients for higher values of |𝑧|
• Gradient undefined for 𝑧 = 0, and 0 for 𝑧 < 0
• Derivative:
0 𝑖𝑓 𝑧 < 0
𝑑
𝑔 𝑧 = 𝑔′ 𝑧 = 1 𝑖𝑓 𝑧 > 0
𝑑𝑧
𝑢𝑛𝑑𝑒𝑓 𝑖𝑓 𝑧 = 0
0 𝑖𝑓 𝑧 < 0
𝑔′ 𝑧 ≈
1 𝑖𝑓 𝑧 ≥ 0
Leaky ReLU (Deep Learning Specialization, Andrew Ng )

𝒈 𝒛 = 𝒎𝒂𝒙(𝟎. 𝟎𝟏𝒛, 𝒛)
• Range: (−∞, ∞)
• Solves the problem of diminished gradients for higher values of |𝑧|
• Gradient undefined for 𝑧 = 0
• Derivative:
0.01 𝑖𝑓 𝑧 < 0
𝑑
𝑔 𝑧 = 𝑔′ 𝑧 = 1 𝑖𝑓 𝑧 > 0
𝑑𝑧
𝑢𝑛𝑑𝑒𝑓 𝑖𝑓 𝑧 = 0
0.01 𝑖𝑓 𝑧 < 0
𝑔′ 𝑧 ≈
1 𝑖𝑓 𝑧 ≥ 0
Table of Activation
functions
(Activation Functions : Sigmoid, ReLU, Leaky ReLU and Softmax basics for Neural
Networks and Deep Learning, Himanshu Sharma,
https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-
relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e)
Up next
• Forward and Backward passes
– Backpropagation
• Logistic Regression
• NN
Neural Networks
Training the NN – Backpropagation

Agha Ali Raza

CS535/EE514 – Machine Learning

[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2
[2] [3]
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
[1]
𝑥3 𝑎3
[0] [2]
(i.e., 𝑎3 )
𝑎3
[1]
𝑎4
Logistic Regression Derivatives w.r.t Cost 𝑳
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒃 𝒂 = 𝝈(𝒛) 𝑳(𝒂, 𝒚)
𝑥2
𝑤2
𝜕𝐿 𝑎, 𝑦 𝜕𝐿 𝑎, 𝑦 𝑳 𝒂, 𝒚
𝑏 𝜕𝑧 = −𝒚𝒍𝒐𝒈 𝒂 − 𝟏 − 𝒚 𝐥𝐨𝐠(𝟏 − 𝒂)
𝜕𝑎
𝜕𝐿 𝜕𝑎 𝑦 1−𝑦
𝑑𝑧 = 𝑑𝑎 = − +
𝜕𝑎 𝜕𝑧 𝑎 1−𝑎
𝜕𝑎
As = 𝑎(1 − 𝑎) for 𝜎(𝑧)
𝜕𝑧
𝑑𝑧 = 𝑎 − 𝑦

𝜕𝐿 𝑎, 𝑦
𝜕𝑤𝑖
𝜕𝐿 𝜕𝑎 𝜕𝑧
𝑑𝑤𝑖 =
𝜕𝑎 𝜕𝑧 𝜕𝑤𝑖
𝑑𝑤𝑖 = 𝑑𝑧 𝑥𝑖
As
𝜕𝑤𝑖
𝜕𝑧
= 𝑥𝑖 Weight update
𝑑𝑤𝑖 = (𝑎 − 𝑦) 𝑥𝑖 𝑤𝑖 ≔ 𝑤𝑖 − 𝛼𝑑𝑤𝑖
and
𝑑𝑏 = 𝑑𝑧
𝑏 ≔ 𝑏 − 𝛼𝑑𝑏
𝜕𝑧
As =1
𝜕𝑏
Vectorizing Logistic Regression – The Backward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛) 𝑳(𝒂, 𝒚)
𝑥2 1 1 1
𝑤2 𝑑𝑧 =𝑎 −𝑦
This is for one training instance. For 𝑚 instances:
𝑥𝑛𝑥
𝑤𝑛𝑥 𝒅𝒁 = 𝑨 − 𝒀
𝑑𝑖𝑚𝑠: 1, 𝑚 = 1, 𝑚 − (1, 𝑚)
𝑏
𝑚
𝑥1 𝑤1 1 𝑖
𝑥2 𝑤2 𝑑𝑏 = ෍ 𝑑𝑧 ⇒
𝑥 = … ,𝑤 = … ,𝑏 = 𝑏 𝑚
𝑖=1
𝑥𝑛𝑥 𝑤𝑛𝑥
𝟏
𝒅𝒃 = 𝒔𝒖𝒎 𝒅𝒁 , 𝑑𝑖𝑚𝑠 = (1,1)
•
𝜕𝑎
= 𝑎(1 − 𝑎) for 𝜎 𝑧 𝒎
𝜕𝑧
𝜕𝑧 𝟏
•
𝜕𝑤𝑖
= 𝑥𝑖
𝒅𝑾 = 𝑿𝒅𝒁𝑻 , 𝑑𝑖𝑚𝑠: (𝑛𝑥 , 1) = (𝑛𝑥 , 𝑚)(𝑚, 1)
• 𝑑𝑎 = − +
𝑦 1−𝑦 𝒎
𝑎 1−𝑎
• 𝑑𝑧 = 𝑎 − 𝑦
• 𝑑𝑤𝑖 = 𝑑𝑧 𝑥𝑖 𝑾 ≔ 𝑾 − 𝜶 𝒅𝑾
• 𝑑𝑤𝑖 = (𝑎 − 𝑦) 𝑥𝑖
𝒃 ≔ 𝒃 − 𝜶(𝒅𝒃)
• 𝑑𝑏 = 𝑑𝑧
Vectorizing Logistic Regression – The Backward Pass
(Deep Learning Specialization, Andrew Ng )
𝑥1
𝑤1 𝒛 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ 𝒘𝒏 𝒙 𝒙𝒏 𝒙 + 𝒃 𝒂 = 𝝈(𝒛) 𝑳(𝒂, 𝒚)
𝑥2 One complete iteration of the Gradient Descent
𝑤2
Forward pass:
𝑥𝑛𝑥
𝑤𝑛𝑥 𝒁 = 𝑾𝑻 𝑿 + 𝒃
𝑨=𝝈 𝒁
𝑏
Backward pass:
𝒅𝒁 = 𝑨 − 𝒀

𝜕𝑎
• = 𝑎(1 − 𝑎) for 𝜎 𝑧 𝟏
𝜕𝑧 𝒅𝑾 = 𝑿𝒅𝒁𝑻
•
𝜕𝑧
= 𝑥𝑖 𝒎
𝜕𝑤𝑖
𝟏
• 𝑑𝑎 = − +
𝑦 1−𝑦 𝒅𝒃 = 𝒔𝒖𝒎 𝒅𝒁
𝑎 1−𝑎 𝒎
• 𝑑𝑧 = 𝑎 − 𝑦
• 𝑑𝑤𝑖 = 𝑑𝑧 𝑥𝑖 𝑾 ≔ 𝑾 − 𝜶 𝒅𝒘
• 𝑑𝑤𝑖 = (𝑎 − 𝑦) 𝑥𝑖
𝒃 ≔ 𝒃 − 𝜶(𝒅𝒃)
• 𝑑𝑏 = 𝑑𝑧
Now the NN?

[1]
𝑎1
𝑥1
[2]
𝑎1
[0]
(i.e., 𝑎1 )

[1]
𝑎2
𝑥2
[0]
(i.e., 𝑎2 ) [2] [3]
𝑎2 𝑎1 𝑦ො
[1]
𝑥3 𝑎3
[0]
(i.e., 𝑎3 ) [2]
𝑎3
[1]
𝑎4
The forward pass
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)

[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2
[2] [3]
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
[1]
𝑥3 𝑎3
[0] [2]
(i.e., 𝑎3 )
𝑎3
[1]
𝑎4
The backward pass
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
The backward pass (Layer 𝑳)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)

[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 ) 𝑎1
[1]
𝑎2
𝑥2 [2] [3]
[0]
(i.e., 𝑎2 )
𝑎2 𝑎1 𝑦ො
[1]
𝑎3
𝑥3
[0] [2]
(i.e., 𝑎3 )
𝑎3
[1]
𝑎4
𝝏𝒂
• 𝝏𝒛
= 𝒂(𝟏 − 𝒂)
for 𝝈 𝒛
𝝏𝒛
• = 𝒙𝒊
𝝏𝒘𝒊
𝝏𝒛
• =𝟏
𝝏𝒃
𝒚 𝟏−𝒚
• 𝒅𝒂 = − 𝒂 + 𝟏−𝒂

• 𝒅𝒛 = 𝒂 − 𝒚
• 𝒅𝒘𝒊 = 𝒅𝒛 𝒙𝒊
• 𝒅𝒘𝒊 = (𝒂 −
𝒚) 𝒙𝒊
• 𝒅𝒃 = 𝒅𝒛
The backward pass (Notes)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
The backward pass (Layer 𝐿 − 1 and beyond)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)

[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 )
[1]
𝑎1
𝑎2
𝑥2
[0]
(i.e., 𝑎2 )
[2] [3]
𝑎2 𝑎1 𝑦ො
𝑥3 [1]
[0]
𝑎3
(i.e., 𝑎3 )

[2]
[1]
𝑎3
𝑎4

𝝏𝒂
• 𝝏𝒛
= 𝒂(𝟏 − 𝒂)
for 𝝈 𝒛
𝝏𝒛
• = 𝒙𝒊
𝝏𝒘𝒊
𝒚 𝟏−𝒚
• 𝒅𝒂 = − 𝒂 + 𝟏−𝒂

• 𝒅𝒛 = 𝒂 − 𝒚
Recursive structure that allows us to
𝜕𝐶 𝜕𝐶 • 𝒅𝒘𝒊 = 𝒅𝒛 𝒙𝒊
incorporate that becomes 𝑎 𝐿 − 𝑦 for
𝜕𝑧 𝑙+1 𝜕𝑧 𝐿 • 𝒅𝒘𝒊 = (𝒂 −
𝒚) 𝒙𝒊
• 𝒅𝒃 = 𝒅𝒛
The backward pass (Layer 𝐿 − 1 and beyond)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)

[1]
𝑎1
𝑥1 [2]
[0]
(i.e., 𝑎1 )
[1]
𝑎1
𝑎2
𝑥2
[0]
(i.e., 𝑎2 )
[2] [3]
𝑎2 𝑎1 𝑦ො
𝑥3 [1]
[0]
𝑎3
(i.e., 𝑎3 )

[2]
[1]
𝑎3
𝑎4

𝝏𝒂
• 𝝏𝒛
= 𝒂(𝟏 − 𝒂)
for 𝝈 𝒛
𝝏𝒛
• = 𝒙𝒊
𝝏𝒘𝒊
𝒚 𝟏−𝒚
• 𝒅𝒂 = − 𝒂 + 𝟏−𝒂

• 𝒅𝒛 = 𝒂 − 𝒚
• 𝒅𝒘𝒊 = 𝒅𝒛 𝒙𝒊

Recursive structure that allows us to • 𝒅𝒘𝒊 = (𝒂 −

𝒚) 𝒙𝒊
𝜕𝐶 𝜕𝐶
incorporate that becomes 𝑎 𝐿 − 𝑦 for • 𝒅𝒃 = 𝒅𝒛
𝜕𝑧 𝑙+1 𝜕𝑧 𝐿
The backward pass (Notes)
(A beginner’s guide to deriving and implementing backpropagation
Pranav Budhwant, https://medium.com/binaryandmore/beginners-guide-to-deriving-and-implementing-backpropagation-e3c1a5a1e536)
The Forward and Backward Pass
(Deep Learning Specialization, Andrew Ng )

One Iteration of the forward and Backward passes for a 2-layered NN:
Forward Propagation:
𝑍 1 =𝑊 1 𝑋+𝑏1 (𝑛 1 , 𝑚) = (𝑛 1 , 𝑛 0 )(𝑛 0 , 𝑚) + (𝑛 1 , 𝑚)
𝐴 1 = 𝑔 1 (𝑍 1 ) (𝑛 1 , 𝑚) = (𝑛 1 , 𝑚)
𝑍 2 =𝑊 2 𝐴1 +𝑏2 (𝑛 2 , 𝑚) = (𝑛 2 , 𝑛 1 )(𝑛 1 , 𝑚) + (𝑛 2 , 𝑚)
𝐴 2 = 𝑔 2 (𝑍 2 ) (𝑛 2 , 𝑚) = (𝑛 2 , 𝑚)
Back Propagation:
𝑑𝑍 2 = 𝐴 2 − 𝑌 𝑛 2 ,𝑚 = 𝑛 2 ,𝑚 − 𝑛 2 ,𝑚 , 𝑛 2 =
1 𝑖𝑓 𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 ℎ𝑎𝑠 1 𝑛𝑜𝑑𝑒
2 1
𝑑𝑊 = 𝑚 𝑑𝑍 2 𝐴 1 𝑇 (𝑛 2 , 𝑛 1 ) = (𝑛 2 , 𝑚)(𝑚, 𝑛 1 )
1
𝑑𝑏 2 = 𝑚 𝑠𝑢𝑚(𝑑𝑍 2 ) 𝑛 2 , 𝑚 = 𝑛 2 , 𝑚 , 𝑗𝑢𝑠𝑡 𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑒 𝑡ℎ𝑒 𝑠𝑢𝑚 𝑚 𝑡𝑖𝑚𝑒𝑠

𝑑𝑍 1 = 𝑊 2 𝑇 𝑑𝑍 2
∗ 𝑔 1 ′(𝑍 1 ) (𝑛 1 , m) = 𝑛 1 , 𝑛 2 𝑛 2 , 𝑚 ∗ (𝑛 1 , m)
1 1
𝑑𝑊 = 𝑚 𝑑𝑍 1 𝑋 𝑇 (𝑛 1 , 𝑛 0 ) = (𝑛 1 , 𝑚)(𝑚, 𝑛 0 )
1
𝑑𝑏 1 = 𝑠𝑢𝑚(𝑑𝑍 1 ) 𝑛 1 ,𝑚 = 𝑛 1 ,𝑚
𝑚
Update:
𝑊 2 = 𝑊 2 − 𝛼𝑑𝑊 2
𝑏 2 = 𝑏 2 − 𝛼𝑑𝑏 2

𝑊 1 = 𝑊 1 − 𝛼𝑑𝑊 1

𝑏 1 = 𝑏 1 − 𝛼𝑑𝑏 1
The Vanishing Gradients Problem
(https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)
• The problem: As more layers using certain activation functions are added to neural
networks, the gradients of the loss function approaches zero, making the network
hard to train.
• The Reason: Activation functions like the sigmoid and tanh, squish a large input
space into a small input space between 0 and 1.
– A large change in the input of the sigmoid function will cause a small change in the output. Hence,
the derivative becomes small.
– The derivatives are small for large values of the input |x|
The Vanishing Gradients Problem
(https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)
• For shallow networks with only a few layers that use these activations, this
isn’t a big problem.
• However, when more layers are used, it can cause the gradient to be too
small for training to work effectively.
• Gradients of neural networks are found using backpropagation.
– Backpropagation finds the derivatives of the network by moving layer by layer
from the final layer to the initial one.
– By the chain rule, the derivatives of each layer are multiplied down the network
(from the final layer to the initial) to compute the derivatives of the initial
layers.
– When n hidden layers use an activation like the sigmoid function, n small
derivatives are multiplied together. Thus, the gradient decreases exponentially
as we propagate down to the initial layers.
• A small gradient means that the weights and biases of the initial layers
will not be updated effectively with each training session. Since these
initial layers are often crucial to recognizing the core elements of the
input data, it can lead to overall inaccuracy of the whole network.
The Vanishing Gradients Problem
(https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)

• The simplest solution is to use other activation functions, such as

ReLU, which do not cause small derivatives.
• Another solution is batch normalization i.e. normalize the input so
|x| doesn’t reach the outer edges of the sigmoid function.
– Normalize the input so that most of it falls in the green region, where the
derivative isn’t too small.
For more details please visit

http://aghaaliraza.com

Thank you!
61

ML Interview Questions and Answers
No ratings yet
ML Interview Questions and Answers
105 pages
PSLE Maths 2020 Paper 1 Booklet B
No ratings yet
PSLE Maths 2020 Paper 1 Booklet B
8 pages
Course: 141 Tig Welding of Stainless Steel
No ratings yet
Course: 141 Tig Welding of Stainless Steel
17 pages
Main
No ratings yet
Main
183 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
1 Linear Algebra Basics 25-07-2024
No ratings yet
1 Linear Algebra Basics 25-07-2024
30 pages
Ai Application
No ratings yet
Ai Application
28 pages
Derivative Networks Reducedversion - 2022
No ratings yet
Derivative Networks Reducedversion - 2022
14 pages
ML MU Unit 5NeuralNetworkpdf 2025 04 16 13 47 39
No ratings yet
ML MU Unit 5NeuralNetworkpdf 2025 04 16 13 47 39
57 pages
Unit I
No ratings yet
Unit I
90 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Matrix Calculus
No ratings yet
Matrix Calculus
33 pages
Deep Learning
No ratings yet
Deep Learning
110 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Essential Concept in Artificial Neural Networks
No ratings yet
Essential Concept in Artificial Neural Networks
27 pages
Mathematics of Neural Networks: Bart M.N. Smets November 12, 2022
No ratings yet
Mathematics of Neural Networks: Bart M.N. Smets November 12, 2022
80 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
CS 236 Section 3
No ratings yet
CS 236 Section 3
59 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
DL Notes
No ratings yet
DL Notes
652 pages
Unit 5
No ratings yet
Unit 5
59 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
Unit 1 DL
No ratings yet
Unit 1 DL
52 pages
JNTU - Neural Network
No ratings yet
JNTU - Neural Network
5 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
120 pages
TFM Lichtner Bajjaoui Aisha
No ratings yet
TFM Lichtner Bajjaoui Aisha
51 pages
Module 2
No ratings yet
Module 2
44 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Reviewer
No ratings yet
Reviewer
7 pages
Basic Maths For DL
No ratings yet
Basic Maths For DL
50 pages
Lecture 4
No ratings yet
Lecture 4
146 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
125 pages
NN and Optimization Regularization
No ratings yet
NN and Optimization Regularization
198 pages
Slides NN
No ratings yet
Slides NN
59 pages
Understanding Deep Convolutional Networks
No ratings yet
Understanding Deep Convolutional Networks
17 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Deep Learning For Mathematicians
No ratings yet
Deep Learning For Mathematicians
32 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Lecture 4
No ratings yet
Lecture 4
50 pages
Backpropagation in Matrix Notation
No ratings yet
Backpropagation in Matrix Notation
8 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
Lecture 8 - Logistic Regression
No ratings yet
Lecture 8 - Logistic Regression
58 pages
(Fall 2024) Deep Learning 1
No ratings yet
(Fall 2024) Deep Learning 1
55 pages
CI DeepLearningFundamentals
No ratings yet
CI DeepLearningFundamentals
45 pages
Cs229 Notes Deep Learning
No ratings yet
Cs229 Notes Deep Learning
21 pages
Deep Learning
No ratings yet
Deep Learning
142 pages
CSC 2541: Neural Net Training Dynamics: Lecture 1 - A Toy Model: Linear Regression
No ratings yet
CSC 2541: Neural Net Training Dynamics: Lecture 1 - A Toy Model: Linear Regression
62 pages
Lecture NN 2005
No ratings yet
Lecture NN 2005
137 pages
Lecture 4 PDF
No ratings yet
Lecture 4 PDF
169 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
155 pages
Neural Networks and Their Statistical Application
No ratings yet
Neural Networks and Their Statistical Application
41 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Cortex™ M3
No ratings yet
Cortex™ M3
384 pages
Thesis Topics On Image Processing
100% (3)
Thesis Topics On Image Processing
6 pages
Describing Gases Focus Points
No ratings yet
Describing Gases Focus Points
2 pages
Motor Current Calculator
No ratings yet
Motor Current Calculator
2 pages
AI Unit 1 Short Answer
No ratings yet
AI Unit 1 Short Answer
14 pages
Mcp737Pro: Cpflight Operations Manual
No ratings yet
Mcp737Pro: Cpflight Operations Manual
12 pages
Recent Advances in Diagnostic Aids
No ratings yet
Recent Advances in Diagnostic Aids
59 pages
Class Xii Latest (Ii) Updated Checklist
No ratings yet
Class Xii Latest (Ii) Updated Checklist
36 pages
AQA GCSE Chem C2 Summary Question Answers
No ratings yet
AQA GCSE Chem C2 Summary Question Answers
4 pages
Steam Jet Spindle Operated Thermocompressor
No ratings yet
Steam Jet Spindle Operated Thermocompressor
3 pages
Evaporators Performance
No ratings yet
Evaporators Performance
14 pages
Ultrapac 2000 Standard, Ultrapac 2000 Superplus, Mini (Typ 0005 Bis 0025)
No ratings yet
Ultrapac 2000 Standard, Ultrapac 2000 Superplus, Mini (Typ 0005 Bis 0025)
3 pages
Understanding Scuffing and Micropitting of Gears: R W Snidle, H P Evans, M P Alanou, M J A Holmes
No ratings yet
Understanding Scuffing and Micropitting of Gears: R W Snidle, H P Evans, M P Alanou, M J A Holmes
18 pages
Activity Fluid Machinery
No ratings yet
Activity Fluid Machinery
1 page
Production of Ceramic Foam Filters For Molten Meta
No ratings yet
Production of Ceramic Foam Filters For Molten Meta
5 pages
Itelect2a Module 1
No ratings yet
Itelect2a Module 1
37 pages
PMA 133 Book - Verbal Intelligence Test Questions (Solved) - 1
No ratings yet
PMA 133 Book - Verbal Intelligence Test Questions (Solved) - 1
4 pages
Optimisation of Intake Trashracks: S. Bjarnason T.S. Leifsson G. Pétursson H. Jóhannesson
No ratings yet
Optimisation of Intake Trashracks: S. Bjarnason T.S. Leifsson G. Pétursson H. Jóhannesson
7 pages
Socio 101 - Midterm Exam Reviewer
No ratings yet
Socio 101 - Midterm Exam Reviewer
8 pages
Ec34 Question Bank
No ratings yet
Ec34 Question Bank
6 pages
Measurement Instrumentation and Sensors Handbook Two Volume Set 2nd Edition John G. Webster (Editor) Instant Download
No ratings yet
Measurement Instrumentation and Sensors Handbook Two Volume Set 2nd Edition John G. Webster (Editor) Instant Download
42 pages
Introduction To Language and Communication-Week11
No ratings yet
Introduction To Language and Communication-Week11
33 pages
PIC Microcontroller and Embedded Systems Muhammad Ali Mazidi, Rolin McKinlay and Danny Causey
No ratings yet
PIC Microcontroller and Embedded Systems Muhammad Ali Mazidi, Rolin McKinlay and Danny Causey
10 pages
Quarter 3 Week 5 and 6 Final
No ratings yet
Quarter 3 Week 5 and 6 Final
11 pages
CME113 Formula Excel
No ratings yet
CME113 Formula Excel
16 pages
Grade 2 Tos Sum1
No ratings yet
Grade 2 Tos Sum1
5 pages
Huawei SinlgeSDB HSS9860-BE Feature Description
No ratings yet
Huawei SinlgeSDB HSS9860-BE Feature Description
26 pages
MLE1101 - Tutorial 2 - Suggested Solutions
No ratings yet
MLE1101 - Tutorial 2 - Suggested Solutions
8 pages

009 Neural - Networks Complete

Uploaded by

009 Neural - Networks Complete

Uploaded by

Neural Networks

Agha Ali Raza

CS535/EE514 – Machine Learning

• A scalar is a number e.g., 9, -25, 0.579,…

• A vector is a list of numbers (a row or column)

• And by the way, matrix multiplication is not commutative

That is Mr. ANN for you!

Agha Ali Raza

CS535/EE514 – Machine Learning

A 3-layered fully-connected network A 2-layered fully-connected network

[1] 𝑎1 = 𝜎(𝑧1 ) have been using

[1] [1] 1 𝑇 [0]

[0] [2] [1] [1]

[1] [1] [1] [1] [1]

[1] 1 𝑇 [0] 1 [1] [1] 1𝑇 𝑙 1𝑇 𝑙 𝑙 𝑙 𝑙

𝑥2 [2] [3] Forward pass with 𝒊𝒕𝒉 instance

𝒛 𝟑 (𝒊) = 𝑾 𝟑 (𝒊) 𝒂 𝟐 (𝒊) + 𝒃 𝟑 (𝒊)

Agha Ali Raza

CS535/EE514 – Machine Learning

Agha Ali Raza

CS535/EE514 – Machine Learning

Recursive structure that allows us to • 𝒅𝒘𝒊 = (𝒂 −

• The simplest solution is to use other activation functions, such as

You might also like