0% found this document useful (0 votes)

41 views92 pages

Topic 4 (Part 2) - NN Learning

The document provides an introduction to neural network learning, outlining key concepts such as neural network architecture, the training process, and the gradient descent algorithm used for minimizing loss. It emphasizes the importance of determining network parameters to approximate a desired function and discusses techniques to avoid overfitting. Additionally, it covers the iterative nature of gradient descent and its convergence properties in relation to step size and function characteristics.

Uploaded by

Bander Moafa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views92 pages

Topic 4 (Part 2) - NN Learning

Uploaded by

Bander Moafa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 92

COE 292

Introduction to Artificial Intelligence

Neural Network Learning

The slides are mostly adopted from material developed by:
Presentation is based on the content developed by Dr. Akram F. Ahmed, COE Dept.
Dr. Aiman El-Maleh, KFUPM, COE Department

12/1/2024 INTRODUCTION TO AI 1
Outline
❖ Neural Network Learning Problem
❖ Neural Network Learning Problem Setup
❖ Gradient Descent
❖ Training Neural Nets through Gradient Descent
❖ Stochastic and Mini-Batch Gradient Descent
❖ Avoiding Overfitting during Neural Network Training: Dropout
Neural Networks
Neural networks are universal function nd th
Input 1st hidden 2 hidden n hidden Output
approximators, where they can: layer Layer Layer Layer Layer
1. Model any Boolean function x1
2. Model any classification boundary
3. Model any continuous valued function x2
y1

To approximate a given function, the neural

network must satisfy minimal architecture x3

constraints, i.e., the number of neurons per

yL
layer and the number of hidden layers must xD-1

be sufficient to approximate the intended

function. xD

12/1/2024 INTRODUCTION TO AI 3
Neural Network Learning
❖ Neural Network Architecture
➢ Involves determining the number of layers in the network, the number
of neurons in each layers and how neurons are connected
❖ Feed-Forward Neural Network
➢ Has no loops: Neuron outputs do not feed back to their inputs directly
or indirectly
❖ Neural Network Parameters
➢ The weights and biases
❖ Neural Network Learning
➢ Determining the values of the network parameters such that the
network computes the desired function
12/1/2024 INTRODUCTION TO AI 4
Neural Network Learning
❖ The overall goal of training a neural network is to model
(approximate) some function 𝑔 𝑋
❖ To model the Neural network (NN), we need to derive
the parameters of the network, i.e., the weights within
the NN such that if we apply the input data (𝑋) to the
NN we get something close to 𝑔 𝑋
➢ Formally we say that the NN is a function 𝑓(𝑋, 𝑾) that
maps the inputs X to predict 𝑔 𝑋 using the weights 𝑊
❖ To learn the Neural Network weights (𝑾) to model
𝑔 𝑋 , we need function 𝑔 𝑋 to be fully specified
➢ This is not feasible in practice or sometimes we do not
know what the actual 𝑔 𝑋 function is!

12/1/2024 INTRODUCTION TO AI 5
Neural Network Learning
❖ The solution is to sample 𝑔 𝑋 and use these samples
to estimate a new function 𝑌 = 𝑓 𝑋, 𝑾 that is close to
𝑔 𝑋
➢ In the sampling process, we get input-output pairs for
several samples of input 𝑋𝑖 , 𝑑𝑖 = 𝑔 𝑋𝑖
➢ After obtaining enough samples, we estimate the
network parameters (𝑾) to “fit” the training points
exactly with the hope that the function will not change
much between these samples
➢ Note: we will need many sample points to make sure
that 𝑌 faithfully represents 𝑔 𝑋 and hence the need
of big data to train the Neural Networks

12/1/2024 INTRODUCTION TO AI 6
MLP Neural Network Learning
❖ Given a neural network architecture and a training set of input-
output pairs (X1, d1), (X2, d2), …, (XN, dN)
➢ di is the desired output of the network in response to Xi
❖ We need to find the network parameters, W, such that the
network produces the desired output for each training input Or a
close approximation of it

12/1/2024 INTRODUCTION TO AI 7
Neural Network Learning
❖ To determine how good our estimating function Y = 𝑓 𝑋, 𝑾 is in
approximating 𝑔 𝑋 , we calculate the divergence between 𝑔 𝑋 and
𝑌 then attempt to make the divergence as small as possible.
❖ What we are looking for are the values of W such that the
divergence is minimized, and we formally write it as:
෡ = argmin න 𝑑𝑖𝑣 𝑓 𝑋, 𝑾 , 𝑔 𝑋 𝑑𝑋
𝑊
𝑊 𝑋
❖ div() is a divergence function that goes to zero when
𝑓 𝑋, 𝑾 = 𝑔 𝑋
❖ A divergence function represents an error function like Squared
Euclidean Distance
1 2
1 2
𝑑𝑖𝑣 𝑌, 𝑑 = 𝑌−𝑑 = ෍ 𝑦𝑖 − 𝑑𝑖
2 2
𝑖
12/1/2024 INTRODUCTION TO AI 8
Neural Network Learning Problem Setup
❖ We assume a “layered” network Input
nd rd
1st hidden 2 hidden 3 hidden Output
for simplicity layer Layer Layer Layer Layer

❖ We will refer to the inputs as the x1

input layer
➢ No neurons here – the “layer”
y1
x2
simply refers to inputs
❖ We refer to the outputs as the x3

output layer
yL
❖ Intermediate layers are “hidden” xD-1

layers
xD

12/1/2024 INTRODUCTION TO AI 9
Neural Network Learning Problem Setup
❖ The input layer is the 0th layer; Input
nd rd
1st hidden 2 hidden 3 hidden Output
layer Layer Layer Layer Layer
Input to network: yi(0) = xi
x1
❖ We will represent the output of
the ith perceptron of the kth layer x2
y1

as yi(k); Output of network: yi = yi(N) (1)

𝑤𝑖𝑗
(2)
𝑤𝑖𝑗 (3)
𝑤𝑖𝑗 𝑤𝑖𝑗
(4)

❖ We will represent the weight of x3

the connection between the 𝑖th

yL
neuron of the 𝑘 − 1th layer and the xD-1

jth neuron of the 𝑘th layer as wij(k)

(0) (1) (2) (3)
xD 𝑦𝑖 𝑦𝑖 𝑦𝑖 𝑦𝑖

12/1/2024 INTRODUCTION TO AI 10
Neural Network Learning Problem Setup
❖ Examples: Input st
1 hidden 2 nd
hidden 3rd
hidden Output
layer Layer Layer Layer Layer
➢ w21(2) is shown in red
x1
➢ Y2(0) is shown in blue
➢ Y1(4) is shown in brown x2
y1

➢ w32(3) is shown in green

yL
xD-1

12/1/2024 INTRODUCTION TO AI 11
Neural Network Learning Problem Setup
❖ Vector Notation:
➢ Xn = [xn1, xn2, …, xnD] is the nth input vector
➢ dn = [dn1, dn2, …, dnL] is the nth vector of desired output
➢ Yn = [yn1, yn2, …, ynL] is the nth vector of actual output
❖ Input representation: vectors of numbers, e.g. vector of pixel values,
real-valued vector representing text
❖ Output representation
➢ Single or a vector of real values
➢ For binary classification: One binary output 1/0 e.g., cat/no cat
➢ For multi-class output: one-hot representation is used e.g. [1 0 0], [0 1
0], [0 0 1]
12/1/2024 INTRODUCTION TO AI 12
Neural Network Learning Problem Setup
❖ Given a neural network architecture and a training set of input-
output pairs (X1, d1), (X2, d2), …, (XT, dT)
❖ The error on the ith instance is div(Yi, di)
❖ Optimize network parameters to minimize the average error over
all training inputs
❖ The average error
T is the number of input samples

❖ Minimize Err w.r.t.

INTRODUCTION TO AI 13
A Simple Network Example

Note that biases are used

as weights with input 1

INTRODUCTION TO AI 14
Neural Networks Learning
❖ To minimize the error, we need to determine for
each sample how the weights need to be
adjusted to reduce the error
❖ This can be done by examining the derivative of
the error function with respect to each weight
and determine whether the weight needs to be
increased or decreased
❖ The derivative of the error function with respect
to a given weight determines the slope
➢ If the slope is positive this means that 𝑦 =  𝑥
decreasing the weight will decrease the error 𝑑𝑦
[represented as or as 𝑓’(𝑥)]
➢ If the slope is negative this means that 𝑑𝑥
increasing the weight will decrease the error

INTRODUCTION TO AI 15
Iterative Solutions for Function Minimization
❖ Often it is not possible to simply solve 𝒇(𝒙)′ = 𝟎
❖ The function to minimize may have an intractable form
❖ In these situations, iterative solutions are used
➢ Start from an initial guess X0 for the optimal X
➢ Update the guess towards a (hopefully) “better” value of f(X)
➢ Stop when f(X) no longer decreases
Gradient Descent

❖ Gradient Descent is an algorithm for minimizing loss when training

neural networks
❖ Gradient Descent Algorithm
➢ Start with a random choice of weights
➢ Repeat:
▪ Calculate the gradient based on all data points
▪ Update weights according to the direction of the gradient; direction that will
lead to decreasing loss

INTRODUCTION TO AI 17
Gradient Descent
❖ Iterative solution:
➢ Start at some point
➢ Find direction in which to shift this point to decrease error
➢ This can be found from the derivative of the function
▪ A positive derivative → moving left decreases error
▪ A negative derivative → moving right decreases error
➢ Shift point in this direction
Gradient Descent – 𝑋 ∈ ℝ
❖ Initialize 𝑋 0 , 𝑘 = 0
❖ While 𝑓 𝑋 𝑘+1 − 𝑓 𝑋 𝑘
>𝜀
𝑘+1 𝑘 𝑘
𝑋 =𝑋 −𝜂 𝑓′ 𝑋 𝑘

𝑘 = 𝑘 + 1
❖ where:
𝑘
➢𝑋 is the 𝑘th estimate of 𝑋
𝑘 𝑘
➢𝜂 is the 𝑘th step size – if step size is fixed then 𝜂 =𝜂
❖ Many solutions to choosing step size 𝜂 𝑘
, usually iteration dependent
Gradient Descent Example
❖ Given the function f(x1, x2, x3)= (x1)2+x1(1-x2)+(x2)2-x2x3+(x3)2+x3
apply the gradient descent algorithm to compute the next value for
minimizing the function starting with the initial solution
(x1,x2 ,x3)=(1,1,1) using a learning rate η=0.3

𝜕𝑓 𝜕𝑓
𝜕𝑥1
= 2𝑥1 + 1 − 𝑥2 = 2 → 𝑥1 = 𝑥1 − 𝜂 𝜕𝑥 = 0.4
1
𝜕𝑓 𝜕𝑓
𝜕𝑥2
= −𝑥1 + 2𝑥2 − 𝑥3 =0 → 𝑥2 = 𝑥2 − 𝜂 𝜕𝑥 = 1
2
𝜕𝑓 𝜕𝑓
𝜕𝑥3
= −𝑥2 + 2𝑥3 + 1 = 2 → 𝑥3 = 𝑥3 − 𝜂 𝜕𝑥 = 0.4
=31

INTRODUCTION TO AI 20
Gradient Descent Example
f(x1, x2, x3)= (x1)2+x1(1-x2)+(x2)2-x2x3+(x3)2+x3

Note that the

function is
minimum when
x1=-1, x2=-1 and
x3=-1
INTRODUCTION TO AI 21
Convergence of Gradient Descent
❖ For appropriate step size, for convex
(bowl-shaped) functions, gradient descent
will always find the minimum

❖ For non-convex functions it will find a local

minimum or an inflection point
Convergence of Gradient Descent
❖ How fast or slow do we move our point to the
optimum point is controlled by step size and
denoted by 

Y-Axis

Y-Axis
 =  opt  <  opt

❖ The effect of step size () is as follows:

➢ For  < 𝑜𝑝𝑡, the algorithm will converge
monotonically and may require many steps X-Axis X-Axis

➢ For  = 𝑜𝑝𝑡, the algorithm will reach to the

optimal solution in one step
 >  opt
➢ For 2𝑜𝑝𝑡 >  > 𝑜𝑝𝑡, we have oscillating
convergence → may converge but requires

Y-Axis

Y-Axis
too many steps to reach to the optimum  > 2 opt

solution
➢ For  > 2𝑜𝑝𝑡, we get divergence
X-Axis X-Axis

INTRODUCTION TO AI 23
Gradient Descent Example
❖ In most real-life applications, the
error function is not convex. What to
do then?
❖ Consider the function:
𝑥 2 cos 𝑥 − 𝑥
𝑓 𝑥 =
10
❖ This function has many local minima.
❖ Gradient descent will find different
ones depending on our initial guess
and our step size 

INTRODUCTION TO AI 24
Gradient Descent Example
❖ If we choose 𝑥0=6 and  =0.2, for
example, gradient descent moves as
shown in the graph below. The first point
is 𝑥0 and the gradient suggests to move
to the left. After only 10 steps, we have
converged to the local minimum near
𝑥=4

INTRODUCTION TO AI 25
Gradient Descent Example
❖ We then get stuck on the local minimum.
❖ We choose  >2 𝑜𝑝𝑡 and get a new point
outside the local minimum valley and
then do the gradient decent again to find
other local minima
❖ We repeat multiple times and then
choose the one that is the lowest
❖ Note: in real applications and depending
on the complexity of the error curve, you
may not reach the global minimum value
INTRODUCTION TO AI 26
Neural Network (NN) Units
❖ Each neuron in the NN is represented by 𝑵
𝑍 = ෍ 𝑥𝑖 𝑤𝑖 + 𝑏
Inputs Weights Output
a Perceptron with the following setting: x1
w1 𝑖=1

1. Inputs are real values 𝑥𝑖 ∈ ℝ x2 w2

z 𝑁

 𝑓 ෍ 𝑥𝑖 𝑤𝑖 + 𝑏 y
𝑖=1
2. A bias representing a threshold to
Activation
trigger the perceptron xN
wN
b Function
▪ Bias can be viewed as the weight of another
input with fixed value 1 𝑵+𝟏
Inputs Weights 𝑍 = ෍ 𝑥𝑖 𝑤𝑖
Output
w1
3. Activation functions are not necessarily x1 𝑖=1

w2
threshold functions x2

z 𝑓
𝑁+1

෍ 𝑥𝑖 𝑤𝑖 y
𝑖=1

4. Activation functions need to be Activation

wN
differentiable xN
wN+1 =b Function
1
We know 𝑋𝑁+1 = 1 and 𝑊𝑁+1 = 𝑏

12/1/2024 INTRODUCTION TO AI 27
Backpropagation
❖ During the training process, the output of the neural network is compared to
the expected output, and the difference between the two is calculated using an
error function. Backpropagation is then used to propagate this error backward
through the network and update the weights and biases to minimize the error
function
❖ Backpropagation is used in gradient descent based neural network learning to
calculate the gradient of the error function with respect to the weights and
biases of the network
❖ This is achieved by propagating the gradient of the error function through
network parameters using the chain rule of calculus. This gradient is then used
to update the weights and biases through an optimization algorithm such as
gradient descent
❖ Backpropagation is a crucial component of training a neural network, and it
allows the network to learn from its mistakes and improve its performance over
time
12/1/2024 INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 28
Activation Functions
❖ Why do we need Activation Functions in Neural Networks?
➢ The purpose of an activation function is to add non-linearity to the
neural network
❖ Let’s suppose we have a neural network working with linear activation
functions. In that case:
1. Every neuron will only be performing a linear transformation on the
inputs using the weights and biases
2. All layers will behave in the same way because the composition of two
linear functions is a linear function itself
3. Learning complex task is impossible, and our model would be just a
linear regression model
12/1/2024 INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 29
Activation Functions
❖Linear activation functions:
1. It’s not possible to use backpropagation as the derivative of the
function is a constant and has no relation to the input x
2. A linear activation function turns the neural network into just one
layer where all layers of the neural network will collapse into one if a
linear activation function is used

❖Non-linear activation functions:

1. They allow backpropagation because now the derivative function
would be related to the input, and it’s possible to go back and
understand which weights in the input neurons can provide a
better prediction
2. They allow the stacking of multiple layers where the output can be
represented as a functional computation in a neural network
12/1/2024 INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 30
Activation Functions
❖ Binary Step Function
➢ The input fed to the activation function is compared
to a certain threshold; if the input is greater or equal
to it, then the neuron is activated, else it is
deactivated, meaning that its output is not passed on
to the next hidden layer
1 𝑖𝑓 𝑧 ≥ 𝑇
𝑓 σ𝑛𝑖=1 𝑥𝑖 𝑤𝑖 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
❖ Limitations of binary step function:
1. It cannot provide multi-value outputs, i.e., it cannot
be used for multi-class classification problems
2. The gradient of the step function is zero, which
causes a hindrance in the backpropagation process

12/1/2024 INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 31

Activation Functions (Sigmoid/logistic Activation)
❖ Sigmoid / Logistic Activation Function
➢ This function takes any real value as input
and outputs values in the range of 0 to 1
➢ The larger the input (more positive), the
closer the output value will be to 1.0,
whereas the smaller the input (more
negative), the closer the output will be to
0.0
➢ Mathematically, it can be represented as:
1
𝑓 𝑧 =
1 + 𝑒 −𝑧
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 32
Activation Functions (Sigmoid/logistic Activation)
❖ Sigmoid/logistic activation function is
one of the most widely used functions
because:
1. Used when we must predict the
probability as an output. Since
probability only exists between 0 and
1, sigmoid is the right choice
2. The function is differentiable and
provides a smooth gradient, i.e.,
preventing jumps in output values

INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 33
Activation Functions (Sigmoid/logistic Activation)
❖ The limitations of sigmoid function:
➢ The derivative of the function is:
𝑓′ 𝑧 = 𝑓 𝑧 1 − 𝑓 𝑧
➢ As we can see from the derivative figure, the
gradients are only significant for range -4 to 4
➢ The output of the sigmoid function is not
symmetric around zero → the output of all
the neurons will be of the same sign!
▪ This makes the training of the neural network more
difficult and unstable
➢ Vanishing Gradient problem (discussed next)

INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 34
Activation Functions (Sigmoid/logistic Activation)
❖ Vanishing Gradient problem in Sigmoid:
➢ Sigmoid squishes the entire input space into a small
output space between 0 and 1
➢ Therefore, a large change in the input of the sigmoid
function will cause a small change in the output
Hence, the derivative becomes small
➢ For shallow networks with only few hidden layers that
use these activations, this isn’t a big problem
However, when more layers are used, it can cause the
gradient to be too small for training to work
effectively

INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 35
Activation Functions (Tanh Activation)
❖ Tanh Function (Hyperbolic Tangent)
➢ Tanh function is very similar to the
sigmoid/logistic activation function, with
the difference in output range of -1 to 1
❖ Advantages:
1. The output is Zero centered; hence we can
map the output values as strongly
negative, neutral, or strongly positive.
2. Usually used in hidden layers of a neural
network as its values lie between -1 to 1. It
helps in centering the data and makes
learning for the next layer much easier.
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 36
Activation Functions (Tanh Activation)
❖ The derivative of the function is:
𝑓′ 𝑧 = 1 − 𝑓2 𝑧
❖ Disadvantages:
➢ Tanh faces the problem of vanishing gradients
like the sigmoid activation function
➢ The gradient of the tanh function is much
steeper as compared to the sigmoid function

INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 37
Activation Functions (ReLU Activation)
❖ ReLU stands for Rectified Linear Unit
❖ ReLU has a derivative function and allows for
backpropagation while simultaneously
making it computationally efficient
❖ ReLU function does not activate all the
neurons at the same time, i.e., the neurons
will only be deactivated if the output of the
linear transformation is less than 0
❖ Mathematically ReLU is expressed as:
𝑓 𝑧 = max(0, 𝑧)
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 38
Activation Functions (ReLU Activation)
❖ The derivative of the function is:
′ 1 𝑖𝑓 𝑧 ≥ 0
𝑓 𝑧 =ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
❖ The advantages of using ReLU are:
1. ReLU is more computationally efficient when compared
to the sigmoid and tanh functions since only a certain
number of neurons are activated in a neural network
2. ReLU accelerates the convergence of gradient descent
towards the global minimum of the loss function due
to its linearity
❖ Mostly used as an activation function for the hidden
layers

INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 39
Activation Functions (ReLU Activation)
❖ Limitations:
➢ All the negative input values become zero
immediately, which decreases the model’s
ability to fit or train from the data properly
➢ The negative side of the graph makes the
gradient value zero. Due to this reason, during
the backpropagation process, the weights and
biases for some neurons are not updated. This
can create dead neurons which never get
activated

INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 40
Activation Functions (Softmax Activation)
❖ Softmax function is described as a combination of multiple
sigmoids
❖ It calculates the relative probabilities. Like the sigmoid/logistic
activation function, the SoftMax function returns the probability of
each class
❖ It is mostly used as an activation function for the last layer of the
neural network in the case of multi-class classification
❖ Mathematically it can be represented as:
𝑒 𝑧𝑖
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑖 =
σ𝑗 𝑒 𝑧𝑗

INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 41
How to Choose the Right Activation Function?
❖ Match your activation function for your output layer based on the
type of prediction problem that you are solving
❖ Begin with using the ReLU activation function and then move over
to other activation functions if ReLU doesn’t provide optimum
results
❖ Guidelines of choosing Activation Functions:
➢ ReLU activation function should only be used in the hidden layers
➢ Sigmoid/Logistic and Tanh functions should not be used in hidden
layers for high depth networks as they make the model more
susceptible to problems during training (due to vanishing
gradients)
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 42
Activation Functions Summary
❖ Activation Functions are used to introduce non-linearity in the
network
❖ A neural network will almost always have the same activation
function in all hidden layers. This activation function should be
differentiable so that the parameters of the network are learned in
backpropagation
❖ ReLU is the most commonly used activation function for hidden
layers
❖ While selecting an activation function, you must consider the
problems it might face such as vanishing gradients
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 43
Typical Problem Statement: Classification

Binary Classification

Multi-Class Classification
Examples of Divergence Functions
❖ For real-valued output vectors, the (scaled) divergence
representing Squared Euclidean Distance between true and
desired output is popular:

❖ The divergence function is differentiable

Gradient:
Perceptrons with Differentiable Activation Functions
❖ Activation functions need to be differentiable
➢ (z) is a differentiable function of z
❖ Using the chain rule, y is a differentiable function of both inputs x𝒊 and weights w𝒊
❖ We can compute the change in the output for small changes in either the input or the
weights
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
q
+
5 f
y ×
z -4

INTRODUCTION TO AI 47
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
q 3
+
❖ Forward calculation y
5 f -12
×
z -4

INTRODUCTION TO AI 48
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞 =𝑥+𝑦 → 𝜕𝑥
= 1, 𝜕𝑦 =1 +
5 f
y ×
z -4

INTRODUCTION TO AI 49
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞
z -4

INTRODUCTION TO AI 50
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞
z -4
❖ We want:
𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 51
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞
z -4
❖ We want:
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧 𝜕𝑓

INTRODUCTION TO AI 52
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want:
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧 𝜕𝑓

INTRODUCTION TO AI 53
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want:
𝜕𝑓 𝜕𝑓 𝜕𝑓
, , 𝜕𝑓
𝜕𝑥 𝜕𝑦 𝜕𝑧 𝜕𝑧

INTRODUCTION TO AI 54
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓
, , 𝜕𝑓
𝜕𝑥 𝜕𝑦 𝜕𝑧 𝜕𝑧

INTRODUCTION TO AI 55
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
, , 𝜕𝑞
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 56
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
-4
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
, , 𝜕𝑞
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 57
Backpropagation Simple Example
❖ Suppose: 𝜕𝑓 𝜕𝑓 𝜕𝑞
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 =
𝜕𝑥 𝜕𝑞 𝜕𝑥
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
-4
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 58
Backpropagation Simple Example
❖ Suppose: 𝜕𝑓 𝜕𝑓 𝜕𝑞
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 =
𝜕𝑥 𝜕𝑞 𝜕𝑥
-2
2. x = -2, y = 5, z = -4 x
-4
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
-4
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 59
Backpropagation Simple Example
❖ Suppose: 𝜕𝑓 𝜕𝑓 𝜕𝑞
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 =
𝜕𝑦 𝜕𝑞 𝜕𝑦
-2
2. x = -2, y = 5, z = -4 x
-4
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
-4
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 60
Backpropagation Simple Example
❖ Suppose: 𝜕𝑓 𝜕𝑓 𝜕𝑞
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 =
𝜕𝑦 𝜕𝑞 𝜕𝑦
-2
2. x = -2, y = 5, z = -4 x
-4
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
-4
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 -4 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 61
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
❖ 𝑤0 = −3.00, 𝑥1 = −2.00, 𝑤1 = −3.00, 𝑥2 = −1.00, 𝑤2 =2.00

INTRODUCTION TO AI 62
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
❖ 𝑤0 = −3.00, 𝑥1 = −2.00, 𝑤1 = −3.00, 𝑥2 = −1.00, 𝑤2 =2.00

INTRODUCTION TO AI 63
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00

-1.00

-3.00

-2.00

-3.00

INTRODUCTION TO AI 64
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
-2.00
-1.00
4.00
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73

-2.00

-3.00

INTRODUCTION TO AI 65
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
-2.00
-1.00
4.00
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73

-2.00

𝒅𝒇 𝟏 𝒅𝒇 𝟏
𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
-3.00 𝒅𝒙 𝒙 𝒅𝒙 𝒙
𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙

INTRODUCTION TO AI 66
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
-2.00
-1.00
4.00
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
1.00
-2.00

𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙

INTRODUCTION TO AI 67
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
−1
4.00 1.00 =-0.53
1.372
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
-0.53 1.00
-2.00

INTRODUCTION TO AI 68
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
4.00 −0.53 1 =-0.53
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
-0.53 -0.53 1.00
-2.00

INTRODUCTION TO AI 69
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
−1
4.00 −0.53 (𝑒 )=-0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
-0.20 -0.53 -0.53 1.00
-2.00

INTRODUCTION TO AI 70
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
4.00 −0.20 (-1)=0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 -0.20 -0.53 -0.53 1.00
-2.00

INTRODUCTION TO AI 71
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
4.00 0.20 (1)=0.20
0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 -0.20 -0.53 -0.53 1.00
-2.00

INTRODUCTION TO AI 72
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
4.00 0.20 (1)=0.20
0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 -0.20 -0.53 -0.53 1.00
-2.00

𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
0.20 𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙

INTRODUCTION TO AI 73
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 0.20 output
4.00 0.20 (1)=0.20
0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 0.20 -0.20 -0.53 -0.53 1.00
-2.00

INTRODUCTION TO AI 74
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-0.20
-2.00 Gradient Gradient
-1.00 0.20 output
4.00 0.20 (-1)=-0.20
0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 0.20 -0.20 -0.53 -0.53 1.00
-2.00

INTRODUCTION TO AI 75
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-0.20
-2.00 Gradient Gradient
-1.00 0.20 output
0.40 4.00 0.20 (2)=0.40
0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 0.20 -0.20 -0.53 -0.53 1.00
-2.00

INTRODUCTION TO AI 76
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
-0.20
-2.00
-1.00 0.20
0.40 4.00
0.20
-3.00
-0.40 6.00 1.00 -1.00 0.37 1.37 0.73
0.20 0.20 -0.20 -0.53 -0.53 1.00
-2.00
-0.60
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
0.20 𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙

INTRODUCTION TO AI 77
Example: Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
❖ Summary in a table: Value
Operation Derivative of x
Derivative
at point
Chain Rule
Equation
Chain Rule
Value
2.00
-0.20
-2.00 1/x -1/x^2 1.37 -0.53 =-0.53*1.00 -0.53
-1.00 0.20
4.00
x+1 1 - 1 =1*-0.53 -0.53
0.40
0.20
-3.00 e^x e^x -1 0.37 =0.37*-0.53 -0.2
-0.40 6.00 1.00 -1.00 0.37 1.37 0.73
0.20 0.20 -0.20 -0.53 -0.53 1.00
x*-1 -1 - -1 =-1*-0.2 0.2
-2.00
-0.60 + w0 1 - 1 =1*0.2 0.2
-3.00
0.20 + 1 - 1 =1*0.2 0.2

Conclusion: x2* w2 x2 -1 -1 =-1*0.2 -0.2

𝝏𝒇 𝝏𝒇 x1* w1 x1 -2 -2 =-2*.2 -0.4
• and are negative increase 𝑤1 and 𝑤2
𝝏𝒘𝟏 𝝏𝒘𝟐
𝝏𝒇
• is positive decrease 𝑤0
𝝏𝒘𝟎
In the above table “-” indicates a don’t care value

INTRODUCTION TO AI 78
Training Neural Nets through Gradient Descent
❖ The average Training Loss function (Error):
1
𝐸𝑟𝑟 = ෍ 𝑑𝑖𝑣 𝑌𝑖 , 𝑑𝑖
𝑇 𝑖
Where:
𝑇 is the number of input samples
𝑑𝑖 is the desired output for training sample i
𝑌𝑖 is the output produced by the neural network for training
sample i

INTRODUCTION TO AI 79
Algorithm: Training of NN using GD (Input: 𝜼, 𝒅𝒊 , 𝑻, 𝒀)
1 : Initialize all 𝒘𝒊,𝒋𝒌
2 : Repeat
3 : Initialize 𝐸𝑟𝑟 = 0
𝜕𝐸𝑟𝑟
4 : For all 𝑖, 𝑗, 𝑘 do initialize = 0 end
Training 5 : For all t = 1 to 𝑇 do
𝑘
𝜕𝑤𝑖,𝑗

Loop over Training Instances

Neural 6 : Compute 𝑌𝑡
Forward Pass
Nets 7 :
8 :
𝐸𝑟𝑟 = 𝐸𝑟𝑟 + 𝑑𝑖𝑣 𝑌𝑡 , 𝑑𝑡
For all 𝑖, 𝑗, 𝑘 do

Epoch
through 9 : Compute
𝝏𝒅𝒊𝒗 𝒀𝒕,𝒅𝒕
𝝏𝒘𝒊,𝒋
𝒌
Backward
Gradient 10:
𝝏𝐄𝐫𝐫
𝒌
𝝏𝒘𝒊,𝒋
=
𝝏𝐄𝐫𝐫
𝒌
𝝏𝒘𝒊,𝒋
+
𝝏𝒅𝒊𝒗 𝒀𝒕,𝒅𝒕
𝒌
𝝏𝒘𝒊,𝒋
Pass
Descent 11: end
12: end
13: For all 𝑖, 𝑗, 𝑘 do
𝜼 𝝏𝑬𝒓𝒓
14: Update: 𝒘𝒊,𝒋𝒌 = 𝒘𝒊,𝒋𝒌 − 𝑻 𝒌 𝜂 is the learning rate
𝝏𝒘𝒊,𝒋

15: end
12/1/2024 INTRODUCTION TO AI 16: Until 𝐸𝑟𝑟 has converged 80
Forward Computation

❖ For each input sample, we need to do forward computation to compute

all values until we reach the produced output values
❖ The Error is then computed based on the difference between the
produced output values and the desired values
INTRODUCTION TO AI 81
Backward Gradient Computation

❖ For each input sample, we need to do backward gradient computation

to compute the gradient of the error function with respect to each
network parameter

INTRODUCTION TO AI 82
Gradient Descent Training Example
❖ Consider using gradient decent for training a neural network that
has a single output and 5 training samples. The table below shows
𝛛𝐃𝐢𝐯
the computed (𝟑) for each of the input samples. Assuming that
𝛛𝐰𝟏,𝟏
(𝟑) (𝟑)
𝐰𝟏,𝟏 =1 and a learning rate η=0.3, then updated value for 𝐰𝟏,𝟏 is:
Training 1 2 3 4 5
Sample
𝜕Div -0.2 -0.3 -0.1 -0.2 -0.2
(3)
𝜕w1,1
(𝟑) (𝟑)
𝐰𝟏,𝟏 = 𝐰𝟏,𝟏 − 0.3/5 * (-0.2-0.3-0.1-0.2-0.2) = 1- 0.3*-0.2 = 1.06

INTRODUCTION TO AI 83
Stochastic Gradient Descent (SGD)
❖ Problem with conventional gradient descent: we try to
simultaneously adjust the function at all training points
➢ We must process all training points before making a single
adjustment; “Batch” update
❖ Alternative: adjust the function at one training point at a time
➢ Keep adjustments small
➢ Eventually, when we have processed all the training points, we
will have adjusted the entire function
▪ With greater overall adjustment than we would if we made a single “Batch”
update
INTRODUCTION TO AI 84
Stochastic Gradient Descent (SGD)
Algorithm: Training of NN using GD (Input: 𝜼, 𝒅𝒊 , 𝑻, 𝒀)
1 : Initialize all 𝒘𝒊,𝒋𝒌
2 : Repeat
3 : Randomly permute training samples; Initialize 𝐸𝑟𝑟 = 0
4 : For all t = 1 to 𝑇 do Loop over Training Instances
5 : Compute 𝑌𝑡 ; 𝐸𝑟𝑟 = 𝐸𝑟𝑟 + 𝑑𝑖𝑣 𝑌𝑡 , 𝑑𝑡 Forward Pass
6 : For all 𝑖, 𝑗, 𝑘 do
𝝏𝒅𝒊𝒗 𝒀𝒕 ,𝒅𝒕
Backward
7 : Compute 𝒌 Pass
𝝏𝒘𝒊,𝒋
𝝏𝒅𝒊𝒗 𝒀𝒕 ,𝒅𝒕
8 : 𝒘𝒊,𝒋𝒌 = 𝒘𝒊,𝒋𝒌 − 𝜼 𝒌
𝝏𝒘𝒊,𝒋

9 : end
10 : end
11 : Until 𝐸𝑟𝑟 has converged

INTRODUCTION TO AI 85
Mini-Batch Gradient Descent
❖ Alternative: adjust the function at a small, randomly chosen subset
of points
➢ Keep adjustments small
➢ If the subsets cover the training set, we will have adjusted the
entire function
❖ As before, vary the subsets randomly in different passes through
the training data
❖ In practice, training is usually performed using minibatches
➢ The mini-batch size is a hyper parameter to be optimized

INTRODUCTION TO AI 86
Various NN Training Optimization Algorithms
❖ There are several algorithms used in practice for optimization of
parameters during neural networks training
➢ Momentum
➢ Nestorov’s Accelerated Gradient (NAG)
➢ RMS Prop
➢ Adagrad
➢ AdaDelta
➢ ADAM: very popular in practice
➢ AdaMax

INTRODUCTION TO AI 87
Avoiding Overfitting during NN Training

❖ To avoid overfitting during neural network training, a technique

called, dropout, is used
❖ Dropout temporarily removes units — selected at random — from
a neural network to prevent over-reliance on certain units
❖ During training: For each input, at each iteration, “turn off” units
(in input and hidden layers) with a given probability

INTRODUCTION TO AI 88
Avoiding Overfitting during NN Training

INTRODUCTION TO AI 89
Avoiding Overfitting during NN Training
❖ Test error for different architectures on MNIST* with and without
dropout (Srivastava et al., 2013)
➢ 2-4 hidden layers with 1024-2048 units

Test
Error
Rate

INTRODUCTION TO AI 90
Neural Network Demo
❖ https://playground.tensorflow.org/

12/1/2024 INTRODUCTION TO AI 91
Acknowledgments
❖ Slides have been used from:
➢ https://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Spring.
2019/www/
➢ https://canvas.northwestern.edu/courses/75723/assignments/syl
labus
➢ http://cs231n.stanford.edu/slides/2022/lecture_4_ruohan.pdf
➢ https://www.v7labs.com/blog/neural-networks-activation-
functions

ML Unit 4
No ratings yet
ML Unit 4
32 pages
Topic 5 - Part2 NN Learning
No ratings yet
Topic 5 - Part2 NN Learning
90 pages
Deep Learning-Question Bank-Module-Wise
67% (3)
Deep Learning-Question Bank-Module-Wise
5 pages
Chapter 5 Artificial Neural Networks
No ratings yet
Chapter 5 Artificial Neural Networks
50 pages
Unit 5
No ratings yet
Unit 5
219 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Lecture NN 2005
No ratings yet
Lecture NN 2005
137 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
Unit 1
No ratings yet
Unit 1
72 pages
Unit-Ii (Ml-I)
No ratings yet
Unit-Ii (Ml-I)
81 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Slides NN
No ratings yet
Slides NN
59 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
80 pages
cz4041 7 ANN
No ratings yet
cz4041 7 ANN
70 pages
Lect 5
No ratings yet
Lect 5
89 pages
UNIT III 3.1 ML Artificial Neural Networks
No ratings yet
UNIT III 3.1 ML Artificial Neural Networks
65 pages
L6 Neural Network
No ratings yet
L6 Neural Network
57 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
Jntuk R20 ML Unit-V
No ratings yet
Jntuk R20 ML Unit-V
19 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Chapter 2. Training NN
No ratings yet
Chapter 2. Training NN
50 pages
855597620
No ratings yet
855597620
44 pages
Lecture 4
No ratings yet
Lecture 4
50 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Neural Networks
No ratings yet
Neural Networks
40 pages
Neural
No ratings yet
Neural
32 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
Slide 2
No ratings yet
Slide 2
35 pages
2021 Lecture11 NeuralNetworks
No ratings yet
2021 Lecture11 NeuralNetworks
48 pages
Unit 1
No ratings yet
Unit 1
29 pages
Part 1.1.neural Network and Training Algorithm
No ratings yet
Part 1.1.neural Network and Training Algorithm
34 pages
Open Ai
100% (2)
Open Ai
23 pages
Upload Unit 2
No ratings yet
Upload Unit 2
19 pages
Neural Network
No ratings yet
Neural Network
44 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Neural Network BSC
No ratings yet
Neural Network BSC
32 pages
Advanced Machine Learning CIE
No ratings yet
Advanced Machine Learning CIE
13 pages
Module4 AI
No ratings yet
Module4 AI
12 pages
Neural Networks and Fuzzy Systems: Multi-Layer Feed Forward Networks
No ratings yet
Neural Networks and Fuzzy Systems: Multi-Layer Feed Forward Networks
27 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Curs3site PDF
No ratings yet
Curs3site PDF
38 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
NN DL
No ratings yet
NN DL
1 page
Neural Networks: Artificial Intelligence: Representation and Problem Solving
No ratings yet
Neural Networks: Artificial Intelligence: Representation and Problem Solving
19 pages
Roadmap To Build A Machine Learning Model
No ratings yet
Roadmap To Build A Machine Learning Model
12 pages
A Weight Decides How Much Influence The Input Will Have On The Output
No ratings yet
A Weight Decides How Much Influence The Input Will Have On The Output
1 page
Deep Learning - A Gentle Introduction
No ratings yet
Deep Learning - A Gentle Introduction
100 pages
Unstructured Data Classification
No ratings yet
Unstructured Data Classification
5 pages
Important File Sheets Class 10
No ratings yet
Important File Sheets Class 10
3 pages
Chapter 01 Engineering Economic Decisions - Esam
No ratings yet
Chapter 01 Engineering Economic Decisions - Esam
25 pages
Perceptron
No ratings yet
Perceptron
26 pages
M SCDS (Sem-III)
No ratings yet
M SCDS (Sem-III)
33 pages
2024.GO ICT277 Act1
No ratings yet
2024.GO ICT277 Act1
1 page
Machine Learning-4
No ratings yet
Machine Learning-4
73 pages
Ethic 22
No ratings yet
Ethic 22
14 pages
Computer Vision Intern - JD
No ratings yet
Computer Vision Intern - JD
3 pages
Generative-AI-smart City-Report
No ratings yet
Generative-AI-smart City-Report
42 pages
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
No ratings yet
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
35 pages
1866 - Year - B.E. Computer Technology Sem-VII Subject - CT7052 - CT705 - Elective-II - Neural Network & Fuzzy Logic
No ratings yet
1866 - Year - B.E. Computer Technology Sem-VII Subject - CT7052 - CT705 - Elective-II - Neural Network & Fuzzy Logic
4 pages
Ise487 - HW#1
No ratings yet
Ise487 - HW#1
22 pages
Ch23 Physical Ergonomics
No ratings yet
Ch23 Physical Ergonomics
22 pages
Lecture Notes 6
No ratings yet
Lecture Notes 6
5 pages
Ethic 22
No ratings yet
Ethic 22
15 pages
Section A: Ques. 1
No ratings yet
Section A: Ques. 1
31 pages
Real-Time Scream Detection and Position Estimation For Worker Safety in Construction Sites
No ratings yet
Real-Time Scream Detection and Position Estimation For Worker Safety in Construction Sites
12 pages
Object Detection and Gesture Recognition
No ratings yet
Object Detection and Gesture Recognition
11 pages
Bilal Google Scholar
No ratings yet
Bilal Google Scholar
15 pages
NLP Unleashed Transforming Communication & Insights
No ratings yet
NLP Unleashed Transforming Communication & Insights
12 pages
Performance Analysis Leveraging Student Feedback With Machine Learning
No ratings yet
Performance Analysis Leveraging Student Feedback With Machine Learning
8 pages
Transfer Learning Alexnet - Ipynb - Colaboratory
No ratings yet
Transfer Learning Alexnet - Ipynb - Colaboratory
5 pages
GROUP 8, Artificial Intelligence
No ratings yet
GROUP 8, Artificial Intelligence
10 pages
T232 Homework For Week 3 v2
No ratings yet
T232 Homework For Week 3 v2
2 pages
IEEE Xplore Reference Download 2024.6.18.20.21.16
No ratings yet
IEEE Xplore Reference Download 2024.6.18.20.21.16
2 pages
Review of Deep Reinforcement Learning Based Scheduling For Optimizing System Load and Response Time in Edge and Fog Computing Environments
No ratings yet
Review of Deep Reinforcement Learning Based Scheduling For Optimizing System Load and Response Time in Edge and Fog Computing Environments
2 pages
SCBD
No ratings yet
SCBD
4 pages
International Journal On Soft Computing Artificial Intelligence and Applications IJSCAI
No ratings yet
International Journal On Soft Computing Artificial Intelligence and Applications IJSCAI
3 pages
Essay
No ratings yet
Essay
3 pages
Neural Networks PDF
No ratings yet
Neural Networks PDF
1 page
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet

Topic 4 (Part 2) - NN Learning

Uploaded by

Topic 4 (Part 2) - NN Learning

Uploaded by

COE 292

Introduction to Artificial Intelligence

Neural Network Learning

To approximate a given function, the neural

constraints, i.e., the number of neurons per

be sufficient to approximate the intended

❖ We will refer to the inputs as the x1

as yi(k); Output of network: yi = yi(N) (1)

❖ We will represent the weight of x3

the connection between the 𝑖th

jth neuron of the 𝑘th layer as wij(k)

➢ w32(3) is shown in green

❖ Minimize Err w.r.t.

Note that biases are used

❖ Gradient Descent is an algorithm for minimizing loss when training

Note that the

❖ For non-convex functions it will find a local

❖ The effect of step size () is as follows:

➢ For  = 𝑜𝑝𝑡, the algorithm will reach to the

1. Inputs are real values 𝑥𝑖 ∈ ℝ x2 w2

4. Activation functions need to be Activation

❖Non-linear activation functions:

12/1/2024 INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 31

❖ The divergence function is differentiable

Conclusion: x2* w2 x2 -1 -1 =-1*0.2 -0.2

Loop over Training Instances

❖ For each input sample, we need to do forward computation to compute

❖ For each input sample, we need to do backward gradient computation

❖ To avoid overfitting during neural network training, a technique

You might also like