Topic 4 (Part 2) - NN Learning
Topic 4 (Part 2) - NN Learning
12/1/2024 INTRODUCTION TO AI 1
Outline
❖ Neural Network Learning Problem
❖ Neural Network Learning Problem Setup
❖ Gradient Descent
❖ Training Neural Nets through Gradient Descent
❖ Stochastic and Mini-Batch Gradient Descent
❖ Avoiding Overfitting during Neural Network Training: Dropout
Neural Networks
Neural networks are universal function nd th
Input 1st hidden 2 hidden n hidden Output
approximators, where they can: layer Layer Layer Layer Layer
1. Model any Boolean function x1
2. Model any classification boundary
3. Model any continuous valued function x2
y1
12/1/2024 INTRODUCTION TO AI 3
Neural Network Learning
❖ Neural Network Architecture
➢ Involves determining the number of layers in the network, the number
of neurons in each layers and how neurons are connected
❖ Feed-Forward Neural Network
➢ Has no loops: Neuron outputs do not feed back to their inputs directly
or indirectly
❖ Neural Network Parameters
➢ The weights and biases
❖ Neural Network Learning
➢ Determining the values of the network parameters such that the
network computes the desired function
12/1/2024 INTRODUCTION TO AI 4
Neural Network Learning
❖ The overall goal of training a neural network is to model
(approximate) some function 𝑔 𝑋
❖ To model the Neural network (NN), we need to derive
the parameters of the network, i.e., the weights within
the NN such that if we apply the input data (𝑋) to the
NN we get something close to 𝑔 𝑋
➢ Formally we say that the NN is a function 𝑓(𝑋, 𝑾) that
maps the inputs X to predict 𝑔 𝑋 using the weights 𝑊
❖ To learn the Neural Network weights (𝑾) to model
𝑔 𝑋 , we need function 𝑔 𝑋 to be fully specified
➢ This is not feasible in practice or sometimes we do not
know what the actual 𝑔 𝑋 function is!
12/1/2024 INTRODUCTION TO AI 5
Neural Network Learning
❖ The solution is to sample 𝑔 𝑋 and use these samples
to estimate a new function 𝑌 = 𝑓 𝑋, 𝑾 that is close to
𝑔 𝑋
➢ In the sampling process, we get input-output pairs for
several samples of input 𝑋𝑖 , 𝑑𝑖 = 𝑔 𝑋𝑖
➢ After obtaining enough samples, we estimate the
network parameters (𝑾) to “fit” the training points
exactly with the hope that the function will not change
much between these samples
➢ Note: we will need many sample points to make sure
that 𝑌 faithfully represents 𝑔 𝑋 and hence the need
of big data to train the Neural Networks
12/1/2024 INTRODUCTION TO AI 6
MLP Neural Network Learning
❖ Given a neural network architecture and a training set of input-
output pairs (X1, d1), (X2, d2), …, (XN, dN)
➢ di is the desired output of the network in response to Xi
❖ We need to find the network parameters, W, such that the
network produces the desired output for each training input Or a
close approximation of it
12/1/2024 INTRODUCTION TO AI 7
Neural Network Learning
❖ To determine how good our estimating function Y = 𝑓 𝑋, 𝑾 is in
approximating 𝑔 𝑋 , we calculate the divergence between 𝑔 𝑋 and
𝑌 then attempt to make the divergence as small as possible.
❖ What we are looking for are the values of W such that the
divergence is minimized, and we formally write it as:
= argmin න 𝑑𝑖𝑣 𝑓 𝑋, 𝑾 , 𝑔 𝑋 𝑑𝑋
𝑊
𝑊 𝑋
❖ div() is a divergence function that goes to zero when
𝑓 𝑋, 𝑾 = 𝑔 𝑋
❖ A divergence function represents an error function like Squared
Euclidean Distance
1 2
1 2
𝑑𝑖𝑣 𝑌, 𝑑 = 𝑌−𝑑 = 𝑦𝑖 − 𝑑𝑖
2 2
𝑖
12/1/2024 INTRODUCTION TO AI 8
Neural Network Learning Problem Setup
❖ We assume a “layered” network Input
nd rd
1st hidden 2 hidden 3 hidden Output
for simplicity layer Layer Layer Layer Layer
input layer
➢ No neurons here – the “layer”
y1
x2
simply refers to inputs
❖ We refer to the outputs as the x3
output layer
yL
❖ Intermediate layers are “hidden” xD-1
layers
xD
12/1/2024 INTRODUCTION TO AI 9
Neural Network Learning Problem Setup
❖ The input layer is the 0th layer; Input
nd rd
1st hidden 2 hidden 3 hidden Output
layer Layer Layer Layer Layer
Input to network: yi(0) = xi
x1
❖ We will represent the output of
the ith perceptron of the kth layer x2
y1
12/1/2024 INTRODUCTION TO AI 10
Neural Network Learning Problem Setup
❖ Examples: Input st
1 hidden 2 nd
hidden 3rd
hidden Output
layer Layer Layer Layer Layer
➢ w21(2) is shown in red
x1
➢ Y2(0) is shown in blue
➢ Y1(4) is shown in brown x2
y1
yL
xD-1
xD
12/1/2024 INTRODUCTION TO AI 11
Neural Network Learning Problem Setup
❖ Vector Notation:
➢ Xn = [xn1, xn2, …, xnD] is the nth input vector
➢ dn = [dn1, dn2, …, dnL] is the nth vector of desired output
➢ Yn = [yn1, yn2, …, ynL] is the nth vector of actual output
❖ Input representation: vectors of numbers, e.g. vector of pixel values,
real-valued vector representing text
❖ Output representation
➢ Single or a vector of real values
➢ For binary classification: One binary output 1/0 e.g., cat/no cat
➢ For multi-class output: one-hot representation is used e.g. [1 0 0], [0 1
0], [0 0 1]
12/1/2024 INTRODUCTION TO AI 12
Neural Network Learning Problem Setup
❖ Given a neural network architecture and a training set of input-
output pairs (X1, d1), (X2, d2), …, (XT, dT)
❖ The error on the ith instance is div(Yi, di)
❖ Optimize network parameters to minimize the average error over
all training inputs
❖ The average error
T is the number of input samples
INTRODUCTION TO AI 13
A Simple Network Example
INTRODUCTION TO AI 14
Neural Networks Learning
❖ To minimize the error, we need to determine for
each sample how the weights need to be
adjusted to reduce the error
❖ This can be done by examining the derivative of
the error function with respect to each weight
and determine whether the weight needs to be
increased or decreased
❖ The derivative of the error function with respect
to a given weight determines the slope
➢ If the slope is positive this means that 𝑦 = 𝑥
decreasing the weight will decrease the error 𝑑𝑦
[represented as or as 𝑓’(𝑥)]
➢ If the slope is negative this means that 𝑑𝑥
increasing the weight will decrease the error
INTRODUCTION TO AI 15
Iterative Solutions for Function Minimization
❖ Often it is not possible to simply solve 𝒇(𝒙)′ = 𝟎
❖ The function to minimize may have an intractable form
❖ In these situations, iterative solutions are used
➢ Start from an initial guess X0 for the optimal X
➢ Update the guess towards a (hopefully) “better” value of f(X)
➢ Stop when f(X) no longer decreases
Gradient Descent
INTRODUCTION TO AI 17
Gradient Descent
❖ Iterative solution:
➢ Start at some point
➢ Find direction in which to shift this point to decrease error
➢ This can be found from the derivative of the function
▪ A positive derivative → moving left decreases error
▪ A negative derivative → moving right decreases error
➢ Shift point in this direction
Gradient Descent – 𝑋 ∈ ℝ
❖ Initialize 𝑋 0 , 𝑘 = 0
❖ While 𝑓 𝑋 𝑘+1 − 𝑓 𝑋 𝑘
>𝜀
𝑘+1 𝑘 𝑘
𝑋 =𝑋 −𝜂 𝑓′ 𝑋 𝑘
𝑘 = 𝑘 + 1
❖ where:
𝑘
➢𝑋 is the 𝑘th estimate of 𝑋
𝑘 𝑘
➢𝜂 is the 𝑘th step size – if step size is fixed then 𝜂 =𝜂
❖ Many solutions to choosing step size 𝜂 𝑘
, usually iteration dependent
Gradient Descent Example
❖ Given the function f(x1, x2, x3)= (x1)2+x1(1-x2)+(x2)2-x2x3+(x3)2+x3
apply the gradient descent algorithm to compute the next value for
minimizing the function starting with the initial solution
(x1,x2 ,x3)=(1,1,1) using a learning rate η=0.3
𝜕𝑓 𝜕𝑓
𝜕𝑥1
= 2𝑥1 + 1 − 𝑥2 = 2 → 𝑥1 = 𝑥1 − 𝜂 𝜕𝑥 = 0.4
1
𝜕𝑓 𝜕𝑓
𝜕𝑥2
= −𝑥1 + 2𝑥2 − 𝑥3 =0 → 𝑥2 = 𝑥2 − 𝜂 𝜕𝑥 = 1
2
𝜕𝑓 𝜕𝑓
𝜕𝑥3
= −𝑥2 + 2𝑥3 + 1 = 2 → 𝑥3 = 𝑥3 − 𝜂 𝜕𝑥 = 0.4
=31
INTRODUCTION TO AI 20
Gradient Descent Example
f(x1, x2, x3)= (x1)2+x1(1-x2)+(x2)2-x2x3+(x3)2+x3
Y-Axis
Y-Axis
= opt < opt
Y-Axis
Y-Axis
too many steps to reach to the optimum > 2 opt
solution
➢ For > 2𝑜𝑝𝑡, we get divergence
X-Axis X-Axis
INTRODUCTION TO AI 23
Gradient Descent Example
❖ In most real-life applications, the
error function is not convex. What to
do then?
❖ Consider the function:
𝑥 2 cos 𝑥 − 𝑥
𝑓 𝑥 =
10
❖ This function has many local minima.
❖ Gradient descent will find different
ones depending on our initial guess
and our step size
INTRODUCTION TO AI 24
Gradient Descent Example
❖ If we choose 𝑥0=6 and =0.2, for
example, gradient descent moves as
shown in the graph below. The first point
is 𝑥0 and the gradient suggests to move
to the left. After only 10 steps, we have
converged to the local minimum near
𝑥=4
INTRODUCTION TO AI 25
Gradient Descent Example
❖ We then get stuck on the local minimum.
❖ We choose >2 𝑜𝑝𝑡 and get a new point
outside the local minimum valley and
then do the gradient decent again to find
other local minima
❖ We repeat multiple times and then
choose the one that is the lowest
❖ Note: in real applications and depending
on the complexity of the error curve, you
may not reach the global minimum value
INTRODUCTION TO AI 26
Neural Network (NN) Units
❖ Each neuron in the NN is represented by 𝑵
𝑍 = 𝑥𝑖 𝑤𝑖 + 𝑏
Inputs Weights Output
a Perceptron with the following setting: x1
w1 𝑖=1
𝑓 𝑥𝑖 𝑤𝑖 + 𝑏 y
𝑖=1
2. A bias representing a threshold to
Activation
trigger the perceptron xN
wN
b Function
▪ Bias can be viewed as the weight of another
input with fixed value 1 𝑵+𝟏
Inputs Weights 𝑍 = 𝑥𝑖 𝑤𝑖
Output
w1
3. Activation functions are not necessarily x1 𝑖=1
w2
threshold functions x2
z 𝑓
𝑁+1
𝑥𝑖 𝑤𝑖 y
𝑖=1
12/1/2024 INTRODUCTION TO AI 27
Backpropagation
❖ During the training process, the output of the neural network is compared to
the expected output, and the difference between the two is calculated using an
error function. Backpropagation is then used to propagate this error backward
through the network and update the weights and biases to minimize the error
function
❖ Backpropagation is used in gradient descent based neural network learning to
calculate the gradient of the error function with respect to the weights and
biases of the network
❖ This is achieved by propagating the gradient of the error function through
network parameters using the chain rule of calculus. This gradient is then used
to update the weights and biases through an optimization algorithm such as
gradient descent
❖ Backpropagation is a crucial component of training a neural network, and it
allows the network to learn from its mistakes and improve its performance over
time
12/1/2024 INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 28
Activation Functions
❖ Why do we need Activation Functions in Neural Networks?
➢ The purpose of an activation function is to add non-linearity to the
neural network
❖ Let’s suppose we have a neural network working with linear activation
functions. In that case:
1. Every neuron will only be performing a linear transformation on the
inputs using the weights and biases
2. All layers will behave in the same way because the composition of two
linear functions is a linear function itself
3. Learning complex task is impossible, and our model would be just a
linear regression model
12/1/2024 INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 29
Activation Functions
❖Linear activation functions:
1. It’s not possible to use backpropagation as the derivative of the
function is a constant and has no relation to the input x
2. A linear activation function turns the neural network into just one
layer where all layers of the neural network will collapse into one if a
linear activation function is used
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 33
Activation Functions (Sigmoid/logistic Activation)
❖ The limitations of sigmoid function:
➢ The derivative of the function is:
𝑓′ 𝑧 = 𝑓 𝑧 1 − 𝑓 𝑧
➢ As we can see from the derivative figure, the
gradients are only significant for range -4 to 4
➢ The output of the sigmoid function is not
symmetric around zero → the output of all
the neurons will be of the same sign!
▪ This makes the training of the neural network more
difficult and unstable
➢ Vanishing Gradient problem (discussed next)
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 34
Activation Functions (Sigmoid/logistic Activation)
❖ Vanishing Gradient problem in Sigmoid:
➢ Sigmoid squishes the entire input space into a small
output space between 0 and 1
➢ Therefore, a large change in the input of the sigmoid
function will cause a small change in the output
Hence, the derivative becomes small
➢ For shallow networks with only few hidden layers that
use these activations, this isn’t a big problem
However, when more layers are used, it can cause the
gradient to be too small for training to work
effectively
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 35
Activation Functions (Tanh Activation)
❖ Tanh Function (Hyperbolic Tangent)
➢ Tanh function is very similar to the
sigmoid/logistic activation function, with
the difference in output range of -1 to 1
❖ Advantages:
1. The output is Zero centered; hence we can
map the output values as strongly
negative, neutral, or strongly positive.
2. Usually used in hidden layers of a neural
network as its values lie between -1 to 1. It
helps in centering the data and makes
learning for the next layer much easier.
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 36
Activation Functions (Tanh Activation)
❖ The derivative of the function is:
𝑓′ 𝑧 = 1 − 𝑓2 𝑧
❖ Disadvantages:
➢ Tanh faces the problem of vanishing gradients
like the sigmoid activation function
➢ The gradient of the tanh function is much
steeper as compared to the sigmoid function
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 37
Activation Functions (ReLU Activation)
❖ ReLU stands for Rectified Linear Unit
❖ ReLU has a derivative function and allows for
backpropagation while simultaneously
making it computationally efficient
❖ ReLU function does not activate all the
neurons at the same time, i.e., the neurons
will only be deactivated if the output of the
linear transformation is less than 0
❖ Mathematically ReLU is expressed as:
𝑓 𝑧 = max(0, 𝑧)
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 38
Activation Functions (ReLU Activation)
❖ The derivative of the function is:
′ 1 𝑖𝑓 𝑧 ≥ 0
𝑓 𝑧 =ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
❖ The advantages of using ReLU are:
1. ReLU is more computationally efficient when compared
to the sigmoid and tanh functions since only a certain
number of neurons are activated in a neural network
2. ReLU accelerates the convergence of gradient descent
towards the global minimum of the loss function due
to its linearity
❖ Mostly used as an activation function for the hidden
layers
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 39
Activation Functions (ReLU Activation)
❖ Limitations:
➢ All the negative input values become zero
immediately, which decreases the model’s
ability to fit or train from the data properly
➢ The negative side of the graph makes the
gradient value zero. Due to this reason, during
the backpropagation process, the weights and
biases for some neurons are not updated. This
can create dead neurons which never get
activated
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 40
Activation Functions (Softmax Activation)
❖ Softmax function is described as a combination of multiple
sigmoids
❖ It calculates the relative probabilities. Like the sigmoid/logistic
activation function, the SoftMax function returns the probability of
each class
❖ It is mostly used as an activation function for the last layer of the
neural network in the case of multi-class classification
❖ Mathematically it can be represented as:
𝑒 𝑧𝑖
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑖 =
σ𝑗 𝑒 𝑧𝑗
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 41
How to Choose the Right Activation Function?
❖ Match your activation function for your output layer based on the
type of prediction problem that you are solving
❖ Begin with using the ReLU activation function and then move over
to other activation functions if ReLU doesn’t provide optimum
results
❖ Guidelines of choosing Activation Functions:
➢ ReLU activation function should only be used in the hidden layers
➢ Sigmoid/Logistic and Tanh functions should not be used in hidden
layers for high depth networks as they make the model more
susceptible to problems during training (due to vanishing
gradients)
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 42
Activation Functions Summary
❖ Activation Functions are used to introduce non-linearity in the
network
❖ A neural network will almost always have the same activation
function in all hidden layers. This activation function should be
differentiable so that the parameters of the network are learned in
backpropagation
❖ ReLU is the most commonly used activation function for hidden
layers
❖ While selecting an activation function, you must consider the
problems it might face such as vanishing gradients
INTRODUCTION TO AI https://www.v7labs.com/blog/neural-networks-activation-functions 43
Typical Problem Statement: Classification
Binary Classification
Multi-Class Classification
Examples of Divergence Functions
❖ For real-valued output vectors, the (scaled) divergence
representing Squared Euclidean Distance between true and
desired output is popular:
Gradient:
Perceptrons with Differentiable Activation Functions
❖ Activation functions need to be differentiable
➢ (z) is a differentiable function of z
❖ Using the chain rule, y is a differentiable function of both inputs x𝒊 and weights w𝒊
❖ We can compute the change in the output for small changes in either the input or the
weights
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
q
+
5 f
y ×
z -4
INTRODUCTION TO AI 47
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
q 3
+
❖ Forward calculation y
5 f -12
×
z -4
INTRODUCTION TO AI 48
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞 =𝑥+𝑦 → 𝜕𝑥
= 1, 𝜕𝑦 =1 +
5 f
y ×
z -4
INTRODUCTION TO AI 49
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞
z -4
INTRODUCTION TO AI 50
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞
z -4
❖ We want:
𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 51
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞
z -4
❖ We want:
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧 𝜕𝑓
INTRODUCTION TO AI 52
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want:
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧 𝜕𝑓
INTRODUCTION TO AI 53
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want:
𝜕𝑓 𝜕𝑓 𝜕𝑓
, , 𝜕𝑓
𝜕𝑥 𝜕𝑦 𝜕𝑧 𝜕𝑧
INTRODUCTION TO AI 54
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓
, , 𝜕𝑓
𝜕𝑥 𝜕𝑦 𝜕𝑧 𝜕𝑧
INTRODUCTION TO AI 55
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
, , 𝜕𝑞
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 56
Backpropagation Simple Example
❖ Suppose:
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
-4
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
, , 𝜕𝑞
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 57
Backpropagation Simple Example
❖ Suppose: 𝜕𝑓 𝜕𝑓 𝜕𝑞
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 =
𝜕𝑥 𝜕𝑞 𝜕𝑥
-2
2. x = -2, y = 5, z = -4 x
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
-4
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 58
Backpropagation Simple Example
❖ Suppose: 𝜕𝑓 𝜕𝑓 𝜕𝑞
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 =
𝜕𝑥 𝜕𝑞 𝜕𝑥
-2
2. x = -2, y = 5, z = -4 x
-4
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
-4
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 59
Backpropagation Simple Example
❖ Suppose: 𝜕𝑓 𝜕𝑓 𝜕𝑞
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 =
𝜕𝑦 𝜕𝑞 𝜕𝑦
-2
2. x = -2, y = 5, z = -4 x
-4
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
-4
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 60
Backpropagation Simple Example
❖ Suppose: 𝜕𝑓 𝜕𝑓 𝜕𝑞
1. 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 𝑧 =
𝜕𝑦 𝜕𝑞 𝜕𝑦
-2
2. x = -2, y = 5, z = -4 x
-4
𝜕𝑞 𝜕𝑞 q
𝑞=𝑥 + 𝑦 → 𝜕𝑥 = 1, 𝜕𝑦 = 1 +
-4
5 f
𝜕𝑓 𝜕𝑓 y ×
𝑓=𝑞 × 𝑧 → 𝜕𝑞 = 𝑧, 𝜕𝑧 = 𝑞 -4 1
z -4
❖ We want: 3
𝜕𝑓 𝜕𝑓 𝜕𝑓
, ,
𝜕𝑥 𝜕𝑦 𝜕𝑧
INTRODUCTION TO AI 61
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
❖ 𝑤0 = −3.00, 𝑥1 = −2.00, 𝑤1 = −3.00, 𝑥2 = −1.00, 𝑤2 =2.00
INTRODUCTION TO AI 62
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
❖ 𝑤0 = −3.00, 𝑥1 = −2.00, 𝑤1 = −3.00, 𝑥2 = −1.00, 𝑤2 =2.00
INTRODUCTION TO AI 63
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
-1.00
-3.00
-2.00
-3.00
INTRODUCTION TO AI 64
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
-2.00
-1.00
4.00
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
-2.00
-3.00
INTRODUCTION TO AI 65
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
-2.00
-1.00
4.00
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
-3.00 𝒅𝒙 𝒙 𝒅𝒙 𝒙
𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 66
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
-2.00
-1.00
4.00
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
1.00
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 67
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
−1
4.00 1.00 =-0.53
1.372
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
-0.53 1.00
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 68
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
4.00 −0.53 1 =-0.53
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
-0.53 -0.53 1.00
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 69
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
−1
4.00 −0.53 (𝑒 )=-0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
-0.20 -0.53 -0.53 1.00
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 70
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
4.00 −0.20 (-1)=0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 -0.20 -0.53 -0.53 1.00
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 71
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
4.00 0.20 (1)=0.20
0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 -0.20 -0.53 -0.53 1.00
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 72
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 output
4.00 0.20 (1)=0.20
0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 -0.20 -0.53 -0.53 1.00
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
0.20 𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 73
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-2.00 Gradient Gradient
-1.00 0.20 output
4.00 0.20 (1)=0.20
0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 0.20 -0.20 -0.53 -0.53 1.00
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
0.20 𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 74
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-0.20
-2.00 Gradient Gradient
-1.00 0.20 output
4.00 0.20 (-1)=-0.20
0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 0.20 -0.20 -0.53 -0.53 1.00
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
0.20 𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 75
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
Upstream Local
-0.20
-2.00 Gradient Gradient
-1.00 0.20 output
0.40 4.00 0.20 (2)=0.40
0.20
-3.00
6.00 1.00 -1.00 0.37 1.37 0.73
0.20 0.20 -0.20 -0.53 -0.53 1.00
-2.00
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
0.20 𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 76
Example: Single Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
2.00
-0.20
-2.00
-1.00 0.20
0.40 4.00
0.20
-3.00
-0.40 6.00 1.00 -1.00 0.37 1.37 0.73
0.20 0.20 -0.20 -0.53 -0.53 1.00
-2.00
-0.60
𝒅𝒇 𝟏 𝒅𝒇 𝟏
-3.00 𝒇 𝒙 = → 𝒆𝒙 = 𝒆𝒙 𝒇 𝒙 = → =− 𝟐
𝒅𝒙 𝒙 𝒅𝒙 𝒙
0.20 𝒅𝒇 𝒅𝒇
𝒇 𝒙 = 𝒂𝒙 → =𝒂 𝒇 𝒙 =𝒄+𝒙 → =𝟏
𝒅𝒙 𝒅𝒙
INTRODUCTION TO AI 77
Example: Perceptron with Sigmoid Activation
1 1
❖ Another example: 𝑓 𝑤, 𝑥 = 1+𝑒 −𝑧
= 1+𝑒 − 𝑤2𝑥2+𝑤1𝑥1+𝑤0
❖ Summary in a table: Value
Operation Derivative of x
Derivative
at point
Chain Rule
Equation
Chain Rule
Value
2.00
-0.20
-2.00 1/x -1/x^2 1.37 -0.53 =-0.53*1.00 -0.53
-1.00 0.20
4.00
x+1 1 - 1 =1*-0.53 -0.53
0.40
0.20
-3.00 e^x e^x -1 0.37 =0.37*-0.53 -0.2
-0.40 6.00 1.00 -1.00 0.37 1.37 0.73
0.20 0.20 -0.20 -0.53 -0.53 1.00
x*-1 -1 - -1 =-1*-0.2 0.2
-2.00
-0.60 + w0 1 - 1 =1*0.2 0.2
-3.00
0.20 + 1 - 1 =1*0.2 0.2
INTRODUCTION TO AI 78
Training Neural Nets through Gradient Descent
❖ The average Training Loss function (Error):
1
𝐸𝑟𝑟 = 𝑑𝑖𝑣 𝑌𝑖 , 𝑑𝑖
𝑇 𝑖
Where:
𝑇 is the number of input samples
𝑑𝑖 is the desired output for training sample i
𝑌𝑖 is the output produced by the neural network for training
sample i
INTRODUCTION TO AI 79
Algorithm: Training of NN using GD (Input: 𝜼, 𝒅𝒊 , 𝑻, 𝒀)
1 : Initialize all 𝒘𝒊,𝒋𝒌
2 : Repeat
3 : Initialize 𝐸𝑟𝑟 = 0
𝜕𝐸𝑟𝑟
4 : For all 𝑖, 𝑗, 𝑘 do initialize = 0 end
Training 5 : For all t = 1 to 𝑇 do
𝑘
𝜕𝑤𝑖,𝑗
Epoch
through 9 : Compute
𝝏𝒅𝒊𝒗 𝒀𝒕,𝒅𝒕
𝝏𝒘𝒊,𝒋
𝒌
Backward
Gradient 10:
𝝏𝐄𝐫𝐫
𝒌
𝝏𝒘𝒊,𝒋
=
𝝏𝐄𝐫𝐫
𝒌
𝝏𝒘𝒊,𝒋
+
𝝏𝒅𝒊𝒗 𝒀𝒕,𝒅𝒕
𝒌
𝝏𝒘𝒊,𝒋
Pass
Descent 11: end
12: end
13: For all 𝑖, 𝑗, 𝑘 do
𝜼 𝝏𝑬𝒓𝒓
14: Update: 𝒘𝒊,𝒋𝒌 = 𝒘𝒊,𝒋𝒌 − 𝑻 𝒌 𝜂 is the learning rate
𝝏𝒘𝒊,𝒋
15: end
12/1/2024 INTRODUCTION TO AI 16: Until 𝐸𝑟𝑟 has converged 80
Forward Computation
INTRODUCTION TO AI 82
Gradient Descent Training Example
❖ Consider using gradient decent for training a neural network that
has a single output and 5 training samples. The table below shows
𝛛𝐃𝐢𝐯
the computed (𝟑) for each of the input samples. Assuming that
𝛛𝐰𝟏,𝟏
(𝟑) (𝟑)
𝐰𝟏,𝟏 =1 and a learning rate η=0.3, then updated value for 𝐰𝟏,𝟏 is:
Training 1 2 3 4 5
Sample
𝜕Div -0.2 -0.3 -0.1 -0.2 -0.2
(3)
𝜕w1,1
(𝟑) (𝟑)
𝐰𝟏,𝟏 = 𝐰𝟏,𝟏 − 0.3/5 * (-0.2-0.3-0.1-0.2-0.2) = 1- 0.3*-0.2 = 1.06
INTRODUCTION TO AI 83
Stochastic Gradient Descent (SGD)
❖ Problem with conventional gradient descent: we try to
simultaneously adjust the function at all training points
➢ We must process all training points before making a single
adjustment; “Batch” update
❖ Alternative: adjust the function at one training point at a time
➢ Keep adjustments small
➢ Eventually, when we have processed all the training points, we
will have adjusted the entire function
▪ With greater overall adjustment than we would if we made a single “Batch”
update
INTRODUCTION TO AI 84
Stochastic Gradient Descent (SGD)
Algorithm: Training of NN using GD (Input: 𝜼, 𝒅𝒊 , 𝑻, 𝒀)
1 : Initialize all 𝒘𝒊,𝒋𝒌
2 : Repeat
3 : Randomly permute training samples; Initialize 𝐸𝑟𝑟 = 0
4 : For all t = 1 to 𝑇 do Loop over Training Instances
5 : Compute 𝑌𝑡 ; 𝐸𝑟𝑟 = 𝐸𝑟𝑟 + 𝑑𝑖𝑣 𝑌𝑡 , 𝑑𝑡 Forward Pass
6 : For all 𝑖, 𝑗, 𝑘 do
𝝏𝒅𝒊𝒗 𝒀𝒕 ,𝒅𝒕
Backward
7 : Compute 𝒌 Pass
𝝏𝒘𝒊,𝒋
𝝏𝒅𝒊𝒗 𝒀𝒕 ,𝒅𝒕
8 : 𝒘𝒊,𝒋𝒌 = 𝒘𝒊,𝒋𝒌 − 𝜼 𝒌
𝝏𝒘𝒊,𝒋
9 : end
10 : end
11 : Until 𝐸𝑟𝑟 has converged
INTRODUCTION TO AI 85
Mini-Batch Gradient Descent
❖ Alternative: adjust the function at a small, randomly chosen subset
of points
➢ Keep adjustments small
➢ If the subsets cover the training set, we will have adjusted the
entire function
❖ As before, vary the subsets randomly in different passes through
the training data
❖ In practice, training is usually performed using minibatches
➢ The mini-batch size is a hyper parameter to be optimized
INTRODUCTION TO AI 86
Various NN Training Optimization Algorithms
❖ There are several algorithms used in practice for optimization of
parameters during neural networks training
➢ Momentum
➢ Nestorov’s Accelerated Gradient (NAG)
➢ RMS Prop
➢ Adagrad
➢ AdaDelta
➢ ADAM: very popular in practice
➢ AdaMax
INTRODUCTION TO AI 87
Avoiding Overfitting during NN Training
INTRODUCTION TO AI 88
Avoiding Overfitting during NN Training
INTRODUCTION TO AI 89
Avoiding Overfitting during NN Training
❖ Test error for different architectures on MNIST* with and without
dropout (Srivastava et al., 2013)
➢ 2-4 hidden layers with 1024-2048 units
Test
Error
Rate
INTRODUCTION TO AI 90
Neural Network Demo
❖ https://playground.tensorflow.org/
12/1/2024 INTRODUCTION TO AI 91
Acknowledgments
❖ Slides have been used from:
➢ https://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Spring.
2019/www/
➢ https://canvas.northwestern.edu/courses/75723/assignments/syl
labus
➢ http://cs231n.stanford.edu/slides/2022/lecture_4_ruohan.pdf
➢ https://www.v7labs.com/blog/neural-networks-activation-
functions