0% found this document useful (0 votes)
4 views20 pages

CHAPTER 3.4.1 - Backpropagation - Updated

Chapter 4 discusses back propagation and optimization in artificial intelligence using Python. It covers concepts such as partial derivatives, gradients, the gradient descent algorithm, and the chain rule, emphasizing their roles in training neural networks. The chapter also provides examples of calculating gradients for various functions and the softmax activation function.

Uploaded by

21146424
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views20 pages

CHAPTER 3.4.1 - Backpropagation - Updated

Chapter 4 discusses back propagation and optimization in artificial intelligence using Python. It covers concepts such as partial derivatives, gradients, the gradient descent algorithm, and the chain rule, emphasizing their roles in training neural networks. The chapter also provides examples of calculating gradients for various functions and the softmax activation function.

Uploaded by

21146424
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

ARTIFICIAL INTELLIGENCE

in Python LANGUAGE

Chapter 4: Back Propagation & Optimization

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Partial derivative:
• The partial derivative measures how much impact a single input has on a function’s output.
• Euler’s notation:

Example:

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Partial derivative:

Examples:

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Partial derivative:
Partial derivative of max function:

𝜕𝑓(𝑥, 𝑦) 𝜕max(𝑥, 𝑦) 1 if 𝑥 ≥ 𝑦
𝑓(𝑥, 𝑦) = max(𝑥, 𝑦) → = =቎
𝜕𝑥 𝜕𝑥 0 if 𝑥 < 𝑦

𝜕𝑓(𝑥, 𝑦) 1 if 𝑥 ≥ 0
𝑓(𝑥, 0) = max(𝑥, 0) → =቎
𝜕𝑥 0 if 𝑥 < 0

Gradient is a vector composed of all of the partial derivatives of one


function, calculated in function of the input variables.

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Gradient:
Gradient is the vector composed by all partial derivatives of the function. Denotation: Nabla 𝛻

Example:

Gradient descent algorithm: An optimal algorithm that help finding the local minimum of a
function (in the case of a NN: the loss function) by making converging the model’s parameters
to optimal values.
M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only
4.1. Back Propagation
• Gradient Descent Algorithm:
Gradient descent algorithm: Starting from a point that could be close to the solution, one will
use an iterative operation to gradually approach the desired point (local minimum), i.e., when
the derivative converge to 0.
In order to converge to the local minimum, one have to move in the inverse sense of the gradient
vector. The formula can be as follow:
𝜕𝑓
for each 𝑥𝑖 in 𝐱: 𝑥𝑖 (𝑡 + 1) = 𝑥𝑖 (𝑡) − 𝐿𝑅. 𝐱
𝜕𝑥𝑖
with 𝐿𝑅 is the learning rate

In the case of a N.N., f is the loss function

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• The chain rule:
When realizing a forward pass, the data is passed through layers of neurons. At each layer, the
outputs are passed through the activation function before going to the next layer, … The loss
function is calculated at the last layer of neurons (the output layer). Its expression can be written in
function of all the parameters of the network. Example with Cross Categorical Entropy Loss:
𝑛2
𝑛1
𝑛0
𝑛2 𝑛1
ා ∀𝑗=1 max 0,෎ ∀𝑗=1 max 0,෍ 𝑥𝑖 𝜔1,𝑖,𝑗 +𝑏1,𝑗 𝜔2,𝑖,𝑗 +𝑏2,𝑗 𝜔3,𝑖,𝑗 +𝑏3,𝑗
𝑛 𝑖=1 𝑖
𝑖=1
𝑛3 𝑒 𝑖=1 𝑖
𝐿 = − ෍ 𝑦𝑘 log ∀𝑗=1 𝑛2
𝑛1
𝑘=1 𝑛3 𝑛0
𝑛2 𝑛1
ා ∀𝑗=1 max 0,෎ ∀𝑗=1 max 0,෍ 𝑥𝑖 𝜔1,𝑖,𝑗 +𝑏1,𝑘 𝜔2,𝑖,𝑗 +𝑏2,𝑘 𝜔3,𝑖,𝑗 +𝑏3,𝑘
𝑖=1 𝑖
𝑖=1
ා 𝑒 𝑖=1 𝑖

𝑙=1 A chain of functions

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• The chain rule:
The derivative of a function chain is a product of all derivatives of all of the functions in this chain

Examples:

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation:
We want to back-propagate our gradients by calculating derivatives and partial derivatives with
respect to each of our parameters and inputs. We’re going to use the chain rule on our NN. We
start on 1 single neuron:

Calculate Gradient of the ReLU activation function:


The ReLU function is equivalent to function max(x, 0). Therefore, its partial derivative is:
𝜕𝐑𝐞𝐋𝐔 1 if 𝑦≥0
𝐑𝐞𝐋𝐮(𝑦) = max(𝑦, 0) ⇒ =቎
𝜕𝑦 0 if 𝑦<0

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation:
drelu_dxw0: the partial derivative of the ReLU w.r.t. the first weighed input, w0x0
drelu_dxw1: the partial derivative of the ReLU w.r.t. the second weighed input, w1x1
drelu_dxw2: the partial derivative of the ReLU w.r.t. the 3rd weighed input, w2x2
drelu_db: the partial derivative of the ReLU w.r.t. the bias, w0x0
Calculate Gradient of the sum function:
The partial derivative of the sum operation is always equal to 1:

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation:
Calculate Gradient of the multiplication function:

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation:
# The derivative of the next layer is here 1.0
Example of backpropagation on a single neuron: dvalue = 1.0
x = [1, -2, 3] # input # The derivative of the ReLU / the chain rule z
w = [-3, -1, 2] # weights dReLU_dz = dvalue * (1. if z > 0 else 0.)
b = 1 # bias # Partial derivative of the sum, the chain rule
# Forward pass dsum_dxw0 = 1 # = dz_dxw0
xw0 = x[0] * w[0] dsum_dxw1 = 1 # = dz_dxw1
xw1 = x[1] * w[1] dsum_dxw2 = 1 # = dz_dxw2
xw2 = x[2] * w[2] dsum_db = 1

z= xw0 + xw1 + xw2 + b # the value of the chain dReLU_dxw0 = dReLU_dz * dsum_dxw0
rule dReLU_dxw1 = dReLU_dz * dsum_dxw1
dReLU_dxw2 = dReLU_dz * dsum_dxw2
# ReLU activation dReLu_db = dReLU_dz * dsum_db
output = max(z,0)

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation: # Determine the gradient vectors
dx = [dReLU_dx0, dReLU_dx1, dReLU_dx2] # gradient of inputs
dw = [dReLU_dw0, dReLU_dw1, dReLU_dw2] # gradient of weights
Example of backpropagation on a single neuron:
db = dReLu_db # gradient of bias
# Partial derivative of the multiplication, the chain rule
dmul_dx0 = w[0] # Update the weights

dmul_dx1 = w[1] w[0] += -0.001*dw[0]


w[1] += -0.001*dw[1]
dmul_dx2 = w[2]
w[2] += -0.001*dw[2]
dmul_dw0 = x[0]
dmul_dw1 = x[1]
b += -0.001*db

dmul_dw2 = x[2] # Now, forward pass again !

dReLU_dx0 = dReLU_dxw0 * dmul_dx0 xw0 = x[0] * w[0]


xw1 = x[1] * w[1]
dReLU_dw0 = dReLU_dxw0 * dmul_dw0
dReLU_dx1 = dReLU_dxw1 * dmul_dx1
xw2 = x[2] * w[2]

dReLU_dw1 = dReLU_dxw1 * dmul_dw1


dReLU_dx2 = dReLU_dxw2 * dmul_dx2 z= xw0 + xw1 + xw2 + b # the value of the chain rule

dReLU_dw2 = dReLU_dxw2 * dmul_dw2

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation:
Calculate Gradient of the CCE loss:

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation:
def backward(self, dvalues, y_true):
Calculate Gradient of the CCE loss:
# Determine the number of samples
samples = len(dvalues)
# Determine the number of labels in each sample
# We use the first sample to count
labels = len(dvalues[0])
# if labels are sparse, turn them into one vector
if len(y_true.shape) == 1:
y_true = np.eye(labels)[y_true]
# Calculate gradient
self.dinputs = - y_true / dvalues
# Normalize gradient
self.dinputs = self.dinputs / samples

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation:
Calculate Gradient of the softmax activation function:

The Softmax Output is defined by:

This is a function with n inputs and n outputs. Thus,


calculating all the gradients of Sj will result in a
Jacobian matrix:

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation:
Calculate Gradient of the softmax
activation function:

The calculation of each term of this Jacobian matrix gives:

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation:
Calculate Gradient of the
softmax activation function:

Therefore:

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


4.1. Back Propagation
• Back Propagation:
Calculate Gradient of the softmax
def backward(self,dvalues):
activation function: # Create uninitialized array
self.dinputs = np.empty_like(dvalues)

# Enumerate outputs and gradients


for index, (single_output,single_dvalues) in enumerate(zip(self.output, dvalues)):
# Flatten ouput array
single_output = single_output.reshape(-1,1)
# Calculate the Jacobian Matrix of the output
jacobian_matrix = np.diagflat(single_output) - np.dot(single_output,single_output.T)

# Calculate sample-wise gradient


# and add it to the array of sample gradients
self.dinputs[index] = np.dot(jacobian_matrix,single_dvalues)

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only


Artificial Intelligence

END OF CHAPTER 4.1

M e c h a t ro n i c s – R o b o t & A I D e p a r t m e n t For Internal Circulation only

You might also like