Part 1.2. Back Propagation
Part 1.2. Back Propagation
Instructor:
Assoc. Prof. Dr. Truong Ngoc Son
Chapter 2
Back Propagation
Outline
Multiclass Classification With Softmax regression
Lost function – cross entropy
Stochastic gradient descent – batch and
mini-batch gradient descent
Translating math into code
Multiclass Classification
With Softmax regression
Multiclass classification example
b(1)1
W(1) z(1)1 a(1)1 W(2)
f
(1)
W(2)1,1
W 1,1 b(1)
2 b(2)1
(1) (1)
z 2 a 2
x1 W(1)2,1 W(2)1,2 a1 o1
f f
W(1)1,n W(2)k,1
W(1)2,n (2)
W(2)1,m
W k,2
b(2)k
W(1)j,1
ak ok
b(1)j W(2)k,j f
z(1)j a(1)j
xn
f W(2)k,m
Output layer
Input layer b(1)m
W(1)m,n z(1)m a(1)m
f
Hidden layer
Multiclass classification example
b(1)1
W(1) z(1)1 a(1)1 W(2)
f
(1)
W(2)1,1 Predictive Desired
W 1,1 b(1)
2 b(2)1 output,o output, y
(1)
z 2 a(1)2 z(2)1
x1 W(1)2,1 W(2)1,2 o1
1 f f
W(1)1,n W(2)k,1
0.9 1
0 W(1)2,n W(2)1,m
W(2)k,2 0.7 0
(2)
b k
W(1)j,1 z(2)k
ok 0.5 0
b(1)j W(2)k,j f
z(1)j a(1)j
1 xn
f W(2)k,m
Output layer
Input layer b(1)m N samples, K outputs
W(1)m,n z(1)m a (1)
m
f N K
1
Cost / Loss 𝐿= 𝑦𝑘𝑡 − 𝑜𝑘𝑡 2
Hidden layer N
𝑡=1 𝑘=1
Multiclass classification with logistic regression
Sigmoid function is used for the neurons at the output layer
𝑒 𝑧𝑗
𝜎(𝑧)𝑗 = 𝐾 𝑧𝑘
The softmax will enforce that the sum of the
𝑘=1 𝑒
probabilities of output classes are equal to one
Softmax is used for multi-classification in the Logistic
Regression model, whereas Sigmoid is used for binary
classification in the Logistic Regression model
Multiclass classification with softmax regression
Softmax function is mostly used in a final layer of Neural Network
The outputs are probability distribution
𝑒 𝑧𝑗
𝜎(𝑧)𝑗 = 𝐾 𝑧𝑘
𝑘=1 𝑒
Lost function – cross entropy
Cross-entropy takes the negative log likelihood of the predicted probability
Cross-entropy loss
𝑀
𝐿𝑜𝑠𝑠 = − 𝑦𝑖 log(𝑜𝑖 )
𝑖=1
M – number of classes
y: class label
o: predicted probability observation
Back propagation
Feedforward propagation with softmax activation
W(1) W(2)
(1)
a 1
W(1)1,1 W(2)1,1
xn a(1)j
W(2)k,m
W(1)m,n Output layer
Input layer
a(1)m
Hidden layer
𝑀 𝐾 𝑒 𝑧𝑘
(1) 1 𝑜𝑘 = (2)
(1) (1) (1) (2) (1) (2) (2) 𝑧𝑗
𝑧𝑗 = 𝑥𝑖 𝑤𝑗,𝑖 + 𝑏𝑗 𝑎𝑗 = 𝑧𝑘 = 𝑎𝑗 𝑤𝑘,𝑗 + 𝑏𝑘 𝐾
−𝑧𝑗
(1) 𝑗=1 𝑒
𝑖=1 1+𝑒 𝑗=1
Feedforward propagation with softmax activation
a1 W 𝜕𝐿
f Gradient descent 𝑤𝑘,𝑗 = 𝑤𝑘,𝑗 − 𝜂
W1,1
𝜕𝑤𝑘,𝑗
a2
W1,2 z1 o1
f f
Wk,1
𝐾
W1,m
Wk,2 𝐿(𝑦, 𝑜) = − 𝑦𝑘 log(𝑜𝑘 ) Apply the chain rule
𝑘=1
zk ok
Wk,j f 𝜕𝐿 𝜕𝐿 𝜕𝑧𝑘
aj 𝑒 𝑧𝑘 =
f Wk,m 𝑜𝑘 = (2) 𝜕𝑤𝑘,𝑗 𝜕𝑧𝑘 𝜕𝑤𝑘,𝑗
𝐾 𝑧𝑗
Output layer
𝑗=1 𝑒
am 𝜕𝐿 𝜕𝐿 𝜕𝑜𝑘
f =
𝜕𝑧𝑘 𝜕𝑜𝑘 𝜕𝑧𝑘
Hidden layer
W(1)1,1 W(2)1,1
(1) W(2)k,2
W 2,n
W(2)1,m
W(1)j,1
W(2)k,j ok
xn a(1)j
W(2)k,m
W(1)m,n Output layer
Input layer
a(1)m
Hidden layer
𝜕𝐿 (1)
(1)
= 𝛿𝑗
𝜕𝑎𝑗
(1)
𝛿𝑗 affects all neurons of next layer via the weights
Back-propagation 𝐵𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒 𝑒𝑟𝑟𝑜𝑟
(1) (2) (2)
𝛿1 = 𝛿1 𝑤1,1 + ⋯ + 𝛿𝑘 𝑤𝑘,1
W(1) W(2)
a(1) 𝐾
1
(1) (2)
W(1)1,1 W(2)1,1 𝛿𝑗 = 𝛿𝑘 𝑤𝑘,𝑗
x1 W(1)2,1 a(1)2 W(2)1,2 o1 𝑘=1
W(1)1,n
W(2)k,2
W(2)k,1 𝐵𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒 𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
W(1)2,n
W(2)1,m
𝜕𝐿 1 1 1
(1)
W j,1
ok (1)
= 𝛿𝑗 𝑎𝑗 (1 − 𝑎𝑗 )
W(2)k,j 𝜕𝑧𝑗
xn a(1)j 𝜕𝐿
W(2)
𝑈𝑝𝑑𝑎𝑡𝑒 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑤𝑗,𝑖 = 𝑤𝑗,𝑖 − 𝜂
Input layer
W(1)m,n k,m
Output layer 𝜕𝑤𝑗,𝑖
a(1)m
(2) (2) 𝜕𝐿 (2) 𝜕𝛿𝑘
Hidden layer 𝑤𝑘,𝑗 =𝑤𝑘,𝑗 − 𝜂 2 = 𝑤𝑘,𝑗 − 𝜂𝛿𝑘 (2)
𝑤𝑘,𝑗 𝜕𝑤𝑘,𝑗
MNIST Dataset
(1)
a 1
W(1)1,1 W(2)1,1
W(1)j,1
W(2)k,j ok
xn a(1)j
W(2)k,m
W(1)m,n Output layer
Input layer
a(1)m
Hidden layer
Feed forward propagation
W(1) W(2)
(1)
a 1
W(1)1,1 W(2)1,1
W(1)j,1
W(2)k,j ok
xn a(1)j
W(2)k,m
W(1)m,n Output layer
Input layer
a(1)m
Hidden layer
𝐷𝑒𝑓𝑖𝑛𝑒: 𝐹𝑜𝑟𝑑𝑤𝑎𝑟𝑑
𝑧2 = 𝑧1 𝑎 = 𝑋𝑊ℎ𝑇 𝑒 𝑧2𝑘
𝑊ℎ = 𝑊 (1) 1 𝑧2 = 𝑋𝑊𝑜 𝑇 𝑜𝑘 = 𝐾 𝑧2𝑘
𝑗=1 𝑒
𝑧1 = 𝜎 𝑧1 =
𝑊𝑜 = 𝑊 (2) 1 + 𝑒 −𝑧1
Back-propagation error (element-wise product)
𝐵𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒 𝑒𝑟𝑟𝑜𝑟
Parameters
n: number of inputs Training sample # 1 (1) (2) (2)
m: number of neurons in hidden layer The kth output 𝛿1 = 𝛿1 𝑤1,1 + ⋯ + 𝛿𝑘 𝑤𝑘,1
k: number of neurons in output layer (1) (1) (1)
d1 d2 dk
t: number of training sample (batch_size) 𝐾
Hidden layer
(1) (2)
, ,…, ,...,
d
d1
(2) (2)
d2
(2)
dk
Outputs of neurons
𝛿𝑗 = 𝛿𝑘 𝑤𝑘,𝑗
M: number of training sample
W(1) W(2) 𝑘=1
(1) (t) (t) (t)
a 1 d1 d2 dk (1) (1) (1)
o1 o2 ok
W(1)1,1 W(2)1,1
(2) (2) (2)
x1 W(1)2,1 a(1)2 W(2)1,2 o1 o1 o2 ok
W(1)1,n W(2)k,1
(2)
W k,2
W(1)2,n (t) (t) (t)
W(2)1,m o1 o2 ok
W(1)j,1
W(2)k,j ok
W(2)
xn a(1)j d1
(1) (1)
d2
(1)
dk
W1,1 W1,2 W1,3 W1,m
W(2)k,m
W(1)m,n Output layer
n inputs (2) (2) (2) W2,1 W2,2 W2,3 W2,m
d1 d2 dk
(1)
Input layer a m
m neurons
Hidden layer (t) (t) (t) Wk,1 Wk,2 Wk,3 Wk,m
d1 d2 dk
txk kxm
n1 W1,1 W1,2 W1,3 W1,n n1 W1,1 W1,2 W1,3 W1,m
W2,1 W2,2 W2,3 W2,n W2,1 W2,2 W2,3 W2,m , ,…, ,...,
W(1) W(2)
mxn kxm (1) (1) (1)
dh1 dh2 dhk
nm Wm,1 Wm,2 Wm,3 Wk,n nk Wk,1 Wk,2 Wk,3 Wk,m
(2) (2) (2)
dh1 dh2 dhk
2
∆𝑊𝑜 = −𝜂 𝑑𝑇 𝑎
t
W(1) W(2)
𝑊𝑜 = Wo+ ∆Wo a (1)
1
W(1)1,1 W(2)1,1
W(1)j,1
𝑏𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒 W(2)k,j ok
𝑑ℎ𝑠 = 𝑑ℎ(𝑎 1 − 𝑎 )
𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑠𝑖𝑔𝑛𝑚𝑜𝑖𝑑 xn a(1)j
2 W(2)k,m
∆𝑊𝑜 = −𝜂 𝑑ℎ𝑠 𝑇 𝑋 Input layer
W(1)m,n Output layer
t
a(1)m
import numpy as np
import tensorflow as tf
#load datashet
print("Load MNIST Database")
mnist = tf.keras.datasets.mnist
(x_train,y_train),(x_test,y_test)= mnist.load_data()
x_train=np.reshape(x_train,(60000,784))/255.0
x_test= np.reshape(x_test,(10000,784))/255.0
y_train = np.matrix(np.eye(10)[y_train])
y_test = np.matrix(np.eye(10)[y_test])
print("----------------------------------")
print(x_train.shape)
print(y_train.shape)
Python code
Define functions
def sigmoid(x):
return 1./(1.+np.exp(-x))
def softmax(x):
return np.divide(np.matrix(np.exp(x)),np.mat(np.sum(np.exp(x),axis=1)))
def Forwardpass(X,Wh,bh,Wo,bo):
zh = [email protected] + bh
a = sigmoid(zh)
[email protected] + bo
o = softmax(z)
return o
def AccTest(label,prediction): # calculate the matching score
OutMaxArg=np.argmax(prediction,axis=1)
LabelMaxArg=np.argmax(label,axis=1)
Accuracy=np.mean(OutMaxArg==LabelMaxArg)
return Accuracy
Python code
Define network architecture, initialize weights
learningRate = 0.5
Epoch=50
NumTrainSamples=60000
NumTestSamples=10000
NumInputs=784
NumHiddenUnits=512
NumClasses=10
#inital weights
#hidden layer
Wh=np.matrix(np.random.uniform(-0.5,0.5,(NumHiddenUnits,NumInputs)))
bh= np.random.uniform(0,0.5,(1,NumHiddenUnits))
dWh= np.zeros((NumHiddenUnits,NumInputs))
dbh= np.zeros((1,NumHiddenUnits))
#Output layer
Wo=np.random.uniform(-0.5,0.5,(NumClasses,NumHiddenUnits))
bo= np.random.uniform(0,0.5,(1,NumClasses))
dWo= np.zeros((NumClasses,NumHiddenUnits))
dbo= np.zeros((1,NumClasses))
Python code Batch Gradient Descent
Training the model
from IPython.display import clear_output
loss = []
Acc = []
for ep in range (Epoch):
#feed fordware propagation
x = x_train
y=y_train
zh = [email protected] + bh
a = sigmoid(zh)
[email protected] + bo
o = softmax(z)
#calculate loss
loss.append(-np.sum(np.multiply(y,np.log10(o))))
#calculate the error for the ouput layer
d = o-y
#Back propagate error
dh = d@Wo
dhs = np.multiply(np.multiply(dh,a),(1-a))
#update weight
dWo = np.matmul(np.transpose(d),a)
dbo = np.mean(d) # consider a is 1 for bias
dWh = np.matmul(np.transpose(dhs),x)
dbh = np.mean(dhs) # consider a is 1 for bias
Wo =Wo - learningRate*dWo/NumTrainSamples
bo =bo - learningRate*dbo
Wh =Wh-learningRate*dWh/NumTrainSamples
bh =bh-learningRate*dbh
#Test accuracy with random innitial weights
prediction = Forwardpass(x_test,Wh,bh,Wo,bo)
Acc.append(AccTest(y_test,prediction))
clear_output(wait=True)
plt.plot([i for i, _ in enumerate(Acc)],Acc,'o')
plt.show()
Python code
prediction = Forwardpass(x_test,Wh,bh,Wo,bo)
Rate = AccTest(y_test,prediction)
Print(Rate)
Python code Mini-Batch Gradient Descent
Training the model