0% found this document useful (0 votes)
77 views30 pages

Part 1.2. Back Propagation

1. Softmax regression is used for multiclass classification problems where the predicted probabilities of each class sum to 1. It is commonly used as the final layer in neural networks. 2. The cross-entropy loss function measures the performance of a classification model whose output is a probability value between 0 and 1. It takes the negative log likelihood of the predicted probabilities. 3. Backpropagation is used to calculate the gradient of the loss function with respect to the weights in each layer, and update the weights to reduce the loss during training. This allows neural networks to learn through iterative weight updates.

Uploaded by

Việt Hoàng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views30 pages

Part 1.2. Back Propagation

1. Softmax regression is used for multiclass classification problems where the predicted probabilities of each class sum to 1. It is commonly used as the final layer in neural networks. 2. The cross-entropy loss function measures the performance of a classification model whose output is a probability value between 0 and 1. It takes the negative log likelihood of the predicted probabilities. 3. Backpropagation is used to calculate the gradient of the loss function with respect to the weights in each layer, and update the weights to reduce the loss during training. This allows neural networks to learn through iterative weight updates.

Uploaded by

Việt Hoàng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

AI - FOUNDATION AND APPLICATION

Instructor:
Assoc. Prof. Dr. Truong Ngoc Son
Chapter 2
Back Propagation
Outline
 Multiclass Classification With Softmax regression
 Lost function – cross entropy
 Stochastic gradient descent – batch and
mini-batch gradient descent
 Translating math into code
Multiclass Classification
With Softmax regression
Multiclass classification example

b(1)1
W(1) z(1)1 a(1)1 W(2)
f
(1)
W(2)1,1
W 1,1 b(1)
2 b(2)1
(1) (1)
z 2 a 2
x1 W(1)2,1 W(2)1,2 a1 o1
f f
W(1)1,n W(2)k,1
W(1)2,n (2)
W(2)1,m
W k,2

b(2)k
W(1)j,1
ak ok
b(1)j W(2)k,j f
z(1)j a(1)j
xn
f W(2)k,m
Output layer
Input layer b(1)m
W(1)m,n z(1)m a(1)m
f

Hidden layer
Multiclass classification example
b(1)1
W(1) z(1)1 a(1)1 W(2)
f
(1)
W(2)1,1 Predictive Desired
W 1,1 b(1)
2 b(2)1 output,o output, y
(1)
z 2 a(1)2 z(2)1
x1 W(1)2,1 W(2)1,2 o1
1 f f
W(1)1,n W(2)k,1
0.9 1
0 W(1)2,n W(2)1,m
W(2)k,2 0.7 0
(2)
b k
W(1)j,1 z(2)k
ok 0.5 0
b(1)j W(2)k,j f
z(1)j a(1)j
1 xn
f W(2)k,m
Output layer
Input layer b(1)m N samples, K outputs
W(1)m,n z(1)m a (1)
m

f N K
1
Cost / Loss 𝐿= 𝑦𝑘𝑡 − 𝑜𝑘𝑡 2
Hidden layer N
𝑡=1 𝑘=1
Multiclass classification with logistic regression
Sigmoid function is used for the neurons at the output layer

 It is ideally for two-class classification


1  The outputs are independent
𝜎 𝑥 =
1 + 𝑒 −𝑥
 The sigmoid may produces high probability for all classes, some of
them, or none of them

0.9 0.9 0.3

0.8 0.2 0.2

0.6 0.5 0.1


Multiclass classification with softmax regression
We expect that there is only one right answer, the outputs are mutually exclusive.

𝑒 𝑧𝑗
𝜎(𝑧)𝑗 = 𝐾 𝑧𝑘
 The softmax will enforce that the sum of the
𝑘=1 𝑒
probabilities of output classes are equal to one
 Softmax is used for multi-classification in the Logistic
Regression model, whereas Sigmoid is used for binary
classification in the Logistic Regression model
Multiclass classification with softmax regression
 Softmax function is mostly used in a final layer of Neural Network
 The outputs are probability distribution

𝑒 𝑧𝑗
𝜎(𝑧)𝑗 = 𝐾 𝑧𝑘
𝑘=1 𝑒
Lost function – cross entropy
Cross-entropy takes the negative log likelihood of the predicted probability

Cross-entropy loss
𝑀

𝐿𝑜𝑠𝑠 = − 𝑦𝑖 log(𝑜𝑖 )
𝑖=1

M – number of classes
y: class label
o: predicted probability observation
Back propagation
Feedforward propagation with softmax activation
W(1) W(2)
(1)
a 1

W(1)1,1 W(2)1,1

x1 W(1)2,1 a(1)2 W(2)1,2 o1


W(1)1,n W(2)k,1 𝐾
(2)
(1) W k,2
W 2,n
W(2)1,m Cross-entropy loss 𝐿𝑜𝑠𝑠 = − 𝑦𝑘 log(𝑜𝑘 )
W(1)j,1 𝑘=1
W(2)k,j ok

xn a(1)j
W(2)k,m
W(1)m,n Output layer
Input layer
a(1)m

Hidden layer

𝑀 𝐾 𝑒 𝑧𝑘
(1) 1 𝑜𝑘 = (2)
(1) (1) (1) (2) (1) (2) (2) 𝑧𝑗
𝑧𝑗 = 𝑥𝑖 𝑤𝑗,𝑖 + 𝑏𝑗 𝑎𝑗 = 𝑧𝑘 = 𝑎𝑗 𝑤𝑘,𝑗 + 𝑏𝑘 𝐾
−𝑧𝑗
(1) 𝑗=1 𝑒
𝑖=1 1+𝑒 𝑗=1
Feedforward propagation with softmax activation
a1 W 𝜕𝐿
f Gradient descent 𝑤𝑘,𝑗 = 𝑤𝑘,𝑗 − 𝜂
W1,1
𝜕𝑤𝑘,𝑗
a2
W1,2 z1 o1
f f
Wk,1
𝐾
W1,m
Wk,2 𝐿(𝑦, 𝑜) = − 𝑦𝑘 log(𝑜𝑘 ) Apply the chain rule
𝑘=1
zk ok
Wk,j f 𝜕𝐿 𝜕𝐿 𝜕𝑧𝑘
aj 𝑒 𝑧𝑘 =
f Wk,m 𝑜𝑘 = (2) 𝜕𝑤𝑘,𝑗 𝜕𝑧𝑘 𝜕𝑤𝑘,𝑗
𝐾 𝑧𝑗
Output layer
𝑗=1 𝑒
am 𝜕𝐿 𝜕𝐿 𝜕𝑜𝑘
f =
𝜕𝑧𝑘 𝜕𝑜𝑘 𝜕𝑧𝑘
Hidden layer

Error of kth neuron of the 𝜕𝐿 𝜕𝐿 𝛿𝑘 = 𝑜𝑘 − 𝑦𝑘


𝛿𝑘 = Using calculus, we obtain = 𝑜𝑘 − 𝑦𝑘
output layer 𝜕𝑧𝑘 𝜕𝑧𝑘

𝑜𝑘 𝑖𝑠 𝑡ℎ𝑒 𝑘𝑡ℎ 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑛𝑒𝑢𝑟𝑜𝑛′ 𝑠 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛


𝑎𝑛𝑑 𝑦𝑘 𝑖𝑠 𝑡ℎ𝑒 𝑘𝑡ℎ 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑙𝑎𝑏𝑒𝑙
Back-propagation
(1)
𝛿𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑖𝑡ℎ 𝑛𝑒𝑢𝑟𝑜𝑛 𝑖𝑛 ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟
W(1) W(2)
(1)
a 1

W(1)1,1 W(2)1,1

x1 W(1)2,1 a(1)2 W(2)1,2 o1


(2)
W(1)1,n W k,1

(1) W(2)k,2
W 2,n
W(2)1,m

W(1)j,1
W(2)k,j ok

xn a(1)j
W(2)k,m
W(1)m,n Output layer
Input layer
a(1)m

Hidden layer
𝜕𝐿 (1)
(1)
= 𝛿𝑗
𝜕𝑎𝑗

(1)
𝛿𝑗 affects all neurons of next layer via the weights
Back-propagation 𝐵𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒 𝑒𝑟𝑟𝑜𝑟
(1) (2) (2)
𝛿1 = 𝛿1 𝑤1,1 + ⋯ + 𝛿𝑘 𝑤𝑘,1
W(1) W(2)
a(1) 𝐾
1
(1) (2)
W(1)1,1 W(2)1,1 𝛿𝑗 = 𝛿𝑘 𝑤𝑘,𝑗
x1 W(1)2,1 a(1)2 W(2)1,2 o1 𝑘=1
W(1)1,n
W(2)k,2
W(2)k,1 𝐵𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒 𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
W(1)2,n
W(2)1,m
𝜕𝐿 1 1 1
(1)
W j,1
ok (1)
= 𝛿𝑗 𝑎𝑗 (1 − 𝑎𝑗 )
W(2)k,j 𝜕𝑧𝑗
xn a(1)j 𝜕𝐿
W(2)
𝑈𝑝𝑑𝑎𝑡𝑒 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑤𝑗,𝑖 = 𝑤𝑗,𝑖 − 𝜂
Input layer
W(1)m,n k,m
Output layer 𝜕𝑤𝑗,𝑖
a(1)m
(2) (2) 𝜕𝐿 (2) 𝜕𝛿𝑘
Hidden layer 𝑤𝑘,𝑗 =𝑤𝑘,𝑗 − 𝜂 2 = 𝑤𝑘,𝑗 − 𝜂𝛿𝑘 (2)
𝑤𝑘,𝑗 𝜕𝑤𝑘,𝑗

(2) (2) (1) 𝛿𝑘 = 𝑜𝑘 − 𝑦𝑘


𝑤𝑘,𝑗 =𝑤𝑘,𝑗 − 𝜂𝛿𝑘 𝑎𝑗
𝐾
(1) (2)
𝛿𝑗 = 𝛿𝑘 𝑤𝑘,𝑗
𝑘=1

(1) (1) (1) 1 1


𝑦𝑜𝑢 𝑠ℎ𝑜𝑢𝑙𝑑 𝑠𝑝𝑒𝑛𝑑 𝑡𝑖𝑚𝑒 𝑡𝑜 𝑚𝑎𝑠𝑡𝑒𝑟 𝑡ℎ𝑒𝑚 𝑤𝑗,𝑖 =𝑤𝑗,𝑖 − 𝜂𝛿𝑗 𝑎𝑗 (1 − 𝑎𝑗 )𝑥𝑖
Stochastic gradient descent – batch and
mini-batch gradient descent
Stochastic gradient descent (SGD): use 1 sample in each iteration
Batch gradient descent (GD): use all samples in each iteration
Mini-batch gradient descent (Mini-batch GD): use b sample in each
iteration, in this case, b is the batch size
Mini-batch stochastic gradient descent (Mini-batch SGD): use b sample in
each iteration, the batch of training samples is randomly selected
PYTHON CODE
Translating Math into Code
Translating mathematics into code
W(1) W(2)

MNIST Dataset
(1)
a 1

W(1)1,1 W(2)1,1

x1 W(1)2,1 a(1)2 W(2)1,2 o1


W(1)1,n W(2)k,1
(2)
W k,2
W(1)2,n
W(2)1,m Cross-entropy loss

W(1)j,1
W(2)k,j ok

xn a(1)j
W(2)k,m
W(1)m,n Output layer
Input layer
a(1)m

Hidden layer
Feed forward propagation
W(1) W(2)
(1)
a 1

W(1)1,1 W(2)1,1

x1 W(1)2,1 a(1)2 W(2)1,2 o1


W(1)1,n
W(2)k,1
(1) W(2)k,2
W 2,n
W(2)1,m

W(1)j,1
W(2)k,j ok

xn a(1)j
W(2)k,m
W(1)m,n Output layer
Input layer
a(1)m

Hidden layer

𝐷𝑒𝑓𝑖𝑛𝑒: 𝐹𝑜𝑟𝑑𝑤𝑎𝑟𝑑
𝑧2 = 𝑧1 𝑎 = 𝑋𝑊ℎ𝑇 𝑒 𝑧2𝑘
𝑊ℎ = 𝑊 (1) 1 𝑧2 = 𝑋𝑊𝑜 𝑇 𝑜𝑘 = 𝐾 𝑧2𝑘
𝑗=1 𝑒
𝑧1 = 𝜎 𝑧1 =
𝑊𝑜 = 𝑊 (2) 1 + 𝑒 −𝑧1
Back-propagation error (element-wise product)
𝐵𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒 𝑒𝑟𝑟𝑜𝑟
Parameters
n: number of inputs Training sample # 1 (1) (2) (2)
m: number of neurons in hidden layer The kth output 𝛿1 = 𝛿1 𝑤1,1 + ⋯ + 𝛿𝑘 𝑤𝑘,1
k: number of neurons in output layer (1) (1) (1)
d1 d2 dk
t: number of training sample (batch_size) 𝐾
Hidden layer
(1) (2)
, ,…, ,...,
d
d1
(2) (2)
d2
(2)
dk
Outputs of neurons
𝛿𝑗 = 𝛿𝑘 𝑤𝑘,𝑗
M: number of training sample
W(1) W(2) 𝑘=1
(1) (t) (t) (t)
a 1 d1 d2 dk (1) (1) (1)
o1 o2 ok
W(1)1,1 W(2)1,1
(2) (2) (2)
x1 W(1)2,1 a(1)2 W(2)1,2 o1 o1 o2 ok
W(1)1,n W(2)k,1
(2)
W k,2
W(1)2,n (t) (t) (t)
W(2)1,m o1 o2 ok
W(1)j,1
W(2)k,j ok
W(2)
xn a(1)j d1
(1) (1)
d2
(1)
dk
W1,1 W1,2 W1,3 W1,m
W(2)k,m
W(1)m,n Output layer
n inputs (2) (2) (2) W2,1 W2,2 W2,3 W2,m
d1 d2 dk
(1)
Input layer a m
m neurons
Hidden layer (t) (t) (t) Wk,1 Wk,2 Wk,3 Wk,m
d1 d2 dk
txk kxm
n1 W1,1 W1,2 W1,3 W1,n n1 W1,1 W1,2 W1,3 W1,m

W2,1 W2,2 W2,3 W2,n W2,1 W2,2 W2,3 W2,m , ,…, ,...,
W(1) W(2)
mxn kxm (1) (1) (1)
dh1 dh2 dhk
nm Wm,1 Wm,2 Wm,3 Wk,n nk Wk,1 Wk,2 Wk,3 Wk,m
(2) (2) (2)
dh1 dh2 dhk

(t) (t) (t)


txm
dh1 dh2 dhk
Update weights
𝑈𝑝𝑑𝑎𝑡𝑒 𝑤𝑒𝑖𝑔ℎ𝑡𝑠
𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟
𝐾 𝜕𝐿
𝑤𝑗,𝑖 = 𝑤𝑗,𝑖 − 𝜂
𝐿(𝑦, 𝑜) = − 𝑦𝑘 log(𝑜𝑘 ) 𝜕𝑤𝑗,𝑖
𝑘=1 (2) (2) 𝜕𝐿 (2) 𝜕𝛿𝑘
𝑤𝑘,𝑗 =𝑤𝑘,𝑗 − 𝜂 2 = 𝑤𝑘,𝑗 − 𝜂𝛿𝑘 (2)
𝑑 =𝑜−𝑦 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑤𝑘,𝑗 𝜕𝑤𝑘,𝑗

2
∆𝑊𝑜 = −𝜂 𝑑𝑇 𝑎
t
W(1) W(2)
𝑊𝑜 = Wo+ ∆Wo a (1)
1

W(1)1,1 W(2)1,1

x1 W(1)2,1 a(1)2 W(2)1,2 o1


W(1)1,n
𝑜𝑢𝑡𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 𝑏𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒 (1) W(2)k,2
W(2)k,1
W 2,n

𝑑ℎ = 𝑑𝑊𝑜 𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 W(2)1,m

W(1)j,1
𝑏𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒 W(2)k,j ok
𝑑ℎ𝑠 = 𝑑ℎ(𝑎 1 − 𝑎 )
𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑠𝑖𝑔𝑛𝑚𝑜𝑖𝑑 xn a(1)j
2 W(2)k,m
∆𝑊𝑜 = −𝜂 𝑑ℎ𝑠 𝑇 𝑋 Input layer
W(1)m,n Output layer
t
a(1)m

𝑡: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑎𝑚𝑝𝑙𝑒 Hidden layer


PYTHON CODE
Python code
Load dataset Batch Gradient Descent

import numpy as np
import tensorflow as tf
#load datashet
print("Load MNIST Database")
mnist = tf.keras.datasets.mnist
(x_train,y_train),(x_test,y_test)= mnist.load_data()
x_train=np.reshape(x_train,(60000,784))/255.0
x_test= np.reshape(x_test,(10000,784))/255.0
y_train = np.matrix(np.eye(10)[y_train])
y_test = np.matrix(np.eye(10)[y_test])
print("----------------------------------")
print(x_train.shape)
print(y_train.shape)
Python code
Define functions

def sigmoid(x):
return 1./(1.+np.exp(-x))

def softmax(x):
return np.divide(np.matrix(np.exp(x)),np.mat(np.sum(np.exp(x),axis=1)))

def Forwardpass(X,Wh,bh,Wo,bo):
zh = [email protected] + bh
a = sigmoid(zh)
[email protected] + bo
o = softmax(z)
return o
def AccTest(label,prediction): # calculate the matching score
OutMaxArg=np.argmax(prediction,axis=1)
LabelMaxArg=np.argmax(label,axis=1)
Accuracy=np.mean(OutMaxArg==LabelMaxArg)
return Accuracy
Python code
Define network architecture, initialize weights

learningRate = 0.5
Epoch=50
NumTrainSamples=60000
NumTestSamples=10000

NumInputs=784
NumHiddenUnits=512
NumClasses=10
#inital weights
#hidden layer
Wh=np.matrix(np.random.uniform(-0.5,0.5,(NumHiddenUnits,NumInputs)))
bh= np.random.uniform(0,0.5,(1,NumHiddenUnits))
dWh= np.zeros((NumHiddenUnits,NumInputs))
dbh= np.zeros((1,NumHiddenUnits))
#Output layer
Wo=np.random.uniform(-0.5,0.5,(NumClasses,NumHiddenUnits))
bo= np.random.uniform(0,0.5,(1,NumClasses))
dWo= np.zeros((NumClasses,NumHiddenUnits))
dbo= np.zeros((1,NumClasses))
Python code Batch Gradient Descent
Training the model
from IPython.display import clear_output
loss = []
Acc = []
for ep in range (Epoch):
#feed fordware propagation
x = x_train
y=y_train
zh = [email protected] + bh
a = sigmoid(zh)
[email protected] + bo
o = softmax(z)
#calculate loss
loss.append(-np.sum(np.multiply(y,np.log10(o))))
#calculate the error for the ouput layer
d = o-y
#Back propagate error
dh = d@Wo
dhs = np.multiply(np.multiply(dh,a),(1-a))
#update weight
dWo = np.matmul(np.transpose(d),a)
dbo = np.mean(d) # consider a is 1 for bias
dWh = np.matmul(np.transpose(dhs),x)
dbh = np.mean(dhs) # consider a is 1 for bias
Wo =Wo - learningRate*dWo/NumTrainSamples
bo =bo - learningRate*dbo
Wh =Wh-learningRate*dWh/NumTrainSamples
bh =bh-learningRate*dbh
#Test accuracy with random innitial weights
prediction = Forwardpass(x_test,Wh,bh,Wo,bo)
Acc.append(AccTest(y_test,prediction))
clear_output(wait=True)
plt.plot([i for i, _ in enumerate(Acc)],Acc,'o')
plt.show()
Python code

Test the model

prediction = Forwardpass(x_test,Wh,bh,Wo,bo)
Rate = AccTest(y_test,prediction)
Print(Rate)
Python code Mini-Batch Gradient Descent
Training the model

from IPython.display import clear_output


loss = []
Acc = []
Batch_size = 200 #update weight
Stochastic_samples = np.arange(NumTrainSamples) dWo = np.matmul(np.transpose(d),a)
for ep in range (Epoch): dbo = np.mean(d) # consider a is 1 for bias
np.random.shuffle(Stochastic_samples) dWh = np.matmul(np.transpose(dhs),x)
for ite in range (0,NumTrainSamples,Batch_size): dbh = np.mean(dhs) # consider a is 1 for bias
#feed fordware propagation
Wo =Wo - learningRate*dWo/Batch_size
Batch_samples = Stochastic_samples[ite:ite+Batch_size]
x = x_train[Batch_samples,:] bo =bo - learningRate*dbo
y=y_train[Batch_samples,:] Wh =Wh-learningRate*dWh/Batch_size
zh = [email protected] + bh bh =bh-learningRate*dbh
a = sigmoid(zh) #Test accuracy with random innitial weights
[email protected] + bo prediction = Forwardpass(x_test,Wh,bh,Wo,bo)
o = softmax(z)
Acc.append(AccTest(y_test,prediction))
#calculate loss
loss.append(-np.sum(np.multiply(y,np.log10(o)))) clear_output(wait=True)
#calculate the error for the ouput layer plt.plot([i for i, _ in enumerate(Acc)],Acc,'o')
d = o-y plt.show()
#Back propagate error print('Epoch:', ep )
dh = d@Wo print('Accuracy:',AccTest(y_test,prediction) )
dhs = np.multiply(np.multiply(dh,a),(1-a))
#update weight
Python code Mini-Batch Gradient Descent
Just calculate the loss and accuracy after each epoch

from IPython.display import clear_output dWo = np.matmul(np.transpose(d),a)


loss = [] dbo = np.mean(d) # consider a is 1 for bias
Acc = [] dWh = np.matmul(np.transpose(dhs),x)
Batch_size = 200 dbh = np.mean(dhs) # consider a is 1 for bias
Stochastic_samples = np.arange(NumTrainSamples) Wo =Wo - learningRate*dWo/Batch_size
for ep in range (Epoch): bo =bo - learningRate*dbo
np.random.shuffle(Stochastic_samples) Wh =Wh-learningRate*dWh/Batch_size
for ite in range (0,NumTrainSamples,Batch_size): bh =bh-learningRate*dbh
#feed fordware propagation #Test accuracy with random innitial weights
Batch_samples = prediction = Forwardpass(x_test,Wh,bh,Wo,bo)
Stochastic_samples[ite:ite+Batch_size] Acc.append(AccTest(y_test,prediction))
x = x_train[Batch_samples,:] print('Epoch:', ep )
y=y_train[Batch_samples,:] print('Accuracy:',AccTest(y_test,prediction) )
zh = [email protected] + bh
a = sigmoid(zh)
[email protected] + bo Epoch: 0
o = softmax(z) Accuracy: 0.8762
#calculate loss Epoch: 1
loss.append(-np.sum(np.multiply(y,np.log10(o)))) Accuracy: 0.9013
#calculate the error for the ouput layer Epoch: 2
d = o-y Accuracy: 0.9136
#Back propagate error Epoch: 3
dh = d@Wo Accuracy: 0.9165
dhs = np.multiply(np.multiply(dh,a),(1-a)) Epoch: 4
Accuracy: 0.9251
Python code Mini-Batch Gradient Descent
Training the model

You might also like