Chapter 6
Chapter 6
13) There are 6 weights connecting Hidden Layer 2 nodes to Output Layer nodes
3
w 11=w 311=¿ the weight connecting 1st Node of 3rd Layer to 1st Node of Next Layer
3
w 12=w 312=¿the weight connecting 1st Node of 3rd Layer to 2nd Node of Next Layer
3
w 21=w 321=¿the weight connecting 2nd Node of 3rd Layer to 1st Node of Next Layer
3
w 22=w 322=¿the weight connecting 2nd Node of 3rd Layer to 2nd Node of Next Layer
3
w 31=w 331=¿the weight connecting 3rd Node of 3rd Layer to 1st Node of Next Layer
3
w 32=w 332=¿the weight connecting 3rd Node of 3rd Layer to 2nd Node of Next Layer
17) These are the Pre activation and activation vectors of Hidden Layer 2
2
a 1=a 21=pre-activation of 1st Node of Hidden Layer 2
2
a 2=a 22=pre-activation of 2nd Node of Hidden Layer 2
2
a 3=a 23=pre-activation of 3rd Node of Hidden Layer 2
2
h1 =h 21=¿activation of 1st Node of Hidden Layer 2
2
h2 =h 22=¿activation of 2nd Node of Hidden Layer 2
2
h3 =h 23=¿ activation of 3rd Node of Hidden Layer 2
18) These are the Pre activation and activation vectors of Output Layer
3
a 1=a 21=pre-activation of 1st Node of Output Layer
3
a 2=a 22=pre-activation of 2nd Node of Output Layer
3
h1 =h 21=¿activation of 1st Node of Output Layer
3
h2 =h 22=¿activation of 2nd Node of Output Layer
[]
Pre-activation and Activation vectors
L
a1
L
a2
L
L
a3
Pre-activation vector of layer L = a = .
.
.
a Ln
[]
L
h1
L
h2
L
L
h3
Activation vector of Layer L = H = .
.
.
hnL
6) There are weights between Lth Layer ∧L−1th Layer in W L ¿
[ ]
L L L L
w11 w 21 w 31 . . . wn 1
L L L L
w12 w22 w32 . . . wn 2
W = w L w L w L . . . w L → W i ∈ Rn xn
L
13 23 33 n3
. . .. . .
w L1 n w L2 n wL3 n .. . w Lnn
7) The biases at Lth Layer to L+1th Layer with n neurons are in vector BL
[]
L
b1
L
b2
L
L
b3 n
B= → Bi ∈ R
.
.
.
bnL
8) Weights between Output Layer with k neurons and Last hidden layer with n neurons
[ ]
L L L L
w11 w21 w 31 . . . wn 1
L L L L
w12 w22 w32 . . . wn 2
W = w L w L w L . . . w L → W i ∈ Rk x n
L
13 23 33 n3
. . . .. .
w L1 k w L2 k w3Lk .. . w Lnk
[]
9) Bias between Output Layer with k neurons and Last hidden layer with n neurons
L
b1
L
b2
L
L
b3 k
B= → Bi ∈ R
.
.
.
bkL
[ ] [ ][ ] [ ]
L LT L−1 L
a =W H +B
L L−1 L
a1 h1 b1
L L L L
a2
L w11 w12 w13 . . . w1 n h L−1
b2
L
2
L L L
a3
L w21 w22 w23 .. . w2Ln h L−1 b L
3 3
= wL wL wL . . . wL . +
. 31 32 33 3n . .
. . . . .. . . .
L L L L
. w n 1 wn 2 wn 3 .. . w nn . .
L L−1
an hn b Ln
Activation at Layer L
H L=g ( aL )
[ ] ([ ])
L L
h1 a1
L
h2 aL2
L L
h3 a3
. =g .
. .
. .
L L
hn an
g ( a ( x ) ) is the activation function for example g(x) can be logistic sigmoid activation function
1
g ( x )=σ ( x )= −x
1+e
That means
L 1
H =
1+e−( a )
L
Loss function
The measure of how “Bad” the network performed is measured using some idea of difference between
actual output and the predicted output
Some popular Loss functions are
( Q 1( X ) )
n
L ( θ )=∑ P ( X ) . log
i=1
Here
P(X) is the actual probability distribution
Q(X) is the predicted distribution
More about Cross Entropy in Chapter 2
( )
w L 11
⋯ wLn 1
L
W = ⋮ ⋱ ⋮
L L
w 1m ⋯ w nm
General Representation of biases
The biases at Lth Layer to L+1th Layer with mneurons are in vector BL
()
L
b 1
L
B= b 2
L
…
L
b m
Gradient Descent algorithm
t←0;
epochs ←1000 ;
while t <epochs do
w t +1=w t−η ∇ w ;
b t+1 =bt −η ∇ b;
t ← t+1
end
For representational simplicity let’s store our Learnable Parameters in another vector θ
[[]]
[ ]
( )
1
w 11 ⋯ w 1 n1 1
⋮ ⋱ ⋮
1 1
w 1 n2 ⋯ w n1 n 2
( )
w2 11 ⋯ w 2 n2 1
⋮ ⋱ ⋮
2 2
w 1 n3 ⋯ w n2 n 3
( )
w 3 11 ⋯ w 3n31
⋮ ⋱ ⋮
3 3
w 1 n4 ⋯ w n3 n 4
…
[]
…
( )
( )
w L−1 11 ⋯ w( L−1 ) n (L−1) 1
[]
W1 ⋮ ⋱ ⋮
2
W w ( L−1)
1 n( L ) ⋯ w ( L−1)
n( L−1 ) n( L )
3
W
…
[ ] W L−1
()
θ= W = =
[]
1
b1
B 1
B b2
1
2
B …
B3 b1 n2
…
()
2
B L−1 b1
b2 2
…
2
b n3
()
3
b1
b3 2
…
b3 n4
…
…
…
( )
L−1
b 1
L−1
b 2
…
L−1
b n L−1
[]
Here
∂ L (θ)
[] ]
1
W
2
W
∂ W3
[
∂ L (θ ) …
W L−1
∇ θ= ∂ W = ….(1)
∂ L (θ ) ∂ L (θ)
[]
∂B 1
B
B2
∂ B3
…
L−1
B
And
[]
∂ L (θ)
∂W1
∂ L (θ)
2
∂W
∂ L (θ)
∂ L (θ ) 3
[]
= ∂W ….(2)
W
1
∂ L (θ)
2 4
W ∂W
∂ W3 …
… …
W
L−1
∂ L (θ)
L−1
∂W
Solving for each element in the vector
[ ]
∂ L (θ) ∂ L(θ ) ∂ L (θ)
…
∂ ( w 11) ∂ ( w 21 ) ∂ ( w n1 1 )
1 1 1
[ ]
∂ L (θ) ∂ L(θ ) ∂ L (θ)
…
∂ ( w 11) ∂ ( w 21 ) ∂ ( w n2 1 )
2 2 2
[ ]
∂ L(θ ) ∂ L(θ ) ∂ L(θ )
…
∂ (w 11) ∂( w 21 ) ∂( w n( L−1) 1 )
L−1 L−1 L−1
∂ L (θ ) ∂ L (θ ) ∂ L(θ )
…
∂ ( w 12 )
1
∂ ( w1 22 ) ∂ (w 1 n 1 2 )
… … ……
∂ L (θ) ∂ L (θ ) ∂ L (θ)
…
∂ ( w 1 n2 ) ∂ ( w 2n 2) ∂ ( w n 1 n2 )
1 1 1
[ ]
∂ L(θ ) ∂ L (θ ) ∂ L(θ )
…
∂ ( w 11 ) ∂ ( w 21 ) ∂ ( w n2 1 )
2 2 2
∂ L(θ ) ∂ L (θ ) ∂ L(θ )
…
∂ ( w 12 ) ∂ ( w 22 ) ∂ (w n 2 2 )
2 2 2
∂ L (θ )
[]
=
… … …… ….(2)
W1
W
2 ∂ L (θ) ∂ L (θ ) ∂ L (θ)
…
∂ ( w 1 n3 ) ∂ ( w 2n 3) ∂ ( w n 2 n3 )
2 2 2
∂ W3
… …
W L−1
[ ]
…
∂ L (θ) ∂ L (θ) ∂ L(θ )
…
∂ (w 11) ∂(w 21 ) ∂(w n( L−1) 1 )
L−1 L−1 L−1
[[ ] ]
Now ,
∂ L (θ)
1
∂B
∂ L (θ)
∂ L(θ ) 2
= ∂B
B1 … …(6)
B
2 …
∂ B3 ∂ L (θ)
… ∂ B L−1
L−1
B
[ ( )] [ ]
∂ L(θ )
1
∂b 1
∂ L(θ )
∂ L(θ ) ∂ L (θ ) 1
1
= = ∂ b 2 ….(7)
∂B b1
1
…
1 …
∂ b 2
… ∂ L(θ )
b1 n2 ∂ b1 n2
[ ( )] [ ]
∂ L (θ )
2
∂b 1
∂ L (θ )
∂ L(θ ) ∂ L (θ ) 2
2
= = ∂b 2 …..(8)
∂B b1
2
…
2 …
∂ b2
… ∂ L (θ )
b2 n3 ∂ b2 n3
[ ( )] [ ]
∂ L (θ)
∂ b L−1 1
∂ L (θ)
∂ L (θ) ∂ L(θ ) L−1
L−1
= = ∂ b 2 …..(9)
∂B b L−1 1 …
L−1 …
∂ b 2
… ∂ L (θ)
L−1
b nL−1 ∂ b L−1 n L−1
[[ ] ]
∂ L (θ )
∂b 2 1
∂ L (θ )
2
∂b 2
∂ L(θ )
= …
1 ….(6)
B …
2
B ∂ L (θ )
∂ B3 ∂ b n3
2
…
L−1 …
B
[]
…
∂ L (θ )
L−1
∂b 1
∂ L (θ )
∂b L−1 2
…
…
∂ L (θ )
L−1
∂b n L−1
Therefore
Putting values of (2) and (6) in (1)
(1) becomes
[[ ] ]
[ ]
∂ L (θ) ∂ L (θ) ∂ L (θ)
…
∂ ( w 11 )
1
∂ ( w 21 )1
∂ ( w 1 n1 1 )
∂ L (θ) ∂ L (θ) ∂ L (θ)
…
∂ ( w 12 ) ∂ ( w 22 ) ∂ ( w n1 2 )
1 1 1
… … ……
∂ L(θ ) ∂ L (θ) ∂ L (θ)
…
∂ ( w 1 n2 ) ∂ ( w 2 n2 ) ∂ ( w n1 n 2)
1 1 1
[ ]
∂ L (θ) ∂ L (θ) ∂ L (θ)
…
∂ ( w 11 )
2
∂ ( w 21 )2
∂ ( w 2 n2 1 )
∂ L (θ) ∂ L (θ) ∂ L (θ)
…
∂ ( w 2 12 ) ∂ ( w 2 22 ) ∂ ( w 2 n2 2 )
… … ……
∂ L(θ ) ∂ L (θ) ∂ L(θ )
…
∂ ( w 1 n3 ) ∂ ( w 2 n3 ) ∂ ( w n2 n3 )
2 2 2
[ ]
…
∂ L(θ ) ∂ L(θ ) ∂ L (θ )
…
∂ (w L−1
11 ) ∂( w L−1
21 ) ∂( w L−1
n (L−1) 1 )
∂ L(θ ) ∂ L(θ ) ∂ L (θ )
∂ L (θ) …
[ ]
∂ (w 12 ) ∂( w 22 ) ∂( w n (L−1) 2 )
L−1 L−1 L−1
1
W
… … ……
W2
∂ W3 ∂ L(θ ) ∂ L (θ ) ∂ L (θ)
…
[ ]
∂ (w 1 n( L ) ) ∂ ( w 2 n( L ) ) ∂(w n (L−1) n( L ) )
L−1 L−1 L−1
∂ L (θ ) …
L−1
W
∇ θ= ∂ W = =
∂ L (θ ) ∂ L (θ)
[]
….(1)
[]
∂B 1
B
2
B ∂ L (θ )
∂ B3 ∂b 1
1
… ∂ L (θ )
B L−1 ∂b 1 2
…
…
∂ L (θ )
1
∂ b n2
∂ L (θ )
2
∂b 1
∂ L (θ )
2
∂b 2
…
This is our final form of ∇ θ
Using this we can update our weights and bias according to the Gradient of Loss function with respect
to each weight and bias.