NNDL
NNDL
Cheatsheet
1 Matrix based notation ∂C
= δjl .
∂blj
The activation zjl of the j-th neuron of the l-th layer is
!
3.4 BP4
l
X l l−1 l
zj = σ wjk zk + bj
k
∂C ∂C
Now take Wl as the matrix l
= zkl−1 δjl = zin δout .
∂wjk ∂w
l l l
···
w00 w01 w02
l l l
w10 w11 w12 · · ·.
.. .. .. ..
. . . . Proof 3.1: BP1
and we define ∂C
δjL =
l
a := W z l l−1
+b ∂aLj
so that zl = σ(al ). X ∂C ∂z L
k
=
Hadamard product: stupid retarded product of matrices: k
∂z L
k ∂a L
j
1 2 1×2 σ(aL
j )
⊙ = .
2 4 3×4 L
∂C ∂ zj
=
2 Cost function ∂zjL ∂aL
j
∂C ′ L
It must be: =
∂zjL
σ (aj ).
• Expressed as mean of the single inputs;
• It must be a function of the outputs of the network.
Proof 3.2: BP2
Example: quadratic cost function
1 2 Here we must show that
C= y(x) − zL (x) .
2n ∂C h i
δjl := l
= (Wl+1 )T δ l+1 ⊙ σ ′ (al )
∂aj j
3 The Four Fundamental Equations X l+1 l+1 l
= wkj δk · σ(aj ).
Define δjl as the error at level l of neuron j: k
∂C X ∂C ∂al+1
3.1 BP1 δjl := = k
∂alj k
∂al+1
k
alj
l+1
by def. δk
∂C
δjL = · σ ′ (aL
j )
∂zjL X ∂al+1
= δkl+1 k
.
⇓ alj
k
δ L = ∇z C ⊙ σ ′ (aL ).
But we know that
X
3.2 BP2 al+1
k = l+1 l
wkj zj + bl+1
k
j
X l+1
X l+1 l+1 ′ l = wkj σ(alj ) + bl+1
k
δjl = wkj δk σ (aj ) j
k
⇓ so we have that
δ l = ((Wl+1 )T δ l+1 ) ⊙ σ ′ (al ).
∂al+1
k l+1 ′ l
= wkj σ (aj ).
∂alj
3.3 BP3
So putting all together we get
X l+1 l+1 ′ l
δjl = wkj δk σ (aj ).
k
∂C 1
We must show that = dlj . Think of C as a function E(w) = E(w⋆ ) + (w − w⋆ )T H(w − w⋆ ).
∂blj 2
of alj and use chain rule:
Since {ui }i is a orthonormal basis we can write any vector as a
∂C X ∂C ∂al
k linear combination of ui vectors, which allows us to write:
l
=
∂bj k
∂alk blj
∂C ∂alj 1X
= . E(w) = E(w⋆ ) + λi αi2 .
∂alj bj 2 i
=1
=δjl
This yields:
X
To compute ∇αi we use the fact that w − w⋆ = αj uj :
j
∂C 1X
= xj (σ(a) − y)
∂wj n x X
!
T ⋆ T
∂C 1 X ui (w − w ) = ui αj uj
= (σ(a) − y).
∂b n x j
ui T (w − w⋆ ) = αi
We can generalize for multi-layer networks: X X ⋆
wj uij − wj uij = αi
1 XXh i
C=− yj ln zjL + (1 − yj ) ln(1 − zjL ) . j j
n x j
so
!
∂ X X
wj uij − wj⋆ uij = ui k
Soft max activation with log-likelihood cost function: ∂wk j j
eaj
L ∂αi
zjL = P = =⇒ ∇αi = ui .
aL ∂wk
ke
k
C = − ln zyL
We have that
5 Convergence
X
• δw = δαi ui
i
Consider the quadratic approximation of the error function
around the minimum point w⋆ :
X
• ∇E = αi λi ui
i
1
E(w) = E(w⋆ )+∇(w)E(w⋆ )T (w−w⋆ )+ (w − w⋆ )T H(w − w⋆ )
2
Normally we have Where g(h) = Dh. So we are interested in measuring the loss of
the reconstruction
w(τ ) = w(τ −1) + ∆w(τ −1) L(x, g(f (x))).
and marginalizing
7 Learning rate scheduling
X
τ τ pmodel (x) = pmodel (h, x)
• Linear: η (τ )
= (1 − )η0 + ( ηK ).
K K h
⇓ X
τ c
• Power law: η (τ ) = η0 (1 + ) . log pmodel (x) = log pmodel (h, x)
s
h
τ
• Exponential decay: η (τ )
= η0 c s
xni − µi X
x̃ni = If we set Ω(h) = λ |hi | (L1 norm of h) then minimizing the
σi i
for each dimension i. sparsity terms is equal to maximizing the log likelihood of p(h)
assuming a Laplace prior over each component independently.
After training we use a moving average of the mean and vari- 10.3 Contractive autoencoders
ance.
They minimize
L(x, g(f (x))) + Ω(h)
(τ ) (τ −1)
µi = αµi + (1 − α)µi with
(τ ) (τ −1) 0⩽α⩽1
σ i = ασ i + (1 − α)σi
X
Ω(h, x) = λ ||∇x hi ||2 .
i
Y = SoftMax XXT X
h i but
= SoftMax QKT V. h
sin(wi · (n + k))
i h
sin(wi · n) cos(wi · k) + cos(wi · n) sin(wi · k)
i
cos(wi · (n + k)) = cos(wi · n) cos(wi · k) − sin(wi · n) sin(wi · k)
K = XW(k)
h i
sin(wi · n) cos(wi · k) + cos(wi · n) sin(wi · k)
= cos(wi · n) cos(wi · k) − sin(wi · n) sin(wi · k)
V = XW(v) .
Then the embeddings get scaled by the dimensionality of key and this means
vectors
v1 v2
cos(wi · k) sin(wi · k)
=
QKT v3 v4 − sin(wi · k) cos(wi · k)
Y = Attention(Q, K, V) = SoftMax √ V.
Dk
In a multi-head scenario where Hh = Attention(Qh , Kh , Vh ) we 11.2 GPTs
have
The goal is to use transformers to build an autoregressive model
of the form
N ×D N ×HDv HDv ×D
Z = LayerNorm[Y(X) + X] Z = [Y(LayerNorm(X)) + X]
Here the attention weights are computed using QKT as be-
and then passing through a MLP with ReLU activation fore, but we can set the attention weights to zero for all future
tokens and computing (QK)Tnm as the attention weights between
e = LayerNorm[MLP(Z)+Z]
X e = MLP[LayerNorm(Z)]+Z.
X tokens n and m multiplied by a mask matrix M that has −∞ in
the upper triangular part.
11.1 Positional encoding
We concatenate input x to positional encoding r obtaining the
QKT
representation x|r. We can apply a linear transformation wx |wr : Y = SoftMax √ ◦ M V.
Dk
x
wx wr = wx x + wr r = w(x + r).
r Temperature scaling:
Encoding must be: exp aTi
yi = P
• unique for each position; j exp T
aj
• bounded;