0% found this document useful (0 votes)
7 views4 pages

NNDL

The document discusses various concepts in deep learning, including matrix notation, backpropagation equations, cost functions, and optimization techniques. It covers topics such as cross-entropy loss, learning rate scheduling, and different types of autoencoders, including sparse and denoising autoencoders. Additionally, it touches on normalization methods and the use of transformers in encoding data.

Uploaded by

lorenzo.sala969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views4 pages

NNDL

The document discusses various concepts in deep learning, including matrix notation, backpropagation equations, cost functions, and optimization techniques. It covers topics such as cross-entropy loss, learning rate scheduling, and different types of autoencoders, including sparse and denoising autoencoders. Additionally, it touches on normalization methods and the use of transformers in encoding data.

Uploaded by

lorenzo.sala969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Il Porcodio: Deep Learning K Kotatsu, the Bringer of Jewishness ;

Cheatsheet
1 Matrix based notation ∂C
= δjl .
∂blj
The activation zjl of the j-th neuron of the l-th layer is
!
3.4 BP4
l
X l l−1 l
zj = σ wjk zk + bj
k

∂C ∂C
Now take Wl as the matrix l
= zkl−1 δjl = zin δout .
∂wjk ∂w
 l l l
···

w00 w01 w02
l l l
w10 w11 w12 · · ·.
.. .. .. ..

. . . . Proof 3.1: BP1

in matrix notation we write then ∂C ∂C ′ L


Show that δjL := = σ (aj ). Use the chain rule:
∂aL ∂z L
zl = σ(Wl zl−1 + bl ) j j

and we define ∂C
δjL =
l
a := W z l l−1
+b ∂aLj

so that zl = σ(al ). X ∂C ∂z L
k
=
Hadamard product: stupid retarded product of matrices: k
∂z L
k ∂a L
j
     
1 2 1×2 σ(aL
j )
⊙ = .
2 4 3×4 L
∂C ∂ zj
=
2 Cost function ∂zjL ∂aL
j
∂C ′ L
It must be: =
∂zjL
σ (aj ).
• Expressed as mean of the single inputs;
• It must be a function of the outputs of the network.
Proof 3.2: BP2
Example: quadratic cost function
1 2 Here we must show that
C= y(x) − zL (x) .
2n ∂C h i
δjl := l
= (Wl+1 )T δ l+1 ⊙ σ ′ (al )
∂aj j
3 The Four Fundamental Equations X l+1 l+1 l
= wkj δk · σ(aj ).
Define δjl as the error at level l of neuron j: k

∂C Start from the fact that we can think of C as a function


δjl = . of al+1 so we can use the chain rule:
∂alj k

∂C X ∂C ∂al+1
3.1 BP1 δjl := = k
∂alj k
∂al+1
k
alj
l+1
by def. δk
∂C
δjL = · σ ′ (aL
j )
∂zjL X ∂al+1
= δkl+1 k
.
⇓ alj
k
δ L = ∇z C ⊙ σ ′ (aL ).
But we know that
X
3.2 BP2 al+1
k = l+1 l
wkj zj + bl+1
k
j
X l+1
X l+1 l+1 ′ l = wkj σ(alj ) + bl+1
k
δjl = wkj δk σ (aj ) j
k
⇓ so we have that
δ l = ((Wl+1 )T δ l+1 ) ⊙ σ ′ (al ).
∂al+1
k l+1 ′ l
= wkj σ (aj ).
∂alj
3.3 BP3
So putting all together we get
X l+1 l+1 ′ l
δjl = wkj δk σ (aj ).
k

© 2025 K Kotatsu, the Bringer of Jewishness ;


Proof 3.3: BP3 but since ∇E = 0 we get

∂C 1
We must show that = dlj . Think of C as a function E(w) = E(w⋆ ) + (w − w⋆ )T H(w − w⋆ ).
∂blj 2
of alj and use chain rule:
Since {ui }i is a orthonormal basis we can write any vector as a
∂C X ∂C ∂al
k linear combination of ui vectors, which allows us to write:
l
=
∂bj k
∂alk blj
∂C ∂alj 1X
= . E(w) = E(w⋆ ) + λi αi2 .
∂alj bj 2 i
=1
=δjl

Proof 5.1: Taylor’s shit


Proof 3.4: BP4
1X
We need to show that E(w) = E(w⋆ ) + λi αi2 . We
∂C 2 i
We must show that = zkl−1 δjl . Use the chain rule:
l
∂wjk know that:
1 T
∂C X ∂C ∂ali E(w) = E(w⋆ ) + (w − w⋆ ) H(w − w⋆ )
= 2
l
∂wjk i
∂ali ∂wjk
l !T !
1 X X
∂C ∂alj = E(w⋆ ) + αi ui H αi ui
= 2 i i
∂alj ∂wjk
l
!T !
1 X X
∂alj = E(w⋆ ) + αi ui αi Hui
= δjl l 2
∂wjk i i
!T !
but we know that 1 X X
= E(w⋆ ) + αi ui αi λi ui
! 2 i i
∂alj ∂ X l
l
= l
wjk zkl−1 + blj = E(w⋆ ) +
1X
λi αi2 .
∂wjk ∂wjk k 2 i
= zkl−1 .

So And this implies:


∂C
l
= zkl−1 δjl .
∂wjk Proof 5.2: Gradient’s shit
X
Show that ∇E = αi λi ui .
4 Improving learning i
!
1X ⋆
∇E(w) = ∇ E(w ) + λi αi2
Cross-entropy cost function: 2 i
1X 1X
C=− [y ln z + (1 − y) ln(1 − z)] . = λi 2αi ∇αi .
n x 2 i

This yields:
X
To compute ∇αi we use the fact that w − w⋆ = αj uj :
j
∂C 1X
= xj (σ(a) − y)
∂wj n x X
!
T ⋆ T
∂C 1 X ui (w − w ) = ui αj uj
= (σ(a) − y).
∂b n x j

ui T (w − w⋆ ) = αi
We can generalize for multi-layer networks: X X ⋆
wj uij − wj uij = αi
1 XXh i
C=− yj ln zjL + (1 − yj ) ln(1 − zjL ) . j j
n x j
so
!
∂ X X
wj uij − wj⋆ uij = ui k
Soft max activation with log-likelihood cost function: ∂wk j j

eaj
L ∂αi
zjL = P = =⇒ ∇αi = ui .
aL ∂wk
ke
k

C = − ln zyL
We have that
5 Convergence
X
• δw = δαi ui
i
Consider the quadratic approximation of the error function
around the minimum point w⋆ :
X
• ∇E = αi λi ui
i
1
E(w) = E(w⋆ )+∇(w)E(w⋆ )T (w−w⋆ )+ (w − w⋆ )T H(w − w⋆ )
2

© 2025 K Kotatsu, the Bringer of Jewishness ;


• ∆u = −η∇E
9 CNNs
The cross-entropy between teo discrete distributions p and q mea-
• ∆α = −ηλi αi =⇒ αinew = (1 − ηλi )αiold .
sures how much q differs from p.
X
This means that if |1 − ηλi | < 1 then αi decreases for each of H(p, q) = − p(v) · log(q(v)).
the T steps. v

1 CNNs employ the cross-entropy loss:


• Fastest convergence: η = (convergence rate 0, mini-
λmax
mum reached in one step).
S
X
• Direction of slowest convergence: λmin where the rate is − yi · log(pi ).
λmin i=1
1− .
λmax
λmax 10 Autoencoders
• Condition number of hessian matrix: . The bigger it
λmin Remember how PCA works:
is, the slower the convergence.
f (x) = arg min ||x − g(h)||2
6 Momentum h

Normally we have Where g(h) = Dh. So we are interested in measuring the loss of
the reconstruction
w(τ ) = w(τ −1) + ∆w(τ −1) L(x, g(f (x))).

but adding the momentum we get 10.1 Sparse autoencoders


∆w (τ −1)
= −η∇E(w (τ −1)
) + µ∆w (τ −2)
Here the loss function has a penalty on h:

so L(x, g(f (x))) + Ω(h).

Consider the distribution


η
∆w = − ∇E.
1−µ pmodel (h, x) = pmodel (h)pmodel (x|h)

and marginalizing
7 Learning rate scheduling
X
τ τ pmodel (x) = pmodel (h, x)
• Linear: η (τ )
= (1 − )η0 + ( ηK ).
K K h
⇓ X
τ c
• Power law: η (τ ) = η0 (1 + ) . log pmodel (x) = log pmodel (h, x)
s
h
τ
• Exponential decay: η (τ )
= η0 c s

So, given a he generated by the encoder we have


8 Normalization X
log pmodel (x) = log pmodel (h, x)
h
8.1 Data normalization
≈ log p(h,
e x) = log pmodel (h)
e + log pmodel (x|h).
e

xni − µi X
x̃ni = If we set Ω(h) = λ |hi | (L1 norm of h) then minimizing the
σi i
for each dimension i. sparsity terms is equal to maximizing the log likelihood of p(h)
assuming a Laplace prior over each component independently.

8.2 Batch normalization


λ −λ|hi |
pmodel (hi ) =
e
2
K ⇓
1 X X 
µi = ani 1
K n=1 − log pmodel (h) = λ|hu | − log = Ω(h) + const
i
2
K
1 X
σi2 = (ani − µi )2
K n=1
10.2 Denoising autoencoders
ani − µi
âni = p 2
σi + δ They minimize
L(x, g(f (e
x)))

After training we use a moving average of the mean and vari- 10.3 Contractive autoencoders
ance.
They minimize
L(x, g(f (x))) + Ω(h)
(τ ) (τ −1)
µi = αµi + (1 − α)µi with
(τ ) (τ −1) 0⩽α⩽1
σ i = ασ i + (1 − α)σi
X
Ω(h, x) = λ ||∇x hi ||2 .
i

© 2025 K Kotatsu, the Bringer of Jewishness ;


11 Transformers The encoding of n + m can always be expressed as a linear com-
bination of the encodings of n and m and it is always possible to
Consider the attention to embedding yn as find a matrix M that depends only on k such that rn+k = Mrn .
     
v1 v2 sin(wi · n) sin(wi · (n + k))
exp(xTn xm ) · =
anm = PN . v3 v4 cos(wi · n) cos(wi · (n + k)).
T
m′ =1 exp(xn xm )

Proof 11.1: Matrix shit


Therefore we can express our new embeddings Y as We have
h i h i h i
v1 v2 sin(wi · n) i · n) v2 cos(wi · n)
h i v3 v4 · cos(w i · n)
= vv13 sin(w
sin(wi · n) v4 cos(wi · n)

Y = SoftMax XXT X
h i but
= SoftMax QKT V. h
sin(wi · (n + k))
i h
sin(wi · n) cos(wi · k) + cos(wi · n) sin(wi · k)
i
cos(wi · (n + k)) = cos(wi · n) cos(wi · k) − sin(wi · n) sin(wi · k)

Where queries, keys and values are trainable. so


h i
v1 sin(wi · n) + v2 cos(wi · n)
Q = XW(q) v3 sin(wi · n) + v4 cos(wi · n)

K = XW(k)
h i
sin(wi · n) cos(wi · k) + cos(wi · n) sin(wi · k)
= cos(wi · n) cos(wi · k) − sin(wi · n) sin(wi · k)
V = XW(v) .
Then the embeddings get scaled by the dimensionality of key and this means
vectors 
v1 v2
 
cos(wi · k) sin(wi · k)

=
QKT v3 v4 − sin(wi · k) cos(wi · k)
 
Y = Attention(Q, K, V) = SoftMax √ V.
Dk
In a multi-head scenario where Hh = Attention(Qh , Kh , Vh ) we 11.2 GPTs
have
The goal is to use transformers to build an autoregressive model
of the form
N ×D N ×HDv HDv ×D

Y(X) = Concat[H1 , . . . , HH ] W(o) . N


Y
p(x1 , x2 , . . . , xN ) = p(xn |x1 , x2 , . . . , xn−1 ).
To improve learning it is possible to add a residual connection n=1

Z = LayerNorm[Y(X) + X] Z = [Y(LayerNorm(X)) + X]
Here the attention weights are computed using QKT as be-
and then passing through a MLP with ReLU activation fore, but we can set the attention weights to zero for all future
tokens and computing (QK)Tnm as the attention weights between
e = LayerNorm[MLP(Z)+Z]
X e = MLP[LayerNorm(Z)]+Z.
X tokens n and m multiplied by a mask matrix M that has −∞ in
the upper triangular part.
11.1 Positional encoding
We concatenate input x to positional encoding r obtaining the 
QKT

representation x|r. We can apply a linear transformation wx |wr : Y = SoftMax √ ◦ M V.
Dk
 
  x
wx wr = wx x + wr r = w(x + r).
r Temperature scaling:
Encoding must be: exp aTi

yi = P
• unique for each position; j exp T
aj 

• bounded;

• generalizable to sequences of arbitrary length;

• capable of expressing relative positions.

Sinusoidal positional encoding:


 
sin(w1 · n)
 cos(w1 · n) 
 
 sin(w2 · n) 
 
 cos(w2 · n)  1
rn =  , wi = .
 .. 
10000 D
2i

 . 

 sin(w D · n) 
 2

cos(w D · n)
2

This is good because


D
2
X
rTn rm = cos(wi · (n − m)).
i=1

© 2025 K Kotatsu, the Bringer of Jewishness ;

You might also like