0% found this document useful (0 votes)

7 views4 pages

NNDL

The document discusses various concepts in deep learning, including matrix notation, backpropagation equations, cost functions, and optimization techniques. It covers topics such as cross-entropy loss, learning rate scheduling, and different types of autoencoders, including sparse and denoising autoencoders. Additionally, it touches on normalization methods and the use of transformers in encoding data.

Uploaded by

lorenzo.sala969

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views4 pages

NNDL

Uploaded by

lorenzo.sala969

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Il Porcodio: Deep Learning K Kotatsu, the Bringer of Jewishness ;

Cheatsheet
1 Matrix based notation ∂C
= δjl .
∂blj
The activation zjl of the j-th neuron of the l-th layer is
!
3.4 BP4
l
X l l−1 l
zj = σ wjk zk + bj
k

∂C ∂C
Now take Wl as the matrix l
= zkl−1 δjl = zin δout .
∂wjk ∂w
 l l l
···

w00 w01 w02
l l l
w10 w11 w12 · · ·.
.. .. .. ..

. . . . Proof 3.1: BP1

in matrix notation we write then ∂C ∂C ′ L

Show that δjL := = σ (aj ). Use the chain rule:
∂aL ∂z L
zl = σ(Wl zl−1 + bl ) j j

and we define ∂C
δjL =
l
a := W z l l−1
+b ∂aLj

so that zl = σ(al ). X ∂C ∂z L
k
=
Hadamard product: stupid retarded product of matrices: k
∂z L
k ∂a L
j

1 2 1×2 σ(aL
j )
⊙ = .
2 4 3×4 L
∂C ∂ zj
=
2 Cost function ∂zjL ∂aL
j
∂C ′ L
It must be: =
∂zjL
σ (aj ).
• Expressed as mean of the single inputs;
• It must be a function of the outputs of the network.
Proof 3.2: BP2
Example: quadratic cost function
1 2 Here we must show that
C= y(x) − zL (x) .
2n ∂C h i
δjl := l
= (Wl+1 )T δ l+1 ⊙ σ ′ (al )
∂aj j
3 The Four Fundamental Equations X l+1 l+1 l
= wkj δk · σ(aj ).
Define δjl as the error at level l of neuron j: k

∂C Start from the fact that we can think of C as a function

δjl = . of al+1 so we can use the chain rule:
∂alj k

∂C X ∂C ∂al+1
3.1 BP1 δjl := = k
∂alj k
∂al+1
k
alj
l+1
by def. δk
∂C
δjL = · σ ′ (aL
j )
∂zjL X ∂al+1
= δkl+1 k
.
⇓ alj
k
δ L = ∇z C ⊙ σ ′ (aL ).
But we know that
X
3.2 BP2 al+1
k = l+1 l
wkj zj + bl+1
k
j
X l+1
X l+1 l+1 ′ l = wkj σ(alj ) + bl+1
k
δjl = wkj δk σ (aj ) j
k
⇓ so we have that
δ l = ((Wl+1 )T δ l+1 ) ⊙ σ ′ (al ).
∂al+1
k l+1 ′ l
= wkj σ (aj ).
∂alj
3.3 BP3
So putting all together we get
X l+1 l+1 ′ l
δjl = wkj δk σ (aj ).
k

© 2025 K Kotatsu, the Bringer of Jewishness ;

Proof 3.3: BP3 but since ∇E = 0 we get

∂C 1
We must show that = dlj . Think of C as a function E(w) = E(w⋆ ) + (w − w⋆ )T H(w − w⋆ ).
∂blj 2
of alj and use chain rule:
Since {ui }i is a orthonormal basis we can write any vector as a
∂C X ∂C ∂al
k linear combination of ui vectors, which allows us to write:
l
=
∂bj k
∂alk blj
∂C ∂alj 1X
= . E(w) = E(w⋆ ) + λi αi2 .
∂alj bj 2 i
=1
=δjl

Proof 5.1: Taylor’s shit

Proof 3.4: BP4
1X
We need to show that E(w) = E(w⋆ ) + λi αi2 . We
∂C 2 i
We must show that = zkl−1 δjl . Use the chain rule:
l
∂wjk know that:
1 T
∂C X ∂C ∂ali E(w) = E(w⋆ ) + (w − w⋆ ) H(w − w⋆ )
= 2
l
∂wjk i
∂ali ∂wjk
l !T !
1 X X
∂C ∂alj = E(w⋆ ) + αi ui H αi ui
= 2 i i
∂alj ∂wjk
l
!T !
1 X X
∂alj = E(w⋆ ) + αi ui αi Hui
= δjl l 2
∂wjk i i
!T !
but we know that 1 X X
= E(w⋆ ) + αi ui αi λi ui
! 2 i i
∂alj ∂ X l
l
= l
wjk zkl−1 + blj = E(w⋆ ) +
1X
λi αi2 .
∂wjk ∂wjk k 2 i
= zkl−1 .

So And this implies:

∂C
l
= zkl−1 δjl .
∂wjk Proof 5.2: Gradient’s shit
X
Show that ∇E = αi λi ui .
4 Improving learning i
!
1X ⋆
∇E(w) = ∇ E(w ) + λi αi2
Cross-entropy cost function: 2 i
1X 1X
C=− [y ln z + (1 − y) ln(1 − z)] . = λi 2αi ∇αi .
n x 2 i

This yields:
X
To compute ∇αi we use the fact that w − w⋆ = αj uj :
j
∂C 1X
= xj (σ(a) − y)
∂wj n x X
!
T ⋆ T
∂C 1 X ui (w − w ) = ui αj uj
= (σ(a) − y).
∂b n x j

ui T (w − w⋆ ) = αi
We can generalize for multi-layer networks: X X ⋆
wj uij − wj uij = αi
1 XXh i
C=− yj ln zjL + (1 − yj ) ln(1 − zjL ) . j j
n x j
so
!
∂ X X
wj uij − wj⋆ uij = ui k
Soft max activation with log-likelihood cost function: ∂wk j j

eaj
L ∂αi
zjL = P = =⇒ ∇αi = ui .
aL ∂wk
ke
k

C = − ln zyL
We have that
5 Convergence
X
• δw = δαi ui
i
Consider the quadratic approximation of the error function
around the minimum point w⋆ :
X
• ∇E = αi λi ui
i
1
E(w) = E(w⋆ )+∇(w)E(w⋆ )T (w−w⋆ )+ (w − w⋆ )T H(w − w⋆ )
2

© 2025 K Kotatsu, the Bringer of Jewishness ;

• ∆u = −η∇E
9 CNNs
The cross-entropy between teo discrete distributions p and q mea-
• ∆α = −ηλi αi =⇒ αinew = (1 − ηλi )αiold .
sures how much q differs from p.
X
This means that if |1 − ηλi | < 1 then αi decreases for each of H(p, q) = − p(v) · log(q(v)).
the T steps. v

1 CNNs employ the cross-entropy loss:

• Fastest convergence: η = (convergence rate 0, mini-
λmax
mum reached in one step).
S
X
• Direction of slowest convergence: λmin where the rate is − yi · log(pi ).
λmin i=1
1− .
λmax
λmax 10 Autoencoders
• Condition number of hessian matrix: . The bigger it
λmin Remember how PCA works:
is, the slower the convergence.
f (x) = arg min ||x − g(h)||2
6 Momentum h

Normally we have Where g(h) = Dh. So we are interested in measuring the loss of
the reconstruction
w(τ ) = w(τ −1) + ∆w(τ −1) L(x, g(f (x))).

but adding the momentum we get 10.1 Sparse autoencoders

∆w (τ −1)
= −η∇E(w (τ −1)
) + µ∆w (τ −2)
Here the loss function has a penalty on h:

so L(x, g(f (x))) + Ω(h).

Consider the distribution

η
∆w = − ∇E.
1−µ pmodel (h, x) = pmodel (h)pmodel (x|h)

and marginalizing
7 Learning rate scheduling
X
τ τ pmodel (x) = pmodel (h, x)
• Linear: η (τ )
= (1 − )η0 + ( ηK ).
K K h
⇓ X
τ c
• Power law: η (τ ) = η0 (1 + ) . log pmodel (x) = log pmodel (h, x)
s
h
τ
• Exponential decay: η (τ )
= η0 c s

So, given a he generated by the encoder we have

8 Normalization X
log pmodel (x) = log pmodel (h, x)
h
8.1 Data normalization
≈ log p(h,
e x) = log pmodel (h)
e + log pmodel (x|h).
e

xni − µi X
x̃ni = If we set Ω(h) = λ |hi | (L1 norm of h) then minimizing the
σi i
for each dimension i. sparsity terms is equal to maximizing the log likelihood of p(h)
assuming a Laplace prior over each component independently.

8.2 Batch normalization

λ −λ|hi |
pmodel (hi ) =
e
2
K ⇓
1 X X
µi = ani 1
K n=1 − log pmodel (h) = λ|hu | − log = Ω(h) + const
i
2
K
1 X
σi2 = (ani − µi )2
K n=1
10.2 Denoising autoencoders
ani − µi
âni = p 2
σi + δ They minimize
L(x, g(f (e
x)))

After training we use a moving average of the mean and vari- 10.3 Contractive autoencoders
ance.
They minimize
L(x, g(f (x))) + Ω(h)
(τ ) (τ −1)
µi = αµi + (1 − α)µi with
(τ ) (τ −1) 0⩽α⩽1
σ i = ασ i + (1 − α)σi
X
Ω(h, x) = λ ||∇x hi ||2 .
i

11 Transformers The encoding of n + m can always be expressed as a linear com-
bination of the encodings of n and m and it is always possible to
Consider the attention to embedding yn as find a matrix M that depends only on k such that rn+k = Mrn .

v1 v2 sin(wi · n) sin(wi · (n + k))
exp(xTn xm ) · =
anm = PN . v3 v4 cos(wi · n) cos(wi · (n + k)).
T
m′ =1 exp(xn xm )
′

Proof 11.1: Matrix shit

Therefore we can express our new embeddings Y as We have
h i h i h i
v1 v2 sin(wi · n) i · n) v2 cos(wi · n)
h i v3 v4 · cos(w i · n)
= vv13 sin(w
sin(wi · n) v4 cos(wi · n)

Y = SoftMax XXT X
h i but
= SoftMax QKT V. h
sin(wi · (n + k))
i h
sin(wi · n) cos(wi · k) + cos(wi · n) sin(wi · k)
i
cos(wi · (n + k)) = cos(wi · n) cos(wi · k) − sin(wi · n) sin(wi · k)

Where queries, keys and values are trainable. so

h i
v1 sin(wi · n) + v2 cos(wi · n)
Q = XW(q) v3 sin(wi · n) + v4 cos(wi · n)

K = XW(k)
h i
sin(wi · n) cos(wi · k) + cos(wi · n) sin(wi · k)
= cos(wi · n) cos(wi · k) − sin(wi · n) sin(wi · k)
V = XW(v) .
Then the embeddings get scaled by the dimensionality of key and this means
vectors
v1 v2

cos(wi · k) sin(wi · k)

=
QKT v3 v4 − sin(wi · k) cos(wi · k)

Y = Attention(Q, K, V) = SoftMax √ V.
Dk
In a multi-head scenario where Hh = Attention(Qh , Kh , Vh ) we 11.2 GPTs
have
The goal is to use transformers to build an autoregressive model
of the form
N ×D N ×HDv HDv ×D

Y(X) = Concat[H1 , . . . , HH ] W(o) . N

Y
p(x1 , x2 , . . . , xN ) = p(xn |x1 , x2 , . . . , xn−1 ).
To improve learning it is possible to add a residual connection n=1

Z = LayerNorm[Y(X) + X] Z = [Y(LayerNorm(X)) + X]
Here the attention weights are computed using QKT as be-
and then passing through a MLP with ReLU activation fore, but we can set the attention weights to zero for all future
tokens and computing (QK)Tnm as the attention weights between
e = LayerNorm[MLP(Z)+Z]
X e = MLP[LayerNorm(Z)]+Z.
X tokens n and m multiplied by a mask matrix M that has −∞ in
the upper triangular part.
11.1 Positional encoding
We concatenate input x to positional encoding r obtaining the
QKT

representation x|r. We can apply a linear transformation wx |wr : Y = SoftMax √ ◦ M V.
Dk

x
wx wr = wx x + wr r = w(x + r).
r Temperature scaling:
Encoding must be: exp aTi

yi = P
• unique for each position; j exp T
aj

• bounded;

• generalizable to sequences of arbitrary length;

• capable of expressing relative positions.

Sinusoidal positional encoding:

 
sin(w1 · n)
 cos(w1 · n) 
 
 sin(w2 · n) 
 
 cos(w2 · n)  1
rn =  , wi = .
 .. 
10000 D
2i

 . 

 sin(w D · n) 
 2

cos(w D · n)
2

This is good because

D
2
X
rTn rm = cos(wi · (n − m)).
i=1

Neural Networks Desing - Martin T. Hagan - 2nd Edition
100% (2)
Neural Networks Desing - Martin T. Hagan - 2nd Edition
1,013 pages
04 Radford, David E. - Hopf Algebras
100% (1)
04 Radford, David E. - Hopf Algebras
584 pages
Deep Learning
No ratings yet
Deep Learning
800 pages
8th Class Maths Worksheet-1
100% (1)
8th Class Maths Worksheet-1
5 pages
Set-1 Maths Que Bank STD-7
No ratings yet
Set-1 Maths Que Bank STD-7
15 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Short Notes (Revision) Mathematics
No ratings yet
Short Notes (Revision) Mathematics
90 pages
NMOS R2 Mock Paper 4
No ratings yet
NMOS R2 Mock Paper 4
12 pages
MATH 147 001 F23 Sample Final
No ratings yet
MATH 147 001 F23 Sample Final
4 pages
@cbse10bystudentshelper On Telegram: Mathematics
No ratings yet
@cbse10bystudentshelper On Telegram: Mathematics
18 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Stress Transformation
No ratings yet
Stress Transformation
18 pages
Combined Activity Class 10
No ratings yet
Combined Activity Class 10
12 pages
First
No ratings yet
First
92 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
05 Optimization Basics
No ratings yet
05 Optimization Basics
94 pages
MMP ppt-1
No ratings yet
MMP ppt-1
5 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
PDE - Notes 1
No ratings yet
PDE - Notes 1
79 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
A Practical Guide To The Solution of Real-Life Optimal Control Problems
No ratings yet
A Practical Guide To The Solution of Real-Life Optimal Control Problems
39 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
Math Behind ML Algos
No ratings yet
Math Behind ML Algos
18 pages
Curs4site PDF
No ratings yet
Curs4site PDF
44 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
001-2023-0802 DLAPMAT01 Course Book
No ratings yet
001-2023-0802 DLAPMAT01 Course Book
204 pages
Bruno Gonçalves: Deep Learning From Scratch
No ratings yet
Bruno Gonçalves: Deep Learning From Scratch
95 pages
Lec1 Mathreview
No ratings yet
Lec1 Mathreview
61 pages
Deep Neural Networks - 2
No ratings yet
Deep Neural Networks - 2
55 pages
Anthony Kuh - Neural Networks and Learning Theory
No ratings yet
Anthony Kuh - Neural Networks and Learning Theory
72 pages
009 Neural - Networks Complete
No ratings yet
009 Neural - Networks Complete
61 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Lecture # 4-2 Autoregressive Models
No ratings yet
Lecture # 4-2 Autoregressive Models
39 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
DL (1-10)
No ratings yet
DL (1-10)
10 pages
Lec 03 Deep Networks 1
No ratings yet
Lec 03 Deep Networks 1
53 pages
Slides 11
No ratings yet
Slides 11
48 pages
Applying Statistical Learning Theory To Deep Learning
No ratings yet
Applying Statistical Learning Theory To Deep Learning
51 pages
Unit 4 Short Notes
No ratings yet
Unit 4 Short Notes
27 pages
Week2 DL
No ratings yet
Week2 DL
29 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
SWE Math4543 Lecture1
No ratings yet
SWE Math4543 Lecture1
25 pages
Mathematical Morphology: Václav Hlaváč Czech Technical University in Prague
No ratings yet
Mathematical Morphology: Václav Hlaváč Czech Technical University in Prague
56 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
Chapter 2 - 2 Shallow Neural Network 2 - 2
No ratings yet
Chapter 2 - 2 Shallow Neural Network 2 - 2
34 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Assignment3 - DeepLearning
No ratings yet
Assignment3 - DeepLearning
16 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Midterm Study Guide Csci566
No ratings yet
Midterm Study Guide Csci566
20 pages
Ch06 - Probality and Random Process
No ratings yet
Ch06 - Probality and Random Process
42 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
Converse To Lebesgues Dominated Convergence Theorem
No ratings yet
Converse To Lebesgues Dominated Convergence Theorem
3 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
Maths Behind ML Algos
No ratings yet
Maths Behind ML Algos
18 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Binomial Theorem
No ratings yet
Binomial Theorem
19 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Determinant For Non Square Matrices
No ratings yet
Determinant For Non Square Matrices
14 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
DL Lecture8 Autoencoder
No ratings yet
DL Lecture8 Autoencoder
28 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
Dat 300
No ratings yet
Dat 300
12 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
Lec 100
No ratings yet
Lec 100
13 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Simon Chapter 3
No ratings yet
Simon Chapter 3
12 pages
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
No ratings yet
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
6 pages
Sample Paper Cbse
No ratings yet
Sample Paper Cbse
5 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Strong Form and Weak Form of Fea
No ratings yet
Strong Form and Weak Form of Fea
2 pages
Deep Learning Basics Lecture 8 Autoencoder & DBM
No ratings yet
Deep Learning Basics Lecture 8 Autoencoder & DBM
28 pages
B.SC BA Mathematics Dif DwM8TxZ
No ratings yet
B.SC BA Mathematics Dif DwM8TxZ
2 pages
Iygb Gce: Mathematics MP2 Advanced Level
No ratings yet
Iygb Gce: Mathematics MP2 Advanced Level
7 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
2023gauss7 (E)
No ratings yet
2023gauss7 (E)
5 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Algebra and Functions 2 QP
No ratings yet
Algebra and Functions 2 QP
9 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
The Einstein Convention
No ratings yet
The Einstein Convention
4 pages
Analysis and Approaches 1 Page Formula Sheet
No ratings yet
Analysis and Approaches 1 Page Formula Sheet
1 page
Homework 2
No ratings yet
Homework 2
2 pages
T FebReg Alg1 TeamTest2
No ratings yet
T FebReg Alg1 TeamTest2
15 pages

NNDL

Uploaded by

NNDL

Uploaded by

Il Porcodio: Deep Learning K Kotatsu, the Bringer of Jewishness ;

in matrix notation we write then ∂C ∂C ′ L

∂C Start from the fact that we can think of C as a function

© 2025 K Kotatsu, the Bringer of Jewishness ;

Proof 5.1: Taylor’s shit

So And this implies:

© 2025 K Kotatsu, the Bringer of Jewishness ;

1 CNNs employ the cross-entropy loss:

but adding the momentum we get 10.1 Sparse autoencoders

so L(x, g(f (x))) + Ω(h).

Consider the distribution

So, given a he generated by the encoder we have

8.2 Batch normalization

© 2025 K Kotatsu, the Bringer of Jewishness ;

Proof 11.1: Matrix shit

Where queries, keys and values are trainable. so

Y(X) = Concat[H1 , . . . , HH ] W(o) . N

• generalizable to sequences of arbitrary length;

• capable of expressing relative positions.

Sinusoidal positional encoding:

This is good because

© 2025 K Kotatsu, the Bringer of Jewishness ;

You might also like