Machine Learning Cheat Sheet HCMUT K
Machine Learning Cheat Sheet HCMUT K
Banh Tan Thuan([email protected]), Le Quang Khai, Vu Chau Duy Quang, Le Viet Tung,
Vo Hoang Nhat Khang, La Cam Huy
August 1, 2024
Contents
1 Introduction 3
1.1 Type of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Decision Tree 5
2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Problem with Decision True . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Bayesian Learning 7
3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Linear Regression 8
4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Weakness of linear regession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Genetic Algorithm 9
5.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6 Graphical Models 10
6.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
8 Dimensionality Reduction 14
8.1 Matrix calculus Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
9 Discriminative Models 15
9.1 Feature-based linear classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
9.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
9.3 Maximum entropy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
9.4 Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
12 Regularization 21
12.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
12.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
12.3 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
12.4 Other regularization approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
12.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1
13 Optimization for training deep models 24
13.1 Learning vs. Pure Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
13.2 Challenges in Neural Network Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
13.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Abstract
Machine Learning Materials.
Book:
• Bishop Pattern Recognition and Machine Learning 2006
• Deep Learning (Adaptive Computation and Machine Learning series)
Web : https://machinelearningcoban.com/
Bishop : https://www.youtube.com/playlist?list=PL8FnQMH2k7jzhtVYbKmvrMyXDYMmgjj_n
Examples: https://www.youtube.com/@MaheshHuddar
Deep Learning From Chapter 6: https://www.youtube.com/playlist?list=PLbBjZEwyU7W1CDs3Vx_GOJ9b3EgYQB3GE
https://www.deeplearningbook.org/lecture_slides.html
2
1 Introduction
1.1 Type of Machine Learning
1. Supervised learning: the learner (learning algorithm) are trained on labeled examples, i.e., input where the
desired output is known.
• Classification: to find the class of an instance given its selected features.
• Regression: to find a function whose curve passes as close as possible to all of the given data points
2. Unsupervised learning: the learner operates on unlabeled examples, i.e., input where the desired output is
unknown.
• Clustering to find the grouping data points into clusters based on their patterns without the need for labeled.
3. Reinforcement learning: between supervised and unsupervised learning. It is told when an answer is wrong,
but not how to correct it.
4. Evolutionary learning: biological evolution can be seen as a learning process, to improve survival rates and
chance of having offspring.
• FP = 8
• TN = 33
• FN = 7
TP 43 43
P recision = = = = 0.8431
TP + FP 43 + 8 51
TP 43 43
Recall = = = = 0.86
TP + FN 43 + 7 50
TP + TN 43 + 33 76
Accuracy = = = = 0.8352
TP + TN + FP + FN 43 + 33 + 8 + 7 91
100% Precision but 0% Recall, or 0% Precision but 100% Recall: It will result, F1 score = 0.
F1 Score:
The F1 score is a measure that combines both precision and recall into a single metric, providing a balance between
them. It is defined as the harmonic mean of precision and recall:
Precision × Recall
F1 = 2 ×
Precision + Recall
The F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It provides a useful way to
evaluate a classifier’s overall performance, considering both false positives and false negatives.
3
Table 1: Machine Learning Algorithms
4
2 Decision Tree
2.1 Theory
Data: Learning set S, attribute set A, attribute values V
Result: Decision tree
Function ID3(S, A, V ):
Load learning set S and create decision tree root node rootN ode, adding learning set S into rootN ode as its
subset;
Compute Entropy(rootN ode.subset);
if Entropy(rootN ode.subset) == 0 then
▷ subset is homogenious
return a leaf node;
end
if Entropy(rootN ode.subset) ̸= 0 then
▷ subset is not homogenious
Compute Information Gain for each attribute A not yet used for splitting;
Find attribute A with maximum Gain(S, A);
Create child nodes for rootN ode based on attribute A and add them to the decision tree;
end
foreach child of rootN ode do
ID3(S, A, V );
Continue until a node with Entropy of 0 or a leaf node is reached
end
Algorithm 1: ID3 Algorithm
Functions
Entropy :
E(S) = −p+ log2 p+ − p− log2 p−
S is a set of examples. p+ is the proportion of examples in class +. p− = 1 − p+ is the proportion of examples in
class - .
Entropy for n > 2 classes :
n
X
E(S) = −p1 logp1 − p2 logp2 ... − pn logpn = − pi logpi
i=1
5
2.2 Problem with Decision True
Problem: attributes with a large number of values.
More Subset, more attribute values −→ Likely to be pure −→ Gain() biased towards choosing attributes with a large
number of values.
This may cause several problems:
• Overfitting: selection of an attribute that is non-optimal for prediction.
• Fragmentation: data are fragmented into (too) many small sets. −→ Too deep and complex tree.
6
3 Bayesian Learning
3.1 Theory
Functions
Bayes Theorem:
P (D|h)P (h)
P (h|D) =
P (D)
Where P (h): prior probability of hypothesis h, P (D): prior probability of training data D, P (h|D): probability
that h holds given D, P (D|h): probability that D is observed given h.
Maximum A-posteriori hypothesis (MAP):
7
4 Linear Regression
4.1 Theory
in general linear regression is used to perform a regression on linear dependent variables.
It has the form:
h(x) = θT x
optimized by the cost function:
m
1X 2
J(θ) = (hθ (x(i) − y (i))
2 i=1
Formula to calculate the hyperlane:
θ = (X T X)−1 X T ⃗y
8
5 Genetic Algorithm
5.1 Theory
Functions
1. Initialize population: P = randomly generated p hypotheses.
Selection: Probabilistically select (1 − r)p hypotheses of P to add to the new generation. The selection probability
of a hypothesis :
F itness(hi )
P r(hi ) = P
h∈P F itness(h)
Crossover:
1. Probabilistically select (r/2)p pairs of hypotheses from P according to P r(h)
2. For each pair (h1 , h2 ), produce two offsprings by applying a Crossover operator.
Where p : number of hypothesis, r: portion of hypothesis that will discard. Example: r = 0.2 −→ discard 20% of
the population
Fitness Function Example:
F itness(h) = (correct(h))2
Not always the good individual is always survive, there is a chance that it will die, it just more than normal individual
9
6 Graphical Models
6.1 Theory
Functions
If we have a joint distribution P (C, a1 , a2 , .., an ), we can calculate the cdf. Then have a uniform random {0, 1} for
each ai . We can generate by mapping [a′1 , a′2 , ..., a′n ] to the joint distribution to get new P (C, a′1 , a′2 , .., a′n ). Therefore,
Naive Bayes is a generative model:
Three basic problems of HMMs. Once we have an HMM, there are three problems of interest.
1. The Evaluation Problem Given an HMM λ and a sequence of observations O = o1 , o2 , ..., oT , what is the
probability that the observations are generated by the model, pO|λ ?
2. The Decoding Problem Given an HMM λ and a sequence of observations O = o1 , o2 , ..., oT , what is the most
likely state sequence in the model that produced the observations?
3. The Learning Problem Given an HMM λ and a sequence of observations O = o1 , o2 , ..., oT , how should we
adjust the model parameters A, B, π in order to maximize pO|λ ?
10
Functions
Markov assumption :
1. State at time t depends only on state at time t - 1. p(yt |yt−1 , Z) = p(yt |yt−1 )
2. Observation at time t depends only on state at time t. P (xt |yt , Z) = p(xt |yt )
Joint distributions:
Y
p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT ) = p(yt |yt−1 ).p(xt |yt )
t=1,T
Forward algorithm: X
at (yt ) = p(yt , x1 , x2 , ..., xt ) = p(yt , yt−1 , x1 , x2 , ..., xt )
yt−1
X
= p(xt |yt , yt−1 , x1 , x2 , ..., xt−1 )p(yt , yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 , x1 , x2 , ..., xt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )(M arkov 1st Assumption)
yt−1
X
= p(xt |yt )p(yt |yt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )(M arkov 2nd Assumption)
yt−1
X
= p(xt |yt ) p(yt |yt−1 )at−1 (yt−1 )
yt−1
Viterbi algorithm:
Dynamic programming:
1. Compute:
argmax p(y1 , x1 ) = argmax p(x1 |y1 ).p(y1 )
y1 y1
3. Select:
argmax p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T
As the more advance t in at , the number would be very very small. So Viterbi algorithm.
11
Fast Viterbi using Casio
log(p11 ) log(p21 )
log(p
12 ) log(p22 )
α1 + log(t11 ) + log(e1 (x)) α2 + log(t21 ) + log(e1 (x))
=
α + log(t12 ) + log(e2 (x)) α2 + log(t22 ) + log(e2 (x))
1
α1 α2 log(t11 ) log(t21 ) log(e1 (x)) log(e1 (x))
= + +
α1 α2 log(t12 ) log(t22 ) log(e2 (x)) log(e2 (x))
= oαT +T T + eoT
α1
a= (P revious timestep)
α2
1
o= (V ector of 1s)
1
log(t11 ) log(t21 )
T = (T ransition matrix)
log(t12 ) log(t22 )
log(e1 (x))
e= (Emission vector)
log(e2 (x))
In calculator: Let M atA4,1 = a, M atB4,4 = T, M atC4,1 = e, M atD4,1 = o
Then to calculate: M atD ∗ T rn(M atA) + T rn(M atB) + M atC ∗ T rn(M atD)
Then take max each row to get the new M atA, recalculate MatC to get new emission, and repeat
12
7 Support Vector Machine (SVM)
7.1 Theory
Functions
Signed distance between the decision boundary and a sample xn :
y(xn )
||w||
tn .y(xn )
||w||
tn = +1 iff y(xn ) > 0 and tn = −1 iff y(xn ) < 0
Maximum margin:
1
argmax min(tn .(w.xn + b))
w,b ||w|| n
with the constraint:
tn .(w.xn + b) >= 1
Lagrange function for maximum margin classifier:
N
1 X
L(w, b, a) = ||w||2 − an .(tn .(w.xn + b) − 1)
2 n=1
Solution for w:
δ(w, b, a)/δw = 0
N
X
w= an .tn .xn
n=1
N
X
δL(w, b, a)/δb = an .tn = 0
n=1
Dual representation:
N N N
X 1XX
L∗ (a) = an − an .am .tn .tm .xn .xm
n=1
2 n=1 m=1
with the constraints:
an ≥ 0
N
X
an .tn = 0
n=1
Solution for b (NOTE THAT SLIDE DR. DUNG CAP, SLIDE DR. SACH IS CORRECT ):
1 X X(((
b= am .t(
m .xm(.x( ((1(( w.xn
n =
|S| ((( |S|
(
(((( n∈S n∈S
1 X X 1 X
b= (tn − am .tm .xm .xn ) = (tn − w.xn )
|S| |S|
n∈S m∈S n∈S
13
8 Dimensionality Reduction
8.1 Matrix calculus Theory
Functions
A vector as a column matrix
Dot product in matrix notation:
aT b
Vector projection of a on b:
ab
r=
||b||
a.b is a linear combination of a’s dimensions.
Matrix differentiation:
y = Ψ(x)
y is an m × 1 matrix, x is an 1 × n matrix
Proposition 1:
a = xT Ax
∂a
= (A + AT )x
∂x
x is n × 1, A is n × n, A does not depend on x, A is Matrix of quadratic forms.
Proposition 2: A is symmetric
a = xT Ax
∂a
= 2xT A
∂x
x is n × 1, A is n × n, A does not depend on x . Convert To Symmetric Matrix
Proposition 3: A is symmetric
a = xT Ax
∂a
( )T = 2Ax
∂x
x is n × 1 A is n × n, A does not depend on x
Eigenvalues and eigenvectors:
Av = λv
A is n × n (linear transformation), v is n × 1, λ is an eigenvalue of A’s, v is an eigenvector of A’s
Proposition 4: A is a n × n symmetric matrix All of its eigenvalues are real There are n linearly independent
eigenvectors for A.
Proposition 5: v1 , v2 , ..., vn are linearly independent eigenvectors of A, and λ1 , λ2 , ..., λn are their corre-
sponding eigenvalues
A = P DP −1
where
λ1 0 ... 0
0 λ2 ... 0
D= . , P = v1 v2 v3
.. .. ..
.. . . .
0 0 ... λn
Watch this
14
9 Discriminative Models
A discriminative model will discriminates y given x, calculating the conditional distribution p(y|x)
Features of classifiers
Linear classifier:
M
X
argmax λm .fm (c, x)
c∈C m=1
SVM:
fm (c, x) = xm
In SVM, changing λ to smaller it will move the middle line closer to the margin, the margin is still the same.
Naive Bayes Classifier:
fm (c, x) = log p(xm |c)
NLP:
Equations
1 X
Ẽ(fm ) ≈ fm (c, x)
D
(c,x)∈observed(c,x)
1 X X
E(fm ) ≈ p(c|x)fm (c, x)
D
x∈observed(x) c∈C
Consistency constraint:
E(fm ) = Ẽ(fm )
15
1
p(⊖|x) = PM
1 + exp m=1 λm .fm (y, x)
zi
zi e
p(ci |x) = P vs P zi (Softmax)
zi e
The maximum one are still the maximum one. Why prefer softmax? Because of the derivative, Convenient than
zi
derivative. P
zi
ezi δσi ezi ezi
σi = P zi ⇒ = ... = P zi ∗ (1 − P zi ) = σi (1 − σi )
e δzi e e
And the partial derivative with respect to zy
δσi ezi ezy
⇒ = ... = P zi ∗ P zi = σi σy
δzy e e P
This look like probability but not the probability, There are no proof of ci p(ci |x) = 1 of the data in real world, it
just what we think it should be.*
zi ezi
Both p(ci |x) = P and P zi
zi e
Formulas
Conditional entropy: X
H(y|x) = − p(y, x) · log p(y|x)
(x,y)∈(X,Y )
X
H(y|x) ≈ − p̃(x)p(y|x)log p(y|x)
(x,y)∈(X,Y )
Constraints:
E(fm ) = Ẽ(fm )
X
p(y|x) = 1
y∈Y
Optimization:
δL(p(y|x))
=0
δp(y|x)
Solution: PM
exp m=1 λm · fm (y, x)
p(y|x) = P PM ∗
y ∗ ∈Y exp m=1 λm · fm (y , x)
16
10 Artificial Neural Networks (ANN)
This part is rewritten based on lecture 4 of Stanford University (Fei-Fei Li, Ranjay Krishna, Danfei Xu - April 16, 2020),
slide Mr.Sach was missing.
The function max(0, z) is called the activation function. Remove activation functions then we have linear classifier
(W2 × W 1 × x)
NOTE: Purpose of activation functions? To introduce non-linearity to the classifier.
ReLU is a good default choice for most problems. derivative of some activation function
17
10.3 Neural network architectures
• If the data is linearly separable then you don’t need any hidden layers at all.
• If data is less complex and is having fewer dimensions or features then neural networks with 1 to 2 hidden layers
would work.
• If data is having large dimensions or features then to get an optimum solution, 3 to 5 hidden layers can be used.
It should be kept in mind that increasing hidden layers would also increase the complexity of the model and choosing
hidden layers such as 8, 9, or in two digits may sometimes lead to overfitting.
Once hidden layers have been decided the next task is to choose the number of nodes in each hidden layer. The
number of hidden neurons should be between the size of the input layer and the output layer.
The most appropriate number of hidden neurons is:
p
input layer nodes ∗ output layer nodes
The number of hidden neurons should keep on decreasing in subsequent layers to get more and more close to pattern
and feature extraction and to identify the target class.
18
11 Deep Feedforward Networks
Watch this video: https://www.youtube.com/watch?v=kWOPkec1RSQ
Deep feedforward networks (multilayer perceptrons (MLPs))
• Goal: approximate some function f ∗
• Information flow through the function being evaluated
• No feedback connection
Linear models
• Logistic regression, linear regression
• Can be fit efficiently and reliably
• Can obtain closed form solution or with convex optimization
• Limitation: capacity is limited to linear functions
A sufficiently powerful neural network can represent any function from a wide class of functions.
19
11.2 Hidden Units
Rectified linear units (ReLU)
• Can’t be learned via gradient-based methods on examples for which their activation is 0
• Generalization (Leakly Relu):
hi = g(z, a)i = max(0, zi ) + αi min(0, zi )
Maxout units
• Divide z into groups of k values
• Each maxout unit outputs the maximum element of one of these groups:
11.4 Back-Propagation
Formulas
Chain Rule of Calculus Let x be a real number, and let f and g both be functions mapping from a real number
to a real number. Suppose that y = g(x), z = f (g(x)) = f (y).
The chain rule
dz dz dy
=
dx dy dx
Generalization: x ∈ Rm , y ∈ Rn , g : Rm → Rn , f : Rn → R
∂z X ∂z ∂yi
=
∂xi j
∂yi ∂xi
Vector notation:
∂y T
∆x z = ( ) ∆y z
∂x
∂y
where is the nxm Jabcobian matrix of g
∂x
∂y
The gradient of a variable x can be obtained by multiplying a Jacobian matrix by a gradient ∆y z
∂x
How much memory need for back propagation ? Feed forward : Have to remember all set of the deriavative
(δz1 /δw1 = x1 ). At least some importance that not 1.
20
12 Regularization
Regularization strategies are explicitly designed to reduce the test error, possibly at the expense of increased training
error, and are based on tradeoff between increased bias and reduced variance.
L2 regularization
L1 regularization
L1 regularization: X
Ω(θ) = ||w||1 = |wi |
i
Parameter gradient:
˜ X, y) = αsign(w) + ∇w J(w, X, y)
∇w J(w,
L2 is smooth and dense, parameter varies in every possible range in value. While L1 is not dense (spare), the value
is might be 0 if too small, or not 0 => Selecting a few significant feature, hard to optimize.
• If we use a line search, we can search only over step sizes ϵ that yield new x points that are feasible, or we can
project each point on the line back into the constant region.
A general solution to constrained optimization problem is Karush - Kuhn - Tucker (KKT). This solution introduces KKT
multipliers λi and αj for each constraint. The generalized Lagrangian is then defined as:
X X
L(x, λ, α) = f (x) + λi g (i) (x) + αj h(j) (x)
i j
21
12.4 Other regularization approaches
Semi-supervised learning: Learn a representation so that examples from the same class have similar representations.
• Construct models in which a generative model of either P(x) or P(x,y) shares parameters with a discriminative
model of P (y|x)
• Trade-off between the supervised criterion log P (y|x) and unsupervised criterion (−log P (x) or −log P (x, y))
• Principal components analysis (PCA): a pre-processing step before applying a classifier (on the projected data).=>
Project data actually remove noise a bit. PCA is extract importance data and remove not so important data
Multitask Learning
12.5 Dropout
Bagging (bootstrap aggregating):
Given a standard training set D of size n, bagging generates m new training sets Di , each of size n’, by sampling from
D uniformly and with replacement. Then, m models are fitted using the above m bootstrap samples and combined by
averaging the output (for regression) or voting (for classification).
• Prediction of the ensemble is given by the arithmetic mean of all of these distributions
k
1 X (i)
p (y|x)
k i=1
22
Dropout:
Dropout trains the ensemble consisting of all sub-networks that can be formed by removing non-output units from an
underlying base network.
• Models share parameters, and are not explicitly trained
• Infeasible to sample all possible sub-networks within the lifetime of the universe
• Each sub-model defined by mask vector µ defines a probability distribution p(y|x, µ)
• Arithmetic mean over all masks: X
p(µ)p(y|x, µ)
µ
For very large datasets, computation cost may outweigh the benefit of regularization. Otherwise, for datasets with very
few samples, dropout is less effective.
23
13 Optimization for training deep models
13.1 Learning vs. Pure Optimization
Cost function can be written as an average over the training set, such as
J(θ) = E(x,y)≈p̂: data L(f (x, θ), y)
where L is the per-example loss function, f (x, θ) is the predicted output when the input is x, and p̂data is the empirical
distribution. In the supervised learning case, y is the target output.
Empirical risk has to be minimized:
m
1 X
E(x,y)≈p̂data [L(f (x, θ), y)] = L(f (x(i) , θ), y (i) )
M i=1
A surrogate loss function can be used as a proxy, and is minimized until a convergence criterion based on early stopping
is satisfied.
Batch and minibatch algorithms
Optimization is updating the parameters based on an expected value of the cost function estimated using only a subset
of the full cost function.
• Deterministic gradient methods: process all training examples simultaneously in a large batch.
• Stochastic gradient methods: use only a single example at a time.
• Minibatch: larger batches mean more accurate estimate of the gradient, but with less than linear returns
13.3 Algorithms
Stochastic Gradient Descent
Require: Learning rate schedule ϵ1 , ϵ2 , . . .
Require: Initial parameter θ
k←1
while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
1
∇θ i L(f (x(i) , θ), y (i) )
P
Compute gradient estimate: ĝ ←
m
Apply update: θ ← θ − ϵk ĝ
k ←k+1
End while
Momentum
Require: Learning rate ϵ, momentum parameter α
Require: Initial parameter θ, initial velocity v
while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
1
∇θ i L(f (x(i) , θ), y (i) )
P
Compute gradient estimate: g ←
m
Compute velocity update: v ← αv − ϵg
Apply update: θ ← θ + v
End while
24
Nesterov Momentum
Require: Learning rate ϵ, momentum parameter α
Require: Initial parameter θ, initial velocity v
while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
Apply interim update: θ̃ ← θ + αv
1
∇θ̃ i L(f (x(i) , θ̃), y (i) )
P
Compute gradient (at interim point): g ←
m
Compute velocity update: v ← αv − ϵg
Apply update: θ ← θ + v
End while
AdaGrad
Require: Global learning rate ϵ
Require: Initial parameter θ
Require: Small constant δ, perhaps 10−7 , for numerical stability.
Initialize gradient accumulation variable r = 0
while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
1
∇θ i L(f (x(i) , θ), y (i) )
P
Compute gradient estimate: g ←
m
Accumulate squared gradient: r ← r + g ⊙ g
ϵ
Compute update: ∆θ ← − √ ⊙g
δ+ r
Apply update: θ ← θ + ∆θ
End while
RMSProp
25
Adam
Require: Step size ϵ
Require: Exponential decay rates for moment estimates, ρ1 and ρ2 in [0, 1)
Require: Small constant δ.
Initialize 1st and 2nd moment variables s = 0, r = 0
Initialize time step t = 0
while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
1
∇θ i L(f (x(i) , θ), y (i) )
P
Compute gradient estimate: g ←
m
t←t+1
Update biased first moment estimate: s ← ρ1 s + (1 − ρ1 )g
Which algorithm should one choose?(SGD, SGD with momentum, RMSProp, RMSProp with momen-
tum, AdaDelta, Adam)
Adam is the best, other doesn’t matter much.
26
14 Convolutional Neural Network (CNN)
14.1 Theory
Convolution operation
Signal processing: Z Z
s(t) = (x ∗ w)(t) = x(τ )w(t − τ )dτ = x(t − τ )w(τ )dτ
Discrete convolution:
∞
X
s(t) = (x ∗ w)(t) = x(τ )w(t − τ )
τ =−∞
2D image I, 2D kernel K:
XX
S(i, j) = (I ∗ K)(i, j) = I(m, n)K(i − m, j − n)
m n
Convolution is commutative:
XX
S(i, j) = (K ∗ I)(i, j) = I(i − m, j − n)K(m, n)
m n
An algorithm based on convolution will learn a kernel that is flipped relative to the kernel learned by an algorithm
without flipping.
Motivation:
• Sparse interactions: Output unit only interacts with a small number of input units through convolution kernel.
• Parameter sharing:
• Equivariance: Convolution is not naturally equivariant to transformations, except for translation. (f is equivarient
to g if f (g(x)) = g(f (x)))
Convolution layer
Assume an input of size W1 × H1 × D1 , a kernel of size F × F with K filters, Their spatial extent F, stride is S,
amount of zero padding is P.
Then, we will produce an output of size W2 × H2 × D2 :
W2 = (W1 − F + 2P )/S + 1
H2 = (H2 − F + 2P )/S + 1
D2 = K
Number of parameters:
In the output volume, the d-th depth slice (of size W2 H2 ) is the result of performing a valid convolution of the
d-th filter over the input volume with a stride of S, and then offset by d-th bias.
Memory of 1 Convolution Layer of output of size W2 × H2 × D2 :
Source: https://stackoverflow.com/questions/59282135/how-do-we-approximately-calculate-how-much-memory-is
Byte of 1 Instance = K × W2 × H2 × b
where b is usually 4 (if the feature map use byte of 32-bit float, else it must stated in the question )
Pooling is replacing the output of the net at a certain location with a summary statistic of the nearby outputs. This
makes the representation become approximately invariant to small translations of the input.
27
Pooling layer
Source: https://cs231n.github.io/convolutional-networks/#case
Assume an input of size W1 × H1 × D1 , a kernel of size F × F with K filters, Their spatial extent F, stride is S,
amount of zero padding is P.
Then, we will produce an output of size W2 × H2 × D2 :
W2 = (W1 − F + 2P )/S + 1
H2 = (H1 − F + 2P )/S + 1
D2 = D1
Number of parameters: Introduces zero parameters since it computes a fixed function of the input
In the output volume, the d-th depth slice (of size W2 H2 ) is the result of performing a valid convolution of the
d-th filter over the input volume with a stride of S, and then offset by d-th bias.
Memory of 1 Convolution Layer of output of size W2 × H2 × D2 :
Source: https://stackoverflow.com/questions/59282135/how-do-we-approximately-calculate-how-much-memory-is
Byte of 1 Instance = K × W2 × H2 × b
where b is usually 4 (if the feature map use byte of 32-bit float, else it must stated in the question )
14.2 Examples
Question 1: To identify letters in the alphabet (26 letters), people preprocess them into 32 x 32 color images.
Let input x be a tensor with size 32 x 32 x 3. This input is passed through layers in a neural network, creating
the following outputs:
[32 × 32 × 3] → [32 × 32 × 64] → [16 × 16 × 64] → [8 × 8 × 128] → [8 × 8 × 32] → [2048 × 1] → [256 × 1] → [27 × 1]
1. Identify the layers in the network, given that they can be designed using convolution, pooling, or fully
connected. Identify the structure of each layer (size, stride, etc)
2. Calculate the size (number of parameters, including bias) for each layer
28
and output channels being 256 and 27 respectively.
2. Layer 1 is a convolution layer with 64 filters of size 3, padding 1 and stride 1, with input of size 32 × 32 × 3 so
the number of parameters is:
(3 × 3 × 3 + 1) × 64 = 1792
Layer 2 is a pooling layer, so the number of parameters is 0.
Layer 3 is a convolution layer with 128 filters of size 3, padding 1, stride 2, with input of size 16 × 16 × 64 so the number
of parameters is:
(3 × 3 × 16 + 1) × 128 = 18560
.
Layer 4 is a convolution layer with 32 filters of size 3, padding 1, stride 1, with input of size 8 × 8 × 128 so the number
of parameters is:
(3 × 3 × 128 + 1) × 32 = 36896
.
Layer 5 is a flatten layer, so the number of parameters is 0.
Layer 6 is a fully connected layer with input and output number of channels being 2048 and 256 respectively, so the
number of parameters is:
2048 × 256 + 256 = 524544
Layer 7 is a fully connected layer with input and output number of channels being 256 and 27 respectively, so the number
of parameters is:
256 × 27 + 27 = 6939
29
15 Recurrent Neural Network (RNN)
15.1 RNN
Types of RNN
• RNNs that produce an output at each time step and have recurrent connections only from the output at one time
step to the hidden units at the next time step
• RNNs with recurrent connections between hidden units that read an entire sequence and then produce a single
output
Teacher forcing
• A procedure that emerges from the maximum likelihood criterion: during training, the model receives the ground
truth output y (t) as input at time t + 1
• The conditional maximum likelihood criterion is:
• During training: feeding the model’s own output back into itself
Disadvantages:
• Problems arise when the network is going to be used in an open-loop mode
30
Back-propagtion through time
Parameters: U, V, W, b, c
Sequence of nodes at time t: x(t) , h(t) , o(t) , L(t)
Start the recursion with the nodes immediately preceding the final loss:
δL
=1
δL(t)
Assume that outputs o(t) are used as the argument to the softmax function to obtain the vector ŷ of probabilities
over the output, and the loss is the negative log-likelihood of the true target y (t) .
The gradient ∇o(t) L on the outputs at time step t, for all i, t:
δL δL δL(t) (t)
(∇o(t) L)i = (t)
= = ŷi − 1i,y(t)
δoi δL(t) δo(t)
i
Back-propagation:
T T
δh(t+1) δo(t)
∇h(t) = (∇h(t+1) L) + (∇o(t) L) = W T (∇h(t+1) L)diag(1 − (h(t+1) )2 ) + V T (∇o(t) L)
δh(t) δh(t)
X δo(t) T X
∇c L = ∇o(t) L = ∇o(t) L
t
δc t
X δh(t) T X
∇b L = ∇h(t) L = diag(1 − (h(t) )2 )∇h(t) L
t
δb(t) t
!
XX δL (t)
X T
∇V L = (t)
∇V oi = (∇o(t) L)h(t)
t i δoi t
!
XX δL X T
∇W L = (t)
∇W (t) h(t) = diag(1 − (h(t) )2 )(∇h(t) L)h(t−1)
t i δhi i
!
XX δL X T
∇U L = (t)
∇U (t) h(t) = diag(1 − (h(t) )2 )(∇h(t) L)x(t)
t i δhi t
31
15.3 Encoder-Decoder Architectures
• Input to RNN: context
• Limitation: the output of encoder has a dimension that is too small to properly summarize a long sequence
32
Recursive Neural Networks
• Advantage: for a sequence of the same length τ , the depth can be drastically reduced from τ to O(log τ )
• Might help deal with long-term dependencies
• Time scale of integration can be changed dynamically based on the input sequence
• Time constants are output by the model itself
33
Convolution operation
Forget gate:
(t) (t) (t−1)
X X
fi = σ(bfi + f
Ui,j xj + f
Wi,j hj )
j j
Internal state:
(t) (t) (i−1) (t) (t) (t−1)
X X
si = fi si + gi σ(bi + Ui,j xj + Wi,j hJ )
j j
(t)
Output gate qi :
(t) (t) (t)
hi = tanh(si )qi
(t) o (t) o (t−1)
X X
qi = σ(boi + Ui,j xj + Wi,j hj )
j j
34