0% found this document useful (0 votes)
40 views34 pages

Machine Learning Cheat Sheet HCMUT K

The document provides an overview of machine learning, detailing types such as supervised, unsupervised, reinforcement, and evolutionary learning. It also discusses performance measures including precision, recall, and accuracy, along with their formulas. Additionally, it outlines various machine learning techniques and models, including decision trees, Bayesian learning, linear regression, genetic algorithms, and neural networks.

Uploaded by

giakhang1704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views34 pages

Machine Learning Cheat Sheet HCMUT K

The document provides an overview of machine learning, detailing types such as supervised, unsupervised, reinforcement, and evolutionary learning. It also discusses performance measures including precision, recall, and accuracy, along with their formulas. Additionally, it outlines various machine learning techniques and models, including decision trees, Bayesian learning, linear regression, genetic algorithms, and neural networks.

Uploaded by

giakhang1704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Machine Learning

Banh Tan Thuan([email protected]), Le Quang Khai, Vu Chau Duy Quang, Le Viet Tung,
Vo Hoang Nhat Khang, La Cam Huy
August 1, 2024

Contents
1 Introduction 3
1.1 Type of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Decision Tree 5
2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Problem with Decision True . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Bayesian Learning 7
3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Linear Regression 8
4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Weakness of linear regession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Genetic Algorithm 9
5.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

6 Graphical Models 10
6.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

7 Support Vector Machine (SVM) 13


7.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

8 Dimensionality Reduction 14
8.1 Matrix calculus Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

9 Discriminative Models 15
9.1 Feature-based linear classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
9.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
9.3 Maximum entropy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
9.4 Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

10 Artificial Neural Networks (ANN) 17


10.1 Neural networks: Without the brain stuff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
10.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
10.3 Neural network architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

11 Deep Feedforward Networks 19


11.1 Gradient Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
11.2 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
11.3 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
11.4 Back-Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

12 Regularization 21
12.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
12.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
12.3 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
12.4 Other regularization approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
12.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1
13 Optimization for training deep models 24
13.1 Learning vs. Pure Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
13.2 Challenges in Neural Network Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
13.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

14 Convolutional Neural Network (CNN) 27


14.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
14.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

15 Recurrent Neural Network (RNN) 30


15.1 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
15.2 Bidirectional RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
15.3 Encoder-Decoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
15.4 Other RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
15.5 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Abstract
Machine Learning Materials.
Book:
• Bishop Pattern Recognition and Machine Learning 2006
• Deep Learning (Adaptive Computation and Machine Learning series)
Web : https://machinelearningcoban.com/
Bishop : https://www.youtube.com/playlist?list=PL8FnQMH2k7jzhtVYbKmvrMyXDYMmgjj_n
Examples: https://www.youtube.com/@MaheshHuddar
Deep Learning From Chapter 6: https://www.youtube.com/playlist?list=PLbBjZEwyU7W1CDs3Vx_GOJ9b3EgYQB3GE
https://www.deeplearningbook.org/lecture_slides.html

2
1 Introduction
1.1 Type of Machine Learning
1. Supervised learning: the learner (learning algorithm) are trained on labeled examples, i.e., input where the
desired output is known.
• Classification: to find the class of an instance given its selected features.
• Regression: to find a function whose curve passes as close as possible to all of the given data points

2. Unsupervised learning: the learner operates on unlabeled examples, i.e., input where the desired output is
unknown.
• Clustering to find the grouping data points into clusters based on their patterns without the need for labeled.
3. Reinforcement learning: between supervised and unsupervised learning. It is told when an answer is wrong,
but not how to correct it.
4. Evolutionary learning: biological evolution can be seen as a learning process, to improve survival rates and
chance of having offspring.

1.2 Performance Measures


Precision:
TP number of correct system answers
Precision = =
TP + FP number of system answers
Recall:
TP number of correct system answers
Recall = =
TP + FN number of correct problem answers
Accuracy:
TP + TN Number of correct predictions
Accuracy =
T P + T N + F P + F N Total number of predictions
Example:
• TP = 43

• FP = 8
• TN = 33
• FN = 7

TP 43 43
P recision = = = = 0.8431
TP + FP 43 + 8 51
TP 43 43
Recall = = = = 0.86
TP + FN 43 + 7 50
TP + TN 43 + 33 76
Accuracy = = = = 0.8352
TP + TN + FP + FN 43 + 33 + 8 + 7 91
100% Precision but 0% Recall, or 0% Precision but 100% Recall: It will result, F1 score = 0.
F1 Score:
The F1 score is a measure that combines both precision and recall into a single metric, providing a balance between
them. It is defined as the harmonic mean of precision and recall:
Precision × Recall
F1 = 2 ×
Precision + Recall
The F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It provides a useful way to
evaluate a classifier’s overall performance, considering both false positives and false negatives.

1.3 Inductive Bias


No prior assumptions = Cannot classify any unseen instances, no rational basic for classifying any unseen instances
(making random or arbitrary classifications, which would not be reliable or meaningful).
The inductive bias (learning bias): the set of assumptions that the learner uses to predict outputs given inputs that
it has not encountered.

3
Table 1: Machine Learning Algorithms

Algorithm Type Inductive Bias Why choose this model?


Decision Supervised Prefers simple, high-information Decision trees are easily interpretable models that
Trees features and splits data to max- handle non-linear relationships and missing values. It
imize information gain or mini- also does not need much of preprocessing data.
mize impurity. However, they are prone to overfitting and sensitive
to small variations in data can significant changes the
outcome, which can be unstable. To avoid overfit-
ting, the tree need to end the tree early to get a brief
tree, or use Pruning or Random Forest. It also costly
mathematical equations because we have to calculate
Entropy or Gini, so with large argument set will sig-
nigicantly slow. It cannot classify multi-class
Naive Supervised Maximum conditional inde- Naive Bayes works quickly and can save a lot of time.
Bayes pendence: if the hypothesis can suitable for solving multi-class prediction problems.
classifier be cast in a Bayesian framework, If its assumption of the independence of features holds
try to maximize conditional in- true, it can perform better than other models and
dependence requires much less training data.
The attribute in data is indepen- However, Naive Bayes assumes that all predictors (or
dent given some class (not so cor- features) are independent, rarely happening in real
relate with each other) Example: life. This limits the applicability of this algorithm in
Skin Color vs Clothes = Rich. real-world use cases. This algorithm faces the ‘zero-
frequency problem’ where it assigns zero probability
to a categorical variable whose category in the test
data set wasn’t available in the training dataset. It
would be best if you used a smoothing technique to
overcome this issue.
Hidden Supervised, Markov assumption : State HMMs are interpretable and transparent, providing a
Markov Unsuper- at time t depends only on state clear and intuitive representation of hidden states and
Models vised at time t - 1. p(yt |yt−1 , Z) = their transitions, as well as allowing for uncertainty
p(yt |yt−1 ) Observation at time t quantification and confidence estimation.
depends only on state at time They are restrictive and simplistic, as they assume
t. P (xt |yt , Z) = p(xt |yt ) Where that the hidden states are discrete and finite, and that
Z = y1...t−1 + other the observations are conditionally independent given
the hidden states, which may not be true, example of
relationship between different aspects of speech, such
as pronunciation and intonation. Additionally, they
are prone to overfitting and underfitting due to need-
ing a careful choice of the number of hidden states
and prior distributions over the parameters. Further-
more, HMMs are not very good at generalizing to
unseen data.
Support Supervised Maximum margin: when
Vector drawing a boundary between two
Machines classes, attempt to maximize the
(SVM) width of the boundary. The as-
sumption is that distinct classes
tend to be separated by wide
boundaries.
K Nearest Unsupervised Nearest neighbors: assume Description of K Nearest Neighbor.
Neighbor that most of the cases in a small
neighborhood in feature space
belong to the same class

4
2 Decision Tree
2.1 Theory
Data: Learning set S, attribute set A, attribute values V
Result: Decision tree
Function ID3(S, A, V ):
Load learning set S and create decision tree root node rootN ode, adding learning set S into rootN ode as its
subset;
Compute Entropy(rootN ode.subset);
if Entropy(rootN ode.subset) == 0 then
▷ subset is homogenious
return a leaf node;
end
if Entropy(rootN ode.subset) ̸= 0 then
▷ subset is not homogenious
Compute Information Gain for each attribute A not yet used for splitting;
Find attribute A with maximum Gain(S, A);
Create child nodes for rootN ode based on attribute A and add them to the decision tree;
end
foreach child of rootN ode do
ID3(S, A, V );
Continue until a node with Entropy of 0 or a leaf node is reached
end
Algorithm 1: ID3 Algorithm

Functions
Entropy :
E(S) = −p+ log2 p+ − p− log2 p−
S is a set of examples. p+ is the proportion of examples in class +. p− = 1 − p+ is the proportion of examples in
class - .
Entropy for n > 2 classes :
n
X
E(S) = −p1 logp1 − p2 logp2 ... − pn logpn = − pi logpi
i=1

pi is the proportion of examples in S that belong to the i-th class.


Weight sum :
X |Si |
I(S, A) = E(Si )
i
|S|
Information Gain for Attribute A:
X |Si |
Gain(S, A) = E(S) − I(S, A) = E(S) − E(Si )
i
|S|

Intrinsic information of a split:


X |Si | |Si |
IntI(S, A) = − log
i
|S| |S|
Gain Ratio:
Gain(S, A)
GR(S, A) =
IntI(S, A)
Impurity measure: X
Gini(S) = 1 − p2i
i

Average Gini index (instead of average entropy / information):


X |Si |
Gini(S, A) = Gini(Si ))
i
|S|

5
2.2 Problem with Decision True
Problem: attributes with a large number of values.
More Subset, more attribute values −→ Likely to be pure −→ Gain() biased towards choosing attributes with a large
number of values.
This may cause several problems:
• Overfitting: selection of an attribute that is non-optimal for prediction.
• Fragmentation: data are fragmented into (too) many small sets. −→ Too deep and complex tree.

We only want a brief tree.


Solution: Intrinsic information and Gain Ratio Reduces its bias towards multi-valued attributes
Solution: Gini Index Replace E(S) with Gini(S), I(S,A) with Gini(S,A)
Gini impurity is computationally efficient as compared to entropy. It is not ideal to use entropy when we have
continuous feature values or having larger dataset as it will take more time. On the other hand, it is efficient to use Gini
impurity as it takes less time to do the computation (Entropy have Log2 function, where Gini is just algebric).

6
3 Bayesian Learning
3.1 Theory
Functions
Bayes Theorem:
P (D|h)P (h)
P (h|D) =
P (D)
Where P (h): prior probability of hypothesis h, P (D): prior probability of training data D, P (h|D): probability
that h holds given D, P (D|h): probability that D is observed given h.
Maximum A-posteriori hypothesis (MAP):

hM AP = argmax P (h|D) = argmax P (D|h)P (h)


h∈H h∈H

P(h) is not a uniform distribution over H


Maximum Likelihood hypothesis (ML): :

hM AP = argmax P (h|D) = argmax P (D|h)


h∈H h∈H

P(h) is a uniform distribution over H


Bayes Optimal Classifier: X
argmax P (c|D) = argmax P (c|h)P (h|D)
c∈C c∈C
h∈H

Naive Bayes Classifier:

CM AP = argmax P (c|a1 , a2 , ..., an ) = argmax P (a1 , a2 , ..., an |c)P (c)


c∈C c∈C
Y
CN B = argmax P (ai |c)P (c)
c∈C i=1,n

assuming that a1, a2, ..., an are independent given c


Estimating probabilities:
nc + mp
n+m
Where n: total number of training examples of a particular class. nc : number of training examples having a
particular attribute value in that class. m: equivalent sample size. p: prior estimate of the probability (equals
1/k where k is the number of possible values of the attribute)

7
4 Linear Regression
4.1 Theory
in general linear regression is used to perform a regression on linear dependent variables.
It has the form:

h(x) = θT x
optimized by the cost function:
m
1X 2
J(θ) = (hθ (x(i) − y (i))
2 i=1
Formula to calculate the hyperlane:

θ = (X T X)−1 X T ⃗y

4.2 Weakness of linear regession


.
Not really fit to work with nonlinear data. (svm job)

8
5 Genetic Algorithm
5.1 Theory
Functions
1. Initialize population: P = randomly generated p hypotheses.

2. Evaluate fitness: compute F itness(h), for each h ∈ P


3. While maxh∈P F itness(h) < F itnessthreshold do:
(a) Create new generation (Selection, Crossover, Mutation)
(b) Evaluate fitness

Selection: Probabilistically select (1 − r)p hypotheses of P to add to the new generation. The selection probability
of a hypothesis :

F itness(hi )
P r(hi ) = P
h∈P F itness(h)

Crossover:
1. Probabilistically select (r/2)p pairs of hypotheses from P according to P r(h)
2. For each pair (h1 , h2 ), produce two offsprings by applying a Crossover operator.

3. Add all offspring to the new generation.


Mutation:
1. Choose m percent of the added hypotheses with uniform distribution.
2. For each, invert one randomly selected bit in its representation.

Where p : number of hypothesis, r: portion of hypothesis that will discard. Example: r = 0.2 −→ discard 20% of
the population
Fitness Function Example:
F itness(h) = (correct(h))2

Not always the good individual is always survive, there is a chance that it will die, it just more than normal individual

9
6 Graphical Models
6.1 Theory
Functions

Bayesian Networks (revisited). Advantages of graphical modeling:


Conditional independence: p(D|C, E, A, B) = p(D|C)
Factorization: p(A, B, C, D, E) = p(D|C)p(E|C)p(C|A, B)p(A)p(B)
Each instance x is described by a conjunction of attribute values < a1 , a2 , ..., an > (Independent of each other).
It is to assign the most probable class c to an instance (only depend on C).
Y
CN B = argmax P (ai |c)P (c)
c∈C i=1,n

If we have a joint distribution P (C, a1 , a2 , .., an ), we can calculate the cdf. Then have a uniform random {0, 1} for
each ai . We can generate by mapping [a′1 , a′2 , ..., a′n ] to the joint distribution to get new P (C, a′1 , a′2 , .., a′n ). Therefore,
Naive Bayes is a generative model:
Three basic problems of HMMs. Once we have an HMM, there are three problems of interest.
1. The Evaluation Problem Given an HMM λ and a sequence of observations O = o1 , o2 , ..., oT , what is the
probability that the observations are generated by the model, pO|λ ?
2. The Decoding Problem Given an HMM λ and a sequence of observations O = o1 , o2 , ..., oT , what is the most
likely state sequence in the model that produced the observations?
3. The Learning Problem Given an HMM λ and a sequence of observations O = o1 , o2 , ..., oT , how should we
adjust the model parameters A, B, π in order to maximize pO|λ ?

10
Functions
Markov assumption :

1. State at time t depends only on state at time t - 1. p(yt |yt−1 , Z) = p(yt |yt−1 )
2. Observation at time t depends only on state at time t. P (xt |yt , Z) = p(xt |yt )
Joint distributions:
Y
p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT ) = p(yt |yt−1 ).p(xt |yt )
t=1,T

p(y1 |y0 ) = p(y1 )

Forward algorithm: X
at (yt ) = p(yt , x1 , x2 , ..., xt ) = p(yt , yt−1 , x1 , x2 , ..., xt )
yt−1
X
= p(xt |yt , yt−1 , x1 , x2 , ..., xt−1 )p(yt , yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 , x1 , x2 , ..., xt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )(M arkov 1st Assumption)
yt−1
X
= p(xt |yt )p(yt |yt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )(M arkov 2nd Assumption)
yt−1
X
= p(xt |yt ) p(yt |yt−1 )at−1 (yt−1 )
yt−1

a1 (y1 ) = p(y1 , x1 ) = p(x1 |y1 )p(y1 )


To compute the most probable sequence of states y1 , y2 , ..., yT given a sequence of observations x1 , x2 , ..., xT :

Y ∗ = argmax p(Y |X)p(X)


Y

Viterbi algorithm:

max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT ) = max max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )


y1:T yT y1:T −1

= max max p(xT |yT ).p(yT |yy−1 ).p(y1 , y2 , ..., yT −1 , x1 , x2 , ..., xT −1 )


yT y1:T −1

= max max{p(xT |yT ).p(yT |yy−1 ) max .p(y1 , y2 , ..., yT −1 , x1 , x2 , ..., xT −1 )}


yT yT −1 y1:T −2

Dynamic programming:

1. Compute:
argmax p(y1 , x1 ) = argmax p(x1 |y1 ).p(y1 )
y1 y1

2. For each t from 2 to T, and for each state yt , compute:

argmax p(y1 , y2 , ..., yT , x1 , x2 , ..., xt )


y1:t−1

3. Select:
argmax p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T

As the more advance t in at , the number would be very very small. So Viterbi algorithm.

11
Fast Viterbi using Casio
 
log(p11 ) log(p21 )
log(p
 12 ) log(p22 ) 
α1 + log(t11 ) + log(e1 (x)) α2 + log(t21 ) + log(e1 (x))
=
α + log(t12 ) + log(e2 (x)) α2 + log(t22 ) + log(e2 (x))
 1     
α1 α2 log(t11 ) log(t21 ) log(e1 (x)) log(e1 (x))
= + +
α1 α2 log(t12 ) log(t22 ) log(e2 (x)) log(e2 (x))
= oαT +T T + eoT
α1
a= (P revious timestep)
α2
 
1
o= (V ector of 1s)
1
 
log(t11 ) log(t21 )
T = (T ransition matrix)
 log(t12 ) log(t22 )
log(e1 (x))
e= (Emission vector)
log(e2 (x))
In calculator: Let M atA4,1 = a, M atB4,4 = T, M atC4,1 = e, M atD4,1 = o
Then to calculate: M atD ∗ T rn(M atA) + T rn(M atB) + M atC ∗ T rn(M atD)
Then take max each row to get the new M atA, recalculate MatC to get new emission, and repeat

12
7 Support Vector Machine (SVM)
7.1 Theory
Functions
Signed distance between the decision boundary and a sample xn :

y(xn )
||w||

Absolute distance between the decision boundary and a sample xn :

tn .y(xn )
||w||
tn = +1 iff y(xn ) > 0 and tn = −1 iff y(xn ) < 0
Maximum margin:
1
argmax min(tn .(w.xn + b))
w,b ||w|| n
with the constraint:
tn .(w.xn + b) >= 1
Lagrange function for maximum margin classifier:
N
1 X
L(w, b, a) = ||w||2 − an .(tn .(w.xn + b) − 1)
2 n=1

Solution for w:
δ(w, b, a)/δw = 0
N
X
w= an .tn .xn
n=1
N
X
δL(w, b, a)/δb = an .tn = 0
n=1

Dual representation:
N N N
X 1XX
L∗ (a) = an − an .am .tn .tm .xn .xm
n=1
2 n=1 m=1
with the constraints:
an ≥ 0
N
X
an .tn = 0
n=1

Solution for b (NOTE THAT SLIDE DR. DUNG CAP, SLIDE DR. SACH IS CORRECT ):

1 X X(((
b= am .t(
m .xm(.x( ((1(( w.xn
n =
|S| ((( |S|
(
(((( n∈S n∈S

1 X X 1 X
b= (tn − am .tm .xm .xn ) = (tn − w.xn )
|S| |S|
n∈S m∈S n∈S

where S is the set of support vectors (an ̸= 0)

13
8 Dimensionality Reduction
8.1 Matrix calculus Theory
Functions
A vector as a column matrix
Dot product in matrix notation:

aT b
Vector projection of a on b:
ab
r=
||b||
a.b is a linear combination of a’s dimensions.
Matrix differentiation:
y = Ψ(x)
y is an m × 1 matrix, x is an 1 × n matrix
Proposition 1:

a = xT Ax
∂a
= (A + AT )x
∂x
x is n × 1, A is n × n, A does not depend on x, A is Matrix of quadratic forms.
Proposition 2: A is symmetric

a = xT Ax
∂a
= 2xT A
∂x
x is n × 1, A is n × n, A does not depend on x . Convert To Symmetric Matrix
Proposition 3: A is symmetric

a = xT Ax
∂a
( )T = 2Ax
∂x
x is n × 1 A is n × n, A does not depend on x
Eigenvalues and eigenvectors:

Av = λv
A is n × n (linear transformation), v is n × 1, λ is an eigenvalue of A’s, v is an eigenvector of A’s
Proposition 4: A is a n × n symmetric matrix All of its eigenvalues are real There are n linearly independent
eigenvectors for A.
Proposition 5: v1 , v2 , ..., vn are linearly independent eigenvectors of A, and λ1 , λ2 , ..., λn are their corre-
sponding eigenvalues
A = P DP −1
where  
λ1 0 ... 0
0 λ2 ... 0   
D= .  , P = v1 v2 v3
 
.. .. ..
 .. . . . 
0 0 ... λn
Watch this

14
9 Discriminative Models
A discriminative model will discriminates y given x, calculating the conditional distribution p(y|x)

9.1 Feature-based linear classifiers


A feature-based linear classifier makes decision based on a linear combination of features, which can be defined as a
function with a bounded real value of a class and an observation f (c, x)

Features of classifiers
Linear classifier:
M
X
argmax λm .fm (c, x)
c∈C m=1

SVM:
fm (c, x) = xm
In SVM, changing λ to smaller it will move the middle line closer to the margin, the margin is still the same.
Naive Bayes Classifier:
fm (c, x) = log p(xm |c)
NLP:

F(c, x) = (c is a certain class) ∧ (x has a certain property)


where λm : W eight, c : class, fm (c, x) : Feature Function, λ = 0.3, 0.6. In SVM it will move the line of SVM ,
(Margin is still the same)

Equations

Empirical count of a feature: X


Ẽ(fm ) = p̃(c, x).fm (c, x)
(c,x)∈observed(c,x)

1 X
Ẽ(fm ) ≈ fm (c, x)
D
(c,x)∈observed(c,x)

Model expectation of a feature: X


E(fm ) = p(c, x).fm (c, x)
(c,x)∈(C,X)
X
E(fm ) = p(x).p(c|x).fm (c, x)
(c,x)∈(C,X)
X
E(fm ) ≈ p̃(x).p(c|x).fm (c, x)
(c,x)∈(C,X)

1 X X
E(fm ) ≈ p(c|x)fm (c, x)
D
x∈observed(x) c∈C

Consistency constraint:
E(fm ) = Ẽ(fm )

9.2 Logistic regression


Logisitic regression is a discriminative model:
PM
exp m=1 λm .fm (y, x)
p(y|x) = P PM ′
y ′ ∈Y exp m=1 λm .fm (y , x)

It can be seen as a linear classifier:


M
X
log p(y|x) = λm · fm (y, x)
m=1
In the case of two classes, we have the following equation:
PM
exp m=1 λm .fm (y, x)
p(⊕|x) = PM
1 + exp m=1 λm .fm (y, x)

15
1
p(⊖|x) = PM
1 + exp m=1 λm .fm (y, x)
zi
zi e
p(ci |x) = P vs P zi (Softmax)
zi e
The maximum one are still the maximum one. Why prefer softmax? Because of the derivative, Convenient than
zi
derivative. P
zi
ezi δσi ezi ezi
σi = P zi ⇒ = ... = P zi ∗ (1 − P zi ) = σi (1 − σi )
e δzi e e
And the partial derivative with respect to zy
δσi ezi ezy
⇒ = ... = P zi ∗ P zi = σi σy
δzy e e P
This look like probability but not the probability, There are no proof of ci p(ci |x) = 1 of the data in real world, it
just what we think it should be.*
zi ezi
Both p(ci |x) = P and P zi
zi e

9.3 Maximum entropy model


According to the principle of maximum entropy, the only unbiased assumption is a distribution that is as uniform as
possible given the available information. The maximum entropy model is based on this principle. In particular, in this
model, the proper probability distribution is the one that maximizes the entropy given the constraints from the training
data.

Formulas
Conditional entropy: X
H(y|x) = − p(y, x) · log p(y|x)
(x,y)∈(X,Y )
X
H(y|x) ≈ − p̃(x)p(y|x)log p(y|x)
(x,y)∈(X,Y )

Maximum entropy model:


p∗ (y|x) = argmax H(y|x)
p(y|x)∈P

Constraints:
E(fm ) = Ẽ(fm )
X
p(y|x) = 1
y∈Y

where E(fm ) : Expectation


Lagrange function:
M
X X
L(p(y|x)) = H(y|x) + λm (E(fm ) − Ẽ(fm )) + λM +1 ( p(y|x) − 1)
m=1 y∈Y

Optimization:
δL(p(y|x))
=0
δp(y|x)
Solution: PM
exp m=1 λm · fm (y, x)
p(y|x) = P PM ∗
y ∗ ∈Y exp m=1 λm · fm (y , x)

9.4 Conditional Random Field


A conditional random field (CRF) is the conditional extension of Hidden Markov Model (HMM) and the sequential
extension of Logistic Regression.
Formula: QT P
t=1 (exp m=1 M λm · fm (yt , yt−1 , xt ))
p(y|x) = P QT ∗ ∗
P
y ∗ ∈Y t=1 (exp m=1 M λm · fm (yt , yt−1 , xt ))

16
10 Artificial Neural Networks (ANN)
This part is rewritten based on lecture 4 of Stanford University (Fei-Fei Li, Ranjay Krishna, Danfei Xu - April 16, 2020),
slide Mr.Sach was missing.

10.1 Neural networks: Without the brain stuff


(Before) Linear score function: y = W × x
(Now) 2-layer Neural Network: y = W2 max(0, W1 × x), with x ∈ RD , W1 ∈ RH×D , W2 ∈ RC×H
3-layer Neural Network: f = W3 max(0, W2 × max(0, W1 × x)), with x ∈ RD , W1 ∈ RH1 ×D , W2 ∈ RH2 ×H1 , W3 ∈ RC×H2

The function max(0, z) is called the activation function. Remove activation functions then we have linear classifier
(W2 × W 1 × x)
NOTE: Purpose of activation functions? To introduce non-linearity to the classifier.

10.2 Activation functions

ReLU is a good default choice for most problems. derivative of some activation function

17
10.3 Neural network architectures

n-layer neural net: Includes 1 input layer + n − 1 hidden layers.


When training, how many layers should we add? Answer: Just keep adding layers until the test error does not
improve anymore.

• If the data is linearly separable then you don’t need any hidden layers at all.

• If data is less complex and is having fewer dimensions or features then neural networks with 1 to 2 hidden layers
would work.
• If data is having large dimensions or features then to get an optimum solution, 3 to 5 hidden layers can be used.

It should be kept in mind that increasing hidden layers would also increase the complexity of the model and choosing
hidden layers such as 8, 9, or in two digits may sometimes lead to overfitting.
Once hidden layers have been decided the next task is to choose the number of nodes in each hidden layer. The
number of hidden neurons should be between the size of the input layer and the output layer.
The most appropriate number of hidden neurons is:
p
input layer nodes ∗ output layer nodes
The number of hidden neurons should keep on decreasing in subsequent layers to get more and more close to pattern
and feature extraction and to identify the target class.

18
11 Deep Feedforward Networks
Watch this video: https://www.youtube.com/watch?v=kWOPkec1RSQ
Deep feedforward networks (multilayer perceptrons (MLPs))
• Goal: approximate some function f ∗
• Information flow through the function being evaluated
• No feedback connection
Linear models
• Logistic regression, linear regression
• Can be fit efficiently and reliably
• Can obtain closed form solution or with convex optimization
• Limitation: capacity is limited to linear functions

11.1 Gradient Based Learning


Most modern neural networks are trained using maximum likelihood. The cost function for training is simply the
negative log-likelihood, equivalently described as the cross-entropy between the training data and the model distribution.
It depends on the specific form of log pmodel

J(θ) = −Ex,y log pmodel (y|x)

A sufficiently powerful neural network can represent any function from a wide class of functions.

Gaussian output distributions


• Output: ŷ = W T h + b
• Mean of a conditional Gaussian distribution:

p(y|x) = N (y, ŷ, I)

• Minimize the mean squared error


Bernoulli output distributions
• Predict the value of a binary variable y
• Predict P (y = 1|x)
• Sigmoid output: ŷ = σ(wT h + b)
Multinoulli output distributions:
• Softmax function: often used as the output of the classifier, to represent the probability distribution over n different
classes
exp(zi )
sof tmax(z)i = P
j exp(zj )

• Log-likelihood can undo the exp of the softmax


• Softmax output is invariant to adding scalar

Table 2: Output Types

Output Type Output Distribution Output Layer Cost Function


Binary Bernoulli Sigmoid Binary cross-entropy
Discrete Multinoulli Softmax Discrete cross-entropy
Conitnuous Gaussian Linear Gaussian cross-entropy
(MSE)
Continuous Mixture of Gaussian Mixture Density Cross-entropy
Continuous Arbitrary GAN,VAR,FVBN Various

19
11.2 Hidden Units
Rectified linear units (ReLU)

• Activation function: g(z) = max(0, z)


• Used on top of an affine transformation
h̃ = g(W T x + b)

• Can’t be learned via gradient-based methods on examples for which their activation is 0
• Generalization (Leakly Relu):
hi = g(z, a)i = max(0, zi ) + αi min(0, zi )

Maxout units
• Divide z into groups of k values

• Each maxout unit outputs the maximum element of one of these groups:

11.3 Architecture Design


The universal approximation theorem, we can use 1 hidden Layer, but how many neutron inside? Depend on the
complexity of function. Shallow Network is hard to train because of so many neutrons( exponentially). So using Deep
Network (with many layer), to reduce generazation error and also a regularization method. Shallow net may overfit more
than deep network.

11.4 Back-Propagation
Formulas
Chain Rule of Calculus Let x be a real number, and let f and g both be functions mapping from a real number
to a real number. Suppose that y = g(x), z = f (g(x)) = f (y).
The chain rule
dz dz dy
=
dx dy dx
Generalization: x ∈ Rm , y ∈ Rn , g : Rm → Rn , f : Rn → R
∂z X ∂z ∂yi
=
∂xi j
∂yi ∂xi

Vector notation:
∂y T
∆x z = ( ) ∆y z
∂x
∂y
where is the nxm Jabcobian matrix of g
∂x
∂y
The gradient of a variable x can be obtained by multiplying a Jacobian matrix by a gradient ∆y z
∂x
How much memory need for back propagation ? Feed forward : Have to remember all set of the deriavative
(δz1 /δw1 = x1 ). At least some importance that not 1.

M emory = sample ∗ batchsize

20
12 Regularization
Regularization strategies are explicitly designed to reduce the test error, possibly at the expense of increased training
error, and are based on tradeoff between increased bias and reduced variance.

12.1 Parameter Norm Penalties


Main regularization approaches are based on adding a parameter norm penalty Ω(θ) to the objective function J
˜ X, y) = J(θ, X, y) + αΩ(θ)
J(θ,

L2 regularization

Regularized objective function:

˜ X, y) = J(w, X, y) + αΩ(w) = J(θ, X, y) + α wT w


J(w,
2
Parameter gradient:
˜ X, y) = αw + ∇w J(w, X, y)
∇w J(w,
Gradient step:
w ← w − ϵ(αw + ∇w J(w, X, y)) = (1 − ϵα)w − ϵ∇w J(w, X, y)

L1 regularization

L1 regularization: X
Ω(θ) = ||w||1 = |wi |
i

Regularized objective function:


˜ X, y) = J(w, X, y) + αΩ(w) = J(θ, X, y) + α||w||1
J(w,

Parameter gradient:
˜ X, y) = αsign(w) + ∇w J(w, X, y)
∇w J(w,

L2 is smooth and dense, parameter varies in every possible range in value. While L1 is not dense (spare), the value
is might be 0 if too small, or not 0 => Selecting a few significant feature, hard to optimize.

12.2 Constrained Optimization


Constrained optimization is finding the maximal or minimal value of f (x) for values of x in some set S. The points that
satisfy this condition are called feasible points. A common approach to this type of optimization is to impose a norm
constraint, such as ||x|| ≤ 1.
Approach to constrained optimization:
• Modify gradient descent taking the constraint into account.
• If we use a small constant step size ϵ, we can make gradient descent steps, then project the result back into S

• If we use a line search, we can search only over step sizes ϵ that yield new x points that are feasible, or we can
project each point on the line back into the constant region.
A general solution to constrained optimization problem is Karush - Kuhn - Tucker (KKT). This solution introduces KKT
multipliers λi and αj for each constraint. The generalized Lagrangian is then defined as:
X X
L(x, λ, α) = f (x) + λi g (i) (x) + αj h(j) (x)
i j

Minimizing minx∈S f (x) is equivalent to:


min max max L(x, λ, α)
x λ α≥0

12.3 Dataset Augmentation


Dataset augmentation means creating fake data for training. It is used in object recognition( rotating, scaling, affine
transformation) and speech recognition (Injecting noise).

21
12.4 Other regularization approaches
Semi-supervised learning: Learn a representation so that examples from the same class have similar representations.

• Construct models in which a generative model of either P(x) or P(x,y) shares parameters with a discriminative
model of P (y|x)
• Trade-off between the supervised criterion log P (y|x) and unsupervised criterion (−log P (x) or −log P (x, y))
• Principal components analysis (PCA): a pre-processing step before applying a classifier (on the projected data).=>
Project data actually remove noise a bit. PCA is extract importance data and remove not so important data

Multitask Learning

Early Stopping: Terminate while validation set performance is better


Parameter Typing and Parameter Sharing: Force sets of parameters to be equal, used in Convolutional Neural
Network (CNN).

12.5 Dropout
Bagging (bootstrap aggregating):
Given a standard training set D of size n, bagging generates m new training sets Di , each of size n’, by sampling from
D uniformly and with replacement. Then, m models are fitted using the above m bootstrap samples and combined by
averaging the output (for regression) or voting (for classification).

• The models are independent


• Each model is trained to convergence on its respective training set.
• Each model i produces a probability distribution p(i) (y|x)

• Prediction of the ensemble is given by the arithmetic mean of all of these distributions
k
1 X (i)
p (y|x)
k i=1

22
Dropout:
Dropout trains the ensemble consisting of all sub-networks that can be formed by removing non-output units from an
underlying base network.
• Models share parameters, and are not explicitly trained

• Infeasible to sample all possible sub-networks within the lifetime of the universe
• Each sub-model defined by mask vector µ defines a probability distribution p(y|x, µ)
• Arithmetic mean over all masks: X
p(µ)p(y|x, µ)
µ

For very large datasets, computation cost may outweigh the benefit of regularization. Otherwise, for datasets with very
few samples, dropout is less effective.

23
13 Optimization for training deep models
13.1 Learning vs. Pure Optimization
Cost function can be written as an average over the training set, such as
J(θ) = E(x,y)≈p̂: data L(f (x, θ), y)
where L is the per-example loss function, f (x, θ) is the predicted output when the input is x, and p̂data is the empirical
distribution. In the supervised learning case, y is the target output.
Empirical risk has to be minimized:
m
1 X
E(x,y)≈p̂data [L(f (x, θ), y)] = L(f (x(i) , θ), y (i) )
M i=1

A surrogate loss function can be used as a proxy, and is minimized until a convergence criterion based on early stopping
is satisfied.
Batch and minibatch algorithms
Optimization is updating the parameters based on an expected value of the cost function estimated using only a subset
of the full cost function.
• Deterministic gradient methods: process all training examples simultaneously in a large batch.
• Stochastic gradient methods: use only a single example at a time.
• Minibatch: larger batches mean more accurate estimate of the gradient, but with less than linear returns

13.2 Challenges in Neural Network Optimization


• Ill-conditioning: In SGD, even very small steps increase the cost function, leading to getting stuck
• Local minima
• Plateaus, saddle points, and other flat regions
• Cliffs and exploding gradients

13.3 Algorithms
Stochastic Gradient Descent
Require: Learning rate schedule ϵ1 , ϵ2 , . . .
Require: Initial parameter θ
k←1
while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
1
∇θ i L(f (x(i) , θ), y (i) )
P
Compute gradient estimate: ĝ ←
m
Apply update: θ ← θ − ϵk ĝ
k ←k+1

End while

Momentum
Require: Learning rate ϵ, momentum parameter α
Require: Initial parameter θ, initial velocity v
while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
1
∇θ i L(f (x(i) , θ), y (i) )
P
Compute gradient estimate: g ←
m
Compute velocity update: v ← αv − ϵg
Apply update: θ ← θ + v

End while

24
Nesterov Momentum
Require: Learning rate ϵ, momentum parameter α
Require: Initial parameter θ, initial velocity v
while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
Apply interim update: θ̃ ← θ + αv
1
∇θ̃ i L(f (x(i) , θ̃), y (i) )
P
Compute gradient (at interim point): g ←
m
Compute velocity update: v ← αv − ϵg
Apply update: θ ← θ + v
End while

AdaGrad
Require: Global learning rate ϵ
Require: Initial parameter θ
Require: Small constant δ, perhaps 10−7 , for numerical stability.
Initialize gradient accumulation variable r = 0
while stopping criterion not met do

Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
1
∇θ i L(f (x(i) , θ), y (i) )
P
Compute gradient estimate: g ←
m
Accumulate squared gradient: r ← r + g ⊙ g
ϵ
Compute update: ∆θ ← − √ ⊙g
δ+ r
Apply update: θ ← θ + ∆θ
End while

RMSProp

Require: Global learning rate ϵ, decay rate ρ


Require: Initial parameter θ
Require: Small constant δ, perhaps 10−6 , used to stablize division by small numbers.
Initialize accumulation variable r = 0
while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
1
∇θ i L(f (x(i) , θ), y (i) )
P
Compute gradient estimate: g ←
m
Accumulate squared gradient: r ← ρr + (1 − ρ)g ⊙ g
ϵ
Compute update: ∆θ ← − √ ⊙g
δ+r
Apply update: θ ← θ + ∆θ
End while

25
Adam
Require: Step size ϵ
Require: Exponential decay rates for moment estimates, ρ1 and ρ2 in [0, 1)
Require: Small constant δ.
Initialize 1st and 2nd moment variables s = 0, r = 0
Initialize time step t = 0
while stopping criterion not met do
Sample a minibatch of m examples from the training set {x(1) , . . . , x(m) } with corresponding targets y (i)
1
∇θ i L(f (x(i) , θ), y (i) )
P
Compute gradient estimate: g ←
m
t←t+1
Update biased first moment estimate: s ← ρ1 s + (1 − ρ1 )g

Update biased second moment estimate: r ← ρ2 r + (1 − ρ2 )g ⊙ g


s
Correct bias in first moment: ŝ ←
1 − ρt1
r
Correct bias in second moment: r̂ ←
1 − ρt2

Compute update: ∆θ ← −ϵ √ ⊙g
δ + r̂
Apply update: θ ← θ + ∆θ
End while

Which algorithm should one choose?(SGD, SGD with momentum, RMSProp, RMSProp with momen-
tum, AdaDelta, Adam)
Adam is the best, other doesn’t matter much.

26
14 Convolutional Neural Network (CNN)
14.1 Theory
Convolution operation

Signal processing: Z Z
s(t) = (x ∗ w)(t) = x(τ )w(t − τ )dτ = x(t − τ )w(τ )dτ

Discrete convolution:

X
s(t) = (x ∗ w)(t) = x(τ )w(t − τ )
τ =−∞

2D image I, 2D kernel K:
XX
S(i, j) = (I ∗ K)(i, j) = I(m, n)K(i − m, j − n)
m n

Convolution is commutative:
XX
S(i, j) = (K ∗ I)(i, j) = I(i − m, j − n)K(m, n)
m n

An algorithm based on convolution will learn a kernel that is flipped relative to the kernel learned by an algorithm
without flipping.

Motivation:
• Sparse interactions: Output unit only interacts with a small number of input units through convolution kernel.
• Parameter sharing:

• Equivariance: Convolution is not naturally equivariant to transformations, except for translation. (f is equivarient
to g if f (g(x)) = g(f (x)))

Convolution layer

Assume an input of size W1 × H1 × D1 , a kernel of size F × F with K filters, Their spatial extent F, stride is S,
amount of zero padding is P.
Then, we will produce an output of size W2 × H2 × D2 :

W2 = (W1 − F + 2P )/S + 1

H2 = (H2 − F + 2P )/S + 1
D2 = K
Number of parameters:

(F × F × D1) × K(of weight) + K(biases) = (F × F × D1 + 1) × K

In the output volume, the d-th depth slice (of size W2 H2 ) is the result of performing a valid convolution of the
d-th filter over the input volume with a stride of S, and then offset by d-th bias.
Memory of 1 Convolution Layer of output of size W2 × H2 × D2 :
Source: https://stackoverflow.com/questions/59282135/how-do-we-approximately-calculate-how-much-memory-is

Byte of 1 Instance = K × W2 × H2 × b
where b is usually 4 (if the feature map use byte of 32-bit float, else it must stated in the question )

Pooling is replacing the output of the net at a certain location with a summary statistic of the nearby outputs. This
makes the representation become approximately invariant to small translations of the input.

27
Pooling layer

Source: https://cs231n.github.io/convolutional-networks/#case
Assume an input of size W1 × H1 × D1 , a kernel of size F × F with K filters, Their spatial extent F, stride is S,
amount of zero padding is P.
Then, we will produce an output of size W2 × H2 × D2 :

W2 = (W1 − F + 2P )/S + 1

H2 = (H1 − F + 2P )/S + 1
D2 = D1
Number of parameters: Introduces zero parameters since it computes a fixed function of the input
In the output volume, the d-th depth slice (of size W2 H2 ) is the result of performing a valid convolution of the
d-th filter over the input volume with a stride of S, and then offset by d-th bias.
Memory of 1 Convolution Layer of output of size W2 × H2 × D2 :
Source: https://stackoverflow.com/questions/59282135/how-do-we-approximately-calculate-how-much-memory-is

Byte of 1 Instance = K × W2 × H2 × b
where b is usually 4 (if the feature map use byte of 32-bit float, else it must stated in the question )

14.2 Examples
Question 1: To identify letters in the alphabet (26 letters), people preprocess them into 32 x 32 color images.
Let input x be a tensor with size 32 x 32 x 3. This input is passed through layers in a neural network, creating
the following outputs:

[32 × 32 × 3] → [32 × 32 × 64] → [16 × 16 × 64] → [8 × 8 × 128] → [8 × 8 × 32] → [2048 × 1] → [256 × 1] → [27 × 1]

1. Identify the layers in the network, given that they can be designed using convolution, pooling, or fully
connected. Identify the structure of each layer (size, stride, etc)
2. Calculate the size (number of parameters, including bias) for each layer

3. Calculate the memory requirement for back-propagation

1. Suppose the layers are called as below:


• [32 × 32 × 3] → [32 × 32 × 64]: layer 1
• [32 × 32 × 64] → [16 × 16 × 64]: layer 2
• [16 × 16 × 64] → [8 × 8 × 128]: layer 3
• [8 × 8 × 128] → [8 × 8 × 32]: layer 4
• [8 × 8 × 32] → [2048 × 1]: layer 5
• [2048 × 1] → [256 × 1]: layer 6
• [256 × 1] → [27 × 1]: layer 7
For layer 1, to change the height and width of the tensor
 from 32 to 32, number of channels from 3 to 64, we need 64
32 − N + 2P
filters with size N, padding P and stride S such that + 1 = 32. We will let N = 3, P = 1, S = 1.
S
For layer 2, to change the height and width of the tensor from 32 to 16 (by a half) without changing the number of
channels, we will need an average pooling with size 2, stride 1.
For layer 3, to change the height and width of the tensor
 from 16 to
 8, number of channels from 64 to 128, we need 128
16 − N + 2P
filters with size N, padding P and stride S such that + 1 = 8. We will choose S = 2, N = 3, P = 1.
S
For layer 4, to keep the height and width  at 8, change the number of channels from 128 to 32, we need 32 filters with
8 − N + 2P
size N, padding P and stride S such that + 1 = 8. We will choose S = 1, N = 3, P = 1.
S
For layer 5, because 8 × 8 × 32 = 2048, the input tensor has 3 dimensions, while the output tensor has 2 dimensions,
we will use Flatten() to create an output of size 2048 × 1.
For layer 6, to output a tensor with size 256 × 1 from size 2048 × 1, we use a fully connected layer with number of input
and output channels being 2048 and 256 respectively.
For layer 7, to output a tensor with size 27 × 1 from size 256 × 1, we use a fully connected layer with number of input

28
and output channels being 256 and 27 respectively.

2. Layer 1 is a convolution layer with 64 filters of size 3, padding 1 and stride 1, with input of size 32 × 32 × 3 so
the number of parameters is:
(3 × 3 × 3 + 1) × 64 = 1792
Layer 2 is a pooling layer, so the number of parameters is 0.
Layer 3 is a convolution layer with 128 filters of size 3, padding 1, stride 2, with input of size 16 × 16 × 64 so the number
of parameters is:
(3 × 3 × 16 + 1) × 128 = 18560
.
Layer 4 is a convolution layer with 32 filters of size 3, padding 1, stride 1, with input of size 8 × 8 × 128 so the number
of parameters is:
(3 × 3 × 128 + 1) × 32 = 36896
.
Layer 5 is a flatten layer, so the number of parameters is 0.
Layer 6 is a fully connected layer with input and output number of channels being 2048 and 256 respectively, so the
number of parameters is:
2048 × 256 + 256 = 524544

Layer 7 is a fully connected layer with input and output number of channels being 256 and 27 respectively, so the number
of parameters is:
256 × 27 + 27 = 6939

3. Total number of parameters in the model:

1792 + 0 + 18560 + 36896 + 0 + 524544 + 6939 = 588731

Model memory = 588731 × 4 = 2354924


Gradient memory = 2354924
Activation = 32 × 32 × 3 + 32 × 32 × 64 + 16 × 16 × 64 + 8 × 8 × 128 + 8 × 8 × 32 + 2048 + 256 + 27 = 97563
Total memory needed for back-propagation = 2354924 + 2354924 + 97563 × 4 × 2 × batch size = 4709848 + 780504 ×
batch size.

29
15 Recurrent Neural Network (RNN)
15.1 RNN

Types of RNN

Important design patterns of RNN:


• RNNs that produce an output at each time step and have recurrent connections between hidden units

• RNNs that produce an output at each time step and have recurrent connections only from the output at one time
step to the hidden units at the next time step
• RNNs with recurrent connections between hidden units that read an entire sequence and then produce a single
output
Teacher forcing

• A procedure that emerges from the maximum likelihood criterion: during training, the model receives the ground
truth output y (t) as input at time t + 1
• The conditional maximum likelihood criterion is:

log p(y (1) , y (2) |x(1) , x(2) )

• During training: feeding the model’s own output back into itself
Disadvantages:
• Problems arise when the network is going to be used in an open-loop mode

• Inputs in training can be quite different from test time

30
Back-propagtion through time

Parameters: U, V, W, b, c
Sequence of nodes at time t: x(t) , h(t) , o(t) , L(t)
Start the recursion with the nodes immediately preceding the final loss:
δL
=1
δL(t)
Assume that outputs o(t) are used as the argument to the softmax function to obtain the vector ŷ of probabilities
over the output, and the loss is the negative log-likelihood of the true target y (t) .
The gradient ∇o(t) L on the outputs at time step t, for all i, t:

δL δL δL(t) (t)
(∇o(t) L)i = (t)
= = ŷi − 1i,y(t)
δoi δL(t) δo(t)
i

Back-propagation:
T T
δh(t+1) δo(t)
 
∇h(t) = (∇h(t+1) L) + (∇o(t) L) = W T (∇h(t+1) L)diag(1 − (h(t+1) )2 ) + V T (∇o(t) L)
δh(t) δh(t)

Gradient on the remaining parameters:

X  δo(t) T X
∇c L = ∇o(t) L = ∇o(t) L
t
δc t

X  δh(t) T X
∇b L = ∇h(t) L = diag(1 − (h(t) )2 )∇h(t) L
t
δb(t) t
!
XX δL (t)
X T
∇V L = (t)
∇V oi = (∇o(t) L)h(t)
t i δoi t
!
XX δL X T
∇W L = (t)
∇W (t) h(t) = diag(1 − (h(t) )2 )(∇h(t) L)h(t−1)
t i δhi i
!
XX δL X T
∇U L = (t)
∇U (t) h(t) = diag(1 − (h(t) )2 )(∇h(t) L)x(t)
t i δhi t

15.2 Bidirectional RNN


Remember the past and think about the future

31
15.3 Encoder-Decoder Architectures
• Input to RNN: context

• Produce a representation of context C


• Context C: a vector or sequence of vectors that summarize the input sequence X = (X (1) , . . . , x(nx ) )
• Encoder and decoder don’t need to have the same size

• Limitation: the output of encoder has a dimension that is too small to properly summarize a long sequence

15.4 Other RNNs


Deep Recurrent Networks

32
Recursive Neural Networks

• A generalization of recurrent networks

• Advantage: for a sequence of the same length τ , the depth can be drastically reduced from τ to O(log τ )
• Might help deal with long-term dependencies

15.5 Long Short-Term Memory (LSTM)


• Introduces self-loops to produce paths where gradient can flow for a long durations
• Make the weight on this self-loop conditioned on the context
• Gates: Input gate, output gate, forget gate

• Time scale of integration can be changed dynamically based on the input sequence
• Time constants are output by the model itself

33
Convolution operation

Forget gate:
(t) (t) (t−1)
X X
fi = σ(bfi + f
Ui,j xj + f
Wi,j hj )
j j

Internal state:
(t) (t) (i−1) (t) (t) (t−1)
X X
si = fi si + gi σ(bi + Ui,j xj + Wi,j hJ )
j j

External input gate:


(t) (t) (t−1)
X X
gi = σ(bgi + g
Ui,j xj + g
Wi,j hj )
j j

(t)
Output gate qi :
(t) (t) (t)
hi = tanh(si )qi
(t) o (t) o (t−1)
X X
qi = σ(boi + Ui,j xj + Wi,j hj )
j j

34

You might also like