Slides Cours ML
Slides Cours ML
Azmi MAKHLOUF
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 1 / 250
These slides are only for educational purpose, and strictly meant for a
private usage; do not distribute!
Azmi Makhlouf
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 2 / 250
Some references
Christopher M. Bishop.
Pattern Recognition and Machine Learning.
Springer-Verlag, 2006.
Kevin P. Murphy.
Machine learning : a probabilistic perspective.
MIT Press, 2013.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
The Elements of Statistical Learning.
Springer Series in Statistics. Springer New York Inc., 2001.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning.
MIT Press, 2016.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 3 / 250
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 4 / 250
Introduction
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 5 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 6 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 6 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 6 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 6 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 7 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 7 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 7 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 8 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 9 / 250
Introduction
ML
x ∈ RD 7−→ t = f (x ) + noise ε
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 10 / 250
Introduction
(R, typically);
2 classication when t takes discrete (nitely many) values
(classes).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 11 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 12 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 13 / 250
Introduction
Some Applications
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 14 / 250
Introduction
Learning types
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 15 / 250
Introduction
Notation
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 16 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 17 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 18 / 250
Introduction
General procedure in ML
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 19 / 250
Introduction
1
error = loss = loss t , y (X , w ) ;
2
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 20 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 21 / 250
Introduction
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 22 / 250
Example: Polynomial Curve Fitting
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 23 / 250
Example: Polynomial Curve Fitting
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 25 / 250
Example: Polynomial Curve Fitting
1 X
N 2
E (w ) = y (xn , w ) − tn ,
N n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 26 / 250
Example: Polynomial Curve Fitting
1 X
N 2
E (w ) = y (xn , w ) − tn ,
N n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 26 / 250
Example: Polynomial Curve Fitting
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 27 / 250
Example: Polynomial Curve Fitting
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 27 / 250
Example: Polynomial Curve Fitting
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 28 / 250
Example: Polynomial Curve Fitting
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 29 / 250
Example: Polynomial Curve Fitting
1 X
N 2
E train (w ∗ ) = E (w ∗ ) = y (xn , w ∗ ) − tn .
N n=1
Test error: error on new test data (xntest , tntest )n=1...N test (e.g.
N test = 50 other data points):
N test
1 X 2
E test (w ∗ ) = y (xntest , w ∗ ) − tntest .
N test n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 30 / 250
Example: Polynomial Curve Fitting
Figure: Training error and test error for dierent degrees of the tting
polynomial.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 31 / 250
Example: Polynomial Curve Fitting
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 32 / 250
Example: Polynomial Curve Fitting
A too low degree does not suciently capture the complexity of the
data: this is called undertting.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 33 / 250
Example: Polynomial Curve Fitting
We see that the test error rst decreases with repsect to the degree
M until M = 5, from which it starts to increase!
A tting polynomial with a too high degree "sticks" too much to the
training data, so that it will be dicult for it to generalize to new
data: this is called overtting.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 34 / 250
Example: Polynomial Curve Fitting
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 35 / 250
Example: Polynomial Curve Fitting
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 36 / 250
Example: Polynomial Curve Fitting
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 37 / 250
Example: Polynomial Curve Fitting
Regularization
A usual technique to reduce overtting is called regularization,
which consists of adding a penalty to the error function.
Typical examples:
L2 regularization (or "Ridge"): penalty on kw k2 :
M
Ẽ (w ) := E (w ) + λ kw k22 = E (w ) + λ wj2 .
X
j=1
Remark
Remark
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 40 / 250
Probability Tools
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 41 / 250
Probability Tools
y
Product rule:
p(x , y ) = p(x |y )p(y ).
Independence: iif p(x , y ) = p(x )p(y ).
Bayes rule:
p(x |y )p(y )
p(y |x ) = ,
p(x )
which links the posterior probability p(y |x ) to the prior
probability p(y ).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 42 / 250
Probability Tools
Expectation/average/mean:
X
E[ϕ(x)] = ϕ(x)p(x),
x
or Z
E[ϕ(x)] = ϕ(x)p(x)dx.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 43 / 250
Probability Tools
The variance of x :
V(x) := E[(x − E(x))2 ] = E(x 2 ) − E(x)2 .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 44 / 250
Probability Tools
A fundamental example
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 45 / 250
Probability Tools
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 46 / 250
Probability Tools
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 47 / 250
Probability Tools
Then,
p(t | x , w , β) := N (t | y (x , w ), β −1 ).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 48 / 250
Probability Tools
n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 49 / 250
Probability Tools
N
βX N N
= min (y (xn , w ) − tn )2 − log(β) + log(2π),
w 2 2 2
n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 50 / 250
Probability Tools
N
βX N N
= min (y (xn , w ) − tn )2 − log(β) + log(2π),
w 2 2 2
n=1
1X
N
−1
βML = (y (xn , wML ) − tn )2 .
N n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 50 / 250
Probability Tools
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 51 / 250
Probability Tools
By Bayes rule,
p(w |X , t , α, β) ∝ p(t |X , w , β)p(w |α).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 52 / 250
Probability Tools
By Bayes rule,
p(w |X , t , α, β) ∝ p(t |X , w , β)p(w |α).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 52 / 250
Probability Tools
⇒ Computable, since
p(t |X , w )p(w )
p(w |X , t ) = R (Bayes).
p(t |X , w )p(w )dw
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 53 / 250
Probability Tools
Information Theory
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 54 / 250
Probability Tools
x
If x is a continuous random variable, we analogously dene its
(dierential) entropy by
Z
H(x ) := E[h(x )] = − p(x ) log p(x )dx .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 55 / 250
Probability Tools
Maximum entropy:
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 56 / 250
Probability Tools
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 57 / 250
Probability Tools
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 58 / 250
Probability Tools
NB.
1
KL(pdata k qmodel ) = − log-likelihood + const.
N
So, maximizing the log-likelihood is minimizing the KL-divergence
between the data and our model.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 59 / 250
Model Selection
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 60 / 250
Model Selection
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 61 / 250
Model Selection
Changing the training data clearly alters the model weights w , so its
loss.
−→ How does the average square loss change when we change the
training data set?
−→ This is given by the "bias-variance decomposition"
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 62 / 250
Model Selection
t = f (x) + noise ε.
In theory, the best L2 -approximation of t given x is the regression
function
f (x ) = E[t|x ].
It is estimated by the model y := y (x , w ; D), based on an arbitrary
random data set D.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 63 / 250
Model Selection
The square expected loss (over all random data sets D) is given by
E[L] = E[(y − t)2 ]
= E[(y − f (x ) − ε)2 ]
= E[(y − f (x ))2 ] + E[ε2 ] + 2E[ε]E[y − f (x )]
= E[(E[y |x ] − f (x ) + y − E[y |x ])2 ] + E[ε2 ] + 2 × 0
= E[(E[y |x ] − f (x ))2 ] + E[V[y |x ]] + V(ε)
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 64 / 250
Model Selection
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 65 / 250
Model Selection
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 65 / 250
Model Selection
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 65 / 250
Model Selection
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 66 / 250
Model Selection
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 67 / 250
Model Selection
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 68 / 250
Model Selection
Figure: Cross-validation
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 69 / 250
Model Selection
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 70 / 250
Model Selection
Drawbacks:
Training S times ⇒ expensive.
Need for a number of trainings that is exponential with repsect
to the number of parameters (to test for all combinations of
these parameters).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 71 / 250
Model Selection
- Cross-validation is
computationally expensive, and
wasteful of valuable data (need for a training set, a validation
set and a test set).
- The Bayesian approach
avoids the overtting problem of maximum likelihood, and
has automatic methods of determining the model complexity
using the training data alone.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 72 / 250
Linear Models for Regression
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 73 / 250
Linear Models for Regression
p(t|x )
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 74 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 75 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 76 / 250
Linear Models for Regression
Figure: A linear regression example with real-valued input and target data.
Here, y (x, w ) = w1 x + w0 .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 77 / 250
Linear Models for Regression
(we have omitted, as we may do from now on, the obvious variable x
from conditioning, for notational simplicity). Now, the parameters w
and β are to be determined.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 78 / 250
Linear Models for Regression
1X
N
min MSE = (tn − w T φ(xn ))2 .
w N n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 79 / 250
Linear Models for Regression
n=1
where
Φ = (φj (xn )) 1≤n≤N ∈ RN×M .
0≤j≤M−1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 80 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 81 / 250
Linear Models for Regression
1X
N
−1
(βML ) = (tn − wML
T
φ(xn ))2
N n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 82 / 250
Linear Models for Regression
1X
N
(t − w T φ(xn ))2 ,
2 n=1 n
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 83 / 250
Linear Models for Regression
1
For ridge regularization, EW (w ) = 12 M−j=0 wj , and we can
2
P
explicitly solve for the optimal w to obtain
ridge = (λI + ΦT Φ)−1 ΦT t .
wML
Notice that it does not need the matrix ΦT Φ to be non-singular.
P 1
For lasso regularization, EW (w ) = M−
j=0 |wj |, and the
coordinates of the optimal w (not explicit here) become more
sparse as λ is increased.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 84 / 250
Linear Models for Regression
where t := N1 n tn .
P
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 85 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 86 / 250
Linear Models for Regression
As before, we have
t = y (x , w ) + ε = w T φ(x ) + ε,
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 87 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 88 / 250
Linear Models for Regression
Now, we have training data t = (t1 , . . . , tN ) and (Φnj ) = (φj (xn )),
and we ask for the posterior distribution of w given t (and X ):
p(w |t ) =?
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 89 / 250
Linear Models for Regression
with
Σ := (Λ + AT LA)−1 .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 90 / 250
Linear Models for Regression
where
mN = SN (S0−1 m0 + βΦT t );
SN−1 = S0−1 + βΦT Φ.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 91 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 92 / 250
Linear Models for Regression
Thus,
N
βX T α
log p(w |t ) = − (w φ(xn ) − tn )2 − w T w + const,
2 n=1
2
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 93 / 250
Linear Models for Regression
where
σN2 (x ) = β −1 + φ(x )T SN φ(x ).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 94 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 95 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 95 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 96 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 97 / 250
Linear Models for Regression
Gaussian Processes
Thus,
N
E[y (x , w )|t ] = k(x , xn )tn ,
X
n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 99 / 250
Linear Models for Regression
Moreover,
Cov[y (x , w ), y (x 0 , w )|t ] = β −1 k(x , x 0 ).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 100 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 101 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 102 / 250
Linear Models for Regression
i=1 j=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 103 / 250
Linear Models for Regression
Examples of kernels:
Linear kernels: k(x , x 0 ) := x T x 0 + c ;
Polynomial kernel: k(x , x 0 ) := (x T x 0 + c)M ;
Gaussian kernel (RBF): k(x , x 0 ) := exp(− kx −2sx2 k );
0 2
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 104 / 250
Linear Models for Regression
−→ characterized by
mean: m(x ) = E[Y (x )]
covariance (kernel): k(x , x 0 ) = Cov(Y (x ), Y (x 0 ))
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 105 / 250
Linear Models for Regression
Here,
t = t(x) = y (x ) + ε(x ),
where
y (.) is a GP with mean zero and covariance function k0 (x , x 0 )
(chosen kernel);
ε(.) is an independent white noise with distribution N (0, β −1 ).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 106 / 250
Linear Models for Regression
Here,
t = t(x) = y (x ) + ε(x ),
where
y (.) is a GP with mean zero and covariance function k0 (x , x 0 )
(chosen kernel);
ε(.) is an independent white noise with distribution N (0, β −1 ).
Then, (t(x ))x ∈RD is a GP with mean zero and covariance function
k(x , x 0 ) = k0 (x , x 0 ) + β −1 1x =x 0 .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 106 / 250
Linear Models for Regression
Remark.
In Bayesian linear regression: prior on the parameter w as a
random variable
In (Bayesian) GP regression: prior on the funtion y (.) as a
stochastic process
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 107 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 108 / 250
Linear Models for Regression
Set
KN := (k(xn , xm ))n,m=1...N .
Clearly, the train-test data tN+1 = (tN , tN+1 )T have joint distribution
p(tN+1 ) = N (tN+1 |0, KN+1 ), with
KN kN,N+1
KN+1 = ,
kN,N+
T
1 kN+1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 109 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 110 / 250
Linear Models for Regression
µ = (µTa , µTb )T ;
Λaa Λab
Σ−1 = Λ :=
Λba Λbb
(with the sizes of all the above blocks matching those of xa and xb ).
Then,
1
p(xa |xb ) = N (xa |µa|b , Λ−
aa ),
where
1
aa Λab (xb − µb ).
µa|b := µa − Λ−
In particular, the conditional mean µa|b is linear with respect to xb .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 111 / 250
Linear Models for Regression
−1
σ 2 (xN+1 ) = kN+1 − kN,N+
T
1 KN kN,N+1 .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 112 / 250
Linear Models for Regression
Remarks.
A 95%-condence interval for the test is given by
[m(xN+1 ) − 2σ(xN+1 ); m(xN+1 ) + 2σ(xN+1 )].
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 113 / 250
Linear Models for Regression
Figure: Notice how the posterior mean and variance are updated compared
to the prior ones
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 114 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 115 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 116 / 250
Linear Models for Regression
Check that
k(u , v ) = φ̃T (u )φ̃(v ),
with √ √ √
φ̃(u ) = (u12 , u22 , 1, 2u1 , 2u2 , 2u1 u2 )T ∈ R6 !
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 117 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 118 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 119 / 250
Linear Models for Regression
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 120 / 250
Linear Models for Regression
1X
D
k(x , x ) := exp{−
0
ηi (x (i) − x 0(i) )2 }.
2 i=1
If ηi is small, it means that the component x (i) has little eect on the
predictive distribution. Thus, we can determine the relative
importance ("relevance") of the input variables only from the data
("automatically").
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 121 / 250
Linear Models for Classication
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 122 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 123 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 124 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 125 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 126 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 127 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 128 / 250
Linear Models for Classication
Remark
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 129 / 250
Linear Models for Classication
Remark
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 130 / 250
Linear Models for Classication
Remark
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 131 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 132 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 133 / 250
Linear Models for Classication
i.e.
X1
M−
decide (x ∈ C1 ) if wj φj (x ) > 0 (else x ∈ C0 )
j=0
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 134 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 135 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 136 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 137 / 250
Linear Models for Classication
• For all data xn ∈ Rd and (tn ) ∈ {0, 1}, the likelihood is given by
N N
p(t |X , w ) = p(tn |xn , w ) = yntn (1 − yn )1−tn ,
Y Y
n=1 n=1
with yn := σ(w T xn ).
n=1
N n o
tn log σ(w T xn ) + (1 − tn ) log(1 − σ(w T xn )) .
X
=−
n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 138 / 250
Linear Models for Classication
∂ txσ 0 (wx)
• {t log σ(wx)} = = tx(1 − σ(wx));
∂w σ(wx)
∂ −(1 − t)xσ 0 (wx)
• {(1 −t) log(1 −σ(wx))} = = −(1 −t)xσ(wx);
∂w 1 − σ(wx)
Then,
∂
{t log σ(wx)+(1 −t) log(1 −σ(wx))} = tx −xσ(wx) = −(y −t)x.
∂w
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 139 / 250
Linear Models for Classication
n=1 n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 140 / 250
Linear Models for Classication
exp(a(k) )
p(Ck |x ) = y (k) (x , W ) := PK := "softmax function",
(l)
l=1 exp(a )
where
X1
M−
a(k) = a(w (k) , x ) := w (k)T φ = wj φj (x ).
(k)
j=0
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 141 / 250
Linear Models for Classication
t = (0, . . . , 0, 1, 0, . . . , 0),
where 1 is in k th position if the class is Ck .
exp(a(w (k) , xn ))
XN X K
=− tn(k) log PK ,
n=1 k=1 l=1 exp(a( w (l) , x ))
n
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 142 / 250
Linear Models for Classication
with yn = σ(w T φn ).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 143 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 144 / 250
Linear Models for Classication
"Laplace approximation":
p(w |t ) ≈ q(w ) := N (w |wMAP , SN ), (1)
where
wMAP := arg max log p(w |t );
w
N
SN−1 := −∇∇ log p(w |t ) = S0−1 + yn (1 − yn )φn φTn .
X
n=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 145 / 250
Linear Models for Classication
Let a := w T φ. We get
p(a|t ) ≈ N (a|µa , σa2 ),
where
µa := wMAP
T
φ;
σa2 := φT SN φ.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 146 / 250
Linear Models for Classication
= E[σ(a)|t ]
Z
= σ(a)p(a|t )da
Z
≈ σ(a)N (a|µa , σa2 )da,
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 147 / 250
Linear Models for Classication
as a model for p(t = 1|x ), with a(.) a Gaussian process (and σ(.)
the sigmoid function).
Then t ∈ {0, 1} has Bernoulli distribution
p(t|a) = σ(a)t (1 − σ(a))1−t .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 148 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 149 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 149 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 150 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 151 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 152 / 250
Linear Models for Classication
By Bayes formula,
p(x |C1 )p(C1 ) p(x |C1 )p(C1 )
p(C1 |x ) = = .
p(x ) p(x |C1 )p(C1 ) + p(x |C2 )p(C2 )
If we set
p(x |C1 )p(C1 )
a(x ) := log ,
p(x |C2 )p(C2 )
we get
1
p(C1 |x ) = = σ(a(x )).
1 + exp(−a(x ))
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 153 / 250
Linear Models for Classication
Computing a(x) from the Gaussians p(x |C1 ) and p(x |C2 ) gives
a(x ) = w T x + w0 ,
where
w = Σ−1 (µ1 − µ2 );
1 1 p(C1 )
w0 = − µT1 Σ−1 µ1 + µT2 Σ−1 µ2 + log .
2 2 p(C2 )
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 154 / 250
Linear Models for Classication
Thus,
p(C1 |x ) = σ(a(x )) = σ(w T x + w0 ),
which is.. the logistic regression model !
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 155 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 156 / 250
Linear Models for Classication
For K ≥ 2:
Similarly, we get
ak (x ) := log(p(x |Ck )p(Ck )) = wkT x + wk 0 + c(x ).
{(wk − wj )T x + (wk 0 − wj 0 ) = 0}
j,k=1...K
,
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 157 / 250
Linear Models for Classication
Remark.
If the class-covariance matrices are all the same (Σ), the
decision boundaries are linear (LDA)
If Σ = I (independent, equal-variance features within each
class), this is "Naive Bayes"
If the K classes have dierent covariance matrices
(Σ1 , . . . , ΣK ), the decision boundaries become quadratic with
repsect to x (QDA), i.e. of the form x T Ax + bT x + c = 0.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 158 / 250
Linear Models for Classication
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 159 / 250
Other Models
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 160 / 250
Other Models
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 161 / 250
Other Models
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 162 / 250
Other Models
Remark.
Figure: K-means
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 163 / 250
Other Models
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 164 / 250
Other Models
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 165 / 250
Other Models
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 166 / 250
Other Models
P:
Entropy
H =− i pi log2 (pi );
Information gain:
IG (split) = Entropy before split - avg Entropy after split
= H (parent node) - EH (children).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 167 / 250
Other Models
Example.
30 students;
Two input variables: Gender ("Boy" / "Girl") and Classroom ("IX" /
"X");
Output: playing cricket ("Yes" / "No");
Observation numbers for dierent splits are in Figure below:
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 168 / 250
Other Models
Figure: Two dierent splits: which to choose?? (which has higher IG?)
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 169 / 250
Other Models
Parent node = All students, both playing cricket and not playing;
H (parent node) = -[15/30*log2 (15/30) + 15/30*log2 (15/30)] = 1;
IG(Gender split) = ?
= H (parent node) - EH (Gender split);
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 170 / 250
Other Models
Parent node = All students, both playing cricket and not playing;
H (parent node) = -[15/30*log2 (15/30) + 15/30*log2 (15/30)] = 1;
IG(Gender split) = ?
= H (parent node) - EH (Gender split);
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 170 / 250
Other Models
IG(Classroom split) = ?
= H (parent node) - EH (Classroom split);
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 171 / 250
Other Models
IG(Classroom split) = ?
= H (parent node) - EH (Classroom split);
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 171 / 250
Other Models
Only the metric changes: replace entropy H by the Gini index (or
impurity)
G := 1 − pi2 ,
X
Gini gain:
GG (split) = Gini before split - avg Gini after split
= G (parent node) - EG (children).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 172 / 250
Other Models
Example. (idem)
Parent node = All students, play cricket and don't play;
G (parent node) = 1 − [(15/30)2 + (15/30)2 ] = 0.5;
GG(Gender split) = ?
= G (parent node) - EG (Gender split);
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 173 / 250
Other Models
Example. (idem)
Parent node = All students, play cricket and don't play;
G (parent node) = 1 − [(15/30)2 + (15/30)2 ] = 0.5;
GG(Gender split) = ?
= G (parent node) - EG (Gender split);
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 173 / 250
Other Models
GG(Classroom split) = ?
= G (parent node) - EG (Classroom split);
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 174 / 250
Other Models
GG(Classroom split) = ?
= G (parent node) - EG (Classroom split);
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 174 / 250
Other Models
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 175 / 250
Other Models
3 - Random forests
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 176 / 250
Other Models
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 177 / 250
Other Models
Remark
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 178 / 250
Other Models
Figure: SVM
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 179 / 250
Other Models
Figure: SVM
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 180 / 250
Other Models
Hard-margin SVM:
such that: ti (w T xi + b) ≥ 1, ∀i = 1 . . . N
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 181 / 250
Other Models
1X
N
H(w , b) := max{1 − ti (w T xi + b), 0},
N n=1
s.t. H(w , b) = 0.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 182 / 250
Other Models
Soft-margin SVM:
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 183 / 250
Other Models
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 184 / 250
Other Models
Remark.
Figure: Dierent classication losses when t = ±1: hinge for SVM, logit
for logistic regression, and misclassication (zero-one) for accuracy
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 185 / 250
Other Models
Duality
minw ,b 21 kw k2
s.t. 1 ≤ ti (w T xi + b), ∀i = 1 . . . N
Lagrangian:
1
N
L(w , b, α) = kw k2 + αi [1 − ti (w T xi + b)]; αi ≥ 0.
X
2 i=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 186 / 250
Other Models
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 187 / 250
Other Models
N
∂L
= 0.
X
=− αi t i
∂b i=1
1
( N X N N
)
α α t t xT x +
X X
max − α
(Dual) α∈RN 2 i=1 j=1 i j i j i j i=1 i
i=1 αi ti = 0; (αi ) ≥ 0.
PN
s.t.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 188 / 250
Other Models
i=1
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 189 / 250
Other Models
Remarks.
Sparse solution, since αi∗ = 0 on non-support vectors (outside
the margin) =⇒ removing them will not change the solution!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 190 / 250
Other Models
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 191 / 250
Other Models
Kernel method
1
( N XN N
)
α α t t φ(x )T φ(xj ) +
X X
max − αi
α∈RN 2 i=1 j=1 i j i j i i=1
0 0
PN
s.t. ≥
α t
i=1 i i = ; (α i ) .
called kernel.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 192 / 250
Other Models
Reminder:
For linear regression,
N
y (x , w ∗ ) = w ∗T x = k(x , xn )tn ,
X
n=1
for some kernel k(., .): the prediction for input x is a weighted
sum of the data (tn ), giving more weight to those tn for which
xn are closer to x .
In Gaussian Processes (GP) regression, the idea is to assume
a nonparametric model y (x ) as a GP with covariance function
equal to some chosen kernel k(x , x 0 ).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 193 / 250
Other Models
Examples of kernels:
Linear kernels: k(x , x 0 ) := x T x 0 + c ;
Polynomial kernel: k(x , x 0 ) := (x T x 0 + c)M ;
Gaussian kernel (RBF): k(x , x 0 ) := exp(− kx −2sx2 k );
0 2
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 194 / 250
Other Models
Remarks.
With kernels: no need to compute neither features φ(.), nor
weights w !
Mercer's theorem states that kernels implicitly represent features
"k(x , x 0 ) = φ̃T (x )φ̃(x 0 )", for some possibly innite-dimensional
φ̃!
Good for nonlinearly separable data.
We control overtting by choosing the (hyper)parameters of the
kernel (e.g. the scale s of the Gaussian kernel).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 195 / 250
Neural Networks
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 196 / 250
Neural Networks
Reminder
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 197 / 250
Neural Networks
Linear Regression:
X1
M−
y (x , w ) = wj φj (x ) = w T φ(x ).
j=0
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 198 / 250
Neural Networks
Logistic regression:
yk (x , w ) := p(Ck |x ) := σ(wkT φ(x )).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 199 / 250
Neural Networks
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 200 / 250
Neural Networks
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 201 / 250
Neural Networks
Examples
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 202 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 204 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 205 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 206 / 250
Neural Networks
sup ky (x , wε ) − f (x )k ≤ ε.
x ∈K
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 207 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 208 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 209 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 209 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 210 / 250
Neural Networks
Error/loss L = E (w ) = 1
En (w ): → to be minimized (w ∗ ):
PN
N n=1
wK ' w ∗ .
The initialization w0 is often random.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 211 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 212 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 213 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 214 / 250
Neural Networks
Remark
1 See
for example: https://towardsdatascience.com/
10-gradient-descent-optimisation-algorithms-86989510b5e9
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 215 / 250
Neural Networks
SGD
∂L
wk+1 = wk − α
∂wk
Momentum
wk+1 = wk − αmk
∂L
mk = βmk−1 + (1 − β)
∂wk
Adagrad
α ∂L
wk+1 = wk − √ ·
vk + ∂wk
2
∂L
vk = vk−1 +
∂wk
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 216 / 250
Neural Networks
RMSprop
α ∂L
wk+1 = wk − √ ·
vk + ∂wk
2
∂L
vk = βvk−1 + (1 − β)
∂wk
Adadelta
p
Dk−1 + ∂L
wk+1 = wk − √ ·
vk + ∂wk
Dk = βDk−1 + (1 − β)[∆wk ]2
2
∂L
vk = βvk−1 + (1 − β)
∂wk
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 217 / 250
Neural Networks
Nesterov
wk+1 = wk − αmk
∂L
mk = βmk−1 + (1 − β)
∂w ∗
w ∗ = wk − αmk−1
Adam
α
wk+1 = wk − √ · m̂k
vˆk +
mk vk
m̂k = ; vˆk =
1 − β1k
1 − β2k
∂L
mk = β1 mk−1 + (1 − β1 )
∂wk
2
∂L
vk = β2 vk−1 + (1 − β2 )
∂wk
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 218 / 250
Neural Networks
Remark
For losses that are convex but nondierentiable (e.g. lasso penalty,
hinge loss, ...), the gradient is replaced by a subgradient, giving the
proximal gradient methods.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 219 / 250
Neural Networks
Remark
For non-convex losses (often with ANN's!), gradient descent
algorithms may be stuck at local minima:
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 220 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 221 / 250
Neural Networks
A toy example:
and a loss
E = (y − t)2 .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 222 / 250
Neural Networks
Computing y and E :
a(1) = w (1) x;
z (1) = h(1) (a(1) );
a(2) = w (2) z (1) ;
y = z (2) = h(2) (a(2) );
E = (y − t)2 .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 223 / 250
Neural Networks
Computing ∇y and ∇E :
∂E ∂y
(l)
= 2(y − t) (l) ;
∂w ∂w
∂y 0
= h(2) (a(2) )z (1) ;
∂w (2)
∂y 0 0
( 1)
= h(2) (a(2) ) w (2) h(1) (a(1) )x.
∂w | {z }
already computed from layer 2
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 224 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 225 / 250
Neural Networks
l = 1 . . . L;
(l)
= h(l) (aj ),
(l) (l−1)
(2)
(l)
X
aj = wji zi
i
(l−1)
wji h(l−1) (ai ), l = 1 . . . L; (3)
X (l)
=
i
(L) (L)
ynj = zj = h(L) (aj ).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 226 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 227 / 250
Neural Networks
We dene
∂En
(l)
δj := (l)
. (4)
∂aj
Now,
(l)
∂En ∂En ∂aj
(l)
= (l) (l)
.
∂wji ∂aj ∂wji
Then, using (4) and (2), we get, for l = 1 . . . L,
∂En (l) (l−1)
(l)
= δj zi . (5)
∂wji
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 228 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 229 / 250
Neural Networks
In conclusion:
Backpropagation: from last layer to rst layer,
(L) ∂En
δj = (L)
;
∂aj
0 (l+1) (l+1)
l = L − 1 . . . 1.
(l) (l)
X
δj = h(l) (aj ) wkj δk ,
k
And,
∂En (l) (l−1)
(l)
= δj zi .
∂wji
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 230 / 250
Neural Networks
Remarks
ridge/lasso/elastic net;
early stopping: stop at the number of iterations (epochs) from
which the test error does not improve or starts to increase;
dropout: remove some units at random when training;
data augmentation;
tangent propagation (e.g. Tikhonov);
label smoothing ("{0,1} −→ {0 + rand ε, 1 - rand ε}");
...
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 231 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 232 / 250
Neural Networks
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 233 / 250
To keep in "deep mind"!
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 234 / 250
To keep in "deep mind"!
Summary
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 235 / 250
To keep in "deep mind"!
1
error = loss = L t , y (X , w ) (+ regul.);
2
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 236 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 237 / 250
To keep in "deep mind"!
To control under-/overtting:
data preparation: removing outliers and duplicates, lling in
missing data, feature scaling (standardization/normalization),
feature engineering/data mining, ...
tuning of the model: choice of the algorithm and its
hyperparameters
tuning of the loss: choice of the loss, the regularization, the
optimlization algorithm and its hyperparameters
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 238 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 239 / 250
To keep in "deep mind"!
For a long time, this's been the story; but it's half the story!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 240 / 250
To keep in "deep mind"!
For a long time, this's been the story; but it's half the story!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 240 / 250
To keep in "deep mind"!
Then maybe you'll design your own algorithm that will be the best!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 241 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 242 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 243 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 244 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 244 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 244 / 250
To keep in "deep mind"!
Figure: Fooling AI: after only some negligible noise, the same algorithm no
more sees a panda but a monkey!
Source: Explaining and Harnessing Adversarial Examples, Goodfellow et al,
ICLR 2015
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 245 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 246 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 247 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 248 / 250
To keep in "deep mind"!
"Interpretability issue" in ML
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 249 / 250
To keep in "deep mind"!
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 250 / 250