0% found this document useful (0 votes)
13 views272 pages

Slides Cours ML

The document titled 'Machine Learning: Probabilistic Fundamentals' by Azmi Makhlouf provides an educational overview of machine learning concepts, focusing on probabilistic methods. It covers topics such as model selection, linear models for regression and classification, and neural networks, while also introducing practical tools and applications in various fields. The document emphasizes the importance of understanding patterns in data and the role of algorithms in predicting or classifying outcomes.

Uploaded by

med16rahmouni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views272 pages

Slides Cours ML

The document titled 'Machine Learning: Probabilistic Fundamentals' by Azmi Makhlouf provides an educational overview of machine learning concepts, focusing on probabilistic methods. It covers topics such as model selection, linear models for regression and classification, and neural networks, while also introducing practical tools and applications in various fields. The document emphasizes the importance of understanding patterns in data and the role of algorithms in predicting or classifying outcomes.

Uploaded by

med16rahmouni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 272

Machine Learning: Probabilistic Fundamentals

Azmi MAKHLOUF

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 1 / 250
These slides are only for educational purpose, and strictly meant for a
private usage; do not distribute!
Azmi Makhlouf

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 2 / 250
Some references

Christopher M. Bishop.
Pattern Recognition and Machine Learning.
Springer-Verlag, 2006.
Kevin P. Murphy.
Machine learning : a probabilistic perspective.
MIT Press, 2013.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
The Elements of Statistical Learning.
Springer Series in Statistics. Springer New York Inc., 2001.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning.
MIT Press, 2016.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 3 / 250
Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 4 / 250
Introduction

Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 5 / 250
Introduction

Smells like IQ tests

Complete the following sequences:


1, 4, 9, 16, ?
1, 2, 3, 5, ?
1, 5, 19, 49, ?
Answers:

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 6 / 250
Introduction

Smells like IQ tests

Complete the following sequences:


1, 4, 9, 16, ?
1, 2, 3, 5, ?
1, 5, 19, 49, ?
Answers:
1, 4, 9, 16, 25 (x 2 )

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 6 / 250
Introduction

Smells like IQ tests

Complete the following sequences:


1, 4, 9, 16, ?
1, 2, 3, 5, ?
1, 5, 19, 49, ?
Answers:
1, 4, 9, 16, 25 (x 2 )
1, 2, 3, 5, 8 (Fibonacci)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 6 / 250
Introduction

Smells like IQ tests

Complete the following sequences:


1, 4, 9, 16, ?
1, 2, 3, 5, ?
1, 5, 19, 49, ?
Answers:
1, 4, 9, 16, 25 (x 2 )
1, 2, 3, 5, 8 (Fibonacci)
1, 5, 19, 49, 101 (x 3 − x 2 + 1)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 6 / 250
Introduction

Back to "1, 4, 9, 16, ?": what if one answers 10 ??

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 7 / 250
Introduction

Back to "1, 4, 9, 16, ?": what if one answers 10 ??

→ also suitable !! Take


5x 4 25x 3 167x 2 125x
P(x) := − + − + − 15,
8 4 8 4
then P(1, 2, 3, 4, 5) = 1, 4, 9, 16, 10.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 7 / 250
Introduction

Back to "1, 4, 9, 16, ?": what if one answers 10 ??

→ also suitable !! Take


5x 4 25x 3 167x 2 125x
P(x) := − + − + − 15,
8 4 8 4
then P(1, 2, 3, 4, 5) = 1, 4, 9, 16, 10.

Patterns may be dicult to guess and


not unique !
−→ errors, probabilities, ...

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 7 / 250
Introduction

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 8 / 250
Introduction

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 9 / 250
Introduction

ML

Machine Learning (ML) is the automatic discovery of patterns in


data by algorithms, in order to predict or classify :

Find / model f (.) such that


input 7−→ ouput/target

x ∈ RD 7−→ t = f (x ) + noise ε

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 10 / 250
Introduction

The prediction procedure is called


1 regression when the output t takes values in a continuous set

(R, typically);
2 classication when t takes discrete (nitely many) values

(classes).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 11 / 250
Introduction

Figure: Temperature prediction (Regression)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 12 / 250
Introduction

Figure: MNIST handwritten digits recognition (Classication)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 13 / 250
Introduction

Some Applications

computer vision (robots, self-driving cars, ...)


natural language processing (chatbots, translation, ...)
IT and Telecom (mobile apps, network automation, 5G
infrastructure optimization, predictive maintenance, ...)
nancial market analysis (optimal portfolio, ecient pricing,
fraud detection, ...)
medical diagnosis
user behavior analytics (ad placement, ...)
...

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 14 / 250
Introduction

Learning types

Supervised learning: when each training input xn is labeled by


a known training output tn
Unsupervised learning deals with unlabeled input data →
clustering/labeling based on similarity.
Reinforcement learning: a goal-oriented learning based on
interaction with environment to achieve a reward.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 15 / 250
Introduction

Notation

input variable: x = (x (1) , . . . , x (D) )T ∈ RD ;


D is the dimension of the data;
The components x (1) , . . . , x (D) are called features;
output / target variable: t ∈ R;

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 16 / 250
Introduction

(training) data set: N input data;


N is the size of the data set;
The (training) input matrix is X := (x1 , . . . , xN )T ∈ RN×D :
(1)
 (2) (D) 
x1 x1 · · · x1
x (1) x (2) · · · x (D) 
2 2 2 
X =
 .. .
. . . . ;
 . . . .. 
(1) (2) (D)
xN xN · · · xN

(training) targets / labels: t := (t1 , . . . , tN )T ∈ RN :


 
t1
 .. 
t =  . .
tN

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 17 / 250
Introduction

Example: for the digits (MNIST) dataset:

we may have N = 50000 labeled images x1 , . . . , x50000 of handwritten


digits, labeled by t1 , . . . , t50000 ∈ {0, 1, . . . , 9}, that we use to train
our algorithm.
Each image xn ∈ RD is of dimension D = 784 (28 × 28-pixel each).

If x is a new image of a digit, we have to nd the corresponding class


t = t(x ) among {0, 1, . . . , 9}.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 18 / 250
Introduction

General procedure in ML

The task of ML is to build an algorithm that predicts


(generalizes), for any given new data x , its corresponding target t .

This prediction is usually done through an estimation of the actual


target t = t(x ) = f (x ) + ε by a function (model)
y (x ) = y (x , w ),

where w = (w1 , . . . , wM )T ∈ RM is a vector of weights


(parameters) that have to be determined from the training data.
We make a good t when
y (x , w ) ≈ f (x).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 19 / 250
Introduction

As we will see, the algorithm often consists of a minimization of an


error (or loss, or cost) function, representing a certain distance
between the output data t and a theoretical model y (x , w ) for them:

1  
error = loss = loss t , y (X , w ) ;
2

w ∗ = arg min loss ;


w
3

tˆ(x ) := y (x , w ∗ ) prediction for t(x ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 20 / 250
Introduction

Practical tools you might use

Python, on Jupyter Notebook, either


oine −→ download Anaconda (www.anaconda.com)
online −→ Google Colab, with free GPU
(colab.research.google.com)
Kaggle: a platform for datasets, notebooks (editing and
running) and competitions (www.kaggle.com)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 21 / 250
Introduction

scikit-learn: a machine learning library in Python, with many


methods for regression and classication
(https://scikit-learn.org/stable/).
pymc-learn: another ML Python library, with more advanced
probabilistic models (http://docs.pymc-learn.org/).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 22 / 250
Example: Polynomial Curve Fitting

Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 23 / 250
Example: Polynomial Curve Fitting

We have the following data (xn , tn )n=1...N (generated according to a


noisy sinusoid t = sin(2πx) + ε = f (x) + ε, unknown to you!).

For a new x̂ , you have to predict "the" corresponding target


tˆ.
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 24 / 250
Example: Polynomial Curve Fitting

Let's choose a prediction function (model) y (x) as a polynomial of


(xed) degree M with repsect to the input x :
M
y (x) = y (x, w ) :=
X
wj x j ,
j=0

which is a linear model with repsect to the coecients w = (wj )j to


be determined (this is called linear regression).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 25 / 250
Example: Polynomial Curve Fitting

We also choose an error/loss function:

1 X
N 2
E (w ) = y (xn , w ) − tn ,
N n=1

called mean square error (MSE).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 26 / 250
Example: Polynomial Curve Fitting

We also choose an error/loss function:

1 X
N 2
E (w ) = y (xn , w ) − tn ,
N n=1

called mean square error (MSE).

We determine the vector of optimal weights w ∗ by minimizing the


error E :
w ∗ := arg min E (w ).
w

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 26 / 250
Example: Polynomial Curve Fitting

This means that w ∗ is the vector coes of the polynomial of degree


M that ts best (in the mean square sense) to the training data.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 27 / 250
Example: Polynomial Curve Fitting

This means that w ∗ is the vector coes of the polynomial of degree


M that ts best (in the mean square sense) to the training data.
Then, the predicted target for any new input x̂ is
tˆ := y (x̂, w ∗ ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 27 / 250
Example: Polynomial Curve Fitting

Figure: A polynomial t of degree 1 (with 50 data points).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 28 / 250
Example: Polynomial Curve Fitting

Figure: A polynomial t of degree 21 (with 50 data points).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 29 / 250
Example: Polynomial Curve Fitting

Training error: error on the training data (xn , tn )n=1...N :

1 X
N 2
E train (w ∗ ) = E (w ∗ ) = y (xn , w ∗ ) − tn .
N n=1

Test error: error on new test data (xntest , tntest )n=1...N test (e.g.
N test = 50 other data points):
N test
1 X 2
E test (w ∗ ) = y (xntest , w ∗ ) − tntest .
N test n=1

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 30 / 250
Example: Polynomial Curve Fitting

Figure: Training error and test error for dierent degrees of the tting
polynomial.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 31 / 250
Example: Polynomial Curve Fitting

Figure: A polynomial t of degree 5 (with 50 data points)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 32 / 250
Example: Polynomial Curve Fitting

E (w ) decreases with the degree M of the tting polynomial: as, by


increasing M , we have more ability to go closer to the training data
(xn , tn )n=1...N .

A too low degree does not suciently capture the complexity of the
data: this is called undertting.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 33 / 250
Example: Polynomial Curve Fitting

We see that the test error rst decreases with repsect to the degree
M until M = 5, from which it starts to increase!

A tting polynomial with a too high degree "sticks" too much to the
training data, so that it will be dicult for it to generalize to new
data: this is called overtting.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 34 / 250
Example: Polynomial Curve Fitting

Figure: Undertting of parabolic data by a straight line model

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 35 / 250
Example: Polynomial Curve Fitting

Figure: Overtting of straight line data by a high-degree polynomial

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 36 / 250
Example: Polynomial Curve Fitting

Some other losses (for regression)


RMSE (Root Mean Square Error):
v
√ u1 X N
u
RMSE = MSE = t (yn − tn )2 ;
N n=1

MAE (Mean Absolute Error):


1X
N
MAE = |yn − tn |;
N n=1

MAPE (Mean Absolute Percentage Error):


1 X yn − tn
N
MAPE = .
N n=1
tn

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 37 / 250
Example: Polynomial Curve Fitting

Regularization
A usual technique to reduce overtting is called regularization,
which consists of adding a penalty to the error function.
Typical examples:
L2 regularization (or "Ridge"): penalty on kw k2 :
M
Ẽ (w ) := E (w ) + λ kw k22 = E (w ) + λ wj2 .
X

j=1

L1 regularization (or "Lasso"): penalty on kw k1 :


M
Ẽ (w ) := E (w ) + λ kw k1 = E (w ) + λ
X
|wj |.
j=1

Both aim at reducing the parameters (wj ) (this is shrinkage or


weight decay).
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 38 / 250
Example: Polynomial Curve Fitting

Remark

Figure: Ridge vs Lasso (Blue contours are those of the unregularized


quadratic error; Left disc: kw k2 ≤ c ; Right diamond: kw k1 ≤ c ): Lasso
tends to shrink some components of the optimal w ∗ to 0 ("feature
selection").
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 39 / 250
Example: Polynomial Curve Fitting

Remark

In general, to control under-/overtting:


data preparation: removing outliers and duplicates, lling in
missing data, feature scaling (standardization/normalization),
feature engineering/data mining, ...
tuning of the model: choice of the algorithm and its
hyperparameters
tuning of the loss: choice of the loss, the regularization, the
optimization algorithm and its hyperparameters

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 40 / 250
Probability Tools

Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 41 / 250
Probability Tools

Sum rule (marginalization):


Z
p(x ) = p(x , y ), or p(x ) = p(x , y )dy .
X

y
Product rule:
p(x , y ) = p(x |y )p(y ).
Independence: iif p(x , y ) = p(x )p(y ).
Bayes rule:
p(x |y )p(y )
p(y |x ) = ,
p(x )
which links the posterior probability p(y |x ) to the prior
probability p(y ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 42 / 250
Probability Tools

Expectation/average/mean:
X
E[ϕ(x)] = ϕ(x)p(x),
x

or Z
E[ϕ(x)] = ϕ(x)p(x)dx.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 43 / 250
Probability Tools

The variance of x :
V(x) := E[(x − E(x))2 ] = E(x 2 ) − E(x)2 .

The standard deviation of x :


p
σ(x) := V(x).

The covariance between x and y :


Cov (x, y ) := E[(x − E(x))(y − E(y ))] = E(xy ) − E(x)E(y ).

The correlation coecient between x and y :


Cov (x, y )
ρ(x, y ) := .
σ(x)σ(y )

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 44 / 250
Probability Tools

A fundamental example

Gaussian/Normal distribution of mean µ and variance σ 2 :


1 (x − µ)2
 
2
N (x | µ, σ ) := √ exp − .
2πσ 2 2σ 2

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 45 / 250
Probability Tools

Figure: Two Gaussian/Normal densities N (x|µ, σ 2 )

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 46 / 250
Probability Tools

Figure: Bivariate Gaussian N (x |µ, Σ) data with (from left to right)


identity, diagonal and general covariance matrices Σ

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 47 / 250
Probability Tools

Curve tting from a probabilistic point of view

First approach: Frequentist maximum likelihood approach:

We suppose that the noise


ε ∼ N (.|0, σ 2 ) = N (.|0, β −1 ),

with σ 2 = β −1 and β := precision.


Since
t ≈ y (x, w ) + ε = average model + white noise,

Then,
p(t | x , w , β) := N (t | y (x , w ), β −1 ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 48 / 250
Probability Tools

Thus, the likelihood (jointly for all training data, independent) is


given by
N
p(t |X , w , β) = N (tn |y (xn , w ), β −1 ).
Y

n=1

Then, maximum likelihood (ML) amounts to maximize the likelihood,


i.e.
min(− log p(t |X , w , β))
w
N
log N (tn |y (xn , w ), β −1 ))
X
= min(−
w
n=1

1 (tn − y (xn , w ))2


N
X  
= min(− log( p exp − ))
w
n=1 2πβ −2 2β −2

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 49 / 250
Probability Tools

N
βX N N
= min (y (xn , w ) − tn )2 − log(β) + log(2π),
w 2 2 2
n=1

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 50 / 250
Probability Tools

N
βX N N
= min (y (xn , w ) − tn )2 − log(β) + log(2π),
w 2 2 2
n=1

which is exactly the least-squares problem! This gives us a minimizer


wML .
We can also minimize with repsect to β to obtain

1X
N
−1
βML = (y (xn , wML ) − tn )2 .
N n=1

Finally, the predictive distribution of t from x is


−1
p(t|x , wML , βML ) = N (t|y (x , wML ), βML ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 50 / 250
Probability Tools

Second approach: Bayesian approach:

In the above (frequentist) approach, the parameter w is supposed


deterministic a priori.
In the Bayesian approach, it is supposed non-deterministic and we
put a prior distribution on it.
Let us suppose here that a prior for w , with a constant
hyperparameter α (precision), is given by
 α  M+2 1 α
p(w |α) = N (w |0, α I ) = −1
exp{− kw k22 }
2π 2
(with kw k22 = Mj=1 wj ).
2
P

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 51 / 250
Probability Tools

By Bayes rule,
p(w |X , t , α, β) ∝ p(t |X , w , β)p(w |α).

The maximum posterior (MAP) is then given by ("min -log")


N
βX α
min( (y (xn , w ) − tn )2 + kw k22 ),
w 2 2
n=1

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 52 / 250
Probability Tools

By Bayes rule,
p(w |X , t , α, β) ∝ p(t |X , w , β)p(w |α).

The maximum posterior (MAP) is then given by ("min -log")


N
βX α
min( (y (xn , w ) − tn )2 + kw k22 ),
w 2 2
n=1

which is equivalent to a ridge regularization with a regularizing


coecient λ = αβ !

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 52 / 250
Probability Tools

For the predictive distribution, we use marginalization:


Z
p(t|x , X , t ) = p(t, w |x , X , t )dw
Z
= p(t|w , x , X , t )p(w |x , X , t )dw
Z
= p(t|x , w )p(w |X , t )dw .

⇒ Computable, since

p(t|x , w ) = N (t|y (x , w ), β −1 ) (model);

p(t |X , w )p(w )
p(w |X , t ) = R (Bayes).
p(t |X , w )p(w )dw

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 53 / 250
Probability Tools

Information Theory

Entropy point of view:

Let x be a discrete random variable. We would like to dene an


information function h(x ) that represents an "amount of information"
contained in x , or a "degree of surprise" on learning the value of x .
h(x ) = − log p(x ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 54 / 250
Probability Tools

The entropy of x is dened by the average amount of information in


transmitting x , which is then

H(x ) := E[h(x )] = − p(x ) log p(x ).


X

x
If x is a continuous random variable, we analogously dene its
(dierential) entropy by
Z
H(x ) := E[h(x )] = − p(x ) log p(x )dx .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 55 / 250
Probability Tools

Maximum entropy:

• If x is a discrete random variable with M states, the only discrete


nite distribution that has maximum entropy is the uniform one
(which is intuitive: no priviledge for any value).

• If x is a continuous random variable, the only distribution that


maximizes the entropy of x under the constraints E[x ] = µ and
Cov [x ] = Σ is the Gaussian distribution N (x|µ, Σ).
This is a reason (together with the Central Limit theorem) for
the frequent use of this distribution as a model for continuous
random data.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 56 / 250
Probability Tools

Relative entropy or KL-divergence:

The Kullback-Leibler divergence from a probability distribution q


to another p is dened by
q(x )
KL(p k q) := Ep [− log q(x )] − Ep [− log p(x )] = Ep [− log ].
p(x )

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 57 / 250
Probability Tools

We can show (by convexity of − log) that KL(p k q) ≥ 0, and that


KL(p k q) = 0 if and only if p = q .

Kullback-Leibler divergence is a measure of dissimilarity between q


and p (however, it is not symmetric!).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 58 / 250
Probability Tools

NB.
1
KL(pdata k qmodel ) = − log-likelihood + const.
N
So, maximizing the log-likelihood is minimizing the KL-divergence
between the data and our model.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 59 / 250
Model Selection

Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 60 / 250
Model Selection

The model complexity is measured by the number of its free


parameters.

The square loss is a usual criterion for determining a learning model


and compare complexities.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 61 / 250
Model Selection

Changing the training data clearly alters the model weights w , so its
loss.

−→ How does the average square loss change when we change the
training data set?
−→ This is given by the "bias-variance decomposition"

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 62 / 250
Model Selection

The Bias-Variance Decomposition

t = f (x) + noise ε.
In theory, the best L2 -approximation of t given x is the regression
function
f (x ) = E[t|x ].
It is estimated by the model y := y (x , w ; D), based on an arbitrary
random data set D.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 63 / 250
Model Selection

The square expected loss (over all random data sets D) is given by
E[L] = E[(y − t)2 ]
= E[(y − f (x ) − ε)2 ]
= E[(y − f (x ))2 ] + E[ε2 ] + 2E[ε]E[y − f (x )]
= E[(E[y |x ] − f (x ) + y − E[y |x ])2 ] + E[ε2 ] + 2 × 0
= E[(E[y |x ] − f (x ))2 ] + E[V[y |x ]] + V(ε)

avg loss2 = model bias2 + model variance + noise variance

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 64 / 250
Model Selection

avg loss2 = model bias2 + model variance + noise variance

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 65 / 250
Model Selection

avg loss2 = model bias2 + model variance + noise variance

If large bias ⇒ undertting

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 65 / 250
Model Selection

avg loss2 = model bias2 + model variance + noise variance

If large bias ⇒ undertting


If large variance ⇒ overtting
⇒ "bias-variance tradeo"

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 65 / 250
Model Selection

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 66 / 250
Model Selection

Figure: When bias-variance tradeo occurs

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 67 / 250
Model Selection

However, in order to approximate the bias-variance decomposition (by


the law of large numbers), we would need a large number P of data
sets (and even if we had many, we'd better combine them in one!).

Cross-validation is often used instead: it divides the data into S


subsets with
a training set (to learn the parameters of a model)
a validation set (to choose a model)
a test set (to test the chosen model)
The S subsets are then interchanged ("S -fold cross-validation").

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 68 / 250
Model Selection

Figure: Cross-validation

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 69 / 250
Model Selection

Example. We can choose the λ coecient of ridge regularization


among some values (λ1 , ..., λK ), and validate the best λk by
cross-validation. This is grid search cross-validation.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 70 / 250
Model Selection

Drawbacks:
Training S times ⇒ expensive.
Need for a number of trainings that is exponential with repsect
to the number of parameters (to test for all combinations of
these parameters).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 71 / 250
Model Selection

For determining the model complexity:

- Cross-validation is
computationally expensive, and
wasteful of valuable data (need for a training set, a validation
set and a test set).
- The Bayesian approach
avoids the overtting problem of maximum likelihood, and
has automatic methods of determining the model complexity
using the training data alone.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 72 / 250
Linear Models for Regression

Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 73 / 250
Linear Models for Regression

The purpose of regression:


predict the value of a continuous target variable t given the value of a
D -dimensional input variable x , through a model for the distribution

p(t|x )

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 74 / 250
Linear Models for Regression

x = (x1 , . . . , xD )T ∈ RD : the input variable.


t ∈ R: the target (output) variable.
y (x , w ) ∈ R: the regression model (t = y (x , w ) + ε).
w = (w0 , . . . , wM−1 )T ∈ RM : the vector of parameters.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 75 / 250
Linear Models for Regression

We suppose that we have a linear regression model, that is


X1
M−
y (x , w ) = wj φj (x ) = w T φ(x ),
j=0

with φ = (φ0 , . . . , φM−1 )T a vector of basis functions (features),


and φ0 (x) := 1.

For example, φ(x ) = x ; φj (x) = x j for real polynomial model;


φj (x) = cos(jx) for Fourier-type expansion.

In many applications, the {φj } come from feature extraction


(pre-processing) of the original variables in x .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 76 / 250
Linear Models for Regression

Figure: A linear regression example with real-valued input and target data.
Here, y (x, w ) = w1 x + w0 .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 77 / 250
Linear Models for Regression

The target t = y (x , w ) + ε, and we suppose that the independent


noise ε is normally distributed with zero mean and precision β :
p(ε) = N (ε|0, β −1 ). Thus,

p(t|x , w , β) := p(t|w , β) = N (t |w T φ(x ), β −1 )

(we have omitted, as we may do from now on, the obvious variable x
from conditioning, for notational simplicity). Now, the parameters w
and β are to be determined.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 78 / 250
Linear Models for Regression

From the normal distribution of t , we derive the "-log-likelihood":


N
βX N N
− log p(t |w , β) = (tn − w T φ(xn ))2 − log β + log 2π,
2 n=1
2 2

=⇒ We have to minimize the MSE with repsect to to w :

1X
N
min MSE = (tn − w T φ(xn ))2 .
w N n=1

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 79 / 250
Linear Models for Regression

The gradient, with repsect to w , is given by


N
∇(− log p(t |w , β)) = β (tn −w T φ(xn ))φ(xn )T = β(t T Φ−w T ΦT Φ)
X

n=1

where
Φ = (φj (xn )) 1≤n≤N ∈ RN×M .
0≤j≤M−1

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 80 / 250
Linear Models for Regression

The optimal wML (obtained by letting the gradient equal to zero) is


then given by the normal equations:
wML = (ΦT Φ)−1 ΦT t = Φ† t ,
where we suppose that the matrix ΦT Φ is non-singular, and
Φ† := (ΦT Φ)−1 ΦT denotes the Moore-Penrose pseudo-inverse
of Φ.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 81 / 250
Linear Models for Regression

NB. Similarly, the optimal βML is given by

1X
N
−1
(βML ) = (tn − wML
T
φ(xn ))2
N n=1

(which corresponds to a usual statistical estimator of the variance of


t ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 82 / 250
Linear Models for Regression

We may add a regularization (penalty) term with a coecient λ > 0,


and consider
min(ED (w ) + λEW (w )),
w

where ED (w ) is the "Data" error:

1X
N
(t − w T φ(xn ))2 ,
2 n=1 n

and EW (w ) a penalty for w .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 83 / 250
Linear Models for Regression

1
For ridge regularization, EW (w ) = 12 M−j=0 wj , and we can
2
P
explicitly solve for the optimal w to obtain
ridge = (λI + ΦT Φ)−1 ΦT t .
wML
Notice that it does not need the matrix ΦT Φ to be non-singular.
P 1
For lasso regularization, EW (w ) = M−
j=0 |wj |, and the
coordinates of the optimal w (not explicit here) become more
sparse as λ is increased.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 84 / 250
Linear Models for Regression

Remark. In practice, we evaluate a regression model by its


"coecient of determination" or "R2 -score", dened by
MSE error (tn − yn )2
P
R =1− 2
= 1 − Pn 2
,
variance of data n (tn − t)

where t := N1 n tn .
P

=⇒ A good t between predicted target (y ) and actual target (t ) is


when R 2 is close to 1.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 85 / 250
Linear Models for Regression

Remark. The R 2 -score has to be compared between training and


test data:
For linear models, R 2,train ∈ [0, 1] and R 2,test ∈ (−∞, 1];
For nonlinear models, R 2,train ∈ (−∞, 1] and R 2,test ∈ (−∞, 1].

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 86 / 250
Linear Models for Regression

Bayesian linear regression

As before, we have
t = y (x , w ) + ε = w T φ(x ) + ε,

with p(ε) = N (ε|0, β −1 ), and let us suppose that β is known.


The dierence of the Bayesian approach with the previous frequentist
approach is that we now put a prior on w .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 87 / 250
Linear Models for Regression

We suppose, at the beginning (independently of any data), that w is


random with a multivariate normal distribution given by
p(w ) = N (w |m0 , S0 ),

where m0 ∈ RM and S0 ∈ RM×M are given.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 88 / 250
Linear Models for Regression

Now, we have training data t = (t1 , . . . , tN ) and (Φnj ) = (φj (xn )),
and we ask for the posterior distribution of w given t (and X ):
p(w |t ) =?

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 89 / 250
Linear Models for Regression

Theorem (Bayes theorem for Gaussian variables)


Let A ∈ RM×D be a constant matrix, and b ∈ RM a constant vector,
and let x ∈ RD and y ∈ RM be two random variables such that

p(x ) = N (x |µ, Λ−1 );

p(y |x ) = N (y |Ax + b , L−1 ).


Then,
p(y ) = N (y |Aµ + b , L−1 + AΛ−1 AT );
p(x |y ) = N (x |Σ AT L(y − b ) + Λµ , Σ),


with
Σ := (Λ + AT LA)−1 .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 90 / 250
Linear Models for Regression

From Bayes theorem for Gaussian variables:


p(w |t ) = N (w |mN , SN ),

where
mN = SN (S0−1 m0 + βΦT t );
SN−1 = S0−1 + βΦT Φ.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 91 / 250
Linear Models for Regression

To simplify, we take m0 = 0 and S0 = α−1 I , i.e. a prior


p(w |α) = N (w |0, α−1 I ). Then, the posterior parameters are given
by
mN = β S N ΦT t ;
SN−1 = αI + βΦT Φ.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 92 / 250
Linear Models for Regression

Thus,
N
βX T α
log p(w |t ) = − (w φ(xn ) − tn )2 − w T w + const,
2 n=1
2

which is equivalent to a ridge regularization with a regularizing


coecient λ = αβ .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 93 / 250
Linear Models for Regression

We now know both Gaussian p(t|x , w , β) and p(w |t , α, β) with


their parameters. Then, from Bayes theorem for Gaussian variables,
p(t|x , t , α, β) = N (t|mNT φ(x ), σN2 (x )),

where
σN2 (x ) = β −1 + φ(x )T SN φ(x ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 94 / 250
Linear Models for Regression

σN2 (x ) = β −1 + φ(x )T SN φ(x )


=⇒ the predictive variance = a sum of the intrinsic noise in data and
a noise coming from the uncertainty on w .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 95 / 250
Linear Models for Regression

σN2 (x ) = β −1 + φ(x )T SN φ(x )


=⇒ the predictive variance = a sum of the intrinsic noise in data and
a noise coming from the uncertainty on w .

Moreover, it can be shown that σN+2


1 (x ) ≤ σN (x ): the predictive
2

variance decreases with repsect to the size N of data.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 95 / 250
Linear Models for Regression

NB. If β is supposed unknown, a conjugate p(w , β) is of


Gauss-Gamma type, and the predictive distribution is of Student
type.

More generally, Generalized Linear Models (GLM) are linear


regression models with non-normal noise distributions.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 96 / 250
Linear Models for Regression

Limitations of Fixed Basis Functions:

Linear models have some advantages:


a closed form solution of least squares;
a tractable Bayesian treatment.
However, the number of basis functions needs to grow rapidly, often
exponentially, with the dimension D if the input space (curse of
dimensionality).
But
real data may have some properties to help us;
more complex (nonlinear) models can also be used, such as
Gaussian processes (GP), support vector machines (SVM) and
articial neural networks (ANN).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 97 / 250
Linear Models for Regression

Gaussian Processes

The kernel trick:

In Bayesian linear regression, we've shown that


E[y (x , w )|t ] = E[w T φ(x )|t ] = mNT φ(x ) = (β SN ΦT t )T φ(x ).

Thus,
N
E[y (x , w )|t ] = k(x , xn )tn ,
X

n=1

where we've set


k(x , x 0 ) := βφ(x )T SN φ(x 0 ),

called equivalent kernel (or smoother matrix).


Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 98 / 250
Linear Models for Regression

As a function of x 0 , the kernel k(x , .) is localized around x :


the regressor y (x , w ) is a weighted sum of the output data (tn ),
giving more weight to those tn for which xn are closer to x .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 99 / 250
Linear Models for Regression

Moreover,
Cov[y (x , w ), y (x 0 , w )|t ] = β −1 k(x , x 0 ).

That is, the closer x to x 0 , the more correlated are y (x , w ) and


y (x 0 , w ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 100 / 250
Linear Models for Regression

Another approach is to dene any desired localized kernel directly and


use it for prediction. This is the basic idea of Gaussian Process
Regression (GPR).
−→ a Bayesian, nonparametric method

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 101 / 250
Linear Models for Regression

Figure: Gaussian Processes Regression: from prior (before learning: zero


mean, some covariance kernel k(., .)), to posterior (after learning data
(black circles): updated mean, updated covariance), to prediction with
uncertainty (condence interval)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 102 / 250
Linear Models for Regression

A function k(., .) is a kernel if it is symmetric and semi-denite


positive:
n X
n
ci cj k(xi , xj ) ≥ 0, ∀ci ∈ R, xi ∈ RD , n ∈ N∗ .
X

i=1 j=1

−→ a generalization of covariance matrix

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 103 / 250
Linear Models for Regression

In machine learning in general, a kernel k(x , x 0 ) is a priori chosen


to measure similarity between two inputs x and x 0 .

Examples of kernels:
Linear kernels: k(x , x 0 ) := x T x 0 + c ;
Polynomial kernel: k(x , x 0 ) := (x T x 0 + c)M ;
Gaussian kernel (RBF): k(x , x 0 ) := exp(− kx −2sx2 k );
0 2

k(x , x 0 ) := p(x )p(x 0 ), for a probability measure p(.).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 104 / 250
Linear Models for Regression

A Gaussian process (GP) is a stochastic process (Y (x ))x ∈RD such


that, for any x1 , ..., xn , the vector (Y (x1 ), ..., Y (xn )) has multivariate
normal distribution.

−→ characterized by
mean: m(x ) = E[Y (x )]
covariance (kernel): k(x , x 0 ) = Cov(Y (x ), Y (x 0 ))

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 105 / 250
Linear Models for Regression

Here,
t = t(x) = y (x ) + ε(x ),
where
y (.) is a GP with mean zero and covariance function k0 (x , x 0 )
(chosen kernel);
ε(.) is an independent white noise with distribution N (0, β −1 ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 106 / 250
Linear Models for Regression

Here,
t = t(x) = y (x ) + ε(x ),
where
y (.) is a GP with mean zero and covariance function k0 (x , x 0 )
(chosen kernel);
ε(.) is an independent white noise with distribution N (0, β −1 ).

Then, (t(x ))x ∈RD is a GP with mean zero and covariance function
k(x , x 0 ) = k0 (x , x 0 ) + β −1 1x =x 0 .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 106 / 250
Linear Models for Regression

Remark.
In Bayesian linear regression: prior on the parameter w as a
random variable
In (Bayesian) GP regression: prior on the funtion y (.) as a
stochastic process

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 107 / 250
Linear Models for Regression

We have training data, input XN = X = (x1 , ..., xN )T and output


tN = t = (t1 , . . . , tN )T .
We want to predict a new tN+1 from an input xN+1 , i.e. we look for
the posterior p(tN+1 |tN , XN , xN+1 ):
p(tN+1 |tN ) =?

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 108 / 250
Linear Models for Regression

Set
KN := (k(xn , xm ))n,m=1...N .
Clearly, the train-test data tN+1 = (tN , tN+1 )T have joint distribution
p(tN+1 ) = N (tN+1 |0, KN+1 ), with

KN kN,N+1
 
KN+1 = ,
kN,N+
T
1 kN+1

where kN,N+1 = (k(xn , xN+1 ))n=1...N and kN+1 = k(xN+1 , xN+1 ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 109 / 250
Linear Models for Regression

Thus, we know the joint train-test distribution p(tN , tN+1 ), but we


want the conditional distribution of test given train p(tN+1 |tN ): we
use the following theorem.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 110 / 250
Linear Models for Regression

Theorem (Conditional Gaussian distribution)


Let x = (xaT , xbT )T ∈ RD with xa ∈ RM and xb ∈ RD−M , with

p(x ) = N (x |µ, Σ);

µ = (µTa , µTb )T ;
 
Λaa Λab
Σ−1 = Λ :=
Λba Λbb
(with the sizes of all the above blocks matching those of xa and xb ).
Then,
1
p(xa |xb ) = N (xa |µa|b , Λ−
aa ),

where
1
aa Λab (xb − µb ).
µa|b := µa − Λ−
In particular, the conditional mean µa|b is linear with respect to xb .
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 111 / 250
Linear Models for Regression

Using the Conditional Gaussian distribution theorem, we get the


posterior distribution
p(tN+1 |tN ) = N (tN+1 |m(xN+1 ), σ 2 (xN+1 )),

with mean and variance:


−1
m(xN+1 ) = kN,N+1 KN tN ;
T

−1
σ 2 (xN+1 ) = kN+1 − kN,N+
T
1 KN kN,N+1 .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 112 / 250
Linear Models for Regression

Remarks.
A 95%-condence interval for the test is given by
[m(xN+1 ) − 2σ(xN+1 ); m(xN+1 ) + 2σ(xN+1 )].

m(xN+1 ) is of the form Nn=1 an k(xn , xN+1 ): the mean is


P
updated according to the covariances between the test and the
train data.
σ 2 (xN+1 ) ≤ kN+1 : the posterior variance is less than the prior
one.
It can be shown that σ 2 (xN+2 ) ≤ σ 2 (xN+1 ): the more you learn,
the more condent you are.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 113 / 250
Linear Models for Regression

Figure: Notice how the posterior mean and variance are updated compared
to the prior ones

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 114 / 250
Linear Models for Regression

Figure: GPR with 2+1-dimensional data, and RBF kernel

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 115 / 250
Linear Models for Regression

Remarks. Advantages of GPR:


With kernels: no need to compute neither features φ(.), nor
weights w !
Mercer's theorem states that kernels implicitly represent
features: ∃φ̃(.) such that k(x , x 0 ) = φ̃T (x )φ̃(x 0 ), for some
possibly innite-dimensional φ̃!
Good for nonlinearly separable data.
We control overtting by choosing the (hyper)parameters of the
kernel (e.g. the scale s of the Gaussian kernel).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 116 / 250
Linear Models for Regression

Example (of how a kernel implicitly maps the input space to a


higher-dimensional feature space).
For inputs u = (u1 , u2 ), v = (v1 , v2 ) ∈ R2 , take
k(u , v ) = (u T v + 1)2 = (u1 v1 + u2 v2 + 1)2 .

Check that
k(u , v ) = φ̃T (u )φ̃(v ),
with √ √ √
φ̃(u ) = (u12 , u22 , 1, 2u1 , 2u2 , 2u1 u2 )T ∈ R6 !

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 117 / 250
Linear Models for Regression

Remark. GP's have even been proven to be equivalent to innitely


wide neural networks! For an interesting article (Quanta Magazine,
October 2021) about the power of GP's and kernels in relation with
the mysteries of modern deep learning, see here.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 118 / 250
Linear Models for Regression

Remark. A drawback of GPR:


needs O(N 3 ) operations to invert KN . But there are some
accelerating methods (like "subset of data")...

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 119 / 250
Linear Models for Regression

Remark. Learning the hyperparameters:


Let θ be the vector of hyperparameters of the GP model. We nd
the optimal θ by maximum likelihood, i.e. maxθ p(t |θ). We have:
1 1 1 N
log p(t |θ) = − log |KN | − t T K−
N t − log(2π),
2 2 2
and
∂ 1 1 ∂ KN 1 1 ∂ KN −1
log p(t |θ) = − Tr(K− ) + t T K− K t.
∂θi 2 N
∂θi 2 N
∂θi N

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 120 / 250
Linear Models for Regression

Remark. Automatic Relevance Determination (ARD):


In the kernel, we may assign a separate hyperparameter ηi for each
x (i) component of the input x = (x (1) , ..., x (D) ). For example,

1X
D
k(x , x ) := exp{−
0
ηi (x (i) − x 0(i) )2 }.
2 i=1

If ηi is small, it means that the component x (i) has little eect on the
predictive distribution. Thus, we can determine the relative
importance ("relevance") of the input variables only from the data
("automatically").

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 121 / 250
Linear Models for Classication

Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 122 / 250
Linear Models for Classication

For a K -class problem, we have K known classes. Each input x ∈ RD


is assumed to belong to one class Ck = C(x ) for some k among
1 . . . K , and we have to guess this class.

We look for decision boundaries that separate the inputs according


to their classes.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 123 / 250
Linear Models for Classication

Figure: Binary classication

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 124 / 250
Linear Models for Classication

Figure: Multiclass classication

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 125 / 250
Linear Models for Classication

Figure: Email x := D words


(x (1) , . . . , x (D) ) 7−→ t ∈ {spam := 1, not spam := 0 }

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 126 / 250
Linear Models for Classication

Figure: Transaction x := features


(x (1) , . . . , x (9) ) 7−→ t ∈ {fraud := 1, not fraud := 0 }

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 127 / 250
Linear Models for Classication

Figure: Image x := D pixels (x (1) , . . . , x (D) ) 7−→ t ∈ {shoes, trousers,. . . }

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 128 / 250
Linear Models for Classication

Remark

For multiple classes: confusion matrix (number of i -predictions for


j -actual).

Accuracy := Score := True/All = (TP+TN)/(TP+TN+FP+FN).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 129 / 250
Linear Models for Classication

Remark

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 130 / 250
Linear Models for Classication

Remark

Three approaches for classication:


probabilistic discriminative approach: directly models the
conditional probability p(Ck |x ); e.g. Logistic regression
probabilistic generative approach: rst models p(x |Ck ) and
p(Ck ); e.g. Naive Bayes, LDA, QDA
deterministic discriminant function: directly assigns a class to
each input x ; e.g. KNN, trees, SVM

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 131 / 250
Linear Models for Classication

Logistic regression for two classes


We have K = 2 classes, and a vector φ ∈ RM of M features
(φ0 (x ), . . . , φM−1 (x )).

Logistic regression ≡ nd the "most likely" (in the maximum


likelihood sense) parameter vector w ∈ RM , setting
X1
M−
p(C1 |x ) = y (x , w ) := σ(w φ) = σ(
T
wj φj (x )),
j=0

where σ(.) is the logistic or sigmoid function dened by


1
σ(a) = ∈ [0, 1], ∀a ∈ R.
1 + exp(−a)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 132 / 250
Linear Models for Classication

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 133 / 250
Linear Models for Classication

We set that x ∈ C1 iif


p(C1 |x ) = σ(w T φ) > 0.5,

i.e.
X1
M−
decide (x ∈ C1 ) if wj φj (x ) > 0 (else x ∈ C0 )
j=0

=⇒ linear decision boundary (with repsect to φ(x ), not necessarily


with repsect to x !).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 134 / 250
Linear Models for Classication

Figure: Feature engineering may help for linear separability (e.g.


φ : R2 → R3 , (x1 , x2 ) 7→ (x1 , x2 , x12 + x22 ))

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 135 / 250
Linear Models for Classication

NB. Only for notation simplicity, we will suppose φ = x


(otherwise, just replace x by φ(x )).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 136 / 250
Linear Models for Classication

Loss function? −→ minimize "- loglikelihood":


For (x, t) ∈ RD × {0, 1},
p(t = 1|x , w ) = p(C1 |x ) = y := σ(w T x );

p(t = 0|x , w ) = p(C0 |x ) = 1 − y = 1 − σ(w T x );


=⇒ In a compact form, for any t ∈ {0, 1},

p(t|x , w ) = y t (1 − y )1−t = σ(w T x )t (1 − σ(w T x ))1−t .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 137 / 250
Linear Models for Classication

• For all data xn ∈ Rd and (tn ) ∈ {0, 1}, the likelihood is given by
N N
p(t |X , w ) = p(tn |xn , w ) = yntn (1 − yn )1−tn ,
Y Y

n=1 n=1

with yn := σ(w T xn ).

• We have to minimize (with repsect to w ) the "cross-entropy


loss/error":
N n o
E (w ) = − log p(t |X , w ) = − tn log yn + (1 − tn ) log(1 − yn )
X

n=1
N n o
tn log σ(w T xn ) + (1 − tn ) log(1 − σ(w T xn )) .
X
=−
n=1

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 138 / 250
Linear Models for Classication

We need then to compute ∇E (w ).


Only for illustration, suppose w and x scalar (∈ R).
∂ 1 exp(−a)
• σ 0 (a) = { }= = σ(a).(1 − σ(a));
∂a 1 + exp(−a) (1 + exp(−a))2

∂ txσ 0 (wx)
• {t log σ(wx)} = = tx(1 − σ(wx));
∂w σ(wx)
∂ −(1 − t)xσ 0 (wx)
• {(1 −t) log(1 −σ(wx))} = = −(1 −t)xσ(wx);
∂w 1 − σ(wx)
Then,

{t log σ(wx)+(1 −t) log(1 −σ(wx))} = tx −xσ(wx) = −(y −t)x.
∂w

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 139 / 250
Linear Models for Classication

Similarly, in the multidimensional case,


N N
∇E (w ) = (yn − tn )xn = (σ(w T xn ) − tn )xn .
X X

n=1 n=1

Computing the second derivatives (with repsect to w ), it can be


shown that the Hessian matrix H(w ) = ∇∇E is positive denite,
then E (.) is convex and there exists a unique minimum w ∗ .
However, w ∗ cannot be computed exactly explicitly, as for
linear regression! (we cannot solve ∇E (w ∗ ) = 0)
−→ approximation algorithms for the optimal w ∗ (Gradient Descent
and variants)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 140 / 250
Linear Models for Classication

Multiclass logistic regression

The approach is similar with K ≥ 2 classes, but with a "softmax"


function instead of sigmoid:
we use a matrix W = (wj(k) ) of weights and set, for k = 1, . . . , K ,

exp(a(k) )
p(Ck |x ) = y (k) (x , W ) := PK := "softmax function",
(l)
l=1 exp(a )

where
X1
M−
a(k) = a(w (k) , x ) := w (k)T φ = wj φj (x ).
(k)

j=0

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 141 / 250
Linear Models for Classication

We use "one-hot encoding" (or "one-of-K") for target t :

t = (0, . . . , 0, 1, 0, . . . , 0),
where 1 is in k th position if the class is Ck .

In this case (K classes), the cross-entropy error becomes


N X
K
E (W ) := − log p(t |X , W ) = −
X
tn(k) log yn(k)
n=1 k=1

exp(a(w (k) , xn ))
XN X K
=− tn(k) log PK ,
n=1 k=1 l=1 exp(a( w (l) , x ))
n

to be minimized with repsect to all (wj(k) ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 142 / 250
Linear Models for Classication

Bayesian Logistic Regression

In Bayesian seeting, we set a prior p(w ) := N (w |m0 , S0 ).


The posterior is p(w |t ) ∝ p(w )p(t |w ). Then,
1
log p(w |t ) = − (w − m0 )T S0−1 (w − m0 )
2
N
{tn log yn + (1 − tn ) log(1 − yn )} + const,
X
+
n=1

with yn = σ(w T φn ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 143 / 250
Linear Models for Classication

The predictive distribution is


Z
p(C1 |Φ, t ) = p(C1 |Φ, w )p(w |t )dw
Z
= σ(w T φ)p(w |t )dw .

In order to compute the last integral, we can use a numerical


approximation, such as "Laplace approximation":

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 144 / 250
Linear Models for Classication

"Laplace approximation":
p(w |t ) ≈ q(w ) := N (w |wMAP , SN ), (1)
where
wMAP := arg max log p(w |t );
w
N
SN−1 := −∇∇ log p(w |t ) = S0−1 + yn (1 − yn )φn φTn .
X

n=1

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 145 / 250
Linear Models for Classication

Let a := w T φ. We get
p(a|t ) ≈ N (a|µa , σa2 ),

where
µa := wMAP
T
φ;
σa2 := φT SN φ.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 146 / 250
Linear Models for Classication

Then, the predictive distribution is


Z
p(C1 |Φ, t ) = σ(w T φ)p(w |t )dw

= E[σ(a)|t ]
Z
= σ(a)p(a|t )da
Z
≈ σ(a)N (a|µa , σa2 )da,

which can be approximated again.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 147 / 250
Linear Models for Classication

Gaussian Processes for Classication

In GPC (Gaussian Processes for Classication), we consider


y (x ) = σ(a(x ))

as a model for p(t = 1|x ), with a(.) a Gaussian process (and σ(.)
the sigmoid function).
Then t ∈ {0, 1} has Bernoulli distribution
p(t|a) = σ(a)t (1 − σ(a))1−t .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 148 / 250
Linear Models for Classication

We assume that the Gaussian process a(.) is of zero mean, and


covariance function
k(x , x 0 ) := k0 (x , x 0 ) + ν1x =x 0 ,

where k0 (., .) is a positive semidenite kernel and ν a positive


parameter (to ensure k(., .) is positive denite).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 149 / 250
Linear Models for Classication

We assume that the Gaussian process a(.) is of zero mean, and


covariance function
k(x , x 0 ) := k0 (x , x 0 ) + ν1x =x 0 ,

where k0 (., .) is a positive semidenite kernel and ν a positive


parameter (to ensure k(., .) is positive denite).

Remember that the power of kernels is to implicitly perform


high-dimensional feature engineering!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 149 / 250
Linear Models for Classication

We have target data tN := t = (t1 , . . . , tN ) for inputs (x1 , . . . , xN )T .


Set
aN := (a(x1 ), . . . , a(xN ))T .
For a new data input xN+1 , we want to predict the corresponding
target tN+1 .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 150 / 250
Linear Models for Classication

We set aN+1 := a(xN+1 ). The predictive distribution is


Z
p(tN+1 = 1|tN ) = p(tN+1 = 1|aN+1 )p(aN+1 |tN )daN+1
Z
= σ(aN+1 )p(aN+1 |tN )daN+1 .

−→ can be approximated (sampling, variational inference,


expectation propagation, Laplace approximation...).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 151 / 250
Linear Models for Classication

Linear/Quadratic Discriminant Analysis

In LDA approach, we rst assume that, within each class Ck , the


input x is normally distributed: for k = 1 . . . K ,
p(x |Ck ) := N (x |µk , Σ)
1 1 1
 
T −1
= exp − (x − µk ) Σ (x − µk ) .
(2π)D/2 |Σ|D/2 2
The parameters µk and Σ, as well as p(Ck ), are given and can be
estimated from the data, by maximum likelihood.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 152 / 250
Linear Models for Classication

For K = 2 classes (C1 and C2 ):

By Bayes formula,
p(x |C1 )p(C1 ) p(x |C1 )p(C1 )
p(C1 |x ) = = .
p(x ) p(x |C1 )p(C1 ) + p(x |C2 )p(C2 )

If we set
p(x |C1 )p(C1 )
a(x ) := log ,
p(x |C2 )p(C2 )
we get
1
p(C1 |x ) = = σ(a(x )).
1 + exp(−a(x ))

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 153 / 250
Linear Models for Classication

Computing a(x) from the Gaussians p(x |C1 ) and p(x |C2 ) gives
a(x ) = w T x + w0 ,

where
w = Σ−1 (µ1 − µ2 );
1 1 p(C1 )
w0 = − µT1 Σ−1 µ1 + µT2 Σ−1 µ2 + log .
2 2 p(C2 )

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 154 / 250
Linear Models for Classication

Thus,
p(C1 |x ) = σ(a(x )) = σ(w T x + w0 ),
which is.. the logistic regression model !

The decision boundary is the set


{p(C1 |x ) = 0.5} = {w T x + w0 = 0},

which is a linear boundary.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 155 / 250
Linear Models for Classication

Remark. We have shown that logistic regression is like implicitly


assuming Gaussian distribution of inputs within each class.
However, there are less parameters to estimate in logistic regression
(w −→ D params) than in LDA/QDA (µ1 , µ2 , Σ and p(C1 )
−→ 2D + D 2 + 1 params).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 156 / 250
Linear Models for Classication

For K ≥ 2:

Similarly, we get
ak (x ) := log(p(x |Ck )p(Ck )) = wkT x + wk 0 + c(x ).

The decision boundaries (p(Cj |x ) = p(Ck |x )) are given by


aj (x ) = ak (x ), i.e.

{(wk − wj )T x + (wk 0 − wj 0 ) = 0}

j,k=1...K
,

which are linear.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 157 / 250
Linear Models for Classication

Remark.
If the class-covariance matrices are all the same (Σ), the
decision boundaries are linear (LDA)
If Σ = I (independent, equal-variance features within each
class), this is "Naive Bayes"
If the K classes have dierent covariance matrices
(Σ1 , . . . , ΣK ), the decision boundaries become quadratic with
repsect to x (QDA), i.e. of the form x T Ax + bT x + c = 0.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 158 / 250
Linear Models for Classication

Figure: LDA and QDA with two classes. Source:


http://scikit-learn.org/

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 159 / 250
Other Models

Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 160 / 250
Other Models

K -Nearest Neighbors (K-NN)

Figure: K-NN (Here: K = 3 → B ; K = 7 → A)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 161 / 250
Other Models

Figure: Eect of K on the boundary smoothness (K = 1 vs K = 20)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 162 / 250
Other Models

Remark.

K-NN (a supervised classication) is dierent from K-means (an


unsupervised clustering):

Figure: K-means

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 163 / 250
Other Models

Figure: K-means corresponds to a Gaussian-Mixture Model (GMM):


p(x ) := K
x |µi , Σi )
P
i=1 αi N (

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 164 / 250
Other Models

Decision Trees (CART)

Figure: A decision tree

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 165 / 250
Other Models

1- Trees based on information gain (entropy)

Figure: Information gain (High vs Low)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 166 / 250
Other Models

P:
Entropy
H =− i pi log2 (pi );

Information gain:
IG (split) = Entropy before split - avg Entropy after split
= H (parent node) - EH (children).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 167 / 250
Other Models

Example.
30 students;
Two input variables: Gender ("Boy" / "Girl") and Classroom ("IX" /
"X");
Output: playing cricket ("Yes" / "No");
Observation numbers for dierent splits are in Figure below:

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 168 / 250
Other Models

Figure: Two dierent splits: which to choose?? (which has higher IG?)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 169 / 250
Other Models

Parent node = All students, both playing cricket and not playing;
H (parent node) = -[15/30*log2 (15/30) + 15/30*log2 (15/30)] = 1;

IG(Gender split) = ?
= H (parent node) - EH (Gender split);

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 170 / 250
Other Models

Parent node = All students, both playing cricket and not playing;
H (parent node) = -[15/30*log2 (15/30) + 15/30*log2 (15/30)] = 1;

IG(Gender split) = ?
= H (parent node) - EH (Gender split);

H(Female node) = -[2/10*log2 (2/10) + 8/10*log2 (8/10)] =


0.72;
H(Male node) = -[13/20*log2 (13/20) + 7/20*log2 (7/20)] =
0.93;
EH(Gender split) = 10/30*0.72 + 20/30*0.93 = 0.86;
IG(Gender split) = 1 - 0.86 = 0.14.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 170 / 250
Other Models

IG(Classroom split) = ?
= H (parent node) - EH (Classroom split);

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 171 / 250
Other Models

IG(Classroom split) = ?
= H (parent node) - EH (Classroom split);

H(IX node) = -[6/14*log2 (6/14) + 8/14*log2 (8/14)] = 0.99;


H(X node) = -[9/16*log2 (9/16) + 7/16*log2 (7/16)] = 0.99;
EH(Classroom split) = 14/30*0.99 + 16/30*0.99 = 0.99;
IG(Classroom split) = 1 - 0.99 = 0.01.

Conclusion. IG(Gender split) > IG (Classroom split). We choose


the tree splitting on Gender.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 171 / 250
Other Models

2- Trees based on Gini index

Only the metric changes: replace entropy H by the Gini index (or
impurity)
G := 1 − pi2 ,
X

which, like entropy, measures heterogeneity (G = 1 - P(two items


have same class)).

Gini gain:
GG (split) = Gini before split - avg Gini after split
= G (parent node) - EG (children).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 172 / 250
Other Models

Example. (idem)
Parent node = All students, play cricket and don't play;
G (parent node) = 1 − [(15/30)2 + (15/30)2 ] = 0.5;

GG(Gender split) = ?
= G (parent node) - EG (Gender split);

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 173 / 250
Other Models

Example. (idem)
Parent node = All students, play cricket and don't play;
G (parent node) = 1 − [(15/30)2 + (15/30)2 ] = 0.5;

GG(Gender split) = ?
= G (parent node) - EG (Gender split);

G(Female node) = 1 − [(2/10)2 + (8/10)2 ] = 0.32;


G(Male node) = 1 − [(13/20)2 + (7/20)2 ] = 0.45;
EG(Gender split) = 10/30 ∗ 0.32 + 20/30 ∗ 0.45 = 0.41;
GG(Gender split) = 0.5 - 0.41 = 0.09.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 173 / 250
Other Models

GG(Classroom split) = ?
= G (parent node) - EG (Classroom split);

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 174 / 250
Other Models

GG(Classroom split) = ?
= G (parent node) - EG (Classroom split);

G(IX node) = 1 − [(6/14)2 + (8/14)2 ] = 0.49;


G(X node) = 1 − [(9/16)2 + (7/16)2 ] = 0.49;
EG(Classroom split) = 14/30 ∗ 0.49 + 16/30 ∗ 0.49 = 0.49;
GG(Classroom split) = 0.5 - 0.49 = 0.01.

Conclusion. GG(Gender split) > GG (Classroom split). We choose


the tree splitting on Gender.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 174 / 250
Other Models

Remark. For continuous targets (regression), the split is usually


done by variance reduction: The variance is computed for each
node, then the variance for each split as the weighted average of each
node variance. The split with lower variance is selected.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 175 / 250
Other Models

3 - Random forests

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 176 / 250
Other Models

Remark. A random forest is used to counter the possible overtting


of a single decision tree.
Another technique is to perform "tree pruning" (reducing the size
of the tree).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 177 / 250
Other Models

Remark

Gradient Boosting is a sequential prediction method in the form of


an ensemble of weak prediction models, typically decision trees.
It acts by incremental improvement as a cumulative weighted sum of
models (weak learners), by minimizing the successive errors:
Model = γ1 Model 1 + γ2 Model 2 + ...

Popular recent variations are XGBoost and LightGBM.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 178 / 250
Other Models

Support Vector Machine (SVM)

Figure: SVM

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 179 / 250
Other Models

Figure: SVM

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 180 / 250
Other Models

Hard-margin SVM:

For labels ti = ±1,


minw ,b kw k2


such that: ti (w T xi + b) ≥ 1, ∀i = 1 . . . N

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 181 / 250
Other Models

Remark. If we dene the "hinge loss"

1X
N
H(w , b) := max{1 − ti (w T xi + b), 0},
N n=1

then hard-margin SVM is equivalent to


minw ,b kw k2


s.t. H(w , b) = 0.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 182 / 250
Other Models

Soft-margin SVM:

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 183 / 250
Other Models

For labels ti = ±1,


2
 minw ,b,ξi kw k + c n=1 ξn
 PN
s.t. ti (w T xi + b) ≥ 1 − ξi , ∀i = 1 . . . N
ξi ≥ 0, ∀i = 1 . . . N.

Soft-margin SVM is equivalent to


min kw k2 + cH(w , b)
w ,b

=⇒ ≡ a ridge-regularized hinge loss (with ridge coe. 1


c
).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 184 / 250
Other Models

Remark.

Figure: Dierent classication losses when t = ±1: hinge for SVM, logit
for logistic regression, and misclassication (zero-one) for accuracy

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 185 / 250
Other Models

Duality

Original optimization (hard SVM):

minw ,b 21 kw k2


s.t. 1 ≤ ti (w T xi + b), ∀i = 1 . . . N

Lagrangian:

1
N
L(w , b, α) = kw k2 + αi [1 − ti (w T xi + b)]; αi ≥ 0.
X
2 i=1

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 186 / 250
Other Models

(Primal) min max L(w , b, α),


w ,b α≥0

equivalent (by Slater's condition) to


(Dual) max min L(w , b, α).
α≥0 w ,b

=⇒ we can solve for optimal (w , b) as functions of


α = (α1 , . . . , αN ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 187 / 250
Other Models

By KarushKuhnTucker (KKT) conditions:


N N
∂L
= ∇w L = w − αi ti xi = 0 =⇒ w = αi ti xi ;
X X
∂w i=1 i=1

N
∂L
= 0.
X
=− αi t i
∂b i=1

Subtituting in (Dual) gives

1
 ( N X N N
)
α α t t xT x +
 X X
 max − α
(Dual) α∈RN 2 i=1 j=1 i j i j i j i=1 i
i=1 αi ti = 0; (αi ) ≥ 0.
 PN
s.t.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 188 / 250
Other Models

Solve (Dual) directly to nd the optimal α∗ , then


N
w = αi∗ ti xi ;
X

i=1

b ∗ = tk − w ∗T xk , for any k where αk > 0.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 189 / 250
Other Models

Remarks.
Sparse solution, since αi∗ = 0 on non-support vectors (outside
the margin) =⇒ removing them will not change the solution!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 190 / 250
Other Models

A smilar formulation holds for soft-margin SVM, just changing


the constraints on α from αi ≥ 0 to 0 ≤ αi ≤ c.
Solving (Dual) for α ∈ RN instead of solving (Primal) for
w ∈ RD is advantageous when D  N (high-dimensional
problem).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 191 / 250
Other Models

Kernel method

If we had features φ(x ) = (φ1 (x ), . . . , φM (x )) in (Dual) SVM, we


would get

1
 ( N XN N
)
α α t t φ(x )T φ(xj ) +
 X X
 max − αi
α∈RN 2 i=1 j=1 i j i j i i=1
0 0
 PN
s.t. ≥

α t
i=1 i i = ; (α i ) .

=⇒ No need to know φ(x ), but only

k(x, x 0 ) := φ(x )T φ(x 0 ),

called kernel.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 192 / 250
Other Models

Reminder:
For linear regression,
N
y (x , w ∗ ) = w ∗T x = k(x , xn )tn ,
X

n=1

for some kernel k(., .): the prediction for input x is a weighted
sum of the data (tn ), giving more weight to those tn for which
xn are closer to x .
In Gaussian Processes (GP) regression, the idea is to assume
a nonparametric model y (x ) as a GP with covariance function
equal to some chosen kernel k(x , x 0 ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 193 / 250
Other Models

A function k(., .) is a kernel if it is symmetric and semi-denite


positive.
In machine learning in general, a kernel k(x , x 0 ) is chosen to
measure similarity between two inputs x and x 0 .

Examples of kernels:
Linear kernels: k(x , x 0 ) := x T x 0 + c ;
Polynomial kernel: k(x , x 0 ) := (x T x 0 + c)M ;
Gaussian kernel (RBF): k(x , x 0 ) := exp(− kx −2sx2 k );
0 2

k(x , x 0 ) := p(x )p(x 0 ), for a probability measure p(.).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 194 / 250
Other Models

Remarks.
With kernels: no need to compute neither features φ(.), nor
weights w !
Mercer's theorem states that kernels implicitly represent features
"k(x , x 0 ) = φ̃T (x )φ̃(x 0 )", for some possibly innite-dimensional
φ̃!
Good for nonlinearly separable data.
We control overtting by choosing the (hyper)parameters of the
kernel (e.g. the scale s of the Gaussian kernel).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 195 / 250
Neural Networks

Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 196 / 250
Neural Networks

Reminder

x = (x1 , . . . , xD )T ∈ RD : the input variable.


t ∈ R: the target (output) variable.
w = (w0 , . . . , wM−1 )T ∈ RM : the vector of weights (to
compute).
y (x , w ) ∈ R: the model (t ≈ y (x , w ) + ε for regression, and
y k (x , w ) = p(Ck |x , w k ) for classication).
φ = (φ0 , . . . , φM−1 )T : a vector of basis functions (features).
E (w ): some error/cost function (loss).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 197 / 250
Neural Networks

Linear Regression:
X1
M−
y (x , w ) = wj φj (x ) = w T φ(x ).
j=0

For least squares error,


N
min E (w ) := (tn − w T φ(xn ))2 .
X
w
n=1

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 198 / 250
Neural Networks

Logistic regression:
yk (x , w ) := p(Ck |x ) := σ(wkT φ(x )).

For the cross-entropy error,


N
min E (w ) = − log p(t |Φ, w ) = − {tn log yn +(1 −tn ) log(1 −yn )}.
X
w
n=1

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 199 / 250
Neural Networks

Neural Networks

Neural networks are a nonlinear generalization: y (x , w ) is (highly)


nonlinear with repsect to both x and w , as a composition of many
nonlinear functions ("layers").

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 200 / 250
Neural Networks

Neural Networks

An articial neural network (ANN) ≡ successive L layers, with output


y (x , W ) := z (L) ,
with
1 z (0) := x ;
2 from layer l − 1 to layer l , we dene intermediate variables
(hidden units) by
z (l) := h(l) W (l) z (l−1) ,


where (h(l) ) are (nonlinear) chosen activation functions.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 201 / 250
Neural Networks

Examples

Figure: Linear regression

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 202 / 250
Neural Networks

Figure: Logistic regression


Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 203 / 250
Neural Networks

Figure: A two-layer neural network

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 204 / 250
Neural Networks

For a general 2-layer network:


M D
!!
(2) (1)
y (x , W ) = h(2) wkj h(1)
X X
wji xi .
j=0 i=0

For classication: the nal h(L) (.) is typically a sigmoid/softmax


function.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 205 / 250
Neural Networks

Deep Learning: many layers, diverse ways of connecting the


layers. . .

Figure: A deep neural network

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 206 / 250
Neural Networks

Universal Approximation Theorem

A 2-layer network with a continuous non-polynomial activation


function h can approximate (up to any given small error ε) any
continuous function f on a compact input domain K , with a
suciently large number M (depending on ε) of hidden units:
Theorem (Universal Approximation Theorem, Cybenko 1989, Hornik
1991)
∀ε > 0, there exist Mε and wε = (w1 , ..., wMε ), and an ANN y (x , wε )
with activation function h, such that

sup ky (x , wε ) − f (x )k ≤ ε.
x ∈K

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 207 / 250
Neural Networks

Remark. The previous theorem holds for arbitrary width and


bounded depth. There are some analogous results for deep learning,
with bounded width and arbitrary depth (Z. Lu et al. (2017); B.
Hanin and M. Selke (2017), P. Kidger and T. Lyons (2020)).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 208 / 250
Neural Networks

=⇒ ANN's implicitly and automatically perform feature engineering!


(the "φi (x )"'s).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 209 / 250
Neural Networks

=⇒ ANN's implicitly and automatically perform feature engineering!


(the "φi (x )"'s).

But: crucial choice of


the number of units (neurons) per layer (how many ??)
the number of layers (L = 2 may t, but possibly L > 2 is better
−→ Deep Learning!)
the activation functions h (Sigmoid, Tanh, Rectied Linear
Unit (ReLu) := max(x, 0), ...)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 209 / 250
Neural Networks

Figure: Some activation functions in ANN's

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 210 / 250
Neural Networks

Network training: optimization algorithm

Error/loss L = E (w ) = 1
En (w ): → to be minimized (w ∗ ):
PN
N n=1

we approximate the true optimum w ∗ by a sequence of iterations


(wk )k≥0 hoping that, after K iterations (or K /N "epochs"),

wK ' w ∗ .
The initialization w0 is often random.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 211 / 250
Neural Networks

(Full) batch Gradient descent with a learning rate α > 0:


N
αX
wk+1 = wk − ∇En (wk ).
N n=1

Mini-batch Gradient Descent:


B
αX
wk+1 = wk − ∇Eni (wk ),
B i=1

where B < N is the batch size and {n1 , ..., nB } a subset of


{1, . . . , N}, chosen randomly at each iteration.
Stochastic Gradient Descent (SGD): B = 1:
wk+1 = wk − α∇Enk (wk ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 212 / 250
Neural Networks

Figure: Remember that, for i.i.d. variables, V( B1 B ...


V(X1 )
PB
i=1 Xi ) =

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 213 / 250
Neural Networks

Method Variance Speed Memory On-line


Batch GD Low Low High No
Mini-batch GD Medium Medium Medium Yes
SGD High High Low Yes
Table: Pros and Cons of GD and SGD

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 214 / 250
Neural Networks

Remark

There are accelerating methods1 :


Momentum
Adagrad
RMSprop
Adadelta
Nesterov
Adam
...

1 See
for example: https://towardsdatascience.com/

10-gradient-descent-optimisation-algorithms-86989510b5e9
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 215 / 250
Neural Networks

SGD
∂L
wk+1 = wk − α
∂wk
Momentum
wk+1 = wk − αmk
∂L
mk = βmk−1 + (1 − β)
∂wk
Adagrad
α ∂L
wk+1 = wk − √ ·
vk +  ∂wk
 2
∂L
vk = vk−1 +
∂wk

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 216 / 250
Neural Networks

RMSprop
α ∂L
wk+1 = wk − √ ·
vk +  ∂wk
 2
∂L
vk = βvk−1 + (1 − β)
∂wk
Adadelta
p
Dk−1 +  ∂L
wk+1 = wk − √ ·
vk +  ∂wk
Dk = βDk−1 + (1 − β)[∆wk ]2
 2
∂L
vk = βvk−1 + (1 − β)
∂wk

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 217 / 250
Neural Networks

Nesterov
wk+1 = wk − αmk
∂L
mk = βmk−1 + (1 − β)
∂w ∗
w ∗ = wk − αmk−1
Adam
α
wk+1 = wk − √ · m̂k
vˆk + 
mk vk
m̂k = ; vˆk =
1 − β1k
1 − β2k
∂L
mk = β1 mk−1 + (1 − β1 )
∂wk
 2
∂L
vk = β2 vk−1 + (1 − β2 )
∂wk
Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 218 / 250
Neural Networks

Remark
For losses that are convex but nondierentiable (e.g. lasso penalty,
hinge loss, ...), the gradient is replaced by a subgradient, giving the
proximal gradient methods.

Figure: The subdierential of a convex real function f at x0 is


∂f (x0 ) := [f 0 (x0− ), f 0 (x0+ )] (e.g. for f (x) = |x|, ∂f (0) = [−1, 1])

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 219 / 250
Neural Networks

Remark
For non-convex losses (often with ANN's!), gradient descent
algorithms may be stuck at local minima:

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 220 / 250
Neural Networks

Network training: forward prop. - backprop.

For Gradient Descent algorithms, we need to compute the gradient of


the objective function (loss):
y (x , w ) and E (w ) are computed by forward propagation.
∇w y (x , w ) and ∇w E (w ) are computed by backpropagation.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 221 / 250
Neural Networks

A toy example:

Consider a 2-layer, one-neuron-per-layer network:


 
(2) (2) (1) (1)
y = y (x, w ) = h w h (w x) ,

and a loss
E = (y − t)2 .

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 222 / 250
Neural Networks

Computing y and E :
a(1) = w (1) x;
z (1) = h(1) (a(1) );
a(2) = w (2) z (1) ;
y = z (2) = h(2) (a(2) );
E = (y − t)2 .

⇒ forward propagation from the rst layer to the last layer.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 223 / 250
Neural Networks

Computing ∇y and ∇E :
∂E ∂y
(l)
= 2(y − t) (l) ;
∂w ∂w
∂y 0
= h(2) (a(2) )z (1) ;
∂w (2)
∂y 0 0
( 1)
= h(2) (a(2) ) w (2) h(1) (a(1) )x.
∂w | {z }
already computed from layer 2

⇒ backpropagation from the last layer to the rst layer.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 224 / 250
Neural Networks

In general ANN: Forward propagation:

For general networks, we set


(l) (l−1)
aj := (W (l) z (l−1) )j =
(l)
X
wji zi ;
i

(l) (l) (l)


zj := hj (aj );
yn := y (xn , w ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 225 / 250
Neural Networks

Forward propagation: from rst layer to last layer,


(0)
zj = xnj ;
(l)
X (l) (l−1)
zj = h(l) ( wji zi )
i

l = 1 . . . L;
(l)
= h(l) (aj ),
(l) (l−1)
(2)
(l)
X
aj = wji zi
i
(l−1)
wji h(l−1) (ai ), l = 1 . . . L; (3)
X (l)
=
i
(L) (L)
ynj = zj = h(L) (aj ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 226 / 250
Neural Networks

In general ANN: Backpropagation:

Backpropagation is an ecient method for evaluating the


derivatives of the error function with repsect to the parameters w :
∂En
(l)
.
∂wji

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 227 / 250
Neural Networks

We dene
∂En
(l)
δj := (l)
. (4)
∂aj
Now,
(l)
∂En ∂En ∂aj
(l)
= (l) (l)
.
∂wji ∂aj ∂wji
Then, using (4) and (2), we get, for l = 1 . . . L,
∂En (l) (l−1)
(l)
= δj zi . (5)
∂wji

It remains to compute the (δj(l) ), and the idea now is to use a


backward scheme.

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 228 / 250
Neural Networks

(l) ∂En X ∂En ∂a(l+1) X (l+1)


(l+1) ∂ak
k
δj = (l)
= (l+1) (l)
= δk (l)
.
∂aj k ∂ak ∂aj k ∂aj
From (3), we have
(l+1)
∂ak (l+1) (l) 0 (l)
(l)
= wkj h (aj ).
∂aj

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 229 / 250
Neural Networks

In conclusion:
Backpropagation: from last layer to rst layer,
(L) ∂En
δj = (L)
;
∂aj
0 (l+1) (l+1)
l = L − 1 . . . 1.
(l) (l)
X
δj = h(l) (aj ) wkj δk ,
k

And,
∂En (l) (l−1)
(l)
= δj zi .
∂wji

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 230 / 250
Neural Networks

Remarks

Regularization in ANN's can be performed by:

ridge/lasso/elastic net;
early stopping: stop at the number of iterations (epochs) from
which the test error does not improve or starts to increase;
dropout: remove some units at random when training;
data augmentation;
tangent propagation (e.g. Tikhonov);
label smoothing ("{0,1} −→ {0 + rand ε, 1 - rand ε}");
...

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 231 / 250
Neural Networks

Some types of neural networks (see a Deep Learning course):

Convolutional Neural Networks (CNN)


Bayesian Neural Networks (BNN)
Recurrent Neural Networks (RNN), LSTM
Mixture Density Networks (MDN)
Generative Adversarial Networks (GAN)
...

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 232 / 250
Neural Networks

Remark. It has been shown that BNN's are related to GP's!

Figure: When parameters θ of an innite width network are sampled from


a prior p(θ), the network output is a GP. (A Video,
An article (2021): GP's and the mystery of DL)

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 233 / 250
To keep in "deep mind"!

Outline
1 Introduction
2 Example: Polynomial Curve Fitting
3 Probability Tools
4 Model Selection
5 Linear Models for Regression
6 Linear Models for Classication
7 Other Models
8 Neural Networks
9 To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 234 / 250
To keep in "deep mind"!

Summary

Machine Learning tries to t the data probability distribution


(p(t|x ) or p(x , t)).
Minimizing the loss amounts to
min − log p(t |X , w ).
w
The average E[t|x ] is modeled by a function y (x , w ) that
ranges from linear (e.g. linear regression) and generalized linear
(e.g. logistic regression) to nonlinear (e.g. kernels, neural
networks).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 235 / 250
To keep in "deep mind"!

1  
error = loss = L t , y (X , w ) (+ regul.);
2

w ∗ = arg min loss (often by GD, SGD, ...);


w
3

tˆ(x ) := y (x , w ∗ ) prediction for t(x ).

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 236 / 250
To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 237 / 250
To keep in "deep mind"!

To control under-/overtting:
data preparation: removing outliers and duplicates, lling in
missing data, feature scaling (standardization/normalization),
feature engineering/data mining, ...
tuning of the model: choice of the algorithm and its
hyperparameters
tuning of the loss: choice of the loss, the regularization, the
optimlization algorithm and its hyperparameters

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 238 / 250
To keep in "deep mind"!

Figure: Maybe often, but not always...

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 239 / 250
To keep in "deep mind"!

For a long time, this's been the story; but it's half the story!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 240 / 250
To keep in "deep mind"!

For a long time, this's been the story; but it's half the story!

Recently (2017 - ?): overparameterized ML may not overt!

Figure: Double-descent risk curve: "curse of dimensionality" or "blessing


of dimensionality"? Source: [Belkin, Hsu, Ma, Mandal, 2019]

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 240 / 250
To keep in "deep mind"!

Then maybe you'll design your own algorithm that will be the best!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 241 / 250
To keep in "deep mind"!

−→ NO, "no free lunch"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 242 / 250
To keep in "deep mind"!

"No Free Lunch" (NFL)!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 243 / 250
To keep in "deep mind"!

Learning Theory ("NFL")

Theorem (NFL, Wolpert 1996)


For any two algorithms A and B , trained on some training set Dtrain
and tested on all possible outside test sets, the average (over all
functions f ) of the test-loss of their predictions is the same:

E[Ltest |Dtrain , A] = E[Ltest |Dtrain , B].

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 244 / 250
To keep in "deep mind"!

Learning Theory ("NFL")

Theorem (NFL, Wolpert 1996)


For any two algorithms A and B , trained on some training set Dtrain
and tested on all possible outside test sets, the average (over all
functions f ) of the test-loss of their predictions is the same:

E[Ltest |Dtrain , A] = E[Ltest |Dtrain , B].

"If an algorithm performs well on a certain class of problems then it


necessarily pays for that with degraded performance on the set of all
remaining problems."

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 244 / 250
To keep in "deep mind"!

Learning Theory ("NFL")

Theorem (NFL, Wolpert 1996)


For any two algorithms A and B , trained on some training set Dtrain
and tested on all possible outside test sets, the average (over all
functions f ) of the test-loss of their predictions is the same:

E[Ltest |Dtrain , A] = E[Ltest |Dtrain , B].

"If an algorithm performs well on a certain class of problems then it


necessarily pays for that with degraded performance on the set of all
remaining problems."

=⇒ when averaged over all probability distributions of data,


all algorithms are as good as "random guessing" !!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 244 / 250
To keep in "deep mind"!

Figure: Fooling AI: after only some negligible noise, the same algorithm no
more sees a panda but a monkey!
Source: Explaining and Harnessing Adversarial Examples, Goodfellow et al,
ICLR 2015

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 245 / 250
To keep in "deep mind"!

Figure: Fooling a face recognition system !


Source: Dodging Attack Using Carefully Crafted Natural Makeup, Guetta
et al, 2021

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 246 / 250
To keep in "deep mind"!

Probably Approximately Correct (PAC)


But, if we restrict the class of algorithms, then we can state
that:
Theorem (PAC learnability, Valiant 1984, Vapnik 2000)
If the data are separable by a classier from a set H of classiers with
nite "Vapnik-Chervonenkis (VC) dimension", then there is a
classier h∗ ∈ H such that, for all ε > 0 and δ ∈ (0, 1),
 
P test error(h∗ ) ≤ ε ≥ 1 − δ.

To achieve it, the number N of training samples must be


c 1
N≥ VC (H) + log .
ε δ

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 247 / 250
To keep in "deep mind"!

Figure: The VC dimension (complexity) of the set of straight line classiers


is 3

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 248 / 250
To keep in "deep mind"!

"Interpretability issue" in ML

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 249 / 250
To keep in "deep mind"!

Azmi MAKHLOUF (October 19, 2021) Machine Learning: Probabilistic Fundamentals 250 / 250

You might also like