Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
Pierre-Alexandre Mattei
http://pamattei.github.io/
@pamattei
1
But actually, what is deep learning?
2
But actually, what is deep learning?
2
But actually, what is deep learning?
The derivatives of F with respect to the tunable parameters can be computed using the
chain rule via the backpropagation algorithm.
2
A glimpse at the zoology of layers
The simplest kind of affine layer is called a fully connected layer:
fl (x) = Wl x + bl ,
where Wl and bl are tunable parameters.
3
A glimpse at the zoology of layers
The simplest kind of affine layer is called a fully connected layer:
fl (x) = Wl x + bl ,
where Wl and bl are tunable parameters.
Hyperbolic tangent
σ(x)
2
Restricted linear unit (ReLU)
4
Why is it convenient to compose affine functions?
• There are similar results for very thin but arbitrarily deep networks (Lin & Jegelka,
NeurIPS 2018).
4
Why is it convenient to compose affine functions?
• There are similar results for very thin but arbitrarily deep networks (Lin & Jegelka,
NeurIPS 2018).
• Some prior knowledge can be distilled into the architecture (i.e. the type of affine
functions/activations) of the network. For example, convolutional neural networks
(CNNs, LeCun et al., NeuIPS 1990) leverage the fact that local information plays an
important role in images/sound/sequence data. In that case, the affine functions are
convolution operators with some learnt filters.
4
Why is it convenient to compose affine functions?
• Often, this prior knowledge can be based on known symmetries, leading to deep
architectures that are equivariant or invariant to the action of some group (see
e.g. the work of Taco Cohen or Stéphane Mallat). This is useful when dealing with
images, sound, molecules...
5
Why is it convenient to compose affine functions?
• Often, this prior knowledge can be based on known symmetries, leading to deep
architectures that are equivariant or invariant to the action of some group (see
e.g. the work of Taco Cohen or Stéphane Mallat). This is useful when dealing with
images, sound, molecules...
5
Why is it convenient to compose affine functions?
• Often, this prior knowledge can be based on known symmetries, leading to deep
architectures that are equivariant or invariant to the action of some group (see
e.g. the work of Taco Cohen or Stéphane Mallat). This is useful when dealing with
images, sound, molecules...
5
A simple example: nonlinear regression with a multilayer
perceptron (MLP)
We want to perform regression on a data set
6
A simple example: nonlinear regression with a multilayer
perceptron (MLP)
We want to perform regression on a data set
We can model the regression function using a multilayer perceptron (MLP): two
connected layers with an hyperbolic tangent in-between:
y ≈ F θ (x) = W1 tanh(W0 x + b0 ) + b1 .
6
A simple example: nonlinear regression with a multilayer
perceptron (MLP)
We want to perform regression on a data set
We can model the regression function using a multilayer perceptron (MLP): two
connected layers with an hyperbolic tangent in-between:
y ≈ F θ (x) = W1 tanh(W0 x + b0 ) + b1 .
A natural way to find the parameters θ (a.k.a. the weights) of the MLP Fθ , is to minimise
the mean squared loss:
n
1X
MSE(θ) = (yi − Fθ (xi ))2
n
i=1
6
A simple example: nonlinear regression with a multilayer
perceptron (MLP)
∀i ≤ n, yi ≈ Fθ (xi ) = W1 tanh(W0 xi + b0 ) + b1 ,
0.8 0.8
0.4 0.4
0.0 0.0
−10 0 10 −10 0 10
7
What’s "kind of wrong" with the approach we just saw?
This was of just minimising the MSE allows us to do predictions, but we cannot assess
the uncertainty of our predictions.
8
What’s "kind of wrong" with the approach we just saw?
This was of just minimising the MSE allows us to do predictions, but we cannot assess
the uncertainty of our predictions.
8
What’s "kind of wrong" with the approach we just saw?
This was of just minimising the MSE allows us to do predictions, but we cannot assess
the uncertainty of our predictions.
So what do we want?
8
What’s "kind of wrong" with the approach we just saw?
This was of just minimising the MSE allows us to do predictions, but we cannot assess
the uncertainty of our predictions.
So what do we want?
8
What’s "kind of wrong" with the approach we just saw?
This was of just minimising the MSE allows us to do predictions, but we cannot assess
the uncertainty of our predictions.
So what do we want?
To do that, we need to have a probabilistic model of our data, hence the need for
generative models.
8
What’s a generative model?
Let’s start with some data D. For example, in the regression case with p-dimensional
continuous features,
D = ((x1 , y1 ), ..., (xn , yn )) ∈ (Rp × R)n .
In the binary classification case,
9
What’s a generative model?
Let’s start with some data D. For example, in the regression case with p-dimensional
continuous features,
D = ((x1 , y1 ), ..., (xn , yn )) ∈ (Rp × R)n .
In the binary classification case,
In the unsupervised case, the data usually looks like D = (x1 , ..., xn ) ∈ (Rp )n .
9
What’s a generative model?
Let’s start with some data D. For example, in the regression case with p-dimensional
continuous features,
D = ((x1 , y1 ), ..., (xn , yn )) ∈ (Rp × R)n .
In the binary classification case,
In the unsupervised case, the data usually looks like D = (x1 , ..., xn ) ∈ (Rp )n .
We call (x1 , ..., xn ) the features and (y1 , ..., yn ) the labels. The features are usually stored
in a n × p matrix called the design matrix.
9
What’s a generative model?
Let’s start with some data D. For example, in the regression case with p-dimensional
continuous features,
D = ((x1 , y1 ), ..., (xn , yn )) ∈ (Rp × R)n .
In the binary classification case,
In the unsupervised case, the data usually looks like D = (x1 , ..., xn ) ∈ (Rp )n .
We call (x1 , ..., xn ) the features and (y1 , ..., yn ) the labels. The features are usually stored
in a n × p matrix called the design matrix.
10
Generative models for supervised learning: General
assumptions
Although we’ll mostly focus on the unsupervised case my lectures, let us begin with
the (arguably simpler) supervised case D = ((x1 , y1 ), ..., (xn , yn )). It could be either a
regression or a classification task, for example.
Most of the time, it makes sense to build generative models that assume that the
observations are independent. This leads to
n
Y
p(D) = p((x1 , y1 ), ..., (xn , yn )) = p(xi , yi ).
i=1
Usually, we also further assume that the data are identically distributed. This means
that all the (xi , yi ) will follow the same distribution that we may denote p(x, y)
10
Generative models for supervised learning: General
assumptions
Although we’ll mostly focus on the unsupervised case my lectures, let us begin with
the (arguably simpler) supervised case D = ((x1 , y1 ), ..., (xn , yn )). It could be either a
regression or a classification task, for example.
Most of the time, it makes sense to build generative models that assume that the
observations are independent. This leads to
n
Y
p(D) = p((x1 , y1 ), ..., (xn , yn )) = p(xi , yi ).
i=1
Usually, we also further assume that the data are identically distributed. This means
that all the (xi , yi ) will follow the same distribution that we may denote p(x, y)
When these two assumptions are met, we say that the data are independent and
identically distributed (i.i.d.). This is super useful is practice because, rather than
having to find a distribution p((x1 , y1 ), ..., (xn , yn )) over a very large space (whose
dimension grows linearly with n), we’ll just have to find a much lower dimensional
distribution p(x, y).
10
Generative models for supervised learning: Do we really have
to be fully generative?
11
Generative models for supervised learning: Do we really have
to be fully generative?
But if we mainly want to do (probailisitic) preditions, knowing p(y|x) is enough. It’s exactly
this conditional distribution that will give us statements like "the probability that this
patient x has this kind of cancer is 56%".
11
Generative models for supervised learning: Do we really have
to be fully generative?
But if we mainly want to do (probailisitic) preditions, knowing p(y|x) is enough. It’s exactly
this conditional distribution that will give us statements like "the probability that this
patient x has this kind of cancer is 56%".
Based on these insights, there are two main approaches for building p(x, y):
• The fully generative (or model-based) approach posits a joint distribution p(x, y)
(often by specifying both p(y) and p(x|y)).
• The discriminative (or conditional) approach just specifies p(y|x) and completely
ignores p(x).
11
Generative models for supervised learning: Do we really have
to be fully generative?
But if we mainly want to do (probailisitic) preditions, knowing p(y|x) is enough. It’s exactly
this conditional distribution that will give us statements like "the probability that this
patient x has this kind of cancer is 56%".
Based on these insights, there are two main approaches for building p(x, y):
• The fully generative (or model-based) approach posits a joint distribution p(x, y)
(often by specifying both p(y) and p(x|y)).
• The discriminative (or conditional) approach just specifies p(y|x) and completely
ignores p(x).
11
Generative models for supervised learning: Discriminative vs
fully generative
One of the wines the bad guys counterfeited was from the Barolo region. According to
Wikipedia, those wines have "pronounced tannins and acidity", and "moderate to high
alcohol levels (Minimum 13%)". This would help a trained human recognise them, but
could we train an algorithm to learn those characteristics?
14
Generative vs Discriminative: a concrete example
125
100
Barolo
acidity
Other
75
50
11 12 13 14 15
alcohol
125
100
Barolo
acidity
Other
75
50
11 12 13 14 15
alcohol
16
Generative vs Discriminative: a concrete example
The generative way would use the formula p(x, y) = p(y)p(x|y) and model the class
conditional distributions p(x|y) using a continuous bivariate ditribution (e.g. 2D Gaussians).
125
100
acidity
Barolo
75
50
11 12 13 14 15
alcohol
17
Generative vs Discriminative: a concrete example
The generative way would use the formula p(x, y) = p(y)p(x|y) and model the class
conditional distributions p(x|y) using a continuous bivariate ditribution (e.g. 2D Gaussians).
Here is what we obtain using the R package Mclust (Scrucca, Fop, Murphy, and Raftery,
R Journal, 2016).
140
120
Fixed Acidity
100
80
60
11 12 13 14
Alcohol
18
Generative vs Discriminative: a concrete example
The discriminative way would only model p(y|x). Since there are only 2 classes, this
means that p(y|x) will be a Bernoulli random variable whose parameter π(x) ∈ [0, 1] is
a function of the features.
125
100
Barolo
acidity
Other
75
50
11 12 13 14 15
alcohol
19
Generative vs Discriminative: a concrete example
The discriminative way would only model p(y|x). Since there are only 2 classes, this
means that p(y|x) will be a Bernoulli random variable whose parameter π(x) ∈ [0, 1] is
a function of the features.
125
100
Barolo
acidity
Other
75
50
11 12 13 14 15
alcohol
19
Generative vs Discriminative: a concrete example
The discriminative way would only model p(y|x). Since there are only 2 classes, this
means that p(y|x) will be a Bernoulli random variable whose parameter π(x) ∈ [0, 1] is
a function of the features.
125
Key idea
100
Since we have an unknown function and Barolo
acidity
50
11 12 13 14 15
alcohol
19
Generative vs Discriminative: last words
We’ll focus now on the discriminative approach using neural nets, because it is
simpler. For more on the differences and links between the generative and discriminative
schools, a wonderful reference is Tom Minka’s short note on the subject: Discriminative
models, not discriminative training 2 .
2 https://tminka.github.io/papers/minka-discriminative.pdf
20
Generative vs Discriminative: last words
We’ll focus now on the discriminative approach using neural nets, because it is
simpler. For more on the differences and links between the generative and discriminative
schools, a wonderful reference is Tom Minka’s short note on the subject: Discriminative
models, not discriminative training 2 .
where B(·|θ) denotes the density of a Bernoulli distribution with parameter θ ∈ [0, 1]. The
key idea is then to model the function x 7→ π(x) using a neural net.
2 https://tminka.github.io/papers/minka-discriminative.pdf
20
Generative vs Discriminative: last words
We’ll focus now on the discriminative approach using neural nets, because it is
simpler. For more on the differences and links between the generative and discriminative
schools, a wonderful reference is Tom Minka’s short note on the subject: Discriminative
models, not discriminative training 2 .
where B(·|θ) denotes the density of a Bernoulli distribution with parameter θ ∈ [0, 1]. The
key idea is then to model the function x 7→ π(x) using a neural net.
2 https://tminka.github.io/papers/minka-discriminative.pdf
20
How to model π
Our discriminative model for binary classification is
and we wish to model π using a neural net. But what kind of neural net?
21
How to model π
Our discriminative model for binary classification is
and we wish to model π using a neural net. But what kind of neural net?
The only really important constraint of the problem is that we need to have
21
How to model π
Our discriminative model for binary classification is
and we wish to model π using a neural net. But what kind of neural net?
The only really important constraint of the problem is that we need to have
Yes! By using a function that only output stuff in [0, 1] as the output layer. For example the
1
logistic sigmoid function σ : a 7→ 1+exp(−a) .
1.00
0.75
σ(x)
0.50
0.25
0.00
−10 −5 0 5 10
x
21
How to model π
where σ is the sigmoid function and fθ : Rp −→ R is any neural network (whose weights
are stored in a vector θ) that takes the features as input and returns an unconstrained real
number.
22
How to model π
where σ is the sigmoid function and fθ : Rp −→ R is any neural network (whose weights
are stored in a vector θ) that takes the features as input and returns an unconstrained real
number.
We have a lot of flexibility to choose fθ . In particular, if the features x1 , ..., xn are images,
we could use a CNN. In the case of time-series, we could use a recurrent neural net. In
the case of sets, we could use a deepsets architecture (Zaheer et al., NeurIPS 2017).
22
How to model π
where σ is the sigmoid function and fθ : Rp −→ R is any neural network (whose weights
are stored in a vector θ) that takes the features as input and returns an unconstrained real
number.
We have a lot of flexibility to choose fθ . In particular, if the features x1 , ..., xn are images,
we could use a CNN. In the case of time-series, we could use a recurrent neural net. In
the case of sets, we could use a deepsets architecture (Zaheer et al., NeurIPS 2017).
fθ (x) = W1 tanh(W0 xi + b0 ) + b1 .
22
How to model π
where σ is the sigmoid function and fθ : Rp −→ R is any neural network (whose weights
are stored in a vector θ) that takes the features as input and returns an unconstrained real
number.
We have a lot of flexibility to choose fθ . In particular, if the features x1 , ..., xn are images,
we could use a CNN. In the case of time-series, we could use a recurrent neural net. In
the case of sets, we could use a deepsets architecture (Zaheer et al., NeurIPS 2017).
fθ (x) = W1 tanh(W0 xi + b0 ) + b1 .
Since the function π and the model p(y|x) now depend on some parameters θ, we’ll
denote them by πθ and pθ (y|x) from now on.
22
How to find θ
There are many ways to find good parameter values for a generative model. One could
use Bayesian inference, score matching, the method of moments, adversarial training...
Let us focus on one of the most traditional ways: maximum likelihood. The idea is to find
a θ̂ that maximises the log-likelihood function log pθ (D).
23
How to find θ
There are many ways to find good parameter values for a generative model. One could
use Bayesian inference, score matching, the method of moments, adversarial training...
Let us focus on one of the most traditional ways: maximum likelihood. The idea is to find
a θ̂ that maximises the log-likelihood function log pθ (D).
We’ll also call `(θ) the likelihood (in fact, we’ll call any function that is equal to log pθ (D) up
to a constant the likelihood).
23
How to find θ: from ML to XENT
We have
n
X N
X
log π(x)yi (1 − π(x))1−yi ,
`(θ) = log pθ (yi |xi ) =
i=1 i=1
which leads to
n
X
`(θ) = [yi ln π(xi ) + (1 − yi ) ln(1 − π(xi ))] .
i=1
24
How to find θ: from ML to XENT
We have
n
X N
X
log π(x)yi (1 − π(x))1−yi ,
`(θ) = log pθ (yi |xi ) =
i=1 i=1
which leads to
n
X
`(θ) = [yi ln π(xi ) + (1 − yi ) ln(1 − π(xi ))] .
i=1
We will want to maximise this function, which is equivalent to minimising its opposite,
which is called the cross-entropy loss.
The cross-entropy loss is the most commonly used loss for neural networks, and is a way
of doing maximum likelihood without necessarily saying it.
24