0% found this document useful (0 votes)
19 views18 pages

Mil780 Classification

Uploaded by

coxdevon045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views18 pages

Mil780 Classification

Uploaded by

coxdevon045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Probabilistic generative and discriminative models for

classification
Stephan Schmidt
March 13, 2024

This is a draft document. Please bring any errors or ambiguities to my attention.

1 Introduction
In supervised learning (or training), the unknown parameters of a model are estimated with a dataset
where the input and target variables are known. The parameters can be estimated using maximum
likelihood estimators or Bayesian inference methods can be used. In this document, we will only focus
on maximum likelihood estimators for parameter estimation.
There are two classes of supervised learning problems: (1) regression and (2) classification. In the
regression problem, the objective is to predict a continuous variable y ∈ R for an input x ∈ RD×1i .
In the classification problem, we aim to predict a discrete variable C ∈ Z from an input x ∈ RD×1 .
The discrete variable is often referred to as a class or category and we can use Ck to denote the kth
class in a multi-classPclassification problem. In probabilistic classification models, the objective is to
infer p(Ck |x) where k p(Ck |x) = 1 and Ck denotes the kth class. The most probable class can be
estimated with
k̂ = arg max p(Ck |x) (1)
k

This means that x is classified to class k̂ with a probability of p(Ck |x). The benefit of probabilistic
classification models is that probabilities are assigned to the predictions. This document will treat the
two main classes of probabilistic classification models, namely,
• Probabilistic discriminative models (Section 4.3 of Bishop [1]), and

• Probabilistic generative models (Section 4.2 of Bishop [1]).


This document will closely follow the approach presented in Bishop [1] and is seen as supplementary.
The layout of the document is as follows: In Section 2, the dataset that will be used throughout this
document is presented, whereafter probabilistic discriminative models are presented in Section 3. In
Section 4, probabilistic generative models are presented and in Section 5 class recognition problems are
discussed. The document is concluded in Section 6 with additional information included in Appendix
A.

2 Dataset
The Iris dataset will be used for demonstration purposes in subsequent sections. More information
on the Iris dataset is included in Appendix A, but in summary, the objective of the classification task
is to predict the class of Iris flower (Iris Setosa, Iris Versicolour, or Iris Virginica) from the available
i We could also develop models that aim to predict a variable vector y ∈ RDo ×1 .

1
measurements. For demonstration purposes, the original four-dimensional data x ∈ R4×1 were reduced
to two-dimensional data ϕ ∈ R2×1 with the two-dimensional training data and testing data shown in
Figure 1. The dimensionality reduction process is also described in Appendix A since the focus of this
document is on classification models.
(a) (b)

1.5
setosa setosa
1.0 versicolor
1.0 versicolor
virginica virginica
0.5 0.5 Testing
φ2

φ2
0.0 0.0

−0.5 −0.5

−1.0 −1.0

−2 0 2 4 −2 0 2 4
φ1 φ1

Figure 1: The training data are presented in Figure 1(a) and the testing data are superimposed on the
training data in Figure 1(b). A data point is denoted ϕ = [ϕ1 , ϕ2 ].

In the next section, probabilistic discriminative models are presented.

3 Probabilistic discriminative model


In this section, a probabilistic discriminative model for classification will be discussed. This class of
models aim to find a mapping directly from the data x to the probability distribution over the classes,
i.e., p(Ck |x). We need to revise Bernoulli distributions to develop probabilistic discriminative models.

3.1 Bernoulli distributions (Section 2 in Ref. [1])


A Bernoulli distribution is defined as follows over a discrete (binary) variable c:

p(c|µ) = µc · (1 − µ)1−c (2)

where c ∈ {0, 1} and 0 ≤ µ ≤ 1 is the mean of the distribution. The maximum likelihood estimate of
µ using the dataset c = [c1 , c2 , . . . , cN ] ∈ RN ×1 is given byii
N
1 X
µ̂ = cn (3)
N n=1

A Bernoulli distribution can be used to model the outcome of a single coin flip. If c = 1 denotes the
heads class and c = 0 denotes the tails class, then µ denotes the probability that the coin will land on
heads. Hence, the coin is unbiased if µ = 0.5.
We can extend this to the multi-dimensional case over a multidimensional discrete variable c as
follows:
K
Y
p(c|µ) = µckk (4)
k=1
ii This needs to be derived by analytically calculating the stationary point of the likelihood function of the Bernoulli
model.

2
where each component of the mean PK to 0 ≤ µk ≤ 1, and each element of c is binary,
PK vector µ adheres
i.e., ck ∈ {0, 1}. Furthermore, k=1 µk = 1 and k=1 ck = 1, which means that one element is 1 and
the rest are 0. These models can also be extended to have a mean that is input data dependent, e.g.,
K
Y
p(c|x, w1 , w2 , . . . , wK ) = µ(wTk x)ck (5)
k=1

where the mean function µ : R 7→ [0, 1] and wk is a parameter vector of the model that corresponds
to the kth class. In this document, the notation R 7→ [0, 1] means a real number is mapped to a real
number between 0 and 1. In the next section, functions that can be used as the mean of the Bernoulli
are discussed.

3.2 Functions mapping RD to [0,1]


Candidate functions for the mean of the Bernoulli distribution need to have a range between 0 and 1.
An example of a function that adheres to this is a sigmoid function [1]:
1
σ(x) = (6)
1 + e−(x)
where σ : R 7→ [0, 1] denotes a sigmoid function that maps an input x to the range [0, 1]. The sigmoid
has the following properties:

• lim σ(x) = 0.0.


x→−∞

• σ(0) = 0.5.
• lim σ(x) = 1.0.
x→∞

The sigmoid function can be parametrised as follows:


1
σ(a · x + b) = (7)
1+ e−(a·x+b)
where a and b are two parameters that control the behaviour of the function. In Figure 2 examples of
these functions are shown for different combinations of parameters.

1.0

0.8 1
1+e−2x
0.6 1
1+e−2x+3
y

1
0.4 1+e−10x
1
1+e−10x−15
0.2

0.0
−4 −2 0 2 4
x

Figure 2: Examples of sigmoid functions of the form in Equation (7) are shown over x.

3
The input data is often multi-dimensional x ∈ RD×1 and therefore the sigmoid function needs to be
extended to multi-dimensional data. Here is an example of how we can accommodate multi-dimensional
data using a sigmoid function:
1
σ(wT x) = (8)
1 + e−wT x
where w ∈ RD×1 denotes a parameter vector and x ∈ RD×1 denotes the multidimensional data vector.
An example of different sigmoid functions in a two-dimensional space is shown in Figure 3. The decision
boundary between the two classes is obtained by solving for σ(wT x) = 0.5. The decision boundary
is given by −wT x = 0 and also included in Figure 3. Furthermore, we can write w = ||w||2 u, where
w
||w|| is the L2 -norm of the weight vector, u is the unit vector u = ||w||2
which means that the decision
function is given by
− ||w||2 uT x = 0

(9)

(a) (b) (c)

w1 = 1.0, w2 = 2.0 w1 = -1.0, w2 = 2.0 w1 = 0.0, w2 = 2.0


5.0 5.0 5.0
w 1 x1 + w 2 x2 = 0 w 1 x1 + w 2 x2 = 0 w 1 x1 + w 2 x2 = 0
0.8 0.8 0.8
2.5 2.5 2.5
0.6 0.6 0.6
x2

x2

x2
0.0 0.0 0.0
0.4 0.4 0.4
−2.5 −2.5 −2.5
0.2 0.2 0.2

−5.0 −5.0 −5.0


−5 0 5 −5 0 5 −5 0 5
x1 x1 x1

(d) (e) (f)

w1 = -1.0, w2 = 0.0 w1 = -1.0, w2 = -1.0 w1 = 5.0, w2 = -2.0


5.0 5.0 5.0 1.0
w 1 x1 + w 2 x2 = 0 w 1 x1 + w 2 x2 = 0 w 1 x1 + w 2 x2 = 0
0.8 0.8 0.8
2.5 2.5 2.5
0.6 0.6 0.6
x2

x2

x2

0.0 0.0 0.0


0.4 0.4 0.4
−2.5 −2.5 −2.5
0.2 0.2 0.2

−5.0 −5.0 −5.0 0.0


−5 0 5 −5 0 5 −5 0 5
x1 x1 x1

Figure 3: Examples of sigmoid functions of the form: σ(w1 x1 + w2 x2 ) = (1 + e−w1 x1 −w2 x2 )−1 .

It is generally desirable to have an offset term in the sigmoid function, i.e.,


1
σ(wT x + w0 ) = (10)
1 + e−wT x−w0

which means that the decision function is given by −wT x = w0 . We can use Equation (8) to incor-
porate an offset term by re-parametrising the input data as follows: x̃ = [1, x1 , x2 , . . . , xD ] ∈ R(D+1)
and define the parameter vector w̃ = [w0 , w1 , . . . , wD ] ∈ R(D+1) , i.e. a 1 is added to the data and the
bias parameter w0 is added to the parameter vector. Hence, σ(wT x + w0 ) is equivalent to σ(w̃T x̃).
Since these functions have a range between 0 and 1, we can use the sigmoid functions as the mean
of the Bernoulli distribution:

p(c|x, w) = σ(wT x)c · (1 − σ(wT x))1−c (11)

Hence, this model can be used to predict a binary class label c ∈ {0, 1} from the multi-dimensional
data x. If we return to the coin flip example, the probability that the coin will land on heads (c = 1)

4
or tails (c = 0) is controlled by wT x: If wT x is a large positive number, then heads is more likely, if
wT x is a negative number then tails is more likely and wT x = 0 is the boundary separating the two
classes. We can use the following function as the mean function in a K-class classification problem [1]:
T
ew k x
µ(wTk x) = PK (12)
wTmx
m=1 e

where µ(wTk x) is the kth mean of a multi-class distribution [1]. Furthermore, the input data can be
transformed using linear transformations, e.g.,

ϕ = Ax + b (13)

or non-linear transformations to obtain new features that result in a more desirable feature space. Ex-
amples of the decision boundaries that correspond to non-linear transformations are shown in Figure
4. Hence, the model in Figure 3 can only be used to accurately predict classes that are near-linearly
separable, whereas non-linear functions (e.g., see examples in Figure 4) have the potential to perform
better for classification problems where the classes are not linearly separable. The non-linear transfor-
mations can be either manually constructed (e.g. by choosing appropriate non-linear basis functions)
or automatically constructed using non-linear models (e.g., feed-forward neural networks).

(a) (b) (c)

4 4 4 0.8
0.9 0.4
2 2 2
0.8 0.3 0.6
x2

x2

x2
0 0 0
0.7 0.2 0.4
−2 −2 −2
0.6 0.1
−4 −4 −4 0.2

−5 0 5 −5 0 5 −5 0 5
x1 x1 x1

Figure 4: Examples of sigmoid functions of the form: (1 + e−w1 ϕ1 (x)−w2 ϕ2 (x) )−1 are shown using
different non-linear basis functions. 4(a): ϕ1 (x) = 0.5x21 , ϕ2 (x) = 0.5x22 , w1 = 1 and w2 = 1. 4(b):
ϕ1 (x) = 0.5x21 , ϕ2 (x) = 0.5x22 , w1 = −1 and w2 = −1. 4(c): ϕ1 (x) = sin(x1 x2 ), ϕ2 (x) = cos(x2 ),
w1 = 1 and w2 = −1. The basis function 4(c) is only used to demonstrate the complex behaviour that
can be modelled using non-linear transformations, but the specific transformation does not necessarily
have practical significance.

3.3 Dataset
Let’s consider the multi-dimensional training dataset X ∈ RN ×D of N observations of data with a
dimensionality of D:  
x11 x12 . . . x1D
 x21 x22 . . . x2D 
X= . .. ..  (14)
 
 .. ..
. . . 
xN 1 xN 2 . . . x N D
with the nth observation denoted  
xn1
 xn2 
xn =  ..  ∈ RD×1 , (15)
 
 . 
xnD

5
and the full dataset constructed with the N -observations as follows:
 T 
x1
 xT2 
X =  .  ∈ RN ×D (16)
 
 .. 
xTN

The corresponding discrete labels or classes in the training dataset are defined as follows:
 
Label1
 Label2 
ℓ= ..  ∈ RN (17)
 
 . 
LabelN

where Labeln could denote the condition of a machine, a qualitative indicator of the quality of a
product, or the class of Iris flower (e.g., Setosa, Versicolour, and Virginica) for example. Let’s assume
there are K classes in the dataset and the number of classes is fixed in the derivation. One-hot encoding
will be used to convert the labels of the classes into numbers for subsequent modelling.
In one-hot encoding, a K-dimensional vector is defined for each observation n. The vector is
denoted cn = [cn1 , cn2 , . . . , cnK ] ∈ RK . Each column corresponds to a specific class, e.g. column 1
corresponds to class 1, column 2 corresponds to class 2, etc. Hence, if the nth observation belongs
to the mth class, cnm = 1. Since an observation can only belong to one class, we have the following
properties for cn :
XK
cnk = 1, and cnk ∈ {0, 1} (18)
k=1

Hence, if cnk = 1, it means that the observation n belongs to the kth class and if cnk = 0 it means it
does not belong to the kth class. We can use this to define a class matrix that corresponds to the data
matrix in Equation (14):  
c11 c12 . . . c1K
 c21 c22 . . . c2K 
N ×K
C= . .. ..  ∈ R (19)
 
 .. ..
. . . 
cN 1 cN 2 . . . cN K
where the nth row contains the elements of cn . In the next section, logistic regression, which is an
example of a probabilistic discriminant model, is discussed.

3.4 Logistic regression


The objective is to develop a model that can be used to predict the label or class directly from the data
x ∈ RD×1 . The nth observation of x, denoted xn is transformed with a basis function ϕn = ϕ(xn ),
where ϕn ∈ RM ×1 . If the nth observation belongs to the kth class, then cnk = 1.
We can use Equation (4) to calculate the distribution over the class vector cn for a given observation
xn . The distribution over the class vector is defined as followsiii :
K
Y
p(cn |W , xn ) = µ(wTk ϕ (xn ))cnk (20)
k=1

iii The notation is used so that we can leverage the one-hot class’ properties in the maximum likelihood estimation.

However, this notation has the following equivalence to the notation used in the introduction: If ck = 1 of c, then
p(c|W , xn ) is equivalent to p(Ck |W , xn )

6
where K denotes the number of classes, µ : R 7→ [0, 1] denotes the mean function of Equation (4),
wk ∈ RM ×1 denotes the parameter vector of the kth class, and ϕ (x) : RD×1 7→ RM ×1 denotes a basis
function (e.g. polynomial function) of x. The unknown parameters of the model are denoted:
 T 
w1
 wT2 
W =  .  ∈ RK×M (21)
 
 .. 
wTK
We assume each observation is independently sampled. The likelihood of a model with the param-
eters W for the data X with corresponding labels C is calculated as follows:
N
Y
p(C|W , X) = p(cn |xn , W ) (22)
n=1
N K
!
Y Y
= µ(wTk ϕn )cnk (23)
n=1 k=1

and the log-likelihood is given by:


N X
X K
cnk · log µk (wTk ϕn )

log p(C|W , X) = (24)
n=1 k=1

We would like to find a maximum likelihood estimator of the unknown parameters W . Hence, the
maximum likelihood problem is formulated as follows:

Ŵ = arg max log p(C|W , X) (25)


W

where Ŵ denotes the maximum likelihood estimate of the unknown parameters. Most optimisation
algorithms assume that the design variables are vectors and not matrices. Hence, the weight matrix
W defined in Equation (21) can be written as the following equivalent vector:
 
w1
 w2 
w̌ =  .  ∈ R(K·M )×1 (26)
 
 .. 
wK
We need gradients with respect to the unknown parameters to use gradient-based optimisation
algorithms. The gradient of the likelihood function is denoted ∇w̌ log p(C|w̌, X) and the gradient of
the kth class’ parameters is calculated as follows:
N XK
∂ X 1 ∂
log p(C|w̌, X) = cnk T
· µk (wTk ϕn ) (27)
∂wj n=1
µk (w k ϕ n ) ∂w j
k=1

which can be shown to simplify as follows:


N
∂ X
cnj − µ(wTj ϕn ) ϕn

log p(C|w̌, X) = (28)
∂wj n=1

if we define the mean of the model as follows:


T T
ewj ϕn ewj ϕn
µk (wTj ϕn ) = PK = (29)
m=1 ewm ϕn
T
Z

7
PK T
where Z = m=1 ewm ϕn . This gradient is derived in the next section.
When implementing these equations, it is important to verify that the implementations are correct.
An error in the gradient calculation could be detrimental to the performance and success of the
parameter estimation and the resulting model. Hence, gradient checks (e.g. using finite differences)
need to be performed.

3.5 Derivation of gradient of the log-likelihood function


Let’s calculate the gradient of the likelihood function w.r.t. the kth class’ weights:
N XK
∂ X 1 ∂
log p(C|W , X) = cnk ·
T ϕ ) ∂w
µk (wTk ϕn ) (30)
∂wj n=1
µk (w k n j
k=1

and use the following mean functioniv :


T T
ewj ϕn ewj ϕn
µk (wTj ϕn ) = PK = (31)
m=1 e
T
wm ϕn Z

The gradient of the mean w.r.t. the unknown variables is defined as follows:
T T
∂ ewk ϕn Ikj ewk ϕn ∂Z
µ(wTk ϕn ) = · ϕn − · (32)
∂wj Z Z2 ∂wj
T T T
∂ ewk ϕn Ikj ewk ϕn ewj ϕn
µ(wTk ϕn ) = · ϕn − · · ϕn (33)
∂wj Z Z Z

µ(wTk ϕn ) = µ(wTk ϕn )Ikj ϕn − µ(wTk ϕn ) · µ(wTj ϕn ) · ϕn (34)
∂wj

µ(wTk ϕn ) = µ(wTk ϕn ) Ikj − µ(wTj ϕn ) ϕn

(35)
∂wj
where Ikj is the kth row and the jth column of an identity matrix. Hence, the gradient of the likelihood
function is defined as follows:
N XK
∂ X 1
· µ(wTk ϕn ) Ikj − µ(wTj ϕn ) ϕn

log p(C|W , X) = cnk Tx )
(36)
∂wj n=1
µk (w k n
k=1

which can be simplified as follows:


N XK
∂ X
cnk Ikj − cnk µ(wTj ϕn ) ϕn

log p(C|W , X) = (37)
∂wj n=1 k=1
PK PK
We can use the property of the identity matrix k=1 cnk Ikj = cnj and the class vector k=1 cnk =1
to simplify Equation (37) as follows:
N
∂ X
cnj − µ(wTj ϕn ) ϕn

log p(C|W , X) = (38)
∂wj n=1

The Hessian matrix H can be calculated with a similar procedure as discussed in Ref. [1]. In the next
section, two variations of logistic regression models will be applied to the Iris dataset.
 2 
iv We can use a mean function that is non-linear in the weights, e.g., µk wT , and the weight vectors might
k ϕn
be shared between the different classes. The procedure however remains the same, use the chain rule to calculate the
gradients of the log-likelihood with respect to the unknown parameters.

8
3.6 Demonstration of logistic regression
Two logistic regression models will be considered in this section:
T
• Model 1: A bias term is not incorporated in the model, i.e. ew ϕ
is used in Equation (29) .
T T
• Model 2: A bias term is incorporated in the model, i.e. ew0 +w ϕ
= ew̃ ϕ̃
is used in Equation
(29).
The models will be trained on the two-dimensional data shown in Figure 1(a).
For Model 1, ϕ ∈ R2×1 and W ∈ R3×2 . The maximum likelihood estimate was obtained by solving
Equation (25) using a gradient-based optimisation algorithm. The conjugate gradient algorithm was
used to minimise the negative log-likelihood, with the analytical gradient of the log-likelihood given in
Equation (28) used. Since the scipy.optimize.minimize function requires the unknown parameters
to be a vector, it is necessary to change the weight matrix W ∈ R3×2 to an equivalent weight vector
w ∈ R6×1 as shown in Equation (26)v . The output of the optimisation process is as follows:
message: Optimization terminated successfully.
success: True
status: 0
fun: 42.07349684710877
x: [-8.939e+01 1.343e+02 4.532e+01 -6.703e+01 4.614e+01 -6.629e+01]
nit: 329
jac: [ 6.442e-07 -1.425e-06 -7.199e-08 6.963e-07 -5.722e-07 7.282e-07]
nfev: 682
njev: 682
We can use the model to calculate p(Ck |x) and subsequently use Equation (1) to predict the class.
The predictions of the model over the feature space is shown in Figure 5, with the decision boundaries
seen between the classes. The training data are superimposed on the class predictions of the model
in Figure 5(a) and the training and testing data are superimposed on the class predictions in Figure
5(b).
A confusion matrix can be used to quantify the accuracy of the model to predict each class correctly.
Furthermore, it can be used to identify potential classes where the model is ”confused” (i.e. where the
model consistently predicts the wrong class). The confusion matrix, calculated using the training data,
is shown in Figure 6(a) and the confusion matrix, calculated using testing data, is shown in Figure 6(b).
Only one training dataset and one testing dataset is considered in this work, it is however, important
to understand the sensitivity of the results to the test-train split. Hence, often the process is repeated
with new training and testing datasets (using a new test-train split) and the average accuracy and the
standard deviation of the accuracy reported.
For the second model, a bias term was incorporated for each class. This means we have one
additional parameter for each class. This was performed by augmenting the features as follows: ϕ̃ =
[1, ϕ1 , ϕ2 ] and augmenting the weight vector of each class as follows: w̃k = [wk0 , wk1 , wk2 ]. Hence,
ϕ̃ ∈ R3 and W ∈ R3×3 . The same optimisation procedure was followed for the previous case, with the
output of scipy.optimize.minimize given here:
message: Optimization terminated successfully.
success: True
status: 0
fun: 6.123340099048337
v This can be done with the reshape function in numpy. It is however important to ensure that the reshape operations

preserve the order of the variables and that the location of the weights are unknown. It is sometimes beneficial to
convert the weight vector in Equation (26) back to a matrix W in the logistic regression function that is called by the
scipy.optimize.minimise function so that each class’ parameters are in separate rows. This can be done by reversing
the reshape operation again with the reshape function.

9
(a) (b)

2.0 2.0
2 2

1.5 1.5
1 1
φ2

φ2
0 1.0 0 1.0
Class 0
−1 Class 0 0.5 −1 Class 1 0.5
Class 1 Class 2
Class 2 Testing
−2 −2
0.0 0.0
−2.5 0.0 2.5 −2.5 0.0 2.5
φ1 φ1

Figure 5: The predicted classes for the full feature space is shown. The classes are predicted using the
logistic regression model without the bias term. In Figure 5(a), the training data are shown on the
predicted classes and in Figure 5(b), the testing and training data are superimposed on the predicted
classes.

x: [-7.912e+00 -1.481e+01 1.254e+01 1.327e+01 1.799e+00


-6.748e-01 -3.069e+00 1.484e+01 -1.089e+01]
nit: 135
jac: [ 6.467e-07 -3.975e-07 -4.576e-07 -4.850e-06 -5.547e-07
-2.810e-06 4.203e-06 9.522e-07 3.268e-06]
nfev: 281
njev: 281
The decision boundaries, identified by the model are shown in Figure 7. In Figure 7(a), the training
data are superimposed on the class predictions of the model and in Figure 7(b) the training and testing
data are superimposed on the class predictions of the model.
The confusion matrix of the training data is shown in Figure 8(a) and the confusion matrix of the
testing data is shown in Figure 8(b).
It is left to the reader to interpret the results and reflect on the impact the modelling decisions. In
the next section, probabilistic generative models for classification models are discussed.

4 Probabilistic generative models


In this section, probabilistic generative models are combined with Bayes’ rule to solve the classification
problem. In the next section, we will look at some examples of generative models.

4.1 Generative models


Generative models are able to generate new data, i.e.,

x ∼ p(x) (39)

There are many types of generative models, with some examples including a univariate Laplacian
distribution, a multivariate Gaussian distribution, a Gaussian Mixture Model, hidden Markov models,
and variational autoencoders. We will consider two models in this document, namely, multivariate
Gaussian models and Gaussian mixture models.

10
(a) (b)

1.0 1.0

setosa 1 0 0 setosa 1 0 0
0.8 0.8
True label

True label
0.6 0.6
versicolor 0 0.37 0.63 versicolor 0 0.5 0.5
0.4 0.4

virginica 0 0.025 0.97 0.2 virginica 0 0 1 0.2

0.0 0.0
setosa versicolor virginica setosa versicolor virginica
Predicted label Predicted label

Figure 6: The confusion matrices that correspond to the logistic regression model without the bias
term. Figure 6(a) shows the confusion matrix for the training data and Figure 6(b) shows the confusion
matrix for the testing data. For example, the true label versicolor in Figure 6(a), was correctly classified
to versicolor 37% of the time and erroneously classified as virginica 63% of the time.

A multivariate Gaussian model is defined as follows:


 D/2  
1 −1/2 1 T −1
N (x|µ, Σ) = |Σ| exp − (x − µ) Σ (x − µ) (40)
2π 2

Hence, a new sample generated by the multivariate Gaussian model is denoted x ∼ N (x|µ, Σ). The
unknown parameters of the multivariate Gaussian distribution can be obtained by maximising the
likelihood function. The maximum likelihood estimation problem is formulated as follows:
N
X
max log N (xn |µ, Σ) (41)
µ,Σ
n=1

where the covariance is symmetric and semi-positive definitevi . The maximum likelihood estimates of
the unknown parameters have closed form solutions [1].
Mixture models can be used to model more complex distributions (e.g. multi-modal distributions).
A Gaussian mixture model of R multivariate Gaussian distributions is defined as follows:
R
X
GMM(x|{αr }, {µr }, {Σr }) = αr N (x|µr , Σr ) (42)
r=1

where {αr } = {α1 , α2 , . . . , αR }, {µr } = {µ1 , µ2 , . . . , µR } and {Σr } = {Σ1 , Σ2 , . . . , ΣR }, with R


denoting the Pnumber of mixtures. The the weight of the rth mixture component is denoted αr , with
R
αr ≥ 0 and r=1 αr = 1. The mean and covariance of the rth Gaussian are denoted µr and Σr
respectively. The GMM’s unknown parameters ({αr }, {µr }, {Σr }) can also be obtained by iteratively
maximising the likelihood function using the expectation-maximisation algorithm [1]. Since the GMM
is also a generative model, we can also generate new data: x ∼ GMM(x|{αr }, {µr }, {Σr }).
A separate multivariate Gaussian model (i.e., Equation (40)) was fitted to each class’ data in the
Iris dataset, with the class’ data superimposed on the PDFs of the models in Figure 9. Since we are
using generative models, we can draw new samples from the models. The actual data are superimposed
on samples from the respective models in Figure 10. The samples in Figure 10(a) do not match the
vi This means that xT Σx ≥ 0 for all x.

11
(a) (b)

2.0 2.0
2 2

1.5 1.5
1 1
φ2

φ2
0 1.0 0 1.0
Class 0
−1 Class 0 0.5 −1 Class 1 0.5
Class 1 Class 2
Class 2 Testing
−2 −2
0.0 0.0
−2.5 0.0 2.5 −2.5 0.0 2.5
φ1 φ1

Figure 7: The class predictions of the model with the bias term is shown over the entire feature space
with the training data superimposed in Figure 7(a) and the training and testing data superimposed
in Figure 7(b).

actual data as well as the other models in Figures 10(b)-10(c). Model evaluation methods need to
be used to determine the suitability of the fits. However, for the purpose of this document, we will
assume that these models are adequate.

4.2 Bayes’ rule


Probabilistic generative models use Bayes’ rule to infer the probability over the classes given the
observed data. Bayes’ rule follows from the product rule of probability:

p(x, y) = p(x|y)p(y) = p(y|x)p(x) (43)

which can be re-arranged as follows:


p(x|y)p(y)
p(y|x) = (44)
p(x)
where p(x) is obtained by marginalising the joint distribution p(x, y) over y
Z
p(x) = p(x, y)dy (45)

For a fixed x, p(x) is constant and therefore p(y|x) ∝ p(x|y)p(y). An example of Bayes’ rule is given
in Figure 1.9 and the text of Section 1.2 [1].

4.3 Classification
We can write the generative model as follows using one-hot encoding:
K
Y
p(x|c) = p(x|wk )ck (46)
k=1

The one-hot encoded vector is denoted c = [c1 , . . . , cK ] and p(x|wk ) denotes the generative model of
the kth class with a corresponding parameter vector wk . Hence, we can draw a sample from the kth
class as follows:
x ∼ p(x|c) (47)

12
(a) (b)

1.0 1.0

setosa 1 0 0 setosa 1 0 0
0.8 0.8
True label

True label
0.6 0.6
versicolor 0 0.97 0.026 versicolor 0 0.92 0.083
0.4 0.4

virginica 0 0.025 0.97 0.2 virginica 0 0 1 0.2

0.0 0.0
setosa versicolor virginica setosa versicolor virginica
Predicted label Predicted label

Figure 8: The confusion matrices of the model with the bias term are shown for the training data in
Figure 8(a) and for the testing data in Figure 8(b).

(a) (b) (c)

1.0 0.8
2 Class 0
1.5 2 Class 1 2 Class 2

0.8
1 1 1 0.6
1.0 0.6
φ2

φ2

φ2
0 0 0 0.4
0.4
−1 0.5 −1 −1 0.2
0.2
−2 −2 −2
0.0 0.0 0.0
−2.5 0.0 2.5 −2.5 0.0 2.5 −2.5 0.0 2.5
φ1 φ1 φ1

Figure 9: A multivariate Gaussian model is used to separately model each class’ data. The PDFs of
class 0, 1, and 2 are shown in Figure 9(a), 9(b) and 9(c) respectively with the associated training data
superimposed on the PDFs.

if ck = 1, or equivalently, we can draw a sample:

x ∼ p(x|wk ) (48)

Hence, each class has its own generative model, i.e. we have K generative models. We can use these
models and Bayes’ rule to infer the class label of an observation x.
To simplify the notation, the generative distribution of the kth class is denoted p(x|Ck , wk ), where
Ck denotes the kth class. Bayes’ rule can be used to calculate p(Ck |x, wk ) as follows:
p(x|Ck , wk ) · p(Ck )
p(Ck |x, wk ) = (49)
p(x)
where p(x|Ck , wk ) is the generative model of the kth class, p(Ck ) is the prior probability of the kth
PK
class, with k=1 p(Ck ) = 1. We can calculate p(x) as follows:
K
X
p(x) = p(x|Ck , wk ) · p(Ck ) (50)
k=1

We however have an unknown parameter vector for each class. We need to use a parameter estimation
or inference technique to determine the unknown parameters. A maximum likelihood estimator will

13
(a) (b) (c)

Samples Samples Samples


1
setosa versicolor virginica
1 1

0
φ2

φ2

φ2
0 0

−1
−1
−1

−3.5 −3.0 −2.5 −2.0 −1 0 1 2 0 1 2 3 4


φ1 φ1 φ1

Figure 10: Samples are generated using the multivariate models of each class and compared to the
actual data.

be used in this document. The likelihood function of the model for the given dataset is defined as
follows:
N Y
Y K
p(X|C, W ) = p(xn |wk )cnk (51)
n=1 k=1

with the log-likelihood given by


N X
X K
log p(X|C, W ) = cnk log p(xn |wk ) (52)
n=1 k=1

The gradient w.r.t. the unknown parameters of the jth class is defined as follows:
N XK  
∂ X ∂
log p(X|C, W ) = cnk log p(xn |wk ) (53)
∂wj n=1
∂wj
k=1

We can show that the parameters of the model of the kth class can be estimated with
ŵk = arg max log p(X k |wk ) (54)
wk

where X k denotes the data associated with the kth class.


Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) use multivariate
Gaussian models to infer the posterior class probabilities using Bayes’ rule. For LDA, each class has
the same covariance matrix and for QDA, each class’ covariance matrix is separately estimated. The
covariance matrix can be challenging to estimate for high-dimensional problems or for datasets where
there are few examples from a class in the training dataset. Hence, different strategies are available to
circumvent these problems (e.g., [2, 3]). If we assume that the variables are uncorrelated, then it means
that the covariance matrix is a diagonal matrix, which alleviates the numerical challenges of estimating
the covariance matrix for high-dimensional problems. Models utilising this simplification are referred
to as Naive Bayes, which can perform well for classification tasks despite the strong assumption that
the features are uncorrelated [4, 1].
The derivations in this section assumed the raw data X with the nth observation xn are modelled
using a generative model. However, we can potentially perform manual transformations (or feature
extraction) or automatic learn transformations that extract the important information from the data
before developing the generative model. We can therefore also model the transformed data ϕ(x) and
use this for classification. An example of a linear transformation is
ϕ(x) = Ax + b (55)

with the corresponding generative model denoted ϕ(s) ∼ p(ϕ).

14
4.4 Demonstration of multivariate Gaussian model
A separate multivariate model is used to model the generative model of each class. The multivariate
Gaussian models are shown in Figure 9 for the three classes and can be combined with Equation
(49) to determine p(Ck |x, wk ). We can use Equation (1) to predict the class of each model, with the
prediction shown in Figure 11 over the feature space. The training and testing data are superimposed
on the models’ class prediction. The models have non-linear boundaries between the classes.

2.0
2 Class 0
Class 1
Class 2 1.5
1
Testing
φ2

0 1.0

−1 0.5

−2
0.0
−2.5 0.0 2.5
φ1

Figure 11: The class prediction using a probabilistic generative classification model. Each class’ data
are modelled using a multivariate Gaussian distribution.

The confusion matrices calculated using the training data and calculated using the testing data are
shown in Figures 12(a) and 12(b) respectively.

(a) (b)

1.0 1.0

setosa 1 0 0 setosa 1 0 0
0.8 0.8
True label

True label

0.6 0.6
versicolor 0 0.95 0.053 versicolor 0 0.92 0.083
0.4 0.4

virginica 0 0.025 0.97 0.2 virginica 0 0 1 0.2

0.0 0.0
setosa versicolor virginica setosa versicolor virginica
Predicted label Predicted label

Figure 12: The confusion matrices of the probabilistic generative classification model are shown for
the training data in Figure 12(a) and for the testing data in Figure 12(b).

15
5 Classification vs. Recognition
All the models considered in this document focused on classifying the data x to one of K classes.
This makes sense for problems where the number of classes are known beforehand (e.g. recognising
the alphabet letter), however, for some problems the number of classes are not known beforehand.
The former problem consists of a closed set of classes and the latter problem consists of an open set
of classes. The problem of classification problems is that the data will be allocated to one of the K
classes, irrespective of whether it is from a new (unknown) class or an existing class. The problem
is better formulated as a class recognition problem if there is a possibility that some classes are not
observed in the training dataset, i.e. we do not aim to classify, but aim to recognise the class label and
if new/unobserved classes are encountered, we rather classify it to the ”do not know” class. The open
set recognition problem is formulated in Ref. [5] using discriminative models and in Ref. [6] using
generative models. Furthermore, some data are intrinsically continuous (e.g. a crack is described by
a combination of a continuous crack length, a continuous crack position, a crack orientation) and one
should be careful to discretize the continuous problem into discrete classes [6].

6 Conclusion
In this document, probabilistic generative and discriminative classification models are discussed and
demonstrated on the Iris dataset. Both models make it possible to infer the distribution over the
classes for an observation and can therefore be used to classify the data to one of K classes. Reflect
on the following:
• What is the difference between a discriminative and generative classification model?
• What is the impact of the bias term in probabilistic discriminative models?
• Which assumptions are needed to use the discriminative and generative classification models
(e.g. assumption of the distribution of the data)?
• Refer to Section 5.1 and Section 5.2 of Bishop [1], how can neural networks be used in probabilistic
discriminative models and how can they be used to find non-linear boundaries between classes?
• How are discriminative and generative classification models affected by high-dimensional data?
• Why could having separable classes (according to some linear or non-linear decision function)
potentially lead to parameters with infinite weight magnitudes in discriminative classification
models?
• What is the benefit of using discriminative instead of generative classification models and vice
versa?

A Iris dataset
In this section, a brief overview of the Iris dataset [7] is given. The Iris dataset consists of four
properties that were measured from three classes of Iris flowers. The following properties of each flower
are available: The sepal length, the sepal width, the petal length and petal width. This constitutes our
input data x ∈ R4×1 . The three classes of Iris flowers that are considered are the following: Setosa,
Versicolour, and Virginica. Hence, the objective of the classification problem is to predict the class of
the Iris flower (Setosa, Versicolour, and Virginica) using the four measurements from the flower (The
sepal length, the sepal width, the petal length and petal width). The four-dimensional raw data are
presented in Figure 13.
The dataset needs to be separated into training, validation and testing datasets. Since this docu-
ment will only investigate simple models and hyperparameters will not be selected, only one training

16
and testing data were created. It is generally important to repeatedly generate training, validation
and testing data and obtain an error band on the estimates (e.g. classification accuracy). The train-
ing data-testing data was obtained using a 75%-25% test-train split of the original data with a fixed
random seed.
To make the results easier to visualise in the document, the four-dimensional dataset was reduced
to two-dimensions using a linear transformation of the following form:
ϕ(xn ) = Axn + b (56)
where ϕn = ϕ(xn ) is used to denote the features. The parameters of the linear transformation
were obtained using principal component analysis (Chapter 12 of Bishop [1]). The parameters were
specifically obtained using the training data of the Iris dataset. The training features are shown in
Figure 1(a) and the testing features are superimposed on the training data in Figure 1(b).

8 5 7.5 3

x3 : petal length (cm)


x1 : sepal length (cm)

x4 : petal width (cm)


x2 : sepal width (cm)

set.
7 vers. 4 2
5.0
virg.
6 3 1
2.5
5 2 0

4 1 0.0 −1
4 5 6 7 8 4 5 6 7 8 4 5 6 7 8 4 5 6 7 8
x1 : sepal length (cm) x1 : sepal length (cm) x1 : sepal length (cm) x1 : sepal length (cm)

8 5 7.5 3
x3 : petal length (cm)
x1 : sepal length (cm)

x4 : petal width (cm)


x2 : sepal width (cm)

7 4 2
5.0
6 3 1
2.5
5 2 0

4 1 0.0 −1
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
x2 : sepal width (cm) x2 : sepal width (cm) x2 : sepal width (cm) x2 : sepal width (cm)

8 5 7.5 3
x3 : petal length (cm)
x1 : sepal length (cm)

x4 : petal width (cm)


x2 : sepal width (cm)

7 4 2
5.0
6 3 1
2.5
5 2 0

4 1 0.0 −1
0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5
x3 : petal length (cm) x3 : petal length (cm) x3 : petal length (cm) x3 : petal length (cm)

8 5 7.5 3
x3 : petal length (cm)
x1 : sepal length (cm)

x4 : petal width (cm)


x2 : sepal width (cm)

7 4 2
5.0
6 3 1
2.5
5 2 0

4 1 0.0 −1
−2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4
x4 : petal width (cm) x4 : petal width (cm) x4 : petal width (cm) x4 : petal width (cm)

Figure 13: The raw Iris dataset is presented. The legend, shown in the first subplot uses the following
abbreviations: set. for Setosa; vers. for Versicolour, and virg. for Virginica. A picture of the three
variations of Iris flowers under consideration can be found here: https://upload.wikimedia.org/
wikipedia/commons/c/cb/Flores_de_%C3%8Dris.png

17
References
[1] C. Bishop, “Pattern recognition and machine learning,” Springer, vol. 2, 2006.
[2] C. Bouveyron, S. Girard, and C. Schmid, “High-dimensional discriminant analysis,” Communica-
tions in Statistics—Theory and Methods, vol. 36, no. 14, pp. 2607–2623, 2007.

[3] J. Shao, Y. Wang, X. Deng, and S. Wang, “Sparse linear discriminant analysis by thresholding for
high dimensional data,” The Annals of Statistics, vol. 39, no. 2, pp. 1241–1265, 2011.
[4] I. Rish et al., “An empirical study of the naive bayes classifier,” in IJCAI 2001 workshop on
empirical methods in artificial intelligence, vol. 3, pp. 41–46, Citeseer, 2001.

[5] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult, “Toward open set recognition,”
IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 7, pp. 1757–1772,
2012.
[6] S. Schmidt and P. S. Heyns, “An open set recognition methodology utilising discrepancy analysis for
gear diagnostics under varying operating conditions,” Mechanical Systems and Signal Processing,
vol. 119, pp. 1–22, 2019.
[7] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics,
vol. 7, no. 2, pp. 179–188, 1936.

18

You might also like