0% found this document useful (0 votes)
84 views

ITDS Fifth Assignment

This document discusses four assignments completed for an Information Theory and Data Science course. The first assignment involved writing Python functions to calculate information theoretic measures like entropy, joint entropy, conditional entropy, and mutual information. It also required plotting the entropy of a binary random variable against its probability mass function. The second assignment examined density estimation with kernels and determining optimal bandwidth. The third and fourth assignments applied concepts like Naive Bayes classification to datasets like Iris. The conclusion summarizes the key results and lessons from each assignment.

Uploaded by

Gioacchino Mauri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

ITDS Fifth Assignment

This document discusses four assignments completed for an Information Theory and Data Science course. The first assignment involved writing Python functions to calculate information theoretic measures like entropy, joint entropy, conditional entropy, and mutual information. It also required plotting the entropy of a binary random variable against its probability mass function. The second assignment examined density estimation with kernels and determining optimal bandwidth. The third and fourth assignments applied concepts like Naive Bayes classification to datasets like Iris. The conclusion summarizes the key results and lessons from each assignment.

Uploaded by

Gioacchino Mauri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Information Theory & Data Science

5th Assignment

Author: Gioacchino Mauri

Academic Year: 2021/2022


Abstract
The aim of this document is to describe the four assignments and discuss the
theory beyond the achieved results. A chapter will be dedicated to each of the four
assignments, which will include figures and tables useful for discussing the results.

I
Contents
1 Introduction 1

2 First Assignment 2
2.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.5 Normalized Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . 3
2.6 Normalized Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7 Normalized Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7.1 Type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7.2 Type 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7.3 Type 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.8 Python Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Second Assignment 4
3.1 First Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Second Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Third Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3.1 Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.2 Optimal Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Third Assignment 12
4.1 First Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Second Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Third Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Fourth Assignment 14
5.1 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Naı̈ve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Gaussian Naı̈ve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4 Iris Dataset Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4.1 Bandwidth Selection for Gaussian Nav̈e Bayes Classifier . . . . . . 16

6 Conclusion 17

II
1 Introduction
Information Theory represents an intersection of different fields such as probability theory,
statistics, computer science, statistical mechanics, information engineering, and electrical
engineering.
This innovative theory has gone beyond the telecommunications field, finding applications
in very different fields spanning from neurobiology to information retrieval.
Due to his multisciplinarity, this theory can be used in data analysis.
In this document are collected and discussed the Assignments required by the Course of
Information Theory and Data Science at University of Rome Tor Vergata.
In Chapter 2 we will build an information theory library in Python and we will apply
the entropy function in the library in a Python script that computes the entropy for a
generic binary random variable as a function of its probability mass function and plots
the entropy function.
We will present our Conclusions in Chapter 6 in which we will briefly resume and sum-
marize the results of the assignments.

1
2 First Assignment
In first place is required to write functions that are able to compute information theoretic
measures.
The information theoretic measures that are required by the assignment are: the entropy,
the joint entropy, the conditional entropy, the mutual information and the normalized
versions of the mutual information.
In second place is required to write a Python script that computes the entropy for a
generic binary random variable as a function of its probability mass function and plots
the entropy function.

2.1 Entropy
The first function required is called ”entropy” and computes the entropy of a discrete
random variable given its probability mass function.
The entropy can be defined by the expectation value of the self information, as we can
see in the following equation:
Nx
X
H(X) = pk log(pk ) (1)
k=1

2.2 Joint Entropy


The second function required is called ”joint entropy” and computes the joint entropy
of two discrete random variables given their joint probability density functions. The
joint entropy of two generic random variables X and Y can be expressed by the following
equation:
NX X
X NY
H(X, Y ) = − p(xi , yj )log2 p(xi , yj ) (2)
i=1 j=1

2.3 Conditional Entropy


The ”conditional entropy” function computes the conditional entropy of two discrete
random variables given their joint and marginal probability density functions.
NX X
X NY
H(X|Y ) = − p(xi , yj )log2 p(xi |yj ) (3)
i=1 j=1

2.4 Mutual Information


The ”mutual information” function computes the mutual information of two discrete
random variables given their joint and marginal probability mass functions.
NX X
NY
X p(xi , yj )
I(X, Y ) = p(xi , yj )log2 (4)
i=1 j=1
p(xi )p(yj )

2
2.5 Normalized Conditional Entropy
The Normalized Conditional Entropy can be expressed as:

H(X|Y )
ηCE (X|Y ) = (5)
H(X)

2.6 Normalized Joint Entropy


The Normalized Joint Entropy can be expressed as:

I(X; Y )
ηCE (X|Y ) = 1 − (6)
H(X) + H(Y )

2.7 Normalized Mutual Information


There are three different types of normalized mutual information

2.7.1 Type 1
1
ηM I1 (X; Y ) = −1 (7)
ηJE

2.7.2 Type 2
ηM I2 (X; Y ) = 1 + ηM I1 (8)

2.7.3 Type 3
I(X; Y )
ηM I3 = p (9)
H(X) + H(Y )

2.8 Python Script


The script ”test entropy2” is aimed to compute the entropy for a generic binary random
variable as a function of its probability mass function p0 and plots the entropy function.
In this case we are in presence of a Bernoulli process where the probability p0 is the
probability of one of the two possible outcomes.
Since we are using bits the only maximum, that is also the global maximum is when
p0 = 12 and entropy is 1 bit.
Entropy can be seen as the average level of ”surprise” or ” ”uncertainty” to the variable’s
possible outcomes. When p0 = 12 no prior knowledge can be gained with knowledge of
probabilities and the ”uncertainty” or ”surprise” is at maximum.
When p0 = 0 or p0 = 1 the outcome is certain and as we expected the entropy is zero.
For the other values of p0, the entropy values fall in a range between to the previous
cases. In fact, if p0 is 1/3 we can still predict the result correctly more often than not, so
the uncertainty associated with it will be less than 1 bit .

3
Figure 1: In this figure we can see the entropy in function of the p.m.f. of a generic
binary random variable. The maximum is when entropy is 1 bit, because we are using
the logarithm to base 2 in the definition of entropy function.

3 Second Assignment
The second assignment consists of three parts: in the first and third part it is required
to compute the difference between entropies and in the second part to write a function
which computes the differential entropy.
In the first part we have to compute the entropy of a discrete random variable given its
probability mass function vector, then we have to compute the entropy of an estimated
probability mass function from a set of samples generated through the preceding p.m.f.
vector. In the last step we have to compute the difference between the two entropies.
In the second part we write a function that can compute the differential entropy of a
continuous random variable once its probability density function is given.
In the third part is required, firstly to compute the differential entropy of a Gaussian
continuous random variable once its probability density function vector is given.
Subsequently we have to compute the differential entropy of an estimated p.d.f. from a
set of samples generated through the preceding p.d.f. vector. Finally we have to compute
the difference between the two differential entropies.

3.1 First Part


The first step is to choose a probability mass function vector, it represents the probabil-
ities of all the possible outcomes. Once fixed the all the possible outcomes that are in
correspondence of the p.m.f. and choose the vector length, we have all the ingredients
to generate a discrete sample. It is important to note that we took care of the experi-
ment reproducibility fixing the random seed and that the vector length is an important
parameter, because the probability estimation probably will be better if the vector length
is larger.
The p.m.f. estimation was made by summing the occurrences for the different outcomes
and then dividing by the vector length.
We computed the entropies of the p.m.f. and estimated p.m.f. and their difference using
the function defined in equation 1 of Chapter 2.
The p.m.f. entropy is 1.8464393446710154 bits and estimated p.m.f. entropy for a vector
which is length is 1000 is 1.8483871018568563 bits.

4
Figure 2: In this figure we can see a representation of the p.m.f. estimated in our case we
considered a discrete list of four values [1,2,3,4] as possible outcomes.The vector length
chosen was 1000, that is by far sufficient to have a good p.m.f estimation since the previ-
ously chosen discrete probabilities in generating the sample were [0.1, 0.2, 0.3, 0.4].

Since the probability estimation is dependent on the vector length, we will resume in the
table 3.1 how the difference between the two entropies varies as it varies the vector length.

vector length entropies difference [b]


50 -0.030174700907082297
100 -0.03938319922881339
150 0.02190817554407598
200 0.012449447666063485
250 0.010429856597415021
500 0.016153501049261276
750 0.0073507609006733254
1000 0.020461044198866407
1500 0.007920557264722516
2000 0.013499610422643027
3000 0.00787053836722329
4000 0.007051179475053626
5000 -0.001620842628490493
6000 0.00928873857139001
7000 -0.005255171211856924
8000 -0.0022097544551082926
9000 -0.0021559013405689775
10000 0.004997655831053827
15000 -0.0006234580043351645
20000 -0.003033890465727218
30000 -0.001813452741423971
50000 -0.00385094853612733
100000 0.002210345256025592

5
In order to better show the trend of the entropy difference we made a plot on 10000
different values of vector length with a 100 step length. The figure 3 shows that the
absolute value of entropies differences are decreasing as vector length grows. It seems to
have an exponential decay, but maybe when the vector length increases the way in which
the sample is generated or the presence of roundoff errors could affect more the entropies
difference results.

Figure 3: In this figure it is shown the trend of the entropies difference in which respect to
the vector length. The plot was made on 10000 different values of vector length ranging
from 100 to 1000000 with a 100 step length. We made the absolute value of the entropies
difference.

3.2 Second Part


The second part is aimed to build a function that computes the differential entropy of a
continuous random variable given its probability density function.
The differential entropy is also referred as continuous entropy and it is defined in analogy
to the discrete case as: Z b
h(X) = − fX (x)lnfX (x)dx (10)
a

if the integral exists, where X is a continuous random variable with p.d.f. fX (x) and with
support set S = (a, b).

3.3 Third Part


The third part of the second assignment is focused on continuous case. Firstly We have
to choose a p.d.f. vector that has to be a p.d.f. vector of a Gaussian continuous random
variable of which we can freely choose the mean and the variance. Then, we have to

6
generate a sample through this p.d.f. vector and to estimate the p.d.f. . Once computed
the differential entropy of the p.d.f. and the estimated p.d.f. we have to compute their
difference.
In figure 4 we can clearly see that the p.d.f. have Gaussian shape and respects the imposed
constraint.

Figure 4: In this figure we can see the p.d.f. chosen. The mean and variance chosen are
respectively 33 and 5. we can see that this p.d.f. clearly respects the constraint to be a
p.d.f. vector of a Gaussian continuous random variable

In order to estimate the p.d.f. of the generated sample we used a Kernel Method. Kernel
Density Estimation is a non parametric method which can be considered an extension
of the concept of histogram. The advantages in using the kernel density estimator in
place of histograms are that it is smoother and converge to the true density as n → ∞
faster. The histogram density estimation in univariate case is achieved through the rudi-
mentary concept of splitting into equally spaced intervals the range of random variables
and counting the fraction of samples in each interval. This intervals of equal width h
are also called bins. In what follows we are synthesizing the theory about kernel density
estimation following the lectures of the Information Theory & Data Science Course.i Let
us assume that exists a continuous random variable X and a sample set S = s1 , s2 , ..., sn
is drawn from X with unknown probability density function fX (x). Using the rectangular
weight function or Parzen window function:
(
1, if |x| ≤ 21
I(x) =
0, if |x| > 21

We can write the histogram estimator formula as:


i
https://didatticaweb.uniroma2.it/it/files/index/insegnamento/
196377-Information-Theory-And-Data-Mining/

7
n  
1 X x − sk
fbX (x) = I (11)
nh k=1 h

where fbX (x) = fbX (x, s1 , s2 , ..., sn ), h is the bin width and n is the number of samples.
The summation in the equation 11 is representing the fraction of samples that are falling
into each bin.
Defining the kernel function K(x) as a non negative function with the properties:
Z +∞
K(x)dx = 1 (12)
−∞

Z +∞
xK(x)dx = 0 (13)
−∞

We are able to define the kernel density estimator equation as:


n  
1 X x − sk
fbX (x) = K (14)
nh k=1 h
where the width h of the interval is called bandwidth.
The Kernel function used in this case is the Gaussian kernel function and the bandwidth
used is the optimal one in case of Gaussian basis functions are used to approximate
univariate data. This optimal bandwidth is the bandwidth that minimises the mean
integrated squared error and in our case it is expressed as we can see in websiteii :
 15
4σ̂ 5

h= ≈ 1.06 σ̂ n−1/5 (15)
3n
Once we have the p.d.f. and estimated p.d.f. we have to compute their differential en-
tropies and their difference that are respectively in this case: 4.65705070962202, 4.437228716475945
and 0.2198219931460743.

3.3.1 Kernel Functions


The choice of kernel functions surely has an important role in p.d.f. estimation.
In our case being the sample a Gaussian continuous random variable, we expect that the
Gaussian kernel function will perform better than other kernel functions.
Keeping fixed the bandwidth at the optimal bandwidth given by the equation 15 we will
see how different kind of kernel functions are performing in terms of the mean squared
error compared to the p.d.f. .
In figure 6, borrowed from the websiteiii , we can see the shape of the available kernel
functions in scikit learn library and we will apply all the kernel functions in figure 7.
Before comparing how the kernel choice is impacting the entropies difference, we computed
mean squared errors of the p.d.f and estimated p.d.f. .
ii
https://en.wikipedia.org/wiki/Kernel_density_estimation
iii
https://scikit-learn.org/stable/auto_examples/neighbors/plot_kde_1d.html

8
Figure 5: In this figure we can see the estimated p.d.f. vs p.d.f.

Figure 6: Available kernel functions in scikit learn library.

Our goal is to find the estimated p.d.f. that better fits the p.d.f. and as a criteria we are
considering the mean squared error.
We report the results in the following table:

9
Figure 7: In this figure we can see the p.d.f. in black compared to the estimated p.d.f.
generated with all the available kernels in scikit learn library keeping fixed the bandwidth
given from the equation 15.

kernels MSE
gaussian 3.863619081920247e-06
tophat 4.81709154169017e-06
epanechnikov 5.436396853980572e-06
exponential 5.993194851731685e-06
linear 6.170371980137539e-06
cosine 5.6070141839657405e-06
In table 3.3.1 we can note that as we expected the Gaussian kernel has the smallest mean
squared error.
We are reporting in the following table the absolute value of the differences between
differential entropies for each kernel:

kernels differences
gaussian 0.2198219931460743
tophat 0.25631972214292276
epanechnikov 0.26431680694603266
exponential 0.17286048456924874
linear 0.26707485654715235
cosine 0.2651241867353118

It is important to note that the kernels that in case of tophat, epanechnikov, linear, cosine
the resulting zeroes in the estimated probabilities density functions are smoothed (i.e. are
replaced with a small value, in our case 10−16 ). We are also noting that exponential kernel
is the one with smallest differential entropies difference, even if the mean squared error of
its estimated p.d.f. is not the best one.

10
The mean squared error represents a more robust metric in kernel selection than see the
entropies difference and based on this metric we selected two kernel functions: gaussian
and tophat.
We will use this two gaussian functions in searching the optimal bandwidth in the next
section.

3.3.2 Optimal Bandwidth


In addition to the choice of kernel functions, the bandwidth choice is highly influencing
the p.d.f. estimation.
we made the choice of using the gaussian and tophat kernel.
Firstly we will try to reduce the mean squared error between p.d.f. and estimated p.d.f.
using in place of the formula 15 the also known as Silverman’s rule of thumb.
This formula, as mentioned on this websiteiv , aims to make the h value more robust to
make the fitness well for different cases such as bimodal mixture, skew distribution and
long-tailed. The factor 1.06 is reduce to 0.9 in order to improve the model as we can see
in the formula:
 
IQR 1
h = 0.9 min σ̂, n− 5 (16)
1.34
where σ̂ is the standard deviation of the samples, n is the sample size and the IQR is the
interquartile range.
We applied the formula 16 in p.d.f. density estimation in figure 8 and we wanted to
compare it with figure 7. It is not possible to establish only visually which of the two
bandwidth is better in our case, so we will again resort to the calculation of the MSE.
We will report in the following table the mean squared errors for both kernels and for
both bandwidth, which we will refer to the one given by the equation 15 with ”optimal
gaussian” and to the one given by the equation 16 with ”Silverman rule”:

kernels optimal gaussian MSE Silverman rule MSE


gaussian 3.863619081920247e-06 3.7501041007396715e-06
tophat 4.81709154169017e-06 5.589098271484399e-06

From MSE calculations we can affirm that the Silverman’s rule works better than the
preceiding for the Gaussian kernel, but works worse for the tophat kernel. Using the best
value of each bandwidth rounded to the second decimal placed as starting point, we will
search the best bandwidth for the two kernels. The procedure is like a sifting process in
which we select the best value and move around that in order to get the best bandwidth.
Through the sifting process we wanted to be sure of not being trapped in local minima
by plotting the MSE. We stopped for both to the sixth decimal placed and reported the
results in the following table:

kernels bandwidth MSE


gaussian 1.184066 3.745675666744457e-06
tophat 2.188016 4.0272368084708545e-06

iv
https://en.wikipedia.org/wiki/Kernel_density_estimation

11
Figure 8: In this figure we can see the p.d.f. compared to the estimated p.d.f. generated
with a gaussian and tophat kernels. The bandwidth choice was made using the Silverman’s
rule of thumb.

4 Third Assignment
In this chapter we will discuss about the third assignment that consists in three parts.
We will use the multivariate Iris dataset, that is well known in literature. This data
set contains three classes, that are referring to a type of iris plant, which are: Setosa,
Virginica and Versicolour.
Each class contains 50 instances and each instance has five attributes:

• Sepal length in cm

• Sepal width in cm

• Petal length in cm

• Petal width in cm

• Class:

1. Iris Setosa
2. Iris Versicolour
3. Iris Virginica

In the first part it is required to discretize the features of the Iris dataset as integers and
compute their probability mass function, in the second part is required to compute their
entropies and in the third part to compute the mutual information between any pair of
features.

12
4.1 First Part
The technique used to discretize the feature of the data set is to multiply all the values
that are in centimetres by 10 obtaining integers values.
After discretization step we computed the probability mass function of each feature by
summing the number of occurrences for the different values present in each feature and
dividing by the total number of values in the considered feature.

4.2 Second Part


In the second part is required to compute the entropy of each feature of the data set. We
applied the entropy function built in Chapter 2 to the four probability mass functions
obtained in the part 4.1 of Chapter 4. We resume the results in a table in which we show
the entropy obtained for each of the four features:
Features Entropy
Sepal Length 4.822018088381167
Sepal Width 4.023181025924308
Petal Lenght 5.034569953674171
Petal Width 4.0498270903914175

4.3 Third Part


The third part is aimed to compute the mutual information between any pair of features of
the Iris data set. We used to compute the mutual information the ”mutual information”
function defined in section 2.4 of Chapter 2. We have to give in input to this function
the joint and marginal probability mass function of each pair of features. We used the
marginal probability used in the preceding parts of the assignment and for each pair
of feature we computed their joint probability mass functions. We computed the joint
probability mass function in the same way as in the univariate case, but taking into
account that this time we are in a multivariate case and the considered event it is the
union of the two events.
In analogy with the univariate case this time we have to sum the number of occurrences
for the different values that the pair of considered features can assume and we have to
divide by the total number of pairs.
When a pair of features values is not occurring, but is unique, we smoothed the value
replacing the zeroes with 10− 20 and we took into account that the mutual information is
a symmetric measure avoid repetition of calculations.
We resume in the following table the mutual information results:
Feature 1 Feature 2 Mutual Information
Sepal Length Sepal Width 2.089844090533955
Sepal Length Petal Length 3.002867101602726
Sepal Length Petal Width 2.2406847582216676
Sepal Width Petal Length 2.2274285391747153
Sepal Width Petal Width 1.6759771322899024
Petal Length Petal Width 2.6948471229825794

13
5 Fourth Assignment
The fourth assignment is aimed to build and compare three different types of Bayes
classifiers.
As classifiers their goal is to return a class label vector for the test dataset given a training
dataset with his class label vector and a test dataset.
These classifiers are based on the Bayesan theorem, what they differ from each other are
the assumptions.
The Bayes theorem says that for two different events A and B of the campionary space,
the following equation holds:

P (B|A)P (A)
P (A|B) = (17)
P (B)
where the P (A|B) and P (B|A) are conditional probabilities, respectively the probability
of event A occurring given that B is true, also called the posterior probability of A
given B and the probability of event B occurring given that A is true, also considered as
likelihood of A given a fixed B. P (A) and P (B) are the marginal probabilities of observing
respectively A and B.
We will refer as Bayes Classifier in the rest of text to the Classifier in which we have
assumed that the features are continuous random variables and we estimated through a
multivariate kernel density estimator the multivariate probability density function.
From this point on we will refer as Naı̈ve Bayes Classifier to the classifier in which we
made the assumption that the variables are not only continuous and random, but also
independent. In building the Naı̈ve Bayes Classifier we used a univariate kernel density
estimator to estimate the probability density function of each feature.
The last classifier is referred as Gaussian Naı̈ve Bayes Classifier in which we have to make
the assumption that the variables are not only continuous, random and independent, but
also Gaussian distributed.
In the last step we have to compare the three classifiers accuracy on Iris dataset split in
half between test dataset and train dataset.

5.1 Bayes Classifier


The Bayes classifier is based on theorem , in which we consider the event P (A|B) as the
posterior probability of the class given the instance P (ck |x). Our goal is to select the
class ck that maximizes the posterior probability obtained through the application of the
Bayes Theorem to compute the posterior probability as a function of the likelihood and
prior probability. We can rewrite the Bayes theorem as follows:

P (x|ck )P (ck )
P (ck |x) = (18)
P (x)

The Bayes classifier must calculate an a posteriori probability for each class. The class is
assigned to the class with which the maximum posterior probability value is associated.
Since P (x) is not a function of k, it is considered negligible and can be not considered in
the calculation of the maximum posterior probability.

14
As we saw before a Bayes classifier needs as inputs the prior probabilities of the classes
P (ck ) and likelihoods P (x|ck ).
We made the estimation of prior probabilities through the estimation of the p.m.f. of
each class feature and the likelihood estimation was made by firstly grouping the dataset
for each class and then using a multivariate probability density function estimation.

5.2 Naı̈ve Bayes Classifier


The Naı̈ve Bayes Classifier is a Bayes Classifier in which it is made the assumption that
the features are independent each other. This naı̈ve reduces the computational cost of
computing P (xj |ck ) for 1 ≤ k ≤ Nc , i.e. :

M
Y
P (x|ck ) = P (xj |ck ) (19)
j=1

5.3 Gaussian Naı̈ve Bayes Classifier


The Gaussian Naı̈ve Bayes Classifier differs from the Naı̈ve Bayes Classifier for the as-
sumption that the the values associated with each class that we supposed to be continuous,
random and independent are Gaussian distributed. The data are first splitted by classes
and then mean and variance are computed for each feature and for each class. Then
P (xj |ck ) is computed using the obtained means and variances in a Gaussian density for-
mula and finally assuming the features independence P (x|ck ) is computed where x is the,
in general multidimensional, instance.

5.4 Iris Dataset Classification


The last part of the fourth assignment is aimed to compute and compare the average
accuracy of the three classifiers previously builded on the Iris dataset, discussed in the
first part of Chapter 4.
For each class label we have to take the 50% of the instances as training set and the
remaining 50% as test dataset.
we are reporting in the following table the accuracy, precision and recall results for the
three classifiers in which we used a gaussian kernel and a bandwidth chosen using the
equation 16 in the Bayes and Naı̈ve Bayes classifiers.

Metrics Bayes Naı̈ve Bayes Gaussian Naı̈ve Bayes


accuracy 0.8933333333333333 0.9733333333333334 0.96
precision 0.9191919191919191 0.9753086419753086 0.9642857142857143
recall 0.8933333333333333 0.9733333333333334 0.96

in our specific case the Naı̈ve Bayes Classifier works better in terms of accuracy than the
others, even better than the Gaussian Naı̈ve Bayes Classifier that is not using k.d.e. and
does not have parameters such as kernel functions or bandwidth that can be modified.

15
We are surprised from seeing that the Naı̈ve Bayes Classifier is performing better than
the Bayes Classifier since more assumptions are made on data. We will see in the next
section that the accuracy score are strongly dependent form the bandwidth.

5.4.1 Bandwidth Selection for Gaussian Nav̈e Bayes Classifier


In this section we are searching the bandwidth that maximize the accuracy for Bayes and
Naı̈ve Bayes Classifiers. This kind of search may be useless from the classification point
of view because it leads to a loss of generalization and may be to a poorer accuracy on
new data, but it allows us to understand if our models are capable of represent the data.
The bandwidth was found using a sifting process like in section 3.3.2 of Chapter 3, but
this time we are starting from the value 0.01. We are resuming in a table our fingings:

Classifiers Bandwidth Accuracy


Bayes 1.395 ±0.695 0.9733333333333334
Naı̈ve Bayes 0.0591 ±0.0005 0.9866666666666667

We can see that the accuracy is strongly dependent on the bandwidth and even if the Bayes
classifier seems do not reach the Naı̈ve Bayes accuracy. The values of bandwidth reported
in table 5.4.1 are values in which the bandwidth has reached a plateau in correspondence
to the accuracy. For instance for the values of the Bayes Classifier we have that the values
of the accuracy are not changing in the bandwidth interval 1.395 ± 0.695.

16
6 Conclusion
A concise summary of contents and results

17
References

18

You might also like