0% found this document useful (0 votes)

84 views

ITDS Fifth Assignment

This document discusses four assignments completed for an Information Theory and Data Science course. The first assignment involved writing Python functions to calculate information theoretic measures like entropy, joint entropy, conditional entropy, and mutual information. It also required plotting the entropy of a binary random variable against its probability mass function. The second assignment examined density estimation with kernels and determining optimal bandwidth. The third and fourth assignments applied concepts like Naive Bayes classification to datasets like Iris. The conclusion summarizes the key results and lessons from each assignment.

Uploaded by

Gioacchino Mauri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views

ITDS Fifth Assignment

Uploaded by

Gioacchino Mauri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Information Theory & Data Science

5th Assignment

Author: Gioacchino Mauri

Academic Year: 2021/2022

Abstract
The aim of this document is to describe the four assignments and discuss the
theory beyond the achieved results. A chapter will be dedicated to each of the four
assignments, which will include figures and tables useful for discussing the results.

I
Contents
1 Introduction 1

2 First Assignment 2
2.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.5 Normalized Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . 3
2.6 Normalized Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7 Normalized Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7.1 Type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7.2 Type 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7.3 Type 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.8 Python Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Second Assignment 4
3.1 First Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Second Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Third Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3.1 Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.2 Optimal Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Third Assignment 12
4.1 First Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Second Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Third Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Fourth Assignment 14
5.1 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Naı̈ve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Gaussian Naı̈ve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4 Iris Dataset Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4.1 Bandwidth Selection for Gaussian Nav̈e Bayes Classifier . . . . . . 16

6 Conclusion 17

II
1 Introduction
Information Theory represents an intersection of different fields such as probability theory,
statistics, computer science, statistical mechanics, information engineering, and electrical
engineering.
This innovative theory has gone beyond the telecommunications field, finding applications
in very different fields spanning from neurobiology to information retrieval.
Due to his multisciplinarity, this theory can be used in data analysis.
In this document are collected and discussed the Assignments required by the Course of
Information Theory and Data Science at University of Rome Tor Vergata.
In Chapter 2 we will build an information theory library in Python and we will apply
the entropy function in the library in a Python script that computes the entropy for a
generic binary random variable as a function of its probability mass function and plots
the entropy function.
We will present our Conclusions in Chapter 6 in which we will briefly resume and sum-
marize the results of the assignments.

1
2 First Assignment
In first place is required to write functions that are able to compute information theoretic
measures.
The information theoretic measures that are required by the assignment are: the entropy,
the joint entropy, the conditional entropy, the mutual information and the normalized
versions of the mutual information.
In second place is required to write a Python script that computes the entropy for a
generic binary random variable as a function of its probability mass function and plots
the entropy function.

2.1 Entropy
The first function required is called ”entropy” and computes the entropy of a discrete
random variable given its probability mass function.
The entropy can be defined by the expectation value of the self information, as we can
see in the following equation:
Nx
X
H(X) = pk log(pk ) (1)
k=1

2.2 Joint Entropy

The second function required is called ”joint entropy” and computes the joint entropy
of two discrete random variables given their joint probability density functions. The
joint entropy of two generic random variables X and Y can be expressed by the following
equation:
NX X
X NY
H(X, Y ) = − p(xi , yj )log2 p(xi , yj ) (2)
i=1 j=1

2.3 Conditional Entropy

The ”conditional entropy” function computes the conditional entropy of two discrete
random variables given their joint and marginal probability density functions.
NX X
X NY
H(X|Y ) = − p(xi , yj )log2 p(xi |yj ) (3)
i=1 j=1

2.4 Mutual Information

The ”mutual information” function computes the mutual information of two discrete
random variables given their joint and marginal probability mass functions.
NX X
NY
X p(xi , yj )
I(X, Y ) = p(xi , yj )log2 (4)
i=1 j=1
p(xi )p(yj )

2
2.5 Normalized Conditional Entropy
The Normalized Conditional Entropy can be expressed as:

H(X|Y )
ηCE (X|Y ) = (5)
H(X)

2.6 Normalized Joint Entropy

The Normalized Joint Entropy can be expressed as:

I(X; Y )
ηCE (X|Y ) = 1 − (6)
H(X) + H(Y )

2.7 Normalized Mutual Information

There are three different types of normalized mutual information

2.7.1 Type 1
1
ηM I1 (X; Y ) = −1 (7)
ηJE

2.7.2 Type 2
ηM I2 (X; Y ) = 1 + ηM I1 (8)

2.7.3 Type 3
I(X; Y )
ηM I3 = p (9)
H(X) + H(Y )

2.8 Python Script

The script ”test entropy2” is aimed to compute the entropy for a generic binary random
variable as a function of its probability mass function p0 and plots the entropy function.
In this case we are in presence of a Bernoulli process where the probability p0 is the
probability of one of the two possible outcomes.
Since we are using bits the only maximum, that is also the global maximum is when
p0 = 12 and entropy is 1 bit.
Entropy can be seen as the average level of ”surprise” or ” ”uncertainty” to the variable’s
possible outcomes. When p0 = 12 no prior knowledge can be gained with knowledge of
probabilities and the ”uncertainty” or ”surprise” is at maximum.
When p0 = 0 or p0 = 1 the outcome is certain and as we expected the entropy is zero.
For the other values of p0, the entropy values fall in a range between to the previous
cases. In fact, if p0 is 1/3 we can still predict the result correctly more often than not, so
the uncertainty associated with it will be less than 1 bit .

3
Figure 1: In this figure we can see the entropy in function of the p.m.f. of a generic
binary random variable. The maximum is when entropy is 1 bit, because we are using
the logarithm to base 2 in the definition of entropy function.

3 Second Assignment
The second assignment consists of three parts: in the first and third part it is required
to compute the difference between entropies and in the second part to write a function
which computes the differential entropy.
In the first part we have to compute the entropy of a discrete random variable given its
probability mass function vector, then we have to compute the entropy of an estimated
probability mass function from a set of samples generated through the preceding p.m.f.
vector. In the last step we have to compute the difference between the two entropies.
In the second part we write a function that can compute the differential entropy of a
continuous random variable once its probability density function is given.
In the third part is required, firstly to compute the differential entropy of a Gaussian
continuous random variable once its probability density function vector is given.
Subsequently we have to compute the differential entropy of an estimated p.d.f. from a
set of samples generated through the preceding p.d.f. vector. Finally we have to compute
the difference between the two differential entropies.

3.1 First Part

The first step is to choose a probability mass function vector, it represents the probabil-
ities of all the possible outcomes. Once fixed the all the possible outcomes that are in
correspondence of the p.m.f. and choose the vector length, we have all the ingredients
to generate a discrete sample. It is important to note that we took care of the experi-
ment reproducibility fixing the random seed and that the vector length is an important
parameter, because the probability estimation probably will be better if the vector length
is larger.
The p.m.f. estimation was made by summing the occurrences for the different outcomes
and then dividing by the vector length.
We computed the entropies of the p.m.f. and estimated p.m.f. and their difference using
the function defined in equation 1 of Chapter 2.
The p.m.f. entropy is 1.8464393446710154 bits and estimated p.m.f. entropy for a vector
which is length is 1000 is 1.8483871018568563 bits.

4
Figure 2: In this figure we can see a representation of the p.m.f. estimated in our case we
considered a discrete list of four values [1,2,3,4] as possible outcomes.The vector length
chosen was 1000, that is by far sufficient to have a good p.m.f estimation since the previ-
ously chosen discrete probabilities in generating the sample were [0.1, 0.2, 0.3, 0.4].

Since the probability estimation is dependent on the vector length, we will resume in the
table 3.1 how the difference between the two entropies varies as it varies the vector length.

vector length entropies difference [b]

50 -0.030174700907082297
100 -0.03938319922881339
150 0.02190817554407598
200 0.012449447666063485
250 0.010429856597415021
500 0.016153501049261276
750 0.0073507609006733254
1000 0.020461044198866407
1500 0.007920557264722516
2000 0.013499610422643027
3000 0.00787053836722329
4000 0.007051179475053626
5000 -0.001620842628490493
6000 0.00928873857139001
7000 -0.005255171211856924
8000 -0.0022097544551082926
9000 -0.0021559013405689775
10000 0.004997655831053827
15000 -0.0006234580043351645
20000 -0.003033890465727218
30000 -0.001813452741423971
50000 -0.00385094853612733
100000 0.002210345256025592

5
In order to better show the trend of the entropy difference we made a plot on 10000
different values of vector length with a 100 step length. The figure 3 shows that the
absolute value of entropies differences are decreasing as vector length grows. It seems to
have an exponential decay, but maybe when the vector length increases the way in which
the sample is generated or the presence of roundoff errors could affect more the entropies
difference results.

Figure 3: In this figure it is shown the trend of the entropies difference in which respect to
the vector length. The plot was made on 10000 different values of vector length ranging
from 100 to 1000000 with a 100 step length. We made the absolute value of the entropies
difference.

3.2 Second Part

The second part is aimed to build a function that computes the differential entropy of a
continuous random variable given its probability density function.
The differential entropy is also referred as continuous entropy and it is defined in analogy
to the discrete case as: Z b
h(X) = − fX (x)lnfX (x)dx (10)
a

if the integral exists, where X is a continuous random variable with p.d.f. fX (x) and with
support set S = (a, b).

3.3 Third Part

The third part of the second assignment is focused on continuous case. Firstly We have
to choose a p.d.f. vector that has to be a p.d.f. vector of a Gaussian continuous random
variable of which we can freely choose the mean and the variance. Then, we have to

6
generate a sample through this p.d.f. vector and to estimate the p.d.f. . Once computed
the differential entropy of the p.d.f. and the estimated p.d.f. we have to compute their
difference.
In figure 4 we can clearly see that the p.d.f. have Gaussian shape and respects the imposed
constraint.

Figure 4: In this figure we can see the p.d.f. chosen. The mean and variance chosen are
respectively 33 and 5. we can see that this p.d.f. clearly respects the constraint to be a
p.d.f. vector of a Gaussian continuous random variable

In order to estimate the p.d.f. of the generated sample we used a Kernel Method. Kernel
Density Estimation is a non parametric method which can be considered an extension
of the concept of histogram. The advantages in using the kernel density estimator in
place of histograms are that it is smoother and converge to the true density as n → ∞
faster. The histogram density estimation in univariate case is achieved through the rudi-
mentary concept of splitting into equally spaced intervals the range of random variables
and counting the fraction of samples in each interval. This intervals of equal width h
are also called bins. In what follows we are synthesizing the theory about kernel density
estimation following the lectures of the Information Theory & Data Science Course.i Let
us assume that exists a continuous random variable X and a sample set S = s1 , s2 , ..., sn
is drawn from X with unknown probability density function fX (x). Using the rectangular
weight function or Parzen window function:
(
1, if |x| ≤ 21
I(x) =
0, if |x| > 21

We can write the histogram estimator formula as:

i
https://didatticaweb.uniroma2.it/it/files/index/insegnamento/
196377-Information-Theory-And-Data-Mining/

7
n
1 X x − sk
fbX (x) = I (11)
nh k=1 h

where fbX (x) = fbX (x, s1 , s2 , ..., sn ), h is the bin width and n is the number of samples.
The summation in the equation 11 is representing the fraction of samples that are falling
into each bin.
Defining the kernel function K(x) as a non negative function with the properties:
Z +∞
K(x)dx = 1 (12)
−∞

Z +∞
xK(x)dx = 0 (13)
−∞

We are able to define the kernel density estimator equation as:

n
1 X x − sk
fbX (x) = K (14)
nh k=1 h
where the width h of the interval is called bandwidth.
The Kernel function used in this case is the Gaussian kernel function and the bandwidth
used is the optimal one in case of Gaussian basis functions are used to approximate
univariate data. This optimal bandwidth is the bandwidth that minimises the mean
integrated squared error and in our case it is expressed as we can see in websiteii :
15
4σ̂ 5

h= ≈ 1.06 σ̂ n−1/5 (15)
3n
Once we have the p.d.f. and estimated p.d.f. we have to compute their differential en-
tropies and their difference that are respectively in this case: 4.65705070962202, 4.437228716475945
and 0.2198219931460743.

3.3.1 Kernel Functions

The choice of kernel functions surely has an important role in p.d.f. estimation.
In our case being the sample a Gaussian continuous random variable, we expect that the
Gaussian kernel function will perform better than other kernel functions.
Keeping fixed the bandwidth at the optimal bandwidth given by the equation 15 we will
see how different kind of kernel functions are performing in terms of the mean squared
error compared to the p.d.f. .
In figure 6, borrowed from the websiteiii , we can see the shape of the available kernel
functions in scikit learn library and we will apply all the kernel functions in figure 7.
Before comparing how the kernel choice is impacting the entropies difference, we computed
mean squared errors of the p.d.f and estimated p.d.f. .
ii
https://en.wikipedia.org/wiki/Kernel_density_estimation
iii
https://scikit-learn.org/stable/auto_examples/neighbors/plot_kde_1d.html

8
Figure 5: In this figure we can see the estimated p.d.f. vs p.d.f.

Figure 6: Available kernel functions in scikit learn library.

Our goal is to find the estimated p.d.f. that better fits the p.d.f. and as a criteria we are
considering the mean squared error.
We report the results in the following table:

9
Figure 7: In this figure we can see the p.d.f. in black compared to the estimated p.d.f.
generated with all the available kernels in scikit learn library keeping fixed the bandwidth
given from the equation 15.

kernels MSE
gaussian 3.863619081920247e-06
tophat 4.81709154169017e-06
epanechnikov 5.436396853980572e-06
exponential 5.993194851731685e-06
linear 6.170371980137539e-06
cosine 5.6070141839657405e-06
In table 3.3.1 we can note that as we expected the Gaussian kernel has the smallest mean
squared error.
We are reporting in the following table the absolute value of the differences between
differential entropies for each kernel:

kernels differences
gaussian 0.2198219931460743
tophat 0.25631972214292276
epanechnikov 0.26431680694603266
exponential 0.17286048456924874
linear 0.26707485654715235
cosine 0.2651241867353118

It is important to note that the kernels that in case of tophat, epanechnikov, linear, cosine
the resulting zeroes in the estimated probabilities density functions are smoothed (i.e. are
replaced with a small value, in our case 10−16 ). We are also noting that exponential kernel
is the one with smallest differential entropies difference, even if the mean squared error of
its estimated p.d.f. is not the best one.

10
The mean squared error represents a more robust metric in kernel selection than see the
entropies difference and based on this metric we selected two kernel functions: gaussian
and tophat.
We will use this two gaussian functions in searching the optimal bandwidth in the next
section.

3.3.2 Optimal Bandwidth

In addition to the choice of kernel functions, the bandwidth choice is highly influencing
the p.d.f. estimation.
we made the choice of using the gaussian and tophat kernel.
Firstly we will try to reduce the mean squared error between p.d.f. and estimated p.d.f.
using in place of the formula 15 the also known as Silverman’s rule of thumb.
This formula, as mentioned on this websiteiv , aims to make the h value more robust to
make the fitness well for different cases such as bimodal mixture, skew distribution and
long-tailed. The factor 1.06 is reduce to 0.9 in order to improve the model as we can see
in the formula:

IQR 1
h = 0.9 min σ̂, n− 5 (16)
1.34
where σ̂ is the standard deviation of the samples, n is the sample size and the IQR is the
interquartile range.
We applied the formula 16 in p.d.f. density estimation in figure 8 and we wanted to
compare it with figure 7. It is not possible to establish only visually which of the two
bandwidth is better in our case, so we will again resort to the calculation of the MSE.
We will report in the following table the mean squared errors for both kernels and for
both bandwidth, which we will refer to the one given by the equation 15 with ”optimal
gaussian” and to the one given by the equation 16 with ”Silverman rule”:

kernels optimal gaussian MSE Silverman rule MSE

gaussian 3.863619081920247e-06 3.7501041007396715e-06
tophat 4.81709154169017e-06 5.589098271484399e-06

From MSE calculations we can affirm that the Silverman’s rule works better than the
preceiding for the Gaussian kernel, but works worse for the tophat kernel. Using the best
value of each bandwidth rounded to the second decimal placed as starting point, we will
search the best bandwidth for the two kernels. The procedure is like a sifting process in
which we select the best value and move around that in order to get the best bandwidth.
Through the sifting process we wanted to be sure of not being trapped in local minima
by plotting the MSE. We stopped for both to the sixth decimal placed and reported the
results in the following table:

kernels bandwidth MSE

gaussian 1.184066 3.745675666744457e-06
tophat 2.188016 4.0272368084708545e-06

iv
https://en.wikipedia.org/wiki/Kernel_density_estimation

11
Figure 8: In this figure we can see the p.d.f. compared to the estimated p.d.f. generated
with a gaussian and tophat kernels. The bandwidth choice was made using the Silverman’s
rule of thumb.

4 Third Assignment
In this chapter we will discuss about the third assignment that consists in three parts.
We will use the multivariate Iris dataset, that is well known in literature. This data
set contains three classes, that are referring to a type of iris plant, which are: Setosa,
Virginica and Versicolour.
Each class contains 50 instances and each instance has five attributes:

• Sepal length in cm

• Sepal width in cm

• Petal length in cm

• Petal width in cm

• Class:

1. Iris Setosa
2. Iris Versicolour
3. Iris Virginica

In the first part it is required to discretize the features of the Iris dataset as integers and
compute their probability mass function, in the second part is required to compute their
entropies and in the third part to compute the mutual information between any pair of
features.

12
4.1 First Part
The technique used to discretize the feature of the data set is to multiply all the values
that are in centimetres by 10 obtaining integers values.
After discretization step we computed the probability mass function of each feature by
summing the number of occurrences for the different values present in each feature and
dividing by the total number of values in the considered feature.

4.2 Second Part

In the second part is required to compute the entropy of each feature of the data set. We
applied the entropy function built in Chapter 2 to the four probability mass functions
obtained in the part 4.1 of Chapter 4. We resume the results in a table in which we show
the entropy obtained for each of the four features:
Features Entropy
Sepal Length 4.822018088381167
Sepal Width 4.023181025924308
Petal Lenght 5.034569953674171
Petal Width 4.0498270903914175

4.3 Third Part

The third part is aimed to compute the mutual information between any pair of features of
the Iris data set. We used to compute the mutual information the ”mutual information”
function defined in section 2.4 of Chapter 2. We have to give in input to this function
the joint and marginal probability mass function of each pair of features. We used the
marginal probability used in the preceding parts of the assignment and for each pair
of feature we computed their joint probability mass functions. We computed the joint
probability mass function in the same way as in the univariate case, but taking into
account that this time we are in a multivariate case and the considered event it is the
union of the two events.
In analogy with the univariate case this time we have to sum the number of occurrences
for the different values that the pair of considered features can assume and we have to
divide by the total number of pairs.
When a pair of features values is not occurring, but is unique, we smoothed the value
replacing the zeroes with 10− 20 and we took into account that the mutual information is
a symmetric measure avoid repetition of calculations.
We resume in the following table the mutual information results:
Feature 1 Feature 2 Mutual Information
Sepal Length Sepal Width 2.089844090533955
Sepal Length Petal Length 3.002867101602726
Sepal Length Petal Width 2.2406847582216676
Sepal Width Petal Length 2.2274285391747153
Sepal Width Petal Width 1.6759771322899024
Petal Length Petal Width 2.6948471229825794

13
5 Fourth Assignment
The fourth assignment is aimed to build and compare three different types of Bayes
classifiers.
As classifiers their goal is to return a class label vector for the test dataset given a training
dataset with his class label vector and a test dataset.
These classifiers are based on the Bayesan theorem, what they differ from each other are
the assumptions.
The Bayes theorem says that for two different events A and B of the campionary space,
the following equation holds:

P (B|A)P (A)
P (A|B) = (17)
P (B)
where the P (A|B) and P (B|A) are conditional probabilities, respectively the probability
of event A occurring given that B is true, also called the posterior probability of A
given B and the probability of event B occurring given that A is true, also considered as
likelihood of A given a fixed B. P (A) and P (B) are the marginal probabilities of observing
respectively A and B.
We will refer as Bayes Classifier in the rest of text to the Classifier in which we have
assumed that the features are continuous random variables and we estimated through a
multivariate kernel density estimator the multivariate probability density function.
From this point on we will refer as Naı̈ve Bayes Classifier to the classifier in which we
made the assumption that the variables are not only continuous and random, but also
independent. In building the Naı̈ve Bayes Classifier we used a univariate kernel density
estimator to estimate the probability density function of each feature.
The last classifier is referred as Gaussian Naı̈ve Bayes Classifier in which we have to make
the assumption that the variables are not only continuous, random and independent, but
also Gaussian distributed.
In the last step we have to compare the three classifiers accuracy on Iris dataset split in
half between test dataset and train dataset.

5.1 Bayes Classifier

The Bayes classifier is based on theorem , in which we consider the event P (A|B) as the
posterior probability of the class given the instance P (ck |x). Our goal is to select the
class ck that maximizes the posterior probability obtained through the application of the
Bayes Theorem to compute the posterior probability as a function of the likelihood and
prior probability. We can rewrite the Bayes theorem as follows:

P (x|ck )P (ck )
P (ck |x) = (18)
P (x)

The Bayes classifier must calculate an a posteriori probability for each class. The class is
assigned to the class with which the maximum posterior probability value is associated.
Since P (x) is not a function of k, it is considered negligible and can be not considered in
the calculation of the maximum posterior probability.

14
As we saw before a Bayes classifier needs as inputs the prior probabilities of the classes
P (ck ) and likelihoods P (x|ck ).
We made the estimation of prior probabilities through the estimation of the p.m.f. of
each class feature and the likelihood estimation was made by firstly grouping the dataset
for each class and then using a multivariate probability density function estimation.

5.2 Naı̈ve Bayes Classifier

The Naı̈ve Bayes Classifier is a Bayes Classifier in which it is made the assumption that
the features are independent each other. This naı̈ve reduces the computational cost of
computing P (xj |ck ) for 1 ≤ k ≤ Nc , i.e. :

M
Y
P (x|ck ) = P (xj |ck ) (19)
j=1

5.3 Gaussian Naı̈ve Bayes Classifier

The Gaussian Naı̈ve Bayes Classifier differs from the Naı̈ve Bayes Classifier for the as-
sumption that the the values associated with each class that we supposed to be continuous,
random and independent are Gaussian distributed. The data are first splitted by classes
and then mean and variance are computed for each feature and for each class. Then
P (xj |ck ) is computed using the obtained means and variances in a Gaussian density for-
mula and finally assuming the features independence P (x|ck ) is computed where x is the,
in general multidimensional, instance.

5.4 Iris Dataset Classification

The last part of the fourth assignment is aimed to compute and compare the average
accuracy of the three classifiers previously builded on the Iris dataset, discussed in the
first part of Chapter 4.
For each class label we have to take the 50% of the instances as training set and the
remaining 50% as test dataset.
we are reporting in the following table the accuracy, precision and recall results for the
three classifiers in which we used a gaussian kernel and a bandwidth chosen using the
equation 16 in the Bayes and Naı̈ve Bayes classifiers.

Metrics Bayes Naı̈ve Bayes Gaussian Naı̈ve Bayes

accuracy 0.8933333333333333 0.9733333333333334 0.96
precision 0.9191919191919191 0.9753086419753086 0.9642857142857143
recall 0.8933333333333333 0.9733333333333334 0.96

in our specific case the Naı̈ve Bayes Classifier works better in terms of accuracy than the
others, even better than the Gaussian Naı̈ve Bayes Classifier that is not using k.d.e. and
does not have parameters such as kernel functions or bandwidth that can be modified.

15
We are surprised from seeing that the Naı̈ve Bayes Classifier is performing better than
the Bayes Classifier since more assumptions are made on data. We will see in the next
section that the accuracy score are strongly dependent form the bandwidth.

5.4.1 Bandwidth Selection for Gaussian Nav̈e Bayes Classifier

In this section we are searching the bandwidth that maximize the accuracy for Bayes and
Naı̈ve Bayes Classifiers. This kind of search may be useless from the classification point
of view because it leads to a loss of generalization and may be to a poorer accuracy on
new data, but it allows us to understand if our models are capable of represent the data.
The bandwidth was found using a sifting process like in section 3.3.2 of Chapter 3, but
this time we are starting from the value 0.01. We are resuming in a table our fingings:

Classifiers Bandwidth Accuracy

Bayes 1.395 ±0.695 0.9733333333333334
Naı̈ve Bayes 0.0591 ±0.0005 0.9866666666666667

We can see that the accuracy is strongly dependent on the bandwidth and even if the Bayes
classifier seems do not reach the Naı̈ve Bayes accuracy. The values of bandwidth reported
in table 5.4.1 are values in which the bandwidth has reached a plateau in correspondence
to the accuracy. For instance for the values of the Bayes Classifier we have that the values
of the accuracy are not changing in the bandwidth interval 1.395 ± 0.695.

16
6 Conclusion
A concise summary of contents and results

17
References

2013 WMI Grade 3 Solutions Part 1
100% (7)
2013 WMI Grade 3 Solutions Part 1
4 pages
Math in The Modern World With Answer Key
100% (3)
Math in The Modern World With Answer Key
4 pages
Chapter 4 Notes Calc
No ratings yet
Chapter 4 Notes Calc
10 pages
Mutual Information
No ratings yet
Mutual Information
48 pages
Homework 2:: Recap
No ratings yet
Homework 2:: Recap
3 pages
Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples
No ratings yet
Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples
6 pages
ELEN0060-2 Information and Coding Theory: Université de Liège
No ratings yet
ELEN0060-2 Information and Coding Theory: Université de Liège
7 pages
Information Theory: Lecture Notes For
No ratings yet
Information Theory: Lecture Notes For
193 pages
Information Theory Textbook
No ratings yet
Information Theory Textbook
14 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Info Theory
No ratings yet
Info Theory
59 pages
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet
L01
No ratings yet
L01
5 pages
Info Theory Exercise Solutions
No ratings yet
Info Theory Exercise Solutions
16 pages
ITC Module2 1
No ratings yet
ITC Module2 1
34 pages
dabel_info_theory
No ratings yet
dabel_info_theory
25 pages
Laboratory Journal: Signal Coding Estimation Theory
No ratings yet
Laboratory Journal: Signal Coding Estimation Theory
63 pages
Exercise Problems: Information Theory and Coding
No ratings yet
Exercise Problems: Information Theory and Coding
6 pages
280 LN Deller PART1 WITH ALL SUPPLEMENTS Fall2015 PDF
No ratings yet
280 LN Deller PART1 WITH ALL SUPPLEMENTS Fall2015 PDF
118 pages
Information Theory Lecture Notes
No ratings yet
Information Theory Lecture Notes
97 pages
Information Theory For Single-User Systems With Arbitrary Statistical Memory
No ratings yet
Information Theory For Single-User Systems With Arbitrary Statistical Memory
111 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Course No.: Eem822 Title: Digital Communications
No ratings yet
Course No.: Eem822 Title: Digital Communications
8 pages
Information Theory: Info Rmatio N Types
No ratings yet
Information Theory: Info Rmatio N Types
52 pages
Lecture Notes in Information Theory Volume II
No ratings yet
Lecture Notes in Information Theory Volume II
293 pages
CourseNotesEE501 PDF
No ratings yet
CourseNotesEE501 PDF
231 pages
Problem Set 1
No ratings yet
Problem Set 1
3 pages
Advanced Probability Theory For Biomedical Engineers
No ratings yet
Advanced Probability Theory For Biomedical Engineers
106 pages
Information Theory and Applications
No ratings yet
Information Theory and Applications
293 pages
Chapter 5
No ratings yet
Chapter 5
85 pages
Information Theory For Electrical Engineers
No ratings yet
Information Theory For Electrical Engineers
277 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
AppendixA Probability and Statistics
No ratings yet
AppendixA Probability and Statistics
32 pages
EE5143 Tutorial1
No ratings yet
EE5143 Tutorial1
5 pages
Essentials of Bayesian Inference 1706204646
No ratings yet
Essentials of Bayesian Inference 1706204646
21 pages
ECE 313 Course Notes: Probability With Engineering Applications
No ratings yet
ECE 313 Course Notes: Probability With Engineering Applications
188 pages
Lecture 2: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 2: Entropy and Mutual Information: 2.1 Example
8 pages
DifferentialEntropy Examples
No ratings yet
DifferentialEntropy Examples
17 pages
ITC-Post Mid1
No ratings yet
ITC-Post Mid1
36 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
No ratings yet
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
76 pages
Bruce Hajek - Probability With Engineering Applications - Jan 2017
No ratings yet
Bruce Hajek - Probability With Engineering Applications - Jan 2017
291 pages
The Deathguards
From Everand
The Deathguards
Phyllis Ann Karr
No ratings yet
Nitte Meenakshi Institute of Technology
No ratings yet
Nitte Meenakshi Institute of Technology
13 pages
Probability Aug 16
No ratings yet
Probability Aug 16
289 pages
Intermediate Probability Theory For Biomedical Engineers - JohnD. Enderle
No ratings yet
Intermediate Probability Theory For Biomedical Engineers - JohnD. Enderle
114 pages
Lecture_15
No ratings yet
Lecture_15
7 pages
IT-CO-1-EN
No ratings yet
IT-CO-1-EN
26 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Kalman and Bayesian Filters in Python
100% (2)
Kalman and Bayesian Filters in Python
504 pages
Inf Theory 3
No ratings yet
Inf Theory 3
76 pages
Probabilityjan13 PDF
No ratings yet
Probabilityjan13 PDF
281 pages
Probability May 12
No ratings yet
Probability May 12
201 pages
Hofman Notes
No ratings yet
Hofman Notes
114 pages
15359-2009-lecture25
No ratings yet
15359-2009-lecture25
11 pages
Elements of Information Theory-Chapter1-2
No ratings yet
Elements of Information Theory-Chapter1-2
63 pages
Basic Probability Theory For Bio Medical Engineers - JohnD. Enderle
100% (1)
Basic Probability Theory For Bio Medical Engineers - JohnD. Enderle
136 pages
LectureNotes_complete (1)
No ratings yet
LectureNotes_complete (1)
90 pages
New Question Bank - Itc-Qb Ultimate
0% (1)
New Question Bank - Itc-Qb Ultimate
14 pages
A Discourse Analysis of 1 Peter
From Everand
A Discourse Analysis of 1 Peter
Ervin Ray Starwalt
No ratings yet
Operation Exile
From Everand
Operation Exile
E. Hoffmann Price
3.5/5 (1)
Operation Longlife
From Everand
Operation Longlife
E. Hoffmann Price
3.5/5 (3)
Numerical Integration N6
No ratings yet
Numerical Integration N6
23 pages
Math9 q1 Wk2 Mod2a Quadratic-Equations
No ratings yet
Math9 q1 Wk2 Mod2a Quadratic-Equations
21 pages
Common PDE Problems
No ratings yet
Common PDE Problems
28 pages
Points Lines and Angles - I
No ratings yet
Points Lines and Angles - I
1 page
Mathematics Gist of CLASS XI Circle.
No ratings yet
Mathematics Gist of CLASS XI Circle.
3 pages
Guide To Writing A Basic Essay
100% (2)
Guide To Writing A Basic Essay
6 pages
Math For The Surveyor Coan
100% (1)
Math For The Surveyor Coan
176 pages
Unit 4 Homework 2023
No ratings yet
Unit 4 Homework 2023
11 pages
Helmholtz Decomposition
No ratings yet
Helmholtz Decomposition
4 pages
Homework 10 PDF
No ratings yet
Homework 10 PDF
7 pages
PSY112 - 5th Exam - Problem Solving Questionnaire
No ratings yet
PSY112 - 5th Exam - Problem Solving Questionnaire
2 pages
A 2D Electromagnetic Scattering Solver For Matlab: 1 Introduction and Method
No ratings yet
A 2D Electromagnetic Scattering Solver For Matlab: 1 Introduction and Method
23 pages
Revision Guide Foundation Essential Skills Worksheet
No ratings yet
Revision Guide Foundation Essential Skills Worksheet
2 pages
Lecture 1 A
100% (1)
Lecture 1 A
294 pages
Martingales
No ratings yet
Martingales
40 pages
Math by Sundarji 1
No ratings yet
Math by Sundarji 1
614 pages
General Data Types
No ratings yet
General Data Types
2 pages
Chapter 1 - Number System
No ratings yet
Chapter 1 - Number System
16 pages
Review Maths Primary 4 Unit 1
No ratings yet
Review Maths Primary 4 Unit 1
2 pages
Vine - Assignment A
No ratings yet
Vine - Assignment A
12 pages
Improper Integral Formulas
No ratings yet
Improper Integral Formulas
15 pages
Lecture01 Slides
No ratings yet
Lecture01 Slides
37 pages
Homework 3 Computer Architecture
No ratings yet
Homework 3 Computer Architecture
4 pages
Sampling Distribution
No ratings yet
Sampling Distribution
18 pages
Integral Calculus Refresher Set
100% (1)
Integral Calculus Refresher Set
4 pages
Chemistry Lesson Plan 1
No ratings yet
Chemistry Lesson Plan 1
7 pages
Properties of Quadrilaterals
No ratings yet
Properties of Quadrilaterals
66 pages

ITDS Fifth Assignment

Uploaded by

ITDS Fifth Assignment

Uploaded by

Information Theory & Data Science

Author: Gioacchino Mauri

Academic Year: 2021/2022

2.2 Joint Entropy

2.3 Conditional Entropy

2.4 Mutual Information

2.6 Normalized Joint Entropy

2.7 Normalized Mutual Information

2.8 Python Script

3.1 First Part

vector length entropies difference [b]

3.2 Second Part

3.3 Third Part

We can write the histogram estimator formula as:

We are able to define the kernel density estimator equation as:

3.3.1 Kernel Functions

Figure 6: Available kernel functions in scikit learn library.

3.3.2 Optimal Bandwidth

kernels optimal gaussian MSE Silverman rule MSE

kernels bandwidth MSE

4.2 Second Part

4.3 Third Part

5.1 Bayes Classifier

5.2 Naı̈ve Bayes Classifier

5.3 Gaussian Naı̈ve Bayes Classifier

5.4 Iris Dataset Classification

Metrics Bayes Naı̈ve Bayes Gaussian Naı̈ve Bayes

5.4.1 Bandwidth Selection for Gaussian Nav̈e Bayes Classifier

Classifiers Bandwidth Accuracy

You might also like