ITDS Fifth Assignment
ITDS Fifth Assignment
5th Assignment
I
Contents
1 Introduction 1
2 First Assignment 2
2.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.4 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.5 Normalized Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . 3
2.6 Normalized Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7 Normalized Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7.1 Type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7.2 Type 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.7.3 Type 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.8 Python Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Second Assignment 4
3.1 First Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Second Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Third Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3.1 Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.2 Optimal Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Third Assignment 12
4.1 First Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Second Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Third Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Fourth Assignment 14
5.1 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Naı̈ve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Gaussian Naı̈ve Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4 Iris Dataset Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4.1 Bandwidth Selection for Gaussian Nav̈e Bayes Classifier . . . . . . 16
6 Conclusion 17
II
1 Introduction
Information Theory represents an intersection of different fields such as probability theory,
statistics, computer science, statistical mechanics, information engineering, and electrical
engineering.
This innovative theory has gone beyond the telecommunications field, finding applications
in very different fields spanning from neurobiology to information retrieval.
Due to his multisciplinarity, this theory can be used in data analysis.
In this document are collected and discussed the Assignments required by the Course of
Information Theory and Data Science at University of Rome Tor Vergata.
In Chapter 2 we will build an information theory library in Python and we will apply
the entropy function in the library in a Python script that computes the entropy for a
generic binary random variable as a function of its probability mass function and plots
the entropy function.
We will present our Conclusions in Chapter 6 in which we will briefly resume and sum-
marize the results of the assignments.
1
2 First Assignment
In first place is required to write functions that are able to compute information theoretic
measures.
The information theoretic measures that are required by the assignment are: the entropy,
the joint entropy, the conditional entropy, the mutual information and the normalized
versions of the mutual information.
In second place is required to write a Python script that computes the entropy for a
generic binary random variable as a function of its probability mass function and plots
the entropy function.
2.1 Entropy
The first function required is called ”entropy” and computes the entropy of a discrete
random variable given its probability mass function.
The entropy can be defined by the expectation value of the self information, as we can
see in the following equation:
Nx
X
H(X) = pk log(pk ) (1)
k=1
2
2.5 Normalized Conditional Entropy
The Normalized Conditional Entropy can be expressed as:
H(X|Y )
ηCE (X|Y ) = (5)
H(X)
I(X; Y )
ηCE (X|Y ) = 1 − (6)
H(X) + H(Y )
2.7.1 Type 1
1
ηM I1 (X; Y ) = −1 (7)
ηJE
2.7.2 Type 2
ηM I2 (X; Y ) = 1 + ηM I1 (8)
2.7.3 Type 3
I(X; Y )
ηM I3 = p (9)
H(X) + H(Y )
3
Figure 1: In this figure we can see the entropy in function of the p.m.f. of a generic
binary random variable. The maximum is when entropy is 1 bit, because we are using
the logarithm to base 2 in the definition of entropy function.
3 Second Assignment
The second assignment consists of three parts: in the first and third part it is required
to compute the difference between entropies and in the second part to write a function
which computes the differential entropy.
In the first part we have to compute the entropy of a discrete random variable given its
probability mass function vector, then we have to compute the entropy of an estimated
probability mass function from a set of samples generated through the preceding p.m.f.
vector. In the last step we have to compute the difference between the two entropies.
In the second part we write a function that can compute the differential entropy of a
continuous random variable once its probability density function is given.
In the third part is required, firstly to compute the differential entropy of a Gaussian
continuous random variable once its probability density function vector is given.
Subsequently we have to compute the differential entropy of an estimated p.d.f. from a
set of samples generated through the preceding p.d.f. vector. Finally we have to compute
the difference between the two differential entropies.
4
Figure 2: In this figure we can see a representation of the p.m.f. estimated in our case we
considered a discrete list of four values [1,2,3,4] as possible outcomes.The vector length
chosen was 1000, that is by far sufficient to have a good p.m.f estimation since the previ-
ously chosen discrete probabilities in generating the sample were [0.1, 0.2, 0.3, 0.4].
Since the probability estimation is dependent on the vector length, we will resume in the
table 3.1 how the difference between the two entropies varies as it varies the vector length.
5
In order to better show the trend of the entropy difference we made a plot on 10000
different values of vector length with a 100 step length. The figure 3 shows that the
absolute value of entropies differences are decreasing as vector length grows. It seems to
have an exponential decay, but maybe when the vector length increases the way in which
the sample is generated or the presence of roundoff errors could affect more the entropies
difference results.
Figure 3: In this figure it is shown the trend of the entropies difference in which respect to
the vector length. The plot was made on 10000 different values of vector length ranging
from 100 to 1000000 with a 100 step length. We made the absolute value of the entropies
difference.
if the integral exists, where X is a continuous random variable with p.d.f. fX (x) and with
support set S = (a, b).
6
generate a sample through this p.d.f. vector and to estimate the p.d.f. . Once computed
the differential entropy of the p.d.f. and the estimated p.d.f. we have to compute their
difference.
In figure 4 we can clearly see that the p.d.f. have Gaussian shape and respects the imposed
constraint.
Figure 4: In this figure we can see the p.d.f. chosen. The mean and variance chosen are
respectively 33 and 5. we can see that this p.d.f. clearly respects the constraint to be a
p.d.f. vector of a Gaussian continuous random variable
In order to estimate the p.d.f. of the generated sample we used a Kernel Method. Kernel
Density Estimation is a non parametric method which can be considered an extension
of the concept of histogram. The advantages in using the kernel density estimator in
place of histograms are that it is smoother and converge to the true density as n → ∞
faster. The histogram density estimation in univariate case is achieved through the rudi-
mentary concept of splitting into equally spaced intervals the range of random variables
and counting the fraction of samples in each interval. This intervals of equal width h
are also called bins. In what follows we are synthesizing the theory about kernel density
estimation following the lectures of the Information Theory & Data Science Course.i Let
us assume that exists a continuous random variable X and a sample set S = s1 , s2 , ..., sn
is drawn from X with unknown probability density function fX (x). Using the rectangular
weight function or Parzen window function:
(
1, if |x| ≤ 21
I(x) =
0, if |x| > 21
7
n
1 X x − sk
fbX (x) = I (11)
nh k=1 h
where fbX (x) = fbX (x, s1 , s2 , ..., sn ), h is the bin width and n is the number of samples.
The summation in the equation 11 is representing the fraction of samples that are falling
into each bin.
Defining the kernel function K(x) as a non negative function with the properties:
Z +∞
K(x)dx = 1 (12)
−∞
Z +∞
xK(x)dx = 0 (13)
−∞
8
Figure 5: In this figure we can see the estimated p.d.f. vs p.d.f.
Our goal is to find the estimated p.d.f. that better fits the p.d.f. and as a criteria we are
considering the mean squared error.
We report the results in the following table:
9
Figure 7: In this figure we can see the p.d.f. in black compared to the estimated p.d.f.
generated with all the available kernels in scikit learn library keeping fixed the bandwidth
given from the equation 15.
kernels MSE
gaussian 3.863619081920247e-06
tophat 4.81709154169017e-06
epanechnikov 5.436396853980572e-06
exponential 5.993194851731685e-06
linear 6.170371980137539e-06
cosine 5.6070141839657405e-06
In table 3.3.1 we can note that as we expected the Gaussian kernel has the smallest mean
squared error.
We are reporting in the following table the absolute value of the differences between
differential entropies for each kernel:
kernels differences
gaussian 0.2198219931460743
tophat 0.25631972214292276
epanechnikov 0.26431680694603266
exponential 0.17286048456924874
linear 0.26707485654715235
cosine 0.2651241867353118
It is important to note that the kernels that in case of tophat, epanechnikov, linear, cosine
the resulting zeroes in the estimated probabilities density functions are smoothed (i.e. are
replaced with a small value, in our case 10−16 ). We are also noting that exponential kernel
is the one with smallest differential entropies difference, even if the mean squared error of
its estimated p.d.f. is not the best one.
10
The mean squared error represents a more robust metric in kernel selection than see the
entropies difference and based on this metric we selected two kernel functions: gaussian
and tophat.
We will use this two gaussian functions in searching the optimal bandwidth in the next
section.
From MSE calculations we can affirm that the Silverman’s rule works better than the
preceiding for the Gaussian kernel, but works worse for the tophat kernel. Using the best
value of each bandwidth rounded to the second decimal placed as starting point, we will
search the best bandwidth for the two kernels. The procedure is like a sifting process in
which we select the best value and move around that in order to get the best bandwidth.
Through the sifting process we wanted to be sure of not being trapped in local minima
by plotting the MSE. We stopped for both to the sixth decimal placed and reported the
results in the following table:
iv
https://en.wikipedia.org/wiki/Kernel_density_estimation
11
Figure 8: In this figure we can see the p.d.f. compared to the estimated p.d.f. generated
with a gaussian and tophat kernels. The bandwidth choice was made using the Silverman’s
rule of thumb.
4 Third Assignment
In this chapter we will discuss about the third assignment that consists in three parts.
We will use the multivariate Iris dataset, that is well known in literature. This data
set contains three classes, that are referring to a type of iris plant, which are: Setosa,
Virginica and Versicolour.
Each class contains 50 instances and each instance has five attributes:
• Sepal length in cm
• Sepal width in cm
• Petal length in cm
• Petal width in cm
• Class:
1. Iris Setosa
2. Iris Versicolour
3. Iris Virginica
In the first part it is required to discretize the features of the Iris dataset as integers and
compute their probability mass function, in the second part is required to compute their
entropies and in the third part to compute the mutual information between any pair of
features.
12
4.1 First Part
The technique used to discretize the feature of the data set is to multiply all the values
that are in centimetres by 10 obtaining integers values.
After discretization step we computed the probability mass function of each feature by
summing the number of occurrences for the different values present in each feature and
dividing by the total number of values in the considered feature.
13
5 Fourth Assignment
The fourth assignment is aimed to build and compare three different types of Bayes
classifiers.
As classifiers their goal is to return a class label vector for the test dataset given a training
dataset with his class label vector and a test dataset.
These classifiers are based on the Bayesan theorem, what they differ from each other are
the assumptions.
The Bayes theorem says that for two different events A and B of the campionary space,
the following equation holds:
P (B|A)P (A)
P (A|B) = (17)
P (B)
where the P (A|B) and P (B|A) are conditional probabilities, respectively the probability
of event A occurring given that B is true, also called the posterior probability of A
given B and the probability of event B occurring given that A is true, also considered as
likelihood of A given a fixed B. P (A) and P (B) are the marginal probabilities of observing
respectively A and B.
We will refer as Bayes Classifier in the rest of text to the Classifier in which we have
assumed that the features are continuous random variables and we estimated through a
multivariate kernel density estimator the multivariate probability density function.
From this point on we will refer as Naı̈ve Bayes Classifier to the classifier in which we
made the assumption that the variables are not only continuous and random, but also
independent. In building the Naı̈ve Bayes Classifier we used a univariate kernel density
estimator to estimate the probability density function of each feature.
The last classifier is referred as Gaussian Naı̈ve Bayes Classifier in which we have to make
the assumption that the variables are not only continuous, random and independent, but
also Gaussian distributed.
In the last step we have to compare the three classifiers accuracy on Iris dataset split in
half between test dataset and train dataset.
P (x|ck )P (ck )
P (ck |x) = (18)
P (x)
The Bayes classifier must calculate an a posteriori probability for each class. The class is
assigned to the class with which the maximum posterior probability value is associated.
Since P (x) is not a function of k, it is considered negligible and can be not considered in
the calculation of the maximum posterior probability.
14
As we saw before a Bayes classifier needs as inputs the prior probabilities of the classes
P (ck ) and likelihoods P (x|ck ).
We made the estimation of prior probabilities through the estimation of the p.m.f. of
each class feature and the likelihood estimation was made by firstly grouping the dataset
for each class and then using a multivariate probability density function estimation.
M
Y
P (x|ck ) = P (xj |ck ) (19)
j=1
in our specific case the Naı̈ve Bayes Classifier works better in terms of accuracy than the
others, even better than the Gaussian Naı̈ve Bayes Classifier that is not using k.d.e. and
does not have parameters such as kernel functions or bandwidth that can be modified.
15
We are surprised from seeing that the Naı̈ve Bayes Classifier is performing better than
the Bayes Classifier since more assumptions are made on data. We will see in the next
section that the accuracy score are strongly dependent form the bandwidth.
We can see that the accuracy is strongly dependent on the bandwidth and even if the Bayes
classifier seems do not reach the Naı̈ve Bayes accuracy. The values of bandwidth reported
in table 5.4.1 are values in which the bandwidth has reached a plateau in correspondence
to the accuracy. For instance for the values of the Bayes Classifier we have that the values
of the accuracy are not changing in the bandwidth interval 1.395 ± 0.695.
16
6 Conclusion
A concise summary of contents and results
17
References
18