0% found this document useful (0 votes)
4 views

Module14 InformationTheoryandEntropy

The document discusses information theory and entropy. It defines information as reduction of uncertainty and explains how entropy measures the average amount of information needed to describe a random variable. The document also discusses how information gain is used to select attributes in decision tree construction by minimizing the weighted impurity of child nodes.

Uploaded by

riya pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module14 InformationTheoryandEntropy

The document discusses information theory and entropy. It defines information as reduction of uncertainty and explains how entropy measures the average amount of information needed to describe a random variable. The document also discusses how information gain is used to select attributes in decision tree construction by minimizing the weighted impurity of child nodes.

Uploaded by

riya pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Information Theory and

Entropy
Dr. Sayak Roychowdhury
Department of Industrial & Systems Engineering,
IIT Kharagpur
Reference
• Bishop, C. M. (2006). Pattern recognition and machine
learning. Springer google schola, 2, 5-43.
• Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to
data mining. Pearson Education India
What is information?
• CO_ _ ?
• It can be multiple things (COAT, COAL, COST, COOL etc.)
• Now suppose it is given that the third letter is ‘A’
• So your choices are narrowed.
• P is a prime number less than 10? {2,3,5,7}
• How many prime numbers are less than 10100 ?
• Will there be congestion on my route to work?

• What will be the closing stock price of company XYZ after the day of
quarterly profit statement? {increase a lot, increase, decrease, crash}
• Information is reduction of uncertainty
What is information?
• Uncertainty can vary based on your knowledge and belief
• All elements in a guess list may not be equally likely
• There may be a notion of accuracy upto which
the guesses are indistinguishable e.g. 𝜋 = 3.14159265359…
You can choose 𝜋 = 3.142, 𝑜𝑟 3.1416 etc.
Degree of Surprise

Triplets
Snowfall in desert of Saudi Arabia Courtesy: freepik.com
Courtesy: accuweather.com
Information Theory
• The field was originally established by the works of Harry Nyquist and Ralph
Hartley, in the 1920s, and Claude Shannon in the 1940s.
• Consider a discrete random variable x and ask how much information is
received when we observe a specific value for this variable.
• The amount of information can be viewed as the ‘degree of surprise’ on
learning the value of x.
• If a highly improbable event has just occurred, more information is received
than occurrence of some very likely event.
• If it is known that an event was certain to happen we would, we would receive
no information.
• How many bits will take to store the number of possible answers?
Information Theory
• Measure of information content ℎ(𝑥) will depend on the probability distribution 𝑝 𝑥 , and is
a monotonic function of the probability 𝑝(𝑥).
• Two events 𝑥 and 𝑦 that are unrelated, then the information gain from observing both of
them should be the sum of the information gained from each of them separately,
ℎ(𝑥, 𝑦) = ℎ(𝑥) + ℎ(𝑦)
• These two relations hold if
ℎ 𝑥 = − log 2 𝑝 𝑥
• The negative sign ensures that information is positive or zero.
• Note that low probability events 𝑥 correspond to high information content.
• Here since log is used with base 2, unit of ℎ(𝑥) is bits.
• Number of bits required to store uncertainty is given by log base 2 of number of possible
answers, log of cardinality of possible answers
Information Entropy
• Suppose that a sender wishes to transmit the value of a random variable
to a receiver.
• The average amount of information that they transmit in the process is
obtained by taking the expectation of ℎ 𝑥 with respect to the
distribution 𝑝 𝑥 and is given
𝐻[𝑥] = −Σ𝑥 𝑝(𝑥) log 2 𝑝 𝑥
• This is called Entropy of the random variable 𝑥.
• Since lim 𝑝 ln 𝑝 = 0, hence it is taken 𝑝 𝑥 ln 𝑝 𝑥 = 0 whenever
𝑝→0
𝑝 𝑥 = 0.
Information Entropy
• Consider a random variable 𝑥 having 8 possible states, equally likely.
• In order to communicate the value of 𝑥 to a receiver, transmit a
message of length 3 bits.
• Notice that the entropy of this variable is given by
1 1
𝐻 𝑥 = −8 ∗ log 2 = 3
8 8
• Consider a rv of 8 possible states with probabilities
1 1 1 1 1 1 1 1
, , , , , , , What is the entropy?
2 4 8 16 64 64 64 64
• The non-uniform distribution has smaller entropy than the uniform
distribution.
Information Theory
• We can take advantage of the nonuniform distribution to shorten
code length
• By using shorter codes for the more probable events, and longer
codes for the less probable events
• It can be done by representing the states {a, b, c, d, e, f, g, h} using,
for instance, the following set of code strings: 0, 10, 110, 1110,
111100, 111101, 111110, 111111.
• The average length of the code
1 1 1 1 1
×1+ ×2+ ×3+ ×4+4× × 6 = 2 bits
2 4 8 16 64
Noiseless Coding Theorem (Shannon 1948)
• The noiseless coding theorem (Shannon, 1948) states that the entropy
is a lower bound on the number of bits needed to transmit the state
of a random variable.
• Natural logarithms (nats) in place of bits can also be used.
Information Entropy
• For discrete probability distribution, with states 𝑥𝑖 of a discrete random
variable 𝑋, where 𝑝(𝑋 = 𝑥𝑖 ) = 𝑝𝑖 , entropy of 𝑋 is given by
𝐻[𝑝] = −Σ𝑖 𝑝(𝑥𝑖 ) 𝑙𝑛 𝑝 𝑥𝑖
• Distributions that are sharply peaked around a few values will have a relatively
low entropy,
• Whereas those that are spread more evenly across many values will have
higher entropy
• The maximum entropy configuration can be found by maximizing H using a
Lagrange multiplier to enforce the normalization constraint
𝐻෩ = −Σ𝑖 𝑝 𝑥𝑖 𝑙𝑛 𝑝 𝑥𝑖 + 𝜆(Σ𝑖 𝑝 𝑥𝑖 − 1)
1
• Entropy is maximized when all 𝑝 𝑥𝑖 = where 𝑀 is number of states
𝑀
Information Entropy
probabilities

𝑋 𝑋
Information Entropy
• To specify a continuous variable very precisely requires a large number
of bits.
• For a density defined over multiple continuous variables, denoted
collectively by the vector 𝑥, the differential entropy is given by
𝐻[𝑥] = − ‫𝑥𝑑 𝑥 𝑝 𝑛𝑙 𝑥 𝑝 ׬‬
• To obtain the maximum differential entropy, normalizing and
constraining the first and second moments
∞ ∞ ∞
‫׬‬−∞ 𝑝(𝑥) 𝑑𝑥 = 1, ‫׬‬−∞ 𝑥𝑝(𝑥) 𝑑𝑥 = 𝜇, ‫׬‬−∞ 𝑥 − 𝜇 2 𝑝(𝑥) 𝑑𝑥 = 𝜎 2
Information Entropy
• Using Lagrange multipliers and maximizing it is found maximum
differential entropy is given by
1 𝑥−𝜇 2
𝑝 𝑥 = exp −
2𝜋𝜎 2 2𝜎 2
• The differential entropy of Gaussian is given by
1
𝐻 𝑥 = {1 + ln(2𝜋𝜎 2 )}
2
• The entropy increases as the distribution becomes broader, i.e., as 𝜎 2
increases.
Information Entropy in Statistical Learning

Sample data of loan borrower


Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining. Pearson Education India.
Information Entropy in Statistical Learning
• A decision tree to be constructed from the dataset

• How to select the attributes at each node ?


Node Purity
• There are many measures that can be used to determine the
goodness of an attribute test condition.
• These measures try to give preference to attribute test conditions
that partition the training instances into purer subsets in the child
nodes, which mostly have the same class labels.
• Having purer nodes is useful since a node that has all of its training
instances from the same class does not need to be expanded further.
Node Impurity Measures
• Entropy = − σ𝑐−1
𝑖=0 𝑝(𝑥𝑖 ) log 2 𝑝 𝑥𝑖
• Gini index = 1 − σ𝑐−1 𝑖=0 𝑝 𝑥 𝑖
2

• Classification error = 1 − max 𝑝(𝑥𝑖 )


𝑖
• where 𝑝(𝑥𝑖 ) is the relative frequency of training instances that belong
to class 𝑖 at node t, 𝑐 is the total number of classes, and
0 log 2 0 = 0 in entropy calculations.
Collective Impurity
• Consider an attribute test condition that splits a node containing N training
instances into 𝑘 children, {𝑣1 , 𝑣2 ,· · · , 𝑣𝑘 },
• Every child node represents a partition of the data resulting from one of
the 𝑘 outcomes of the attribute test condition.
• Let 𝑁(𝑣𝑗 ) be the number of training instances associated with a child node
𝑣𝑗 , whose impurity value is 𝐼(𝑣𝑗 ).
• Since a training instance in the parent node reaches node 𝑣𝑗 for a fraction
of 𝑁(𝑣𝑗 )/𝑁 times,
• The collective impurity of the child nodes can be computed by taking a
weighted sum of the impurities of the child nodes,
𝑘 𝑁 𝑣𝑗
𝐼 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = σ𝑗=1 𝐼(𝑣𝑗 )
𝑁
Weighted Entropy
Weighted Entropy

Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining. Pearson Education India.
Gain
• To determine the goodness of an attribute test condition, we need to
compare the degree of impurity of the parent node (before splitting)
with the weighted degree of impurity of the child nodes (after
splitting).
• The larger their difference, the better the test condition.
• This difference, Δ, also termed as the gain in purity of an attribute
test condition,
Δ = 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡) − 𝐼(𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛)
Information Gain
• Maximizing the gain at a given node is equivalent to minimizing the
weighted impurity measure of its children since I(parent) is the same
for all candidate attribute test conditions.
• When entropy is used as the impurity measure, the difference in
entropy is commonly known as information gain, Δ𝑖𝑛𝑓𝑜

You might also like