Module14 InformationTheoryandEntropy
Module14 InformationTheoryandEntropy
Entropy
Dr. Sayak Roychowdhury
Department of Industrial & Systems Engineering,
IIT Kharagpur
Reference
• Bishop, C. M. (2006). Pattern recognition and machine
learning. Springer google schola, 2, 5-43.
• Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to
data mining. Pearson Education India
What is information?
• CO_ _ ?
• It can be multiple things (COAT, COAL, COST, COOL etc.)
• Now suppose it is given that the third letter is ‘A’
• So your choices are narrowed.
• P is a prime number less than 10? {2,3,5,7}
• How many prime numbers are less than 10100 ?
• Will there be congestion on my route to work?
• What will be the closing stock price of company XYZ after the day of
quarterly profit statement? {increase a lot, increase, decrease, crash}
• Information is reduction of uncertainty
What is information?
• Uncertainty can vary based on your knowledge and belief
• All elements in a guess list may not be equally likely
• There may be a notion of accuracy upto which
the guesses are indistinguishable e.g. 𝜋 = 3.14159265359…
You can choose 𝜋 = 3.142, 𝑜𝑟 3.1416 etc.
Degree of Surprise
Triplets
Snowfall in desert of Saudi Arabia Courtesy: freepik.com
Courtesy: accuweather.com
Information Theory
• The field was originally established by the works of Harry Nyquist and Ralph
Hartley, in the 1920s, and Claude Shannon in the 1940s.
• Consider a discrete random variable x and ask how much information is
received when we observe a specific value for this variable.
• The amount of information can be viewed as the ‘degree of surprise’ on
learning the value of x.
• If a highly improbable event has just occurred, more information is received
than occurrence of some very likely event.
• If it is known that an event was certain to happen we would, we would receive
no information.
• How many bits will take to store the number of possible answers?
Information Theory
• Measure of information content ℎ(𝑥) will depend on the probability distribution 𝑝 𝑥 , and is
a monotonic function of the probability 𝑝(𝑥).
• Two events 𝑥 and 𝑦 that are unrelated, then the information gain from observing both of
them should be the sum of the information gained from each of them separately,
ℎ(𝑥, 𝑦) = ℎ(𝑥) + ℎ(𝑦)
• These two relations hold if
ℎ 𝑥 = − log 2 𝑝 𝑥
• The negative sign ensures that information is positive or zero.
• Note that low probability events 𝑥 correspond to high information content.
• Here since log is used with base 2, unit of ℎ(𝑥) is bits.
• Number of bits required to store uncertainty is given by log base 2 of number of possible
answers, log of cardinality of possible answers
Information Entropy
• Suppose that a sender wishes to transmit the value of a random variable
to a receiver.
• The average amount of information that they transmit in the process is
obtained by taking the expectation of ℎ 𝑥 with respect to the
distribution 𝑝 𝑥 and is given
𝐻[𝑥] = −Σ𝑥 𝑝(𝑥) log 2 𝑝 𝑥
• This is called Entropy of the random variable 𝑥.
• Since lim 𝑝 ln 𝑝 = 0, hence it is taken 𝑝 𝑥 ln 𝑝 𝑥 = 0 whenever
𝑝→0
𝑝 𝑥 = 0.
Information Entropy
• Consider a random variable 𝑥 having 8 possible states, equally likely.
• In order to communicate the value of 𝑥 to a receiver, transmit a
message of length 3 bits.
• Notice that the entropy of this variable is given by
1 1
𝐻 𝑥 = −8 ∗ log 2 = 3
8 8
• Consider a rv of 8 possible states with probabilities
1 1 1 1 1 1 1 1
, , , , , , , What is the entropy?
2 4 8 16 64 64 64 64
• The non-uniform distribution has smaller entropy than the uniform
distribution.
Information Theory
• We can take advantage of the nonuniform distribution to shorten
code length
• By using shorter codes for the more probable events, and longer
codes for the less probable events
• It can be done by representing the states {a, b, c, d, e, f, g, h} using,
for instance, the following set of code strings: 0, 10, 110, 1110,
111100, 111101, 111110, 111111.
• The average length of the code
1 1 1 1 1
×1+ ×2+ ×3+ ×4+4× × 6 = 2 bits
2 4 8 16 64
Noiseless Coding Theorem (Shannon 1948)
• The noiseless coding theorem (Shannon, 1948) states that the entropy
is a lower bound on the number of bits needed to transmit the state
of a random variable.
• Natural logarithms (nats) in place of bits can also be used.
Information Entropy
• For discrete probability distribution, with states 𝑥𝑖 of a discrete random
variable 𝑋, where 𝑝(𝑋 = 𝑥𝑖 ) = 𝑝𝑖 , entropy of 𝑋 is given by
𝐻[𝑝] = −Σ𝑖 𝑝(𝑥𝑖 ) 𝑙𝑛 𝑝 𝑥𝑖
• Distributions that are sharply peaked around a few values will have a relatively
low entropy,
• Whereas those that are spread more evenly across many values will have
higher entropy
• The maximum entropy configuration can be found by maximizing H using a
Lagrange multiplier to enforce the normalization constraint
𝐻෩ = −Σ𝑖 𝑝 𝑥𝑖 𝑙𝑛 𝑝 𝑥𝑖 + 𝜆(Σ𝑖 𝑝 𝑥𝑖 − 1)
1
• Entropy is maximized when all 𝑝 𝑥𝑖 = where 𝑀 is number of states
𝑀
Information Entropy
probabilities
𝑋 𝑋
Information Entropy
• To specify a continuous variable very precisely requires a large number
of bits.
• For a density defined over multiple continuous variables, denoted
collectively by the vector 𝑥, the differential entropy is given by
𝐻[𝑥] = − 𝑥𝑑 𝑥 𝑝 𝑛𝑙 𝑥 𝑝
• To obtain the maximum differential entropy, normalizing and
constraining the first and second moments
∞ ∞ ∞
−∞ 𝑝(𝑥) 𝑑𝑥 = 1, −∞ 𝑥𝑝(𝑥) 𝑑𝑥 = 𝜇, −∞ 𝑥 − 𝜇 2 𝑝(𝑥) 𝑑𝑥 = 𝜎 2
Information Entropy
• Using Lagrange multipliers and maximizing it is found maximum
differential entropy is given by
1 𝑥−𝜇 2
𝑝 𝑥 = exp −
2𝜋𝜎 2 2𝜎 2
• The differential entropy of Gaussian is given by
1
𝐻 𝑥 = {1 + ln(2𝜋𝜎 2 )}
2
• The entropy increases as the distribution becomes broader, i.e., as 𝜎 2
increases.
Information Entropy in Statistical Learning
Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining. Pearson Education India.
Gain
• To determine the goodness of an attribute test condition, we need to
compare the degree of impurity of the parent node (before splitting)
with the weighted degree of impurity of the child nodes (after
splitting).
• The larger their difference, the better the test condition.
• This difference, Δ, also termed as the gain in purity of an attribute
test condition,
Δ = 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡) − 𝐼(𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛)
Information Gain
• Maximizing the gain at a given node is equivalent to minimizing the
weighted impurity measure of its children since I(parent) is the same
for all candidate attribute test conditions.
• When entropy is used as the impurity measure, the difference in
entropy is commonly known as information gain, Δ𝑖𝑛𝑓𝑜