Module14 InformationTheoryandEntropy

The document discusses information theory and entropy. It defines information as reduction of uncertainty and explains how entropy measures the average amount of information needed to describe a random variable. The document also discusses how information gain is used to select attributes in decision tree construction by minimizing the weighted impurity of child nodes.

Uploaded by

riya pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views24 pages

Module14 InformationTheoryandEntropy

Uploaded by

riya pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Information Theory and

Entropy
Dr. Sayak Roychowdhury
Department of Industrial & Systems Engineering,
IIT Kharagpur
Reference
• Bishop, C. M. (2006). Pattern recognition and machine
learning. Springer google schola, 2, 5-43.
• Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to
data mining. Pearson Education India
What is information?
• CO_ _ ?
• It can be multiple things (COAT, COAL, COST, COOL etc.)
• Now suppose it is given that the third letter is ‘A’
• So your choices are narrowed.
• P is a prime number less than 10? {2,3,5,7}
• How many prime numbers are less than 10100 ?
• Will there be congestion on my route to work?

• What will be the closing stock price of company XYZ after the day of
quarterly profit statement? {increase a lot, increase, decrease, crash}
• Information is reduction of uncertainty
What is information?
• Uncertainty can vary based on your knowledge and belief
• All elements in a guess list may not be equally likely
• There may be a notion of accuracy upto which
the guesses are indistinguishable e.g. 𝜋 = 3.14159265359…
You can choose 𝜋 = 3.142, 𝑜𝑟 3.1416 etc.
Degree of Surprise

Triplets
Snowfall in desert of Saudi Arabia Courtesy: freepik.com
Courtesy: accuweather.com
Information Theory
• The field was originally established by the works of Harry Nyquist and Ralph
Hartley, in the 1920s, and Claude Shannon in the 1940s.
• Consider a discrete random variable x and ask how much information is
received when we observe a specific value for this variable.
• The amount of information can be viewed as the ‘degree of surprise’ on
learning the value of x.
• If a highly improbable event has just occurred, more information is received
than occurrence of some very likely event.
• If it is known that an event was certain to happen we would, we would receive
no information.
• How many bits will take to store the number of possible answers?
Information Theory
• Measure of information content ℎ(𝑥) will depend on the probability distribution 𝑝 𝑥 , and is
a monotonic function of the probability 𝑝(𝑥).
• Two events 𝑥 and 𝑦 that are unrelated, then the information gain from observing both of
them should be the sum of the information gained from each of them separately,
ℎ(𝑥, 𝑦) = ℎ(𝑥) + ℎ(𝑦)
• These two relations hold if
ℎ 𝑥 = − log 2 𝑝 𝑥
• The negative sign ensures that information is positive or zero.
• Note that low probability events 𝑥 correspond to high information content.
• Here since log is used with base 2, unit of ℎ(𝑥) is bits.
• Number of bits required to store uncertainty is given by log base 2 of number of possible
answers, log of cardinality of possible answers
Information Entropy
• Suppose that a sender wishes to transmit the value of a random variable
to a receiver.
• The average amount of information that they transmit in the process is
obtained by taking the expectation of ℎ 𝑥 with respect to the
distribution 𝑝 𝑥 and is given
𝐻[𝑥] = −Σ𝑥 𝑝(𝑥) log 2 𝑝 𝑥
• This is called Entropy of the random variable 𝑥.
• Since lim 𝑝 ln 𝑝 = 0, hence it is taken 𝑝 𝑥 ln 𝑝 𝑥 = 0 whenever
𝑝→0
𝑝 𝑥 = 0.
Information Entropy
• Consider a random variable 𝑥 having 8 possible states, equally likely.
• In order to communicate the value of 𝑥 to a receiver, transmit a
message of length 3 bits.
• Notice that the entropy of this variable is given by
1 1
𝐻 𝑥 = −8 ∗ log 2 = 3
8 8
• Consider a rv of 8 possible states with probabilities
1 1 1 1 1 1 1 1
, , , , , , , What is the entropy?
2 4 8 16 64 64 64 64
• The non-uniform distribution has smaller entropy than the uniform
distribution.
Information Theory
• We can take advantage of the nonuniform distribution to shorten
code length
• By using shorter codes for the more probable events, and longer
codes for the less probable events
• It can be done by representing the states {a, b, c, d, e, f, g, h} using,
for instance, the following set of code strings: 0, 10, 110, 1110,
111100, 111101, 111110, 111111.
• The average length of the code
1 1 1 1 1
×1+ ×2+ ×3+ ×4+4× × 6 = 2 bits
2 4 8 16 64
Noiseless Coding Theorem (Shannon 1948)
• The noiseless coding theorem (Shannon, 1948) states that the entropy
is a lower bound on the number of bits needed to transmit the state
of a random variable.
• Natural logarithms (nats) in place of bits can also be used.
Information Entropy
• For discrete probability distribution, with states 𝑥𝑖 of a discrete random
variable 𝑋, where 𝑝(𝑋 = 𝑥𝑖 ) = 𝑝𝑖 , entropy of 𝑋 is given by
𝐻[𝑝] = −Σ𝑖 𝑝(𝑥𝑖 ) 𝑙𝑛 𝑝 𝑥𝑖
• Distributions that are sharply peaked around a few values will have a relatively
low entropy,
• Whereas those that are spread more evenly across many values will have
higher entropy
• The maximum entropy configuration can be found by maximizing H using a
Lagrange multiplier to enforce the normalization constraint
𝐻෩ = −Σ𝑖 𝑝 𝑥𝑖 𝑙𝑛 𝑝 𝑥𝑖 + 𝜆(Σ𝑖 𝑝 𝑥𝑖 − 1)
1
• Entropy is maximized when all 𝑝 𝑥𝑖 = where 𝑀 is number of states
𝑀
Information Entropy
probabilities

𝑋 𝑋
Information Entropy
• To specify a continuous variable very precisely requires a large number
of bits.
• For a density defined over multiple continuous variables, denoted
collectively by the vector 𝑥, the differential entropy is given by
𝐻[𝑥] = − ‫𝑥𝑑 𝑥 𝑝 𝑛𝑙 𝑥 𝑝 ׬‬
• To obtain the maximum differential entropy, normalizing and
constraining the first and second moments
∞ ∞ ∞
‫׬‬−∞ 𝑝(𝑥) 𝑑𝑥 = 1, ‫׬‬−∞ 𝑥𝑝(𝑥) 𝑑𝑥 = 𝜇, ‫׬‬−∞ 𝑥 − 𝜇 2 𝑝(𝑥) 𝑑𝑥 = 𝜎 2
Information Entropy
• Using Lagrange multipliers and maximizing it is found maximum
differential entropy is given by
1 𝑥−𝜇 2
𝑝 𝑥 = exp −
2𝜋𝜎 2 2𝜎 2
• The differential entropy of Gaussian is given by
1
𝐻 𝑥 = {1 + ln(2𝜋𝜎 2 )}
2
• The entropy increases as the distribution becomes broader, i.e., as 𝜎 2
increases.
Information Entropy in Statistical Learning

Sample data of loan borrower

Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining. Pearson Education India.
Information Entropy in Statistical Learning
• A decision tree to be constructed from the dataset

• How to select the attributes at each node ?

Node Purity
• There are many measures that can be used to determine the
goodness of an attribute test condition.
• These measures try to give preference to attribute test conditions
that partition the training instances into purer subsets in the child
nodes, which mostly have the same class labels.
• Having purer nodes is useful since a node that has all of its training
instances from the same class does not need to be expanded further.
Node Impurity Measures
• Entropy = − σ𝑐−1
𝑖=0 𝑝(𝑥𝑖 ) log 2 𝑝 𝑥𝑖
• Gini index = 1 − σ𝑐−1 𝑖=0 𝑝 𝑥 𝑖
2

• Classification error = 1 − max 𝑝(𝑥𝑖 )

𝑖
• where 𝑝(𝑥𝑖 ) is the relative frequency of training instances that belong
to class 𝑖 at node t, 𝑐 is the total number of classes, and
0 log 2 0 = 0 in entropy calculations.
Collective Impurity
• Consider an attribute test condition that splits a node containing N training
instances into 𝑘 children, {𝑣1 , 𝑣2 ,· · · , 𝑣𝑘 },
• Every child node represents a partition of the data resulting from one of
the 𝑘 outcomes of the attribute test condition.
• Let 𝑁(𝑣𝑗 ) be the number of training instances associated with a child node
𝑣𝑗 , whose impurity value is 𝐼(𝑣𝑗 ).
• Since a training instance in the parent node reaches node 𝑣𝑗 for a fraction
of 𝑁(𝑣𝑗 )/𝑁 times,
• The collective impurity of the child nodes can be computed by taking a
weighted sum of the impurities of the child nodes,
𝑘 𝑁 𝑣𝑗
𝐼 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = σ𝑗=1 𝐼(𝑣𝑗 )
𝑁
Weighted Entropy
Weighted Entropy

Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining. Pearson Education India.
Gain
• To determine the goodness of an attribute test condition, we need to
compare the degree of impurity of the parent node (before splitting)
with the weighted degree of impurity of the child nodes (after
splitting).
• The larger their difference, the better the test condition.
• This difference, Δ, also termed as the gain in purity of an attribute
test condition,
Δ = 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡) − 𝐼(𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛)
Information Gain
• Maximizing the gain at a given node is equivalent to minimizing the
weighted impurity measure of its children since I(parent) is the same
for all candidate attribute test conditions.
• When entropy is used as the impurity measure, the difference in
entropy is commonly known as information gain, Δ𝑖𝑛𝑓𝑜

Storm Drainage System: By: Niel John C. Salgado
No ratings yet
Storm Drainage System: By: Niel John C. Salgado
18 pages
GRADE 9 IG - PAPER1-Question Paper Partial Ans Key
0% (1)
GRADE 9 IG - PAPER1-Question Paper Partial Ans Key
6 pages
TLE-7 8 Cookery Q1 Week7-8
50% (2)
TLE-7 8 Cookery Q1 Week7-8
10 pages
Probability & Information: Prof. J Bapat
No ratings yet
Probability & Information: Prof. J Bapat
20 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
03 InformationGain
No ratings yet
03 InformationGain
20 pages
Entropy
No ratings yet
Entropy
21 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Lecture 1
No ratings yet
Lecture 1
211 pages
comp101-lect02
No ratings yet
comp101-lect02
44 pages
Quantum Information: Stephen M. Barnett
No ratings yet
Quantum Information: Stephen M. Barnett
60 pages
Information Theory: Info Rmatio N Types
No ratings yet
Information Theory: Info Rmatio N Types
52 pages
2 Information Measurement and Entropy
No ratings yet
2 Information Measurement and Entropy
23 pages
MIT16 36s09 Lec03
No ratings yet
MIT16 36s09 Lec03
10 pages
Notes It
No ratings yet
Notes It
46 pages
Module-1
No ratings yet
Module-1
40 pages
IT_w1
No ratings yet
IT_w1
20 pages
IT-CO-1-EN
No ratings yet
IT-CO-1-EN
26 pages
Presentation Math7952
No ratings yet
Presentation Math7952
29 pages
L01
No ratings yet
L01
5 pages
Short Intro Quantum Information
No ratings yet
Short Intro Quantum Information
64 pages
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
No ratings yet
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
114 pages
Information Theory
No ratings yet
Information Theory
114 pages
ICT - Module 1 Lecture 1
No ratings yet
ICT - Module 1 Lecture 1
34 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
Ch5 Entropy and Information
No ratings yet
Ch5 Entropy and Information
77 pages
Paper Theory on Information Theory
No ratings yet
Paper Theory on Information Theory
15 pages
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
No ratings yet
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
76 pages
MIT18 440S14 Lecture34
No ratings yet
MIT18 440S14 Lecture34
19 pages
Iict Unit One
No ratings yet
Iict Unit One
35 pages
Lecture 5
No ratings yet
Lecture 5
42 pages
Rojas 10 Why The Normal Distribution
No ratings yet
Rojas 10 Why The Normal Distribution
10 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
Lecture-1 Information Theory
No ratings yet
Lecture-1 Information Theory
20 pages
Entropy (Information Theory)
No ratings yet
Entropy (Information Theory)
17 pages
C&C Combined Module Notes
No ratings yet
C&C Combined Module Notes
206 pages
21ECE72_Coding and Cryp Module 1
No ratings yet
21ECE72_Coding and Cryp Module 1
34 pages
Information Theory 5th Unit
No ratings yet
Information Theory 5th Unit
20 pages
Lecture Note PDF
No ratings yet
Lecture Note PDF
373 pages
Information Theory and Coding NOTES
No ratings yet
Information Theory and Coding NOTES
129 pages
Lect2 PDF
No ratings yet
Lect2 PDF
25 pages
Module 4
No ratings yet
Module 4
15 pages
Info Theory
No ratings yet
Info Theory
59 pages
1-Information Removed
No ratings yet
1-Information Removed
5 pages
IICT Notes Unit-2
No ratings yet
IICT Notes Unit-2
17 pages
Chapter 2 - Mathematical Preliminaries For Lossless Compression
No ratings yet
Chapter 2 - Mathematical Preliminaries For Lossless Compression
56 pages
Information Theory
No ratings yet
Information Theory
26 pages
AI-W5
No ratings yet
AI-W5
29 pages
Amount of Information I Log (1/P)
No ratings yet
Amount of Information I Log (1/P)
2 pages
COSC1003/1903 Information Theory: Joseph Lizier
No ratings yet
COSC1003/1903 Information Theory: Joseph Lizier
85 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Business Analytics & Machine Learning: Decision Tree Classifiers
No ratings yet
Business Analytics & Machine Learning: Decision Tree Classifiers
60 pages
Entropy: Low Entropy High Entropy
No ratings yet
Entropy: Low Entropy High Entropy
11 pages
Entr5
No ratings yet
Entr5
2 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
MTM 2024 Mod 1 Lec 2
No ratings yet
MTM 2024 Mod 1 Lec 2
12 pages
Module10 - Support Vector Machine
No ratings yet
Module10 - Support Vector Machine
23 pages
Ltintegratedreport 2023
No ratings yet
Ltintegratedreport 2023
100 pages
Module08 PolynomialRegressionSplineGAMs
No ratings yet
Module08 PolynomialRegressionSplineGAMs
56 pages
Manual of sd200 Open Loop Vector Control VFD
No ratings yet
Manual of sd200 Open Loop Vector Control VFD
221 pages
Experimental Analysis of Pressure Vessel
No ratings yet
Experimental Analysis of Pressure Vessel
8 pages
List of Certified Window-Type Room Air Conditioners As of JUNE CY 2016
No ratings yet
List of Certified Window-Type Room Air Conditioners As of JUNE CY 2016
24 pages
What Is The Difference Between A Cogeneration Power Plant and A Combined Cycle Power Plant
100% (1)
What Is The Difference Between A Cogeneration Power Plant and A Combined Cycle Power Plant
1 page
M.Sc. Botany 2022 23
No ratings yet
M.Sc. Botany 2022 23
67 pages
Jevie Joy Ramirez. GE5 Examination Essay.
No ratings yet
Jevie Joy Ramirez. GE5 Examination Essay.
3 pages
MSDS Odc 900
No ratings yet
MSDS Odc 900
3 pages
Archaeological Heritage in The Jericho Oasis
No ratings yet
Archaeological Heritage in The Jericho Oasis
53 pages
The Differences Between Twisted Pair and Fibre Optic Media
No ratings yet
The Differences Between Twisted Pair and Fibre Optic Media
4 pages
Warm Up and Cool Down
No ratings yet
Warm Up and Cool Down
6 pages
Lower 6th Inorganic Chemistry
No ratings yet
Lower 6th Inorganic Chemistry
46 pages
Instant Access to Speaking for Ourselves: Conversations on Life, Music, and Autism Michael B. Bakan ebook Full Chapters
100% (3)
Instant Access to Speaking for Ourselves: Conversations on Life, Music, and Autism Michael B. Bakan ebook Full Chapters
51 pages
Ultramax Dimming: Bi-Level Dimming & Load Shed 0-10V Dimming Instant Start Ballasts
No ratings yet
Ultramax Dimming: Bi-Level Dimming & Load Shed 0-10V Dimming Instant Start Ballasts
2 pages
Ahsanullah University of Science and Technology
No ratings yet
Ahsanullah University of Science and Technology
5 pages
Acid Base Solutions Student Directions
100% (1)
Acid Base Solutions Student Directions
2 pages
Formulation and Evaluation of Herbal Lip Balm 345_removed
No ratings yet
Formulation and Evaluation of Herbal Lip Balm 345_removed
25 pages
6 Mathematics 6 q3 ST Final Printing
No ratings yet
6 Mathematics 6 q3 ST Final Printing
9 pages
Emergency Echocardiography
100% (1)
Emergency Echocardiography
62 pages
Children Height As Predicted by Their Mid-Parental Heights
No ratings yet
Children Height As Predicted by Their Mid-Parental Heights
6 pages
16 10 2014 Intelligent Maintenance of Large-Scale Belt Conveyor Idler Rolls Xiangwei Liu
No ratings yet
16 10 2014 Intelligent Maintenance of Large-Scale Belt Conveyor Idler Rolls Xiangwei Liu
9 pages
Molecular Physiology And Metabolism Of The Nervous System A Clinical Perspective 1st Edition Gary A Rosenberg download
No ratings yet
Molecular Physiology And Metabolism Of The Nervous System A Clinical Perspective 1st Edition Gary A Rosenberg download
83 pages
Instruction Manual: - ENGLISH - Brushcutter
No ratings yet
Instruction Manual: - ENGLISH - Brushcutter
22 pages
1. Homework (29.6.2024)
No ratings yet
1. Homework (29.6.2024)
3 pages
2015 Pharma GB
No ratings yet
2015 Pharma GB
20 pages
English (Joaquin and Tiffany)
No ratings yet
English (Joaquin and Tiffany)
4 pages
Cold Supply Chain: Verdict Systems BV Verdict East Africa LTD
No ratings yet
Cold Supply Chain: Verdict Systems BV Verdict East Africa LTD
25 pages
The Hidden Truth-1
No ratings yet
The Hidden Truth-1
17 pages

Module14 InformationTheoryandEntropy

Uploaded by

Module14 InformationTheoryandEntropy

Uploaded by

Information Theory and

Sample data of loan borrower

• How to select the attributes at each node ?

• Classification error = 1 − max 𝑝(𝑥𝑖 )

You might also like