0% found this document useful (0 votes)
17 views114 pages

Lecture 1

Uploaded by

Vyom Raval
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views114 pages

Lecture 1

Uploaded by

Vyom Raval
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

EE514a – Information Theory I

Fall Quarter 2023

Prof. Jeff Bilmes

University of Washington, Seattle


Department of Electrical & Computer Engineering
Fall Quarter, 2023
https://canvas.uw.edu/courses/1663370/

Lecture 1 - Sep 27th, 2023

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F1/67 (pg.1/114)
Logistics Review

Information Theory I and II


This two-quarter course will be an thorough introduction to
information theory.
Information Theory I: entropy, mutual information, asymptotic
equipartition properties, data compression to the entropy limit
(source coding theorem), Huffman, Lempel-Ziv, convolutional codes,
communication at the channel capacity limit (channel coding
theorem), method of types, differential entropy, maximum entropy.
Information Theory II (EE515, Spring 2022 most probably) : ECC,
turbo, LDPC and other codes, Kolmogorov complexity, spectral
estimation, rate-distortion theory, alternating minimization for
computation of RD curve and channel capacity, more on the
Gaussian channel, network information theory, information geometry,
and some recent results on use of polymatroids in information theory.
Additional topics will include applications to machine learning,
artificial intelligence, natural language processing, computer science
and complexity, biological science, and communications.
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F2/67 (pg.2/114)
Logistics Review

Course Web Pages

See our canvas page


(https://canvas.uw.edu/courses/1663370)
And our web page (https://canvas.uw.edu/courses/1663370/)
(not yet set up)
our assignment dropbox
(https://canvas.uw.edu/courses/1663370/assignments)
which is where all homework will be due. It will be due exclusively
electronically, in PDF format. No paper homework accepted.
our discussion board (https:
//canvas.uw.edu/courses/1663370/discussion_topics) is
where you can ask questions. Please use this rather than email so
that all can benefit from answers to your questions.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F3/67 (pg.3/114)
Logistics Review

Religious Accommodations

Washington state law requires that UW develop a policy for


accommodation of student absences or significant hardship due to
reasons of faith or conscience, or for organized religious activities. The
UW’s policy, including more information about how to request an
accommodation, is available at Religious Accommodations Policy
(https://registrar.washington.edu/staffandfaculty/
religious-accommodations-policy/). Accommodations must be
requested within the first two weeks of this course using the Religious
Accommodations Request form (https://registrar.washington.
edu/students/religious-accommodations-request/).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F4/67 (pg.4/114)
Logistics Review

Prerequisites

Basic probability and statistics & some convex analysis


Random processes (e.g., EE505 or a Stat 5xx class).
Knowledge of python (numpy and scipy)
The course is open to students in all UW departments.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F5/67 (pg.5/114)
Logistics Review

Homework

There will be a new problem set assignment every 1 to 2 weeks


(about 6-7 problem sets for the quarter).
You will have approximately 1 to 1.5 weeks to solve the problem set.
Problem sets might also include python exercises, so you will need
to have access to python (anaconda is recommended).
The problem sets that are longer will take longer to do, so please do
not wait until the night before they are due to start them.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F6/67 (pg.6/114)
Logistics Review

Exams

We will have an in-class midterm and an in-class final.


Midterm exam date: Monday, Nov 6th, 2023, in class.
Final exam date/time: Monday, December 11, 2023, 8:30am.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F7/67 (pg.7/114)
Logistics Review

Grading

Grades will be based on a combination of the final (33.3%) and


midterm (33.3%) exam, and on homework (33.3%).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F8/67 (pg.8/114)
Logistics Review

Our Main Text

“Elements of Information Theory” by Thomas Cover and Joy


Thomas, 2nd edition (2006).

It should be available at the UW bookstore, or you can get it via


any online bookstore.
Reading assignment: Read Chapters 1 and 2.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F9/67 (pg.9/114)
Logistics Review

Other Relevant Texts


“A First Course in Information Theory”, by Raymond W. Yeung,
2002. Very good chapters on information measures and network
information theory.
“Elements of information theory”, 1991, Thomas M. Cover, Joy A.
Thomas, Q360.C68 (First edition of our book).
“Information theory and reliable communication”, 1968, Robert G.
Gallager. Q360.G3 (classic text by foundational researcher).
“Information theory and statistics”, 1968, Solomon Kullback, Math
QA276.K8
“Information theory : coding theorems for discrete memoryless
systems”, 1981, Imre Csiszár and János Körner. Q360.C75 (another
key book, but a little harder to read).
“Information Theory, Inference, and Learning Algorithms”, David
J.C. MacKay 2003

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F10/67 (pg.10/114)
Logistics Review

Still Other Relevant Texts

“The Theory of Information and Coding: 2nd Edition” Robert


McEliece, Cambridge, April 2002
“An Introduction to Information Theory: Symbols, Signals, and
Noise”, John R. Pierce, Dover 1980
“Information Theory”, Robert Ash, Dover 1965
“An Introduction to Information Theory”, Fazlollah M. Reza, Dover
1991
“Mathematical Foundations of Information Theory”, A. Khinchin,
Dover 1957. (best brief summary of key results).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F11/67 (pg.11/114)
Logistics Review

Relevant Background Mathematical Texts

“Convex Optimization”, Boyd and Vandenberghe


“Probability & Measure” Billingsley,
“Probability with Martingales”, Williams
“Probability, Theory & Examples”, Durrett
“Probability and Random Processes”, Grimmett and Stirzaker (great
book on stochastic processes).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F12/67 (pg.12/114)
Logistics Review

On Our Lecture Slides

Slides will (mostly) be available by the early morning before lecture.


Slides will be posted to our canvas page
(https://canvas.uw.edu/courses/1663370).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F13/67 (pg.13/114)
Logistics Review

Goal Lecture Road Map/Syllabus - IT-I


L1 (W9/27): Overview, Communications, L– (M11/6): In class midterm exam
Information, Entropy L12 (W11/8): Arith. Coding, Background
L2 (M10/2): Entropy, Mutual On Channel Capacity
Information, KL-Divergence L13 (M11/13): Channel Capacity, DMC
L3 (W10/4): More KL, Jensen, more L14 (W11/15): Ex. DMC, Properties,
Venn, Log Sum,Data Proc. Inequality Joint AEP, Shannon’s 2nd Theorem.
L4 (M10/9): Thermo, Stats, Fano, L15 (M11/20): Joint AEP, Shannon’s 2nd
L5 (W10/11): WLLN & Modes of Conv., Theorem,
AEP, Source Coding L16 (W11/22): Zero Error Codes, 2nd
L6 (M10/16): AEP, Source Coding, Types Thm Conv, Zero Error, R = C, Feedback,
L7 (W10/18): Types, Converse, Entropy Joint Thm, Coding,
rates L17 (M11/27):Coding, Hamming Codes,
L8 (M10/23): Entropy rates, HMMs, Differential Entropy
Coding, L18 (W12/29): Diff. Entropy, Diff vs.
L9 (W10/25): Coding, Kraft, Discrete, Gaussian case
L10 (M10/30): Kraft, Huffman L19 (M12/4): Max Entropy, Gaussian
L11 (W11/1): Huffman, Channel, GC Capacity
Shannon/Fano/Elias, Games L20 (W12/6): Review, Buffer, Preview
IT-II.
L– (M12/11): Final exam, Monday,
8:30am.
Finals Week: December 9th-15th.
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F14/67 (pg.14/114)
Logistics Review

Lecture Road Map - IT-I


L1 (W9/27): Overview, Communications, L– (M11/6): In class midterm exam
Information, Entropy L12 (W11/8):
L2 (M10/2): L13 (M11/13):
L3 (W10/4): L14 (W11/15):
L4 (M10/9): L15 (M11/20):
L5 (W10/11): L16 (W11/22):
L6 (M10/16): L17 (M11/27):
L7 (W10/18): L18 (W12/29):
L8 (M10/23): L19 (M12/4):
L9 (W10/25): L20 (W12/6):
L10 (M10/30): L– (M12/11): Final exam, Monday,
L11 (W11/1): 8:30am.
Finals Week: December 9th-15th.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F15/67 (pg.15/114)
Logistics Review

Review

We currently know nothing. Hence, nothing yet to review.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F16/67 (pg.16/114)
Logistics Review

Cumulative Outstanding Reading

Read chapters 1 and 2 in our book (Cover & Thomas, “Information


Theory”).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F17/67 (pg.17/114)
Logistics Review

Homework

No homework yet, but will be posted soon, watch for announcements


on our canvas page (https://canvas.uw.edu/courses/1663370)
(you can set it up to receive email whenever anything is posted).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F18/67 (pg.18/114)
Information Theory Information Entropy

Inspirational Quote
The moral life of man forms part of the subject-matter of the
artist but the morality of art consists of the perfect use of an
imperfect medium – Oscar Wilde

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F19/67 (pg.19/114)
Information Theory Information Entropy

Information Theory and Coding Theory

Information Theory is concerned with the theoretical limitations of


and potential for systems that communicate. E.g., “What is the
best compression or communications rate we can achieve”? Can we
communicate perfectly through an imperfect medium?
What is information? Beyond the philosophical questions that this
raises, how can we mathematically quantify information in a way
that is useful?
Coding Theory, (e.g., ECC) is concerned with the creation of
practical encoding and decoding algorithms that can be used for
communication over real-world noisy channels.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F20/67 (pg.20/114)
Information Theory Information Entropy

Information Theory (IT)

In this course, we cover IT and it’s application not only to communication


theory but other fields as well. IT involves or is related to many fields:
Communications theory Speech recognition
Cryptography Pattern recognition, machine
Computer science learning, and artificial
Physics (statistical mechanics) intelligence
Mathematics (in particular, Economics
probability and statistics) Biology, genetics, neurobiology,
Philosophy of science neuronal synergy
Linguistics and natural Psychology
language processing And many more . . .
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F21/67 (pg.21/114)
Information Theory Information Entropy

Claude Shannon, 1916 — 2001

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F22/67 (pg.22/114)
Information Theory Information Entropy

Communications Theory

In 1948, Claude E. Shannon of Bell Labs published a paper “The


Mathematical Theory of Communications”, and single-handedly
created this field. (paper can be found on the web)
Shannon’s work grew out of solving problems of electrical
communication during WWII, but IT applies to many other fields as
well. Would IT exist if WWII didn’t happen?
Many of the results of this course were published back in that
original paper. But the field has become very large, with influence
on many other fields (e.g., IEEE trans. information theory, 6
times/year).
Key idea: Communication is perfectly sending information from one
place and/or time to another place and/or time over a medium that
might cause errors. Make “perfect use of an imperfect medium.”

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F23/67 (pg.23/114)
Information Theory Information Entropy

Communication Theory

General model of communication:


noise

source encoder channel decoder receiver

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F24/67 (pg.24/114)
Information Theory Information Entropy

Source Information Possibilities


noise

source encoder channel decoder receiver

Voice
Words
Pictures
Music, art
Galileo space probe orbiting Jupiter
Human cells about to reproduce
Human parents about to reproduce
Sensory input of biological organism
Or any signal at all (any binary data).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F25/67 (pg.25/114)
Information Theory Information Entropy

Channel Possibilities

noise

source encoder channel decoder receiver

Telephone line
High frequency radio link
Space communication link
Storage (disk, tape, internet, TCP/IP, social media), transmission
through time rather than space, could be degradation due to decay
Biological organism (send message from brain to foot, or from ear to
brain, or genetic message from parent to child)

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F26/67 (pg.26/114)
Information Theory Information Entropy

Receiver Possibilities

noise

source encoder channel decoder receiver

The destination of the information transmitted


Person,
Computer
Disk
Analog Radio or TV
internet streaming audio system

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F27/67 (pg.27/114)
Information Theory Information Entropy

Noise

noise

source encoder channel decoder receiver

Some signal with time-varying frequency response, cross-talk,


thermal noise, impulsive switch noise, random mutation, etc.
Represents our imperfect understanding of the universe. Thus, we
treat it as random, often however obeying some tendencies, such as
that of a probability distribution..

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F28/67 (pg.28/114)
Information Theory Information Entropy

Encoder

noise

source encoder channel decoder receiver

processing done before placing info into channel


First stage: data reduction (keep only important bits or remove
source redundancy),
followed by redundancy insertion catered to channel.
A code = a mechanism for the representation of information of one
type signal in another form.
An encoding = representation of information in another form using
a code.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F29/67 (pg.29/114)
Information Theory Information Entropy

Decoder

noise

source encoder channel decoder receiver

The “decoder” is the inverse system of the “encoder” and it


attempts to recover the original source signal or some “subpart” of
the original source signal.
Exploit and then remove redundancy
Remove and fix any transmission errors
Restore the information in original form

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F30/67 (pg.30/114)
Information Theory Information Entropy

Ex: Transmitting an Image


s encoder t channel r decoder ŝ
f = 10%

From: David J.C. MacKay “Information Theory, Inference, and Learning


Algorithms”, 2003. Transmitting 10,000 source bits over a BSC with f = 10%
using a repetition code and the majority vote algorithm. The probability of
decoded bi error has fallen to about 3%; the rate has fallen to 1/3.
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F31/67 (pg.31/114)
Information Theory Information Entropy

Ex: DNA Code

DNA or the chromosomes within each cell encode all the info about
each body
Source = Two parents
Encoder = Your imagination ,
Channel = Biological combination, meiosis (creation of haploid
gametes), mutation, and so on.
Noise, random mutation.
Decoder = further mitosis creating the new child.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F32/67 (pg.32/114)
Information Theory Information Entropy

Ex: Morse Code

Morse code, series of dots and


dashes to represent letters
most frequent letter sent with
the shortest code, 1 dot
Note: codewords might be
prefixes of each other (e.g., “E”
and “F”).
uses only binary data (single
current telegraph, size two
“alphabet”), could use more
(three, double current
telegraph), but this is more
susceptible to noise (binary in
computer rather than ternary).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F33/67 (pg.33/114)
Information Theory Information Entropy

Ex: Human Speech

noise

source encoder channel decoder receiver

Source = human thought, speakers brain


Encoder = Human Vocal Tract
Channel = air, sound pressure waves
Noise = background noise (cocktail party effect)
Decoder = human auditory system
Receiver = human thought, listeners brain

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F34/67 (pg.34/114)
Information Theory Information Entropy

Communication Theory

When do we know the components and what do we know about them?


noise

source encoder channel decoder receiver

Sometimes we know the code (e.g., when it is designed by humans,


e.g., Morse code)
Other times we do not (e.g., when it is nature, speech, genetics)
Much of machine learning (e.g., object recognition in a sound
source, an image source, etc.) can be seen as the decoder aspects of
the model (we don’t know for certain the code).
This includes deep neural networks and LLMs.
Nor do we know how the brain does it.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F35/67 (pg.35/114)
Information Theory Information Entropy

Communication Theory: On Error

How do we decrease errors in a communications system?


Physical: use more reliable components in circuitry, broaden spectral
bandwidth, use more precise and expensive electronics, increase
signal power.
All of this is more expensive and resource consuming.
Question: Given a fixed imperfect analog channel and transmission
equipment, can we achieve perfect communication over an imperfect
communication line?
Yes: Key is to add redundancy to signal. Ex. Speech.
Encoder adds redundancy appropriate for channel. Decoder exploits
and then removes redundancy.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F36/67 (pg.36/114)
Information Theory Information Entropy

Communication Theory: On Error


Question: If you transmit information at a higher rate, does the
error necessarily go up?
Answer: Surprisingly, not always.
Surprisingly, for a given noisy channel (where the channel has
exceedingly small probability of transmitting without error) one can
achieve perfect communication at a given rate.
If that rate has not exceed a critical value, then one can increase the
rate without increasing error.
Let R = rate of code (bits per channel use), and Pe ∝ exp(−E(R))
be probability of error. Then
log Pe
0 Error
Exponent
E(R)

C R→
0 C R→
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F37/67 (pg.37/114)
Information Theory Information Entropy

Communication & Information

What is information?
OED says:
1 facts provided or learned about something or someone.
2 what is conveyed or represented by a particular arrangement or
sequence of things.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F38/67 (pg.38/114)
Information Theory Information Entropy

Communication & Information

What is information?
Wikipedia says:
1 Information in its most restricted technical sense is a message
(utterance or expression) or collection of messages in an ordered
sequence that consists of symbols, or it is the meaning that can be
interpreted from such a message or collection of messages.
Information can be recorded or transmitted. It can be recorded as
signs, or conveyed as signals. Information is any kind of event that
affects the state of a dynamic system. The concept has numerous
other meanings in different contexts. Moreover, the concept of
information is closely related to notions of constraint,
communication, control, data, form, instruction, knowledge,
meaning, mental stimulus, pattern, perception, representation, and
especially entropy.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F39/67 (pg.39/114)
Information Theory Information Entropy

Communication & Information


What is information?
Websters says:
1 the communication or reception of knowledge or intelligence
2a knowledge obtained from investigation, study, or instruction.
2b the attribute inherent in and communicated by one of two or more
alternative sequences or arrangements of something that produce
specific effects
2c a signal or character representing data
2d something (as a message, experimental data, or a picture) which
justifies change in a construct (as a plan or theory) that represents
physical or mental experience or another construct
2e a quantitative measure of the content of information; specifically : a
numerical quantity that measures the uncertainty in the outcome of
an experiment to be performed

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F40/67 (pg.40/114)
Information Theory Information Entropy

Information

Oranges are 99¢/pound.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F41/67 (pg.41/114)
Information Theory Information Entropy

Information

Oranges are 99¢/pound.


It is cloudy in Seattle today.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F41/67 (pg.42/114)
Information Theory Information Entropy

Information

Oranges are 99¢/pound.


It is cloudy in Seattle today.
You are taking an information
theory course right now.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F41/67 (pg.43/114)
Information Theory Information Entropy

Information

Oranges are 99¢/pound.


It is cloudy in Seattle today.
You are taking an information
theory course right now.
It is a balmy tropical climate in
Seattle. As in other places in
the Pacific North-West, warm,
sunny days are the norm.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F41/67 (pg.44/114)
Information Theory Information Entropy

Information

Oranges are 99¢/pound.


It is cloudy in Seattle today.
You are taking an information
theory course right now.
It is a balmy tropical climate in
Seattle. As in other places in
the Pacific North-West, warm,
sunny days are the norm.
Richard Dawkins will win the
U.S. Presidential Election in
November, 2024.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F41/67 (pg.45/114)
Information Theory Information Entropy

Information
Poetry: “I heard an echo in a
hollow place. No sound of
blowing wind or drifting sand,
Oranges are 99¢/pound.
some ancient voice was this, a
It is cloudy in Seattle today. captive trace of gone-by
You are taking an information speech, of argument, demand,”
theory course right now. – Tiel Aisha Ansari
It is a balmy tropical climate in
Seattle. As in other places in
the Pacific North-West, warm,
sunny days are the norm.
Richard Dawkins will win the
U.S. Presidential Election in
November, 2024.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F41/67 (pg.46/114)
Information Theory Information Entropy

Information
Poetry: “I heard an echo in a
hollow place. No sound of
blowing wind or drifting sand,
Oranges are 99¢/pound.
some ancient voice was this, a
It is cloudy in Seattle today. captive trace of gone-by
You are taking an information speech, of argument, demand,”
theory course right now. – Tiel Aisha Ansari
It is a balmy tropical climate in A Painting (– Dali)
Seattle. As in other places in
the Pacific North-West, warm,
sunny days are the norm.
Richard Dawkins will win the
U.S. Presidential Election in
November, 2024.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F41/67 (pg.47/114)
Information Theory Information Entropy

Information
Poetry: “I heard an echo in a
hollow place. No sound of
blowing wind or drifting sand,
Oranges are 99¢/pound.
some ancient voice was this, a
It is cloudy in Seattle today. captive trace of gone-by
You are taking an information speech, of argument, demand,”
theory course right now. – Tiel Aisha Ansari
It is a balmy tropical climate in A Painting (– Dali)
Seattle. As in other places in
the Pacific North-West, warm,
sunny days are the norm.
Richard Dawkins will win the
U.S. Presidential Election in
November, 2024.

Music
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F41/67 (pg.48/114)
Information Theory Information Entropy

Information

Such information has semantic meaning, but how do we quantify it?


IT is a formal mathematical theory, uses probability and statistics to
make things mathematically precise. We don’t mean semantics, we
want a “quantifiable” meaning.
Key is communication, relaying a message from someone to
someone else.
I’m communicating to you now. How much? Can we quantify it?

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F42/67 (pg.49/114)
Information Theory Information Entropy

Need a mathematical model of a source

Assume a source conveys one of a number of messages.


Message source randomly chooses one among many possible
messages
Information conveyed by a message corresponds to how unlikely that
message is. Information should be inversely related to probability in
some way.
That which is predictable conveys little or no information.
The probability distribution of those messages determines the
inherent information contained in the source, on average.
Ex: uniform distribution, greatest choice, or uncertainty about
source ⇒ greatest information gained on average.
Ex: constant random variable, least choice, least uncertainty ⇒
least information about the source, on average.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F43/67 (pg.50/114)
Information Theory Information Entropy

Entropy

We’ll define entropy shortly, but intuitively:


Entropy H measures uncertainty or information.
Entropy is a measure of choice, the choice that the source exercises
in selecting the messages that are transmitted.
entropy = 0 ⇒ no choice.
entropy = log N ⇒ maximum choice
Entropy is the uncertainty of the receiver — how much uncertainty
does the receiver, when seeing a source, have about the source.
Entropy measures amount of information (complexity) in a source.
The more uniformly at random you are, the more choice you have.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F44/67 (pg.51/114)
Information Theory Information Entropy

Communication Theory
Original model of communication
noise

source encoder channel decoder receiver

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F45/67 (pg.52/114)
Information Theory Information Entropy

Communication Theory
Original model of communication
noise

source encoder channel decoder receiver

General model of communication expanded:


noise
encoder decoder

source source channel channel channel source receiver


coder encoder decoder decoder

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F45/67 (pg.53/114)
Information Theory Information Entropy

Communication Theory
Original model of communication
noise

source encoder channel decoder receiver

General model of communication expanded:


noise
encoder decoder

source source channel channel channel source receiver


coder encoder decoder decoder

Can we do source coding and channel coding separately without


them knowing about each other and retain optimality?

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F45/67 (pg.54/114)
Information Theory Information Entropy

Communication Theory
Original model of communication
noise

source encoder channel decoder receiver

General model of communication expanded:


noise
encoder decoder

source source channel channel channel source receiver


coder encoder decoder decoder

Can we do source coding and channel coding separately without


them knowing about each other and retain optimality?
Source Coding: shrinks source down to ultimate limit, data
compression, H, the entropy of the source

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F45/67 (pg.55/114)
Information Theory Information Entropy

Communication Theory
Original model of communication
noise

source encoder channel decoder receiver

General model of communication expanded:


noise
encoder decoder

source source channel channel channel source receiver


coder encoder decoder decoder

Can we do source coding and channel coding separately without


them knowing about each other and retain optimality?
Source Coding: shrinks source down to ultimate limit, data
compression, H, the entropy of the source
Channel coding: achieves ultimate transmission rate C, channel
capacity, pushes as many bits through channel as possible w/o error.
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F45/67 (pg.56/114)
Information Theory Information Entropy

On Source Coding

What makes a good code?


Lossless codes (such as Huffman, Lempel-Ziv, bzip, bzip2, etc.),
compress to the theoretical limit of entropy, and do so without error.
Code length, we want a short code on average. What does
“average” mean? We’ll see.
Fidelity loss (e.g., JPEG, MPEG, CDMA, TDMA). Lossy
compression/communication. You can compress more if you accept
error.
Rate-distortion - underlying tradeoff between rate of code and the
underlying distortion, but there are limits of rate for any given
distortion.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F46/67 (pg.57/114)
Information Theory Information Entropy

5 basic questions in information theory for communication

1 How to measure information & define a unit of measure (entropy).


2 How to define an information source & measure rate of information
supplied by the source (probability and entropy, source coding
theorem).
3 How to define a channel & rate of information transmission through
a channel (via a conditional distribution)
4 How to study joint rate of transmission from source through channel
to receiver? How to maximize rate of transfer? (mutual information
and channel coding).
5 How to study noise, & how noise limits rate of information
transmission without limiting reliability (channel coding theorem).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F47/67 (pg.58/114)
Information Theory Information Entropy

What is entropy?
Events Ek each occur with probability pk . pk indicates the
likelihood of the event Ek happening.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F48/67 (pg.59/114)
Information Theory Information Entropy

What is entropy?
Events Ek each occur with probability pk . pk indicates the
likelihood of the event Ek happening.
Shannon/Hartley information of event Ek is I(Ek ) = log(1/pk ),
indicating:

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F48/67 (pg.60/114)
Information Theory Information Entropy

What is entropy?
Events Ek each occur with probability pk . pk indicates the
likelihood of the event Ek happening.
Shannon/Hartley information of event Ek is I(Ek ) = log(1/pk ),
indicating:
1 a measure of surprise of finding out Ek . If pk = 1 ⇒no surprise in
finding out that Ek occurred, while pk = 0 ⇒ infinite surprise in
finding out Ek .

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F48/67 (pg.61/114)
Information Theory Information Entropy

What is entropy?
Events Ek each occur with probability pk . pk indicates the
likelihood of the event Ek happening.
Shannon/Hartley information of event Ek is I(Ek ) = log(1/pk ),
indicating:
1 a measure of surprise of finding out Ek . If pk = 1 ⇒no surprise in
finding out that Ek occurred, while pk = 0 ⇒ infinite surprise in
finding out Ek .
2 A measure if information gained in finding out Ek (information
gained is equal to surprise). pk = 1 ⇒No information is gained, while
pk = 0 ⇒ infinite information is gained.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F48/67 (pg.62/114)
Information Theory Information Entropy

What is entropy?
Events Ek each occur with probability pk . pk indicates the
likelihood of the event Ek happening.
Shannon/Hartley information of event Ek is I(Ek ) = log(1/pk ),
indicating:
1 a measure of surprise of finding out Ek . If pk = 1 ⇒no surprise in
finding out that Ek occurred, while pk = 0 ⇒ infinite surprise in
finding out Ek .
2 A measure if information gained in finding out Ek (information
gained is equal to surprise). pk = 1 ⇒No information is gained, while
pk = 0 ⇒ infinite information is gained.
3 A measure of the “uncertainty” of Ek , but really unexpectedness.
Unexpectedness is the thing that determines interest, or information
(see next slide).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F48/67 (pg.63/114)
Information Theory Information Entropy

What is entropy?
Events Ek each occur with probability pk . pk indicates the
likelihood of the event Ek happening.
Shannon/Hartley information of event Ek is I(Ek ) = log(1/pk ),
indicating:
1 a measure of surprise of finding out Ek . If pk = 1 ⇒no surprise in
finding out that Ek occurred, while pk = 0 ⇒ infinite surprise in
finding out Ek .
2 A measure if information gained in finding out Ek (information
gained is equal to surprise). pk = 1 ⇒No information is gained, while
pk = 0 ⇒ infinite information is gained.
3 A measure of the “uncertainty” of Ek , but really unexpectedness.
Unexpectedness is the thing that determines interest, or information
(see next slide).
4 I(Ek ) = − log p(Ek ) = the self information of that event, or that
message. Why is it called self-information? We’ll soon see.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F48/67 (pg.64/114)
Information Theory Information Entropy

What is entropy?
Events Ek each occur with probability pk . pk indicates the
likelihood of the event Ek happening.
Shannon/Hartley information of event Ek is I(Ek ) = log(1/pk ),
indicating:
1 a measure of surprise of finding out Ek . If pk = 1 ⇒no surprise in
finding out that Ek occurred, while pk = 0 ⇒ infinite surprise in
finding out Ek .
2 A measure if information gained in finding out Ek (information
gained is equal to surprise). pk = 1 ⇒No information is gained, while
pk = 0 ⇒ infinite information is gained.
3 A measure of the “uncertainty” of Ek , but really unexpectedness.
Unexpectedness is the thing that determines interest, or information
(see next slide).
4 I(Ek ) = − log p(Ek ) = the self information of that event, or that
message. Why is it called self-information? We’ll soon see.
All logs are base 2 (by default), so log ≡ log2 unless otherwise
stated. ln will be natural log.
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F48/67 (pg.65/114)
Information Theory Information Entropy

On event uncertainty vs. unexpectedness

The word “Uncertainty” doesn’t really apply to the individual event


or information thereof I(Ek ) (as it is described in some texts).
If p(x) = 0, we are as certain about it not happening as we are
certain about x happening when p(x) = 1.
That’s why “surprise” or “unexpectedness” are better words, self
information is a measure of this. I.e., 1/pk is how surprised we are
that k happened when it does.
Unexpectedness is the thing that determines interest, or information.
We care about that which is unexpected.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F49/67 (pg.66/114)
Information Theory Information Entropy

Entropy - What’s in a name?


In Shannon’s 1948 paper, he used the term “entropy” which came
from the “disorder” in a thermodynamical system.
The term “Entropy” came from Rudolph Clausius, 1865:
Since I think it is better to take the names
of such quantities as these, which are im-
portant for science, from the ancient lan-
guages, so that they can be introduced with-
out change into all the modern languages,
I propose to name the magnitude S the
entropy of the body, from the Greek word
“trope” for “transformation.” I have inten-
tionally formed the word “entropy” so as to
be as similar as possible to the word “en-
ergy” since both these quantities which are
to be known by these names are so nearly
related to each other in their physical signifi-
cance that a certain similarity in their names
seemed to me advantageous.
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F50/67 (pg.67/114)
Information Theory Information Entropy

Uses of entropy in IT

Entropy uses:
measure information in the communication theory model.
1
Surprise of an event {X = x} is measured as log p(x) , and there are
reasons for using log. Entropy is the average surprise.
The lower bound on min number of guesses (on average) to guess
the value of a random variable.
The minimum number of bits to compress a source.
The optimal coding “length”, of a random source.
The minimum description length (MDL) of a random source that
can be achieved without probability of error.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F51/67 (pg.68/114)
Information Theory Information Entropy

(discrete) Entropy Definition

Notation: p(x) = PX (X = x). The event is {X = x}.


P
Given random variable X, expected value EX = x xp(x).

P g : X → R, expected value of random variable g(X)


Given function
is Eg(X) = x g(x)p(x).
1
Now take g(x) = log p(x) , thus g(x) is the unexpectedness of
finding out event X = x.
Then, take expectedP value of this g (which is self-referential but
1
well-defined) giving x p(x) log p(x) . That is

X X 1
p(x)g(x) = p(x) log (1.1)
x x
p(x)

This is the average or expected surprise, or expected unexpectedness


in a random variable X and is the definition of entropy.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F52/67 (pg.69/114)
Information Theory Information Entropy

Entropy

Definition 1.5.1 (Entropy)


Given a discrete random variable X over a finite sized alphabet, the
entropy of the random variable is:
1 X 1 X
H(X) ≜ E log = p(x) log =− p(x) log p(x) (1.2)
p(X) x
p(x) x

Entropy is in units of “bits” since logs are base 2 (units of “nats” if


base e logs).
Measures the degree of uncertainty in a distribution.
Measures the disorder or spread of a distribution.
Measures the “choice” that a source has in choosing symbols
according to the density (higher entropy means more choice)

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F53/67 (pg.70/114)
Information Theory Information Entropy

Entropy Of Distributions
p(x)

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F54/67 (pg.71/114)
Information Theory Information Entropy

Entropy Of Distributions
p(x)

Low Entropy

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F54/67 (pg.72/114)
Information Theory Information Entropy

Entropy Of Distributions
p(x)

Low Entropy

x
p(x)

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F54/67 (pg.73/114)
Information Theory Information Entropy

Entropy Of Distributions
p(x)

Low Entropy

x
p(x)

High Entropy

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F54/67 (pg.74/114)
Information Theory Information Entropy

Entropy Of Distributions
p(x)

Low Entropy

x
p(x)

High Entropy

x
p(x)

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F54/67 (pg.75/114)
Information Theory Information Entropy

Entropy Of Distributions
p(x)

Low Entropy

x
p(x)

High Entropy

x
p(x)

In Between

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F54/67 (pg.76/114)
Information Theory Information Entropy

Entropy

A measure of the true average uncertainty, or average surprise which


is a measure over the entire distribution.
Remember this, entropy measures average or expected degree of
uncertainty of the outcome a probability distribution.
Measures of disorder, or spread. High entropy distributions, should
be flat, more uniform, while low entropy should be few modal.
A measure of choice that the source has in choosing elements of Ek .

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F55/67 (pg.77/114)
Information Theory Information Entropy

Binary Entropy
Binary alphabet, X ∈ {0, 1} say.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F56/67 (pg.78/114)
Information Theory Information Entropy

Binary Entropy
Binary alphabet, X ∈ {0, 1} say.
p(X = 1) = p = 1 − p(X = 0).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F56/67 (pg.79/114)
Information Theory Information Entropy

Binary Entropy
Binary alphabet, X ∈ {0, 1} say.
p(X = 1) = p = 1 − p(X = 0).
H(X) = −p log p − (1 − p) log(1 − p) = H(p).

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F56/67 (pg.80/114)
Information Theory Information Entropy

Binary Entropy
Binary alphabet, X ∈ {0, 1} say.
p(X = 1) = p = 1 − p(X = 0).
H(X) = −p log p − (1 − p) log(1 − p) = H(p).
As a function of p, we get:
1
H(p) 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F56/67 (pg.81/114)
Information Theory Information Entropy

Binary Entropy
Binary alphabet, X ∈ {0, 1} say.
p(X = 1) = p = 1 − p(X = 0).
H(X) = −p log p − (1 − p) log(1 − p) = H(p).
As a function of p, we get:
1
H(p) 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p
Note, greatest uncertainty (value 1) when p = 0.5 and least
uncertainty (value 0) when p = 0 or p = 1.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F56/67 (pg.82/114)
Information Theory Information Entropy

Binary Entropy
Binary alphabet, X ∈ {0, 1} say.
p(X = 1) = p = 1 − p(X = 0).
H(X) = −p log p − (1 − p) log(1 − p) = H(p).
As a function of p, we get:
1
H(p) 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p
Note, greatest uncertainty (value 1) when p = 0.5 and least
uncertainty (value 0) when p = 0 or p = 1.
Note also: concave in p.
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F56/67 (pg.83/114)
Information Theory Information Entropy

Entropy: Are Humans Random?


IT and Entropy use randomness to measure information. But are
humans (and is the universe) random?
Humans utilize “semantics” (whatever that is), and may convey
“meaning” or “information” in a source beyond how improbable or
unpredictable it is (e.g., poetry, music, art).
Example: death. After a long bout with cancer, it is predictable, but
it has extraordinary meaning.
Information theory ignores such semantics.
On the other hand, can we model certain properties of a human
with a random process?
Yes. Humans (and natural organisms and signals in general) do
exhibit purposeful statistical regularity. Ex: AI and LLMs.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F57/67 (pg.84/114)
Information Theory Information Entropy

Entropy: Are Humans Random?


IT and Entropy use randomness to measure information. But are
humans (and is the universe) random?
Humans utilize “semantics” (whatever that is), and may convey
“meaning” or “information” in a source beyond how improbable or
unpredictable it is (e.g., poetry, music, art).
Example: death. After a long bout with cancer, it is predictable, but
it has extraordinary meaning.
Information theory ignores such semantics.
On the other hand, can we model certain properties of a human
with a random process?
Yes. Humans (and natural organisms and signals in general) do
exhibit purposeful statistical regularity. Ex: AI and LLMs. What
does this say about creativity? Slightly irregular statistical
regularity?

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F57/67 (pg.85/114)
Information Theory Information Entropy

Joint Entropy

Two random variables X and Y have joint entropy.

XX 1
H(X, Y ) = − p(x, y) log p(x, y) = E log (1.3)
x y
p(X, Y )

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F58/67 (pg.86/114)
Information Theory Information Entropy

Joint Entropy

Two random variables X and Y have joint entropy.

XX 1
H(X, Y ) = − p(x, y) log p(x, y) = E log (1.3)
x y
p(X, Y )

Immediate generalizations to vectors X1:N = (X1 , X2 , . . . , XN ).

X 1
H(X1 , . . . , XN ) = p(x1 , . . . , xN ) log
x1 ,x2 ,...,xN
p(x1 , . . . , xN )
(1.4)
1
= E log (1.5)
p(x1 , . . . , xN )

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F58/67 (pg.87/114)
Information Theory Information Entropy

The non-negativity of discrete Entropy

1 P 1 P
H(X) ≜ E log p(X) = x p(x) log p(x) =− x p(x) log p(x)
Discrete since X is a discrete random variable (i.e., x ∈ X where X
is countable).
Note limα→0 α log α = 0, hence if p(x) = 0, the entropy is
uninfluenced.
Also since p(x) ≥ 0 and log 1/p(x) ≥ 0 discrete entropy is always
non-negative H(X) ≥ 0.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F59/67 (pg.88/114)
Information Theory Information Entropy

Conditional Entropy
For two random variables X, Y related via p(x, y), knowing the
event X = x can change the entropy of Y .

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F60/67 (pg.89/114)
Information Theory Information Entropy

Conditional Entropy
For two random variables X, Y related via p(x, y), knowing the
event X = x can change the entropy of Y .
Event conditional entropy H(Y |X = x)
1
H(Y |X = x) = E log (1.6)
p(Y |X = x)
X
=− p(y|x) log p(y|x) (1.7)
y

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F60/67 (pg.90/114)
Information Theory Information Entropy

Conditional Entropy
For two random variables X, Y related via p(x, y), knowing the
event X = x can change the entropy of Y .
Event conditional entropy H(Y |X = x)
1
H(Y |X = x) = E log (1.6)
p(Y |X = x)
X
=− p(y|x) log p(y|x) (1.7)
y

Averaging over all x, we get the conditional entropy H(Y |X).


X
H(Y |X) = p(x)H(Y |X = x) (1.8)
x
X X
=− p(x) p(y|x) log p(y|x) (1.9)
x y
X 1
=− p(x, y) log p(y|x) = E log (1.10)
x,y
p(Y |X)

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F60/67 (pg.91/114)
Information Theory Information Entropy

Chain rule for Entropy

Proposition 1.5.2 (Chain Rule for Entropy)

H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ) (1.11)

Proof.

− log p(x, y) = − log p(x) − log p(y|x) (1.12)

then take expected value of both sides to get result.

Corollary 1.5.3
⊥Y then H(X, Y ) = H(X) + H(Y ).
If X⊥

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F61/67 (pg.92/114)
Information Theory Information Entropy

General Chain rule for Entropy


Proposition 1.5.4 (Chain Rule for Entropy)

N
X
H(X1 , X2 , . . . , XN ) = H(Xi |X1 , X2 , . . . , Xi−1 ) (1.13)
i=1

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F62/67 (pg.93/114)
Information Theory Information Entropy

General Chain rule for Entropy


Proposition 1.5.4 (Chain Rule for Entropy)

N
X
H(X1 , X2 , . . . , XN ) = H(Xi |X1 , X2 , . . . , Xi−1 ) (1.13)
i=1

Proof.
Use chain rule of conditional probability, i..e, that
N
Y
p(x1 , x2 , . . . , xN ) = p(xi |x1 , . . . , xi−1 ) (1.14)
i=1

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F62/67 (pg.94/114)
Information Theory Information Entropy

General Chain rule for Entropy


Proposition 1.5.4 (Chain Rule for Entropy)

N
X
H(X1 , X2 , . . . , XN ) = H(Xi |X1 , X2 , . . . , Xi−1 ) (1.13)
i=1

Proof.
Use chain rule of conditional probability, i..e, that
N
Y
p(x1 , x2 , . . . , xN ) = p(xi |x1 , . . . , xi−1 ) (1.14)
i=1
then
N
X
− log p(x1 , x2 , . . . , xN ) = − log p(xi |x1 , x2 , . . . , xi−1 ) (1.15)
i=1

then take expected value of both sides to get result.


Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F62/67 (pg.95/114)
Information Theory Information Entropy

Aside: Variational Bound for Log


Convex analysis gives variational representation
ln x = min {λx − ln λ − 1} (1.16)
λ
so for any λ, we have
ln x ≤ λx − ln λ − 1 (1.17)
and with λ = 1, we thus get
ln x ≤ x − 1 (1.18)

variational upper bound on natural log


2

−1
ln(x)

λ = 0.5
−2
λ=1
λ=2
−3 λ=3
λ=5
λ = 10
−4
λ = 20
ln(x)

0.0 0.2 0.4 0.6 0.8 1.0

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I x– Lecture 1 - Sep 27th, 2023 L1 F63/67 (pg.96/114)
Information Theory Information Entropy

Max value of (discrete) Entropy


Proposition 1.5.5
Let X ∈ {x1 , x2 , . . . , xn }. Then H(X) ≤ log n, and equality is achieved
if‌f p(X = xi ) = n1 for all i.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F64/67 (pg.97/114)
Information Theory Information Entropy

Max value of (discrete) Entropy


Proposition 1.5.5
Let X ∈ {x1 , x2 , . . . , xn }. Then H(X) ≤ log n, and equality is achieved
if‌f p(X = xi ) = n1 for all i.

Proof.
Approach: show that H(X) − log n ≤ 0.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F64/67 (pg.98/114)
Information Theory Information Entropy

Max value of (discrete) Entropy


Proposition 1.5.5
Let X ∈ {x1 , x2 , . . . , xn }. Then H(X) ≤ log n, and equality is achieved
if‌f p(X = xi ) = n1 for all i.

Proof.
Approach: show that H(X) − log n ≤ 0.

H(X) − log n

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F64/67 (pg.99/114)
Information Theory Information Entropy

Max value of (discrete) Entropy


Proposition 1.5.5
Let X ∈ {x1 , x2 , . . . , xn }. Then H(X) ≤ log n, and equality is achieved
if‌f p(X = xi ) = n1 for all i.

Proof.
Approach: show that H(X) − log n ≤ 0.
X X
H(X) − log n = − p(x) log p(x) − p(x) log n (1.19)
x x

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F64/67 (pg.100/114)
Information Theory Information Entropy

Max value of (discrete) Entropy


Proposition 1.5.5
Let X ∈ {x1 , x2 , . . . , xn }. Then H(X) ≤ log n, and equality is achieved
if‌f p(X = xi ) = n1 for all i.

Proof.
Approach: show that H(X) − log n ≤ 0.
X X
H(X) − log n = − p(x) log p(x) − p(x) log n (1.19)
x x
X 1
= log2 e p(x) ln (1.20)
x
p(x)n

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F64/67 (pg.101/114)
Information Theory Information Entropy

Max value of (discrete) Entropy


Proposition 1.5.5
Let X ∈ {x1 , x2 , . . . , xn }. Then H(X) ≤ log n, and equality is achieved
if‌f p(X = xi ) = n1 for all i.

Proof.
Approach: show that H(X) − log n ≤ 0.
X X
H(X) − log n = − p(x) log p(x) − p(x) log n (1.19)
x x
X 1
= log2 e p(x) ln (1.20)
x
p(x)n
 
X 1
≤ log e p(x) −1 (1.21)
x
p(x)n

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F64/67 (pg.102/114)
Information Theory Information Entropy

Max value of (discrete) Entropy


Proposition 1.5.5
Let X ∈ {x1 , x2 , . . . , xn }. Then H(X) ≤ log n, and equality is achieved
if‌f p(X = xi ) = n1 for all i.

Proof.
Approach: show that H(X) − log n ≤ 0.
X X
H(X) − log n = − p(x) log p(x) − p(x) log n (1.19)
x x
X 1
= log2 e p(x) ln (1.20)
x
p(x)n
 
X 1
≤ log e p(x) −1 (1.21)
x
p(x)n
" #
X1 X
= log e − p(x)
x
n x
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F64/67 (pg.103/114)
Information Theory Information Entropy

Max value of (discrete) Entropy


Proposition 1.5.5
Let X ∈ {x1 , x2 , . . . , xn }. Then H(X) ≤ log n, and equality is achieved
if‌f p(X = xi ) = n1 for all i.

Proof.
Approach: show that H(X) − log n ≤ 0.
X X
H(X) − log n = − p(x) log p(x) − p(x) log n (1.19)
x x
X 1
= log2 e p(x) ln (1.20)
x
p(x)n
 
X 1
≤ log e p(x) −1 (1.21)
x
p(x)n
" #
X1 X
= log e − p(x) = 0 (1.22)
x
n x
Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F64/67 (pg.104/114)
Information Theory Information Entropy

Max value of (discrete) Entropy

Since ln z = z − 1 when z = 1, the above becomes an equality at


1
stationary point, i.e., when p(x)n = 1, or p(x) = 1/n the uniform
distribution.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F65/67 (pg.105/114)
Information Theory Information Entropy

Max value of (discrete) Entropy

Since ln z = z − 1 when z = 1, the above becomes an equality at


1
stationary point, i.e., when p(x)n = 1, or p(x) = 1/n the uniform
distribution.
Another way to seeP
this is if pi = 1/n, then
− i pi log pi = − i n1 log n1 = − log n1 = log n.
P

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F65/67 (pg.106/114)
Information Theory Information Entropy

Max value of (discrete) Entropy

Since ln z = z − 1 when z = 1, the above becomes an equality at


1
stationary point, i.e., when p(x)n = 1, or p(x) = 1/n the uniform
distribution.
Another way to seeP
this is if pi = 1/n, then
− i pi log pi = − i n1 log n1 = − log n1 = log n.
P

Implications: entropy increases when the distribution becomes more


uniform.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F65/67 (pg.107/114)
Information Theory Information Entropy

Max value of (discrete) Entropy

Since ln z = z − 1 when z = 1, the above becomes an equality at


1
stationary point, i.e., when p(x)n = 1, or p(x) = 1/n the uniform
distribution.
Another way to seeP
this is if pi = 1/n, then
− i pi log pi = − i n1 log n1 = − log n1 = log n.
P

Implications: entropy increases when the distribution becomes more


uniform.
E.g., mixing. λp1 + (1 − λ)p2 , we have
H(λp1 + (1 − λ)p2 ) ≥ λH(p1 ) + (1 − λ)H(p2 ), entropy is concave.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F65/67 (pg.108/114)
Information Theory Information Entropy

Permutations

What if we permute the probabilities themselves?

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F66/67 (pg.109/114)
Information Theory Information Entropy

Permutations

What if we permute the probabilities themselves?


I.e., let distribution p = (p1 , p2 , . . . , pn ) be a discrete probability
distribution and σ = (σ1 , σ2 , . . . , σn ) be a random permutation.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F66/67 (pg.110/114)
Information Theory Information Entropy

Permutations

What if we permute the probabilities themselves?


I.e., let distribution p = (p1 , p2 , . . . , pn ) be a discrete probability
distribution and σ = (σ1 , σ2 , . . . , σn ) be a random permutation.
Let pσ = (pσ1 , pσ2 , . . . , pσn ) be a permutation of the distribution.

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F66/67 (pg.111/114)
Information Theory Information Entropy

Permutations

What if we permute the probabilities themselves?


I.e., let distribution p = (p1 , p2 , . . . , pn ) be a discrete probability
distribution and σ = (σ1 , σ2 , . . . , σn ) be a random permutation.
Let pσ = (pσ1 , pσ2 , . . . , pσn ) be a permutation of the distribution.
P
How does H(p) = − i pi log pi compare with H(pσ )?

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F66/67 (pg.112/114)
Information Theory Information Entropy

Permutations

What if we permute the probabilities themselves?


I.e., let distribution p = (p1 , p2 , . . . , pn ) be a discrete probability
distribution and σ = (σ1 , σ2 , . . . , σn ) be a random permutation.
Let pσ = (pσ1 , pσ2 , . . . , pσn ) be a permutation of the distribution.
P
How does H(p) = − i pi log pi compare with H(pσ )?
P
Same, since H(p) = H(pσ ) = − i pσi log pσi .

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F66/67 (pg.113/114)
Information Theory Information Entropy

Summary so far

X
H(X) = EI(X) = − p(x) log p(x) (1.23)
x
X
H(X, Y ) = − p(x, y) log p(x, y) (1.24)
x,y
X
H(Y |X) = − p(x, y) log p(y|x) (1.25)
x,y

H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ) (1.26)

and

0 ≤ H(X) ≤ log n, where n is X’s alphabet size. (1.27)

Prof. Jeff Bilmes EE514a/Fall 2023/Info. Theory I – Lecture 1 - Sep 27th, 2023 L1 F67/67 (pg.114/114)

You might also like