0% found this document useful (0 votes)
37 views85 pages

COSC1003/1903 Information Theory: Joseph Lizier

Uploaded by

Mukaila Issah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views85 pages

COSC1003/1903 Information Theory: Joseph Lizier

Uploaded by

Mukaila Issah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

COSC1003/1903

Information Theory

Joseph Lizier

Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 2

Guest lecturer

Dr. Joseph Lizier


Senior Lecturer
Complex Systems Research Group
School of Civil Engineering
Rm 338A Civil Eng Building
[email protected]

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 3

Reference texts

ˆ T. M. Cover and J. A. Thomas. Elements of Information


Theory. Wiley-Interscience, New York, 1991.
ˆ see ch. 2
ˆ D. J. C. MacKay. Information Theory, Inference, and Learning
Algorithms. Cambridge University Press, Cambridge, 2003.
ˆ see ch. 2 and 8
ˆ download at
http://www.inference.phy.cam.ac.uk/itprnn/book.html

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 4

Outline

1 Introduction to information theory

2 Entropy: fundamental quantity of information theory

3 Other measures

4 Sample applications

5 Summary

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 5

What is information?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 5

What is information?

ˆ You tell me ...

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 6

A game about information: Guess who?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 6

A game about information: Guess who?

1. Let’s talk about the rules

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 7

A game about information: Guess who?

2. Who wants to play?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 8

A game about information: Guess who?

3. What did we learn from the game:


1 What are the best/worst questions to ask or strategies?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 8

A game about information: Guess who?

3. What did we learn from the game:


1 What are the best/worst questions to ask or strategies?

2 What types of information did we encounter?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 9

What is Information Theory?

ˆ An approach to quantitatively capture the notion of


information.

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 9

What is Information Theory?

ˆ An approach to quantitatively capture the notion of


information.
ˆ Traditionally, information theory provides answers to two
fundamental questions (Cover and Thomas, 1991):
1 What is the ultimate data compression?
ˆ How small can I zip up a file?
2 What is the ultimate transmission rate of communication?
ˆ What is my max download speed at home?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 10

What is Information Theory?

It’s also about far more than these traditional areas:

Image from Cover and

Thomas (1991)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 10

What is Information Theory?

It’s also about far more than these traditional areas:

How do natural
systems process
−→
information? Image from Cover and

Thomas (1991)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 11

Defining information: first pass

JL: “Information is all about questions and answers”

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 11

Defining information: first pass

JL: “Information is all about questions and answers”

Information is the amount by which


ˆ one variable (an answer/signal/measurement)
ˆ reduces our uncertainty or surprises us
ˆ about another variable.

This was quantified by Claude Shannon (1948)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 12

Quantifying information: preliminaries

ˆ X is a random variable
ˆ A variable whose value is subject to chance.
ˆ i.e. an answer/signal/measurement
ˆ e.g. result of a coin flip, whether it rains today, etc.

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 12

Quantifying information: preliminaries

ˆ X is a random variable
ˆ A variable whose value is subject to chance.
ˆ i.e. an answer/signal/measurement
ˆ e.g. result of a coin flip, whether it rains today, etc.
ˆ x is a sample or outcome or measurement of X
ˆ drawn from some discrete alphabet AX = {x1 , x2 , . . .}
ˆ For binary X , AX = {0, 1}
ˆ For a coin toss, AX = {heads, tails}
ˆ For hair colour in Guess who? : AX = {?}

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 12

Quantifying information: preliminaries

ˆ X is a random variable
ˆ A variable whose value is subject to chance.
ˆ i.e. an answer/signal/measurement
ˆ e.g. result of a coin flip, whether it rains today, etc.
ˆ x is a sample or outcome or measurement of X
ˆ drawn from some discrete alphabet AX = {x1 , x2 , . . .}
ˆ For binary X , AX = {0, 1}
ˆ For a coin toss, AX = {heads, tails}
ˆ For hair colour in Guess who? : AX = {?}
ˆ We have PDF defined: p(x) = Pr {X = x}, x ∈ AX
ˆ 0 ≤ p(x) ≤ 1, ∀ x ∈ AX
X
ˆ p(x) = 1
x∈AX

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 13

Shannon information content

ˆ The fundamental quantity of information theory


ˆ Shannon information content1 of a sample or outcome x:
 
1
h(x) = log2
p(x)

ˆ Units are bits for log in base 2.


ˆ Best thought of as a measure of surprise at the value of this
sample or outcome x given p(x):
ˆ No surprise if there is only ever one outcome p(x) = 1;
ˆ There is always some level of surprise if there exists more than
one outcome with p(x) > 0
ˆ Our surprise increases as x becomes less likely;

1
We’ll show later how this is a unique form ...

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 14

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1 7

h(x) = log2 6

p(x) 5

= − log2 (p(x)) 4

h(x)
3

ˆ Examples: 0
0 0.2 0.4
p(x)
0.6 0.8 1

ˆ h(heads) for a fair coin?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 14

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1 7

h(x) = log2 6

p(x) 5

= − log2 (p(x)) 4

h(x)
3

ˆ Examples: 0
0 0.2 0.4
p(x)
0.6 0.8 1

ˆ h(heads) for a fair coin? 1 bit

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 14

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1 7

h(x) = log2 6

p(x) 5

= − log2 (p(x)) 4

h(x)
3

ˆ Examples: 0
0 0.2 0.4
p(x)
0.6 0.8 1

ˆ h(heads) for a fair coin? 1 bit


ˆ h(1) for a 6-sided die (p(1) = 1/6)?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 14

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1 7

h(x) = log2 6

p(x) 5

= − log2 (p(x)) 4

h(x)
3

ˆ Examples: 0
0 0.2 0.4
p(x)
0.6 0.8 1

ˆ h(heads) for a fair coin? 1 bit


ˆ h(1) for a 6-sided die (p(1) = 1/6)? 2.58 bits

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 14

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1 7

h(x) = log2 6

p(x) 5

= − log2 (p(x)) 4

h(x)
3

ˆ Examples: 0
0 0.2 0.4
p(x)
0.6 0.8 1

ˆ h(heads) for a fair coin? 1 bit


ˆ h(1) for a 6-sided die (p(1) = 1/6)? 2.58 bits
ˆ h(not 1) for a 6-sided die (p(not 1) = 5/6)?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 14

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1 7

h(x) = log2 6

p(x) 5

= − log2 (p(x)) 4

h(x)
3

ˆ Examples: 0
0 0.2 0.4
p(x)
0.6 0.8 1

ˆ h(heads) for a fair coin? 1 bit


ˆ h(1) for a 6-sided die (p(1) = 1/6)? 2.58 bits
ˆ h(not 1) for a 6-sided die (p(not 1) = 5/6)? 0.26 bits

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 14

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1 7

h(x) = log2 6

p(x) 5

= − log2 (p(x)) 4

h(x)
3

ˆ Examples: 0
0 0.2 0.4
p(x)
0.6 0.8 1

ˆ h(heads) for a fair coin? 1 bit


ˆ h(1) for a 6-sided die (p(1) = 1/6)? 2.58 bits
ˆ h(not 1) for a 6-sided die (p(not 1) = 5/6)? 0.26 bits
ˆ h(1) for a 20-sided die (p(1) = 1/20)?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 14

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1 7

h(x) = log2 6

p(x) 5

= − log2 (p(x)) 4

h(x)
3

ˆ Examples: 0
0 0.2 0.4
p(x)
0.6 0.8 1

ˆ h(heads) for a fair coin? 1 bit


ˆ h(1) for a 6-sided die (p(1) = 1/6)? 2.58 bits
ˆ h(not 1) for a 6-sided die (p(not 1) = 5/6)? 0.26 bits
ˆ h(1) for a 20-sided die (p(1) = 1/20)? 4.32 bits
ˆ h(not 1) for a 20-sided die (p(not 1) = 19/20)?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 14

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1 7

h(x) = log2 6

p(x) 5

= − log2 (p(x)) 4

h(x)
3

ˆ Examples: 0
0 0.2 0.4
p(x)
0.6 0.8 1

ˆ h(heads) for a fair coin? 1 bit


ˆ h(1) for a 6-sided die (p(1) = 1/6)? 2.58 bits
ˆ h(not 1) for a 6-sided die (p(not 1) = 5/6)? 0.26 bits
ˆ h(1) for a 20-sided die (p(1) = 1/20)? 4.32 bits
ˆ h(not 1) for a 20-sided die (p(not 1) = 19/20)? 0.07 bits

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 15

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1
h(x) = log2
p(x)
= − log2 (p(x))

ˆ Examples – Guess Who? :


ˆ h(alex)?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 15

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1
h(x) = log2
p(x)
= − log2 (p(x))

ˆ Examples – Guess Who? :


 
ˆ h(alex)? log2 1/24
1
= 4.585 bits
ˆ h(female)?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 15

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1
h(x) = log2
p(x)
= − log2 (p(x))

ˆ Examples – Guess Who? :


 
ˆ h(alex)? log2 1/24
1
= 4.585 bits
 
ˆ h(female)? log2 5/24
1
= 2.263 bits

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 15

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1
h(x) = log2
p(x)
= − log2 (p(x))

ˆ Examples – Guess Who? :


 
ˆ h(alex)? log2 1/24
1
= 4.585 bits
 
ˆ h(female)? log2 5/24
1
= 2.263 bits
ˆ h(male)?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 15

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1
h(x) = log2
p(x)
= − log2 (p(x))

ˆ Examples – Guess Who? :


 
ˆ h(alex)? log2 1/24
1
= 4.585 bits
 
ˆ h(female)? log2 5/24
1
= 2.263 bits
 
ˆ h(male)? log2 19/24 = 0.337 bits
1

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 15

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1
h(x) = log2
p(x)
= − log2 (p(x))

ˆ Examples – Guess Who? :


 
ˆ h(alex)? log2 1/24
1
= 4.585 bits
 
ˆ h(female)? log2 5/241
= 2.263 bits
 
ˆ h(male)? log2 19/24 = 0.337 bits
1

ˆ Is “female?” a good question to ask first?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 15

Shannon information content

ˆ Shannon information content of a sample or outcome x:


 
1
h(x) = log2
p(x)
= − log2 (p(x))

ˆ Examples – Guess Who? :


 
ˆ h(alex)? log2 1/24
1
= 4.585 bits
 
ˆ h(female)? log2 5/241
= 2.263 bits
 
ˆ h(male)? log2 19/24 = 0.337 bits
1

ˆ Is “female?” a good question to ask first?


ˆ Is “alex?” a good question to ask first?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 16

(Shannon) entropy

ˆ Shannon entropy of a random variable X :


 
X 1
H(X ) = p(x) log2
p(x)
x∈AX
X
=− p(x) log2 (p(x))
x∈AX

= hh(x)i
ˆ Expectation value of Shannon information content
ˆ p log p = 0 in the limit as p → 0
ˆ Examples:
ˆ If ∃x, p(x) = 1 → H(X ) = 0.
ˆ For binary X , p(0) = p(1) = 0.5 → H(X ) = 1 bit.
ˆ p(x) = 1/|AX |, ∀x → H(X ) = log2 (|AX |) bits.

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 17

(Shannon) entropy

ˆ Shannon entropy of a random variable X :


 
X 1
H(X ) = p(x) log2
p(x)
x∈AX
X
=− p(x) log2 (p(x))
x∈AX

= hh(x)i

ˆ Examples – Guess Who? :


ˆ H(who)?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 17

(Shannon) entropy

ˆ Shannon entropy of a random variable X :


 
X 1
H(X ) = p(x) log2
p(x)
x∈AX
X
=− p(x) log2 (p(x))
x∈AX

= hh(x)i

ˆ Examples – Guess Who? :


ˆ H(who)? 4.585 bits
ˆ H(sex)?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 17

(Shannon) entropy

ˆ Shannon entropy of a random variable X :


 
X 1
H(X ) = p(x) log2
p(x)
x∈AX
X
=− p(x) log2 (p(x))
x∈AX

= hh(x)i

ˆ Examples – Guess Who? :


ˆ H(who)? 4.585 bits
ˆ H(sex)? 24
5
× 2.263 + 19
24 × 0.337 = 0.738 bits

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 17

(Shannon) entropy

ˆ Shannon entropy of a random variable X :


 
X 1
H(X ) = p(x) log2
p(x)
x∈AX
X
=− p(x) log2 (p(x))
x∈AX

= hh(x)i

ˆ Examples – Guess Who? :


ˆ H(who)? 4.585 bits
ˆ H(sex)? 24 5
× 2.263 + 19
24 × 0.337 = 0.738 bits
ˆ Is “female?” a good question to ask first?
ˆ Is “alex?” a good question to ask first?
ˆ What is the best question to ask first, and why?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 18

(Shannon) entropy – coding it

ˆ Shannon entropy of a random variable X :


X
H(X ) = − p(x) log2 (p(x))
x∈AX

ˆ Let’s code it:

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 18

(Shannon) entropy – coding it

ˆ Shannon entropy of a random variable X :


X
H(X ) = − p(x) log2 (p(x))
x∈AX

ˆ Let’s code it:


ˆ For a binary X with p1 = p(X = 1):

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 18

(Shannon) entropy – coding it

ˆ Shannon entropy of a random variable X :


X
H(X ) = − p(x) log2 (p(x))
x∈AX

ˆ Let’s code it:


ˆ For a binary X with p1 = p(X = 1):
ˆ H(X ) = −p1 log2 (p1 ) − (1 − p1 ) log2 ((1 − p1 ))

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 18

(Shannon) entropy – coding it

ˆ Shannon entropy of a random variable X :


X
H(X ) = − p(x) log2 (p(x))
x∈AX

ˆ Let’s code it:


ˆ For a binary X with p1 = p(X = 1):
ˆ H(X ) = −p1 log2 (p1 ) − (1 − p1 ) log2 ((1 − p1 ))
ˆ What does H(X ) look like as a function of p(X = 1)?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 18

(Shannon) entropy – coding it

ˆ Shannon entropy of a random variable X :


X
H(X ) = − p(x) log2 (p(x))
x∈AX

ˆ Let’s code it:


ˆ For a binary X with p1 = p(X = 1):
ˆ H(X ) = −p1 log2 (p1 ) − (1 − p1 ) log2 ((1 − p1 ))
ˆ What does H(X ) look like as a function of p(X = 1)?
ˆ For a general discrete AX :
1 Input?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 18

(Shannon) entropy – coding it

ˆ Shannon entropy of a random variable X :


X
H(X ) = − p(x) log2 (p(x))
x∈AX

ˆ Let’s code it:


ˆ For a binary X with p1 = p(X = 1):
ˆ H(X ) = −p1 log2 (p1 ) − (1 − p1 ) log2 ((1 − p1 ))
ˆ What does H(X ) look like as a function of p(X = 1)?
ˆ For a general discrete AX :
1 Input? Take p(x) as a vector

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 18

(Shannon) entropy – coding it

ˆ Shannon entropy of a random variable X :


X
H(X ) = − p(x) log2 (p(x))
x∈AX

ˆ Let’s code it:


ˆ For a binary X with p1 = p(X = 1):
ˆ H(X ) = −p1 log2 (p1 ) − (1 − p1 ) log2 ((1 − p1 ))
ˆ What does H(X ) look like as a function of p(X = 1)?
ˆ For a general discrete AX :
1 Input? Take p(x) as a vector
2 How to sum over x?
3 What are some possible error conditions here?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 19

What does it mean, and traditional usage

Using an optimal compression or encoding scheme given p(x):


ˆ h(x) is the number of bits for a symbol to communicate x
ˆ H(X ) is the number of bits to communicate the x on average.
Or (in bits): how few yes/no questions could I need to ask (on
average) to determine the value of x?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 19

What does it mean, and traditional usage

Using an optimal compression or encoding scheme given p(x):


ˆ h(x) is the number of bits for a symbol to communicate x
ˆ H(X ) is the number of bits to communicate the x on average.
Or (in bits): how few yes/no questions could I need to ask (on
average) to determine the value of x?

Think about Guess who? as a decoding task

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 20

What does it mean, and traditional usage

Using an optimal compression or encoding scheme given p(x):


ˆ h(x) is the number of bits for a symbol to communicate x
ˆ H(X ) is the number of bits to communicate the x on average.
Or (in bits): how few yes/no questions could I need to ask (on
average) to determine the value of x?

Image from Shannon (1948)

What has information theory ever done for me?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 20

What does it mean, and traditional usage

Using an optimal compression or encoding scheme given p(x):


ˆ h(x) is the number of bits for a symbol to communicate x
ˆ H(X ) is the number of bits to communicate the x on average.
Or (in bits): how few yes/no questions could I need to ask (on
average) to determine the value of x?

Image from Shannon (1948)

What has information theory ever done for me? zip files, mp3s,
encoding mobile telecoms / ADSL etc.
Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015
Outline Intro Entropy Other measures Applications Close 21

What does it mean, and traditional usage

Using an optimal compression or encoding scheme given p(x):


ˆ h(x) is the number of bits for a symbol to communicate x
ˆ H(X ) is the number of bits to communicate the x on average.
Example: say we want to communicate the result of a horse race
with four horses {a, b, c, d}:
ˆ How many bits to encode each outcome?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 21

What does it mean, and traditional usage

Using an optimal compression or encoding scheme given p(x):


ˆ h(x) is the number of bits for a symbol to communicate x
ˆ H(X ) is the number of bits to communicate the x on average.
Example: say we want to communicate the result of a horse race
with four horses {a, b, c, d}:
ˆ How many bits to encode each outcome?
ˆ Assume p(x) = 0.25, ∀x to give 2 bits. max. entropy
assumption
ˆ If p(a) = 0.5, p(b) = 0.25, p(c) = p(d) = 0.125?
ˆ h(x) tells us to use 1 bit for a (say “0”), 2 bits for b (say
“10”) and 3 bits for c and d (say “110” and “111”);
H(X ) = 1.75 bits.
ˆ Using the actual p(x) leads to more efficient coding

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 21

What does it mean, and traditional usage

Using an optimal compression or encoding scheme given p(x):


ˆ h(x) is the number of bits for a symbol to communicate x
ˆ H(X ) is the number of bits to communicate the x on average.
Example: say we want to communicate the result of a horse race
with four horses {a, b, c, d}:
ˆ How many bits to encode each outcome?
ˆ Assume p(x) = 0.25, ∀x to give 2 bits. max. entropy
assumption
ˆ If p(a) = 0.5, p(b) = 0.25, p(c) = p(d) = 0.125?
ˆ h(x) tells us to use 1 bit for a (say “0”), 2 bits for b (say
“10”) and 3 bits for c and d (say “110” and “111”);
H(X ) = 1.75 bits.
ˆ Using the actual p(x) leads to more efficient coding
ˆ Information is not about meaning
Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015
Outline Intro Entropy Other measures Applications Close 22

Entropy of text and compression

ˆ Think about coding letters in English language text


ˆ Can we get any insights into how many bits to use
for each letter?2
ˆ Look at entropy of alphabet in MacKay (2003) →
ˆ Meaning of a non-integer number of bits:
ˆ Encoding one sample at a time can only be done
with an integer number of bits
ˆ To reach the lower limits suggested by information
theory, we would need to use block coding (i.e.
encoding multiple samples together)

2
How to determine the coding to use is a discussion for another time ...
Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015
Outline Intro Entropy Other measures Applications Close 23

Joint entropy

ˆ We can consider joint entropy of a multivariate, e.g. {X , Y }:


 
X X 1
H(X , Y ) = p(x, y ) log2
p(x, y )
x∈AX y ∈AY

ˆ Is H(X , Y ) = H(X ) + H(Y )?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 23

Joint entropy

ˆ We can consider joint entropy of a multivariate, e.g. {X , Y }:


 
X X 1
H(X , Y ) = p(x, y ) log2
p(x, y )
x∈AX y ∈AY

ˆ Is H(X , Y ) = H(X ) + H(Y )?


ˆ Only for indepedent variables where p(x, y ) = p(x)p(y ) !

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 23

Joint entropy

ˆ We can consider joint entropy of a multivariate, e.g. {X , Y }:


 
X X 1
H(X , Y ) = p(x, y ) log2
p(x, y )
x∈AX y ∈AY

ˆ Is H(X , Y ) = H(X ) + H(Y )?


ˆ Only for indepedent variables where p(x, y ) = p(x)p(y ) !
ˆ Can you think about how to code H(X , Y )?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 24

Aside: Shannon entropy – derivation

ˆ Shannon entropy of a random variable X :


X
H(X ) = − p(x) log2 (p(x))
x∈AX

ˆ Is a unique form that satisfies three axioms (Ash, 1965;


Shannon, 1948):
ˆ Continuity w.r.t. p(x)
ˆ Monotony – H(X ) ↑ as |AX | ↑, for p(x) = |A1 |
X
ˆ Grouping – For independent variables X and Y ,
H(X , Y ) = H(X ) + H(Y )

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 25

Conditional entropy

ˆ What if we already know something about X – how does that


change the surprise?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 25

Conditional entropy

ˆ What if we already know something about X – how does that


change the surprise?
ˆ Conditional entropy: (average) surprise remaining about X if
we already know the value of Y :
H(X | Y ) = H(X , Y ) − H(Y )
 
X X 1
= p(x, y ) log2
p(x | y )
x∈AX y ∈AY

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 25

Conditional entropy

ˆ What if we already know something about X – how does that


change the surprise?
ˆ Conditional entropy: (average) surprise remaining about X if
we already know the value of Y :
H(X | Y ) = H(X , Y ) − H(Y )
 
X X 1
= p(x, y ) log2
p(x | y )
x∈AX y ∈AY
H(X,Y)
H(X|Y) + + H(Y|X)

+ +
H(X) H(Y)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 25

Conditional entropy

ˆ What if we already know something about X – how does that


change the surprise?
ˆ Conditional entropy: (average) surprise remaining about X if
we already know the value of Y :
H(X | Y ) = H(X , Y ) − H(Y )
 
X X 1
= p(x, y ) log2
p(x | y )
x∈AX y ∈AY
H(X,Y)
H(Y|X) ˆ 0 ≤ H(X | Y ) ≤ H(X )
H(X|Y) + +
ˆ H(X | Y ) = H(X ) iff X and Y are
+ + independent
H(X) H(Y) ˆ H(X | Y ) = 0 means there is no surprise
left about X once we know Y
Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015
Outline Intro Entropy Other measures Applications Close 26

Conditional entropy

ˆ Conditional entropy: (average) surprise remaining about X if


we already know the value of Y :

H(X | Y ) = H(X , Y ) − H(Y )

Example 1:
H(X,Y)
ˆ Coding characters in English text –
H(X|Y) + + H(Y|X)
what variable Y would drop H(X ) and
therefore the code length for a
+ + conditional encoding of incoming
H(X) H(Y) character X ?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 26

Conditional entropy

ˆ Conditional entropy: (average) surprise remaining about X if


we already know the value of Y :

H(X | Y ) = H(X , Y ) − H(Y )

Example 1:
H(X,Y)
ˆ Coding characters in English text –
H(X|Y) + + H(Y|X)
what variable Y would drop H(X ) and
therefore the code length for a
+ + conditional encoding of incoming
H(X) H(Y) character X ?
ˆ Context of previous character(s) Y
changes the probability of the next
character X – Markov chains

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 27

Conditional entropy

ˆ Conditional entropy: (average) surprise remaining about X if


we already know the value of Y :

H(X | Y ) = H(X , Y ) − H(Y )


Example 2:
H(X,Y)
ˆ Guess who? – how much surprise
H(X|Y) + + H(Y|X) remains (on average) about X = who
given the Y = sex?
+ + 1 First H(who, sex) =
H(X) H(Y)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 27

Conditional entropy

ˆ Conditional entropy: (average) surprise remaining about X if


we already know the value of Y :

H(X | Y ) = H(X , Y ) − H(Y )


Example 2:
H(X,Y)
ˆ Guess who? – how much surprise
H(X|Y) + + H(Y|X) remains (on average) about X = who
given the Y = sex?
+ + 1 First H(who, sex) =
H(X) H(Y) H(who) = 4.585 bits because
character contains all information
about the sex
2 Next, H(sex) = 0.738 bits (Slide
17)
3 So H(who | sex) = 3.847 bits.

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 28

Mutual information

ˆ Mutual Information is the reduction in uncertainty or surprise


about one variable that we obtain from another
I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )
= H(X ) − H(X | Y )
 
X X p(x, y )
= p(x, y ) log2
p(x)p(y )
x∈AX y ∈AY
 
X X p(x | y )
= p(x, y ) log2
p(x)
x∈AX y ∈AY

Can anyone smell Bayes rule?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 29

Mutual information

ˆ Mutual Information is the reduction in uncertainty or surprise


about one variable that we obtain from another
I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )
= H(X ) − H(X | Y )
 
X X p(x | y )
= p(x, y ) log2
p(x)
x∈AX y ∈AY
H(X,Y)
H(X|Y) + + H(Y|X)

+ +
H(X) H(Y)
I(X;Y)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 29

Mutual information

ˆ Mutual Information is the reduction in uncertainty or surprise


about one variable that we obtain from another
I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )
= H(X ) − H(X | Y )
 
X X p(x | y )
= p(x, y ) log2
p(x)
x∈AX y ∈AY
H(X,Y)
H(X|Y) H(Y|X) ˆ 0 ≤ I (X ; Y ) ≤ min(H(X ), H(Y ))
+ +
ˆ I (X ; Y ) = 0 iff X and Y are independent
+ + ˆ I (X ; Y ) = H(X ) means there is no
H(X) H(Y) surprise left about X once we know Y –
i.e. Y tells us all the information about X .
I(X;Y)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 30

Mutual information

ˆ Mutual Information is the reduction in uncertainty or surprise


about one variable that we obtain from another
I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )
= H(X ) − H(X | Y )
 
X X p(x | y )
= p(x, y ) log2
p(x)
x∈AX y ∈AY
H(X,Y)
ˆ This reflects our earlier definition of
H(X|Y) + + H(Y|X) information.
ˆ I (X ; X ) = H(X ) is the self-information.
+ + ˆ Entropy and information are
H(X) H(Y) complementary quantities
I(X;Y) ˆ MI is a non-linear correlation

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 31

Mutual information

I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )


H(X,Y)
H(X|Y) + + H(Y|X) Example: Guess who?
ˆ I (who; sex) = H(sex) trivially
+ +
H(X) H(Y) ˆ I (earings; sex)?

I(X;Y)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 31

Mutual information

I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )


H(X,Y)
H(X|Y) + + H(Y|X) Example: Guess who?
ˆ I (who; sex) = H(sex) trivially
+ +
H(X) H(Y) ˆ I (earings; sex)?

I(X;Y) 1 Construct p(earings, sex)


2 Plug in

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 31

Mutual information

I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )


H(X,Y)
H(X|Y) + + H(Y|X) Example: Guess who?
ˆ I (who; sex) = H(sex) trivially
+ +
H(X) H(Y) ˆ I (earings; sex)?

I(X;Y) 1 Construct p(earings, sex)


2 Plug in (result is 0.212 bits)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 31

Mutual information

I (X ; Y ) = H(X ) + H(Y ) − H(X , Y )


H(X,Y)
H(X|Y) + + H(Y|X) Example: Guess who?
ˆ I (who; sex) = H(sex) trivially
+ +
H(X) H(Y) ˆ I (earings; sex)?

I(X;Y) 1 Construct p(earings, sex)


2 Plug in (result is 0.212 bits)
3 Why is there MI here?

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 32

Mutual information

MI is a great model-free tool to:


ˆ detect relationships between variables
ˆ reveal patterns
ˆ show how such relationships and patterns fluctuate in time.

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 33

Information theory – sample applications

Feature selection for machine learning


e.g. in Disease diagnosis from breath/urine analysis
ˆ ∼10 000 features available
ˆ use of MI to select which could be used in classification

eNose from Sensigent (image


used under CC-BY-SA license)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 34

Information theory – sample applications

Space-time characterisation of information processing in


distributed systems
ˆ Highlight information processing hot-spots;
ˆ Use information processing to explain dynamics.
1
γ+ γ- γ+ α 1

5 5 α γ+ 0.5

10 10
0.8
γ- 0

-0.5 e.g. Cellular automata


15 15 0.6
β γ- γ- -1
(Lizier, 2014)
20 20
0.4
α α -1.5
25 25 γ- γ+
0.2 γ- -2

30 30
α -2.5

γ0-
+
35 35
γ -3
5 10 15 20 25 30 35 5 10 15 20 25 30 35

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 35

Information theory – sample applications

Analysing information processing in the brain


ˆ Localise response to a given stimulus;
ˆ Revealing directed brain network structures;
ˆ Inferring differences in information processing under cognitive
task or condition.

lM1
lSMA
lPMD rPMD

lSPL rSPL

V1

rBG
lSC rSC

rCer

Lizier et al. (2011) Gómez et al. (2014)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 36

Information theory – sample applications

Analysing implicit communications


e.g. in robotic soccer matches (Robocup)

Relative mid. AIS vs. Relative Transfer from mid. to mid.


7
0.54
04 08 07 05 0.6 6
11 09 0.5
0.52 5
0.4
4
02 0.3
0.5

Tm → m(G,C)
3
0.2
10 06 06 02 11 03 2
0.1
0.48
1
0
03 0
0.46 −0.1

−0.2 −1
09 10
−0.3 −2
05 07 08 04 0.44
−0.4 −3
−2 −1.5 −1 −0.5 0 0.5 1 1.5
0.42 δ A ( G, C)
m

Cliff et al. (2014, 2015)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 37

What you need to know

ˆ Meaning of information, uncertainty and surprise and their


relationship
ˆ Meaning of entropy as p log2 (p)
ˆ That entropy tells us about minimal lengths to encode
outcomes of a random variable
ˆ That mutual information tells us information (reduction in
entropy) conveyed by one variable about another
ˆ How to calculate entropy

ˆ There is lots more to information theory that we didn’t cover


(e.g. conditional MI, measures of information processing,
continuous variables, etc.)

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 38

References I

R. B. Ash. Information Theory. Dover Publications Inc., New York, 1965.


O. M. Cliff, J. T. Lizier, X. R. Wang, P. Wang, O. Obst, and M. Prokopenko.
Towards quantifying interaction networks in a football match. In S. Behnke,
M. Veloso, A. Visser, and R. Xiong, editors, RoboCup 2013: Robot World Cup
XVII, volume 8371 of Lecture Notes in Computer Science, pages 1–12. Springer,
Berlin/Heidelberg, 2014.
O. M. Cliff, J. T. Lizier, X. R. Wang, P. Wang, O. Obst, and M. Prokopenko.
Quantifying Long-Range Interactions and Coherent Structure in Multi-Agent
Dynamics. 2015. under submission.
T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience,
New York, 1991.
C. Gómez, J. T. Lizier, M. Schaum, P. Wollstadt, C. Grützner, P. Uhlhaas, C. M.
Freitag, S. Schlitt, S. Bölte, R. Hornero, and M. Wibral. Reduced predictable
information in brain signals in autism spectrum disorder. Frontiers in
Neuroinformatics, 8:9+, 2014.
J. T. Lizier. Measuring the dynamics of information processing on a local scale in time
and space. In M. Wibral, R. Vicente, and J. T. Lizier, editors, Directed Information
Measures in Neuroscience, Understanding Complex Systems, pages 161–193.
Springer, Berlin/Heidelberg, 2014.

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015


Outline Intro Entropy Other measures Applications Close 39

References II

J. T. Lizier, J. Heinzle, A. Horstmann, J.-D. Haynes, and M. Prokopenko. Multivariate


information-theoretic measures reveal directed information structure and task
relevant changes in fMRI connectivity. Journal of Computational Neuroscience, 30
(1):85–107, 2011.
D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge
University Press, Cambridge, 2003.
C. E. Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27(3–4):379–423, 623–656, 1948.

Joseph Lizier Computational Science Lecture 15 | Monday, 14 September 2015

You might also like