0% found this document useful (0 votes)
18 views16 pages

04 Entropy Perplexity Notes

The document discusses the concepts of entropy and perplexity in information theory, defining entropy as the average level of information or uncertainty in a random variable's outcomes. It provides examples, including a horse race scenario, to illustrate how entropy and perplexity are calculated for both uniform and non-uniform distributions. Additionally, it covers cross-entropy and entropy rates, emphasizing their applications in machine learning and language modeling.

Uploaded by

GamerZ Zone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views16 pages

04 Entropy Perplexity Notes

The document discusses the concepts of entropy and perplexity in information theory, defining entropy as the average level of information or uncertainty in a random variable's outcomes. It provides examples, including a horse race scenario, to illustrate how entropy and perplexity are calculated for both uniform and non-uniform distributions. Additionally, it covers cross-entropy and entropy rates, emphasizing their applications in machine learning and language modeling.

Uploaded by

GamerZ Zone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Entropy and perplexity

Herman Kamper

2024-02, CC BY-SA 4.0

Entropy

Example: The horse race

Perplexity

Entropy and perplexity examples

Cross entropy

Entropy rate

1
Entropy
If an outcome has a very low probability, that means that that outcome
carries a lot of information:

• dog bites man


• man bites dog
• it snowed in Chicago
• it snowed in Cape Town

The entropy of a random variable is the average level of information


or uncertainty over the variable’s possible outcomes (Wikipedia).

One way to derive entropy is to list what we want from a definition of


information (Peebles, 2001, p. 80):
1
• Should be large for outcomes with low probability: P (x=k)

• Information from two independent sources should add

• Decision: Information should be positive and should be 0 for a


certain outcome
1
• Logarithm is the only function with these properties: log P (x=k)

• Decision: Base 2 since smallest choice is between two


1
• So information from x = k: log2 P (x=k)
= − log2 P (x = k)

• Average information over outcomes: E [− log2 P (x)]

The entropy of a discrete random variable x taking on possible out-


comes 1, 2, . . . , K is thus defined as

K
X
H(x) , − P (x = k) log2 P (x = k)
k=1

With log2 entropy is measured in bits (but can also use other bases).

2
Properties:

• Maximum with most uncertainty: Uniform distribution

• Minimum with least uncertainty: All mass on one outcome

• Entropy of a uniform distribution over K outcomes: log2 K

Information-theoretic meaning:

The average length of the shortest description of a random


variable.

Equivalent to:

• The minimum number of bits per outcome (on average) to


encode a source.

• The minimum number of yes/no questions (on average) per


outcome. Questions can be about more than one category,
e.g. “Is the outcome one of the categories {4, 5, 6, 7}?” (see
the uniform horse example below).

3
Example: The horse race
Example from (Cover and Thomas, 2006).

We are at a race track and want to send the winning horse of each
race over a binary channel. There are eight horses in a race.

Uniform distribution
The probability of winning is equal over the horses, i.e.
1 1 1 1 1 1 1 1
, , , , , , ,
8 8 8 8 8 8 8 8

One optimal encoding:

Horse Codeword
1 001
2 010
3 011
4 100
5 101
6 110
7 111
8 000

Average number of bits: 3 bits

Does this match the entropy?


8
X
H(x) = − P (x = k) log2 P (x = k)
k=1
8
X 1 1
=− log2
k=1 8 8
= 3 bits

4
Non-uniform distribution
Now the probabilities of winning are
1 1 1 1 1 1 1 1
, , , , , , ,
2 4 8 16 64 64 64 64

What is the entropy?


1 1 1 1 1 1 1 1
H(x) = − log2 − log2 − log2 − log2
2 2 4 4 8 8 16 16
1 1 1 1 1 1 1 1
− log2 − log2 − log2 − log2
64 64 64 64 64 64 64 64
= 2 bits

An encoding achieving this:

Horse Codeword
1 0
2 10
3 110
4 1110
5 111100
6 111101
7 111110
8 111111

Example of prefix code: No codeword is a prefix of any other

5
Outcome trees
The above example also illustrates that entropy is the minimum number
of yes/no questions (on average) needed to transmit an outcome.

Yes/no questions for the uniform distribution:

000
001
010
011
100
101
110
111

Yes/no questions for the non-uniform distribution:

0
10
110
1110
111100
111101
111110
111111

6
Perplexity
Perplexity can be seen as the weighted number of choices we have to
make for a random discrete variable x:

PP(x) , 2H(x)

Example: The horse race


The perplexity for the two cases in the horse race:

• Uniform distribution: PP(x) = 23 = 8

• Non-uniform distribution: PP(x) = 22 = 4

In the examples below, ask yourself how many outcomes are you really
deciding between.

7
Entropy and perplexity examples
Single outcome:

P (x = a) = 1 H(x) = −1 log2 1
= 0 bits
PP(x) = 20
=1

Two equally likely outcomes:

P (x = a) = 0.5 H(x) = −0.5 log2 0.5 − 0.5 log2 0.5


P (x = b) = 0.5 = 1 bit
PP(x) = 21
=2

a
cb

8
Four equally likely outcomes:

P (x = a) = 0.25 H(x) = 2 bits


P (x = b) = 0.25 PP(x) = 4
P (x = c) = 0.25
P (x = d) = 0.25

a
b
c
d

Four non-uniform outcomes:

P (x = a) = 0.7 H(x) = −0.7 log2 0.7 − 3 · 0.1 log2 0.1


P (x = b) = 0.1 = 1.35678 bits
P (x = c) = 0.1 PP(x) = 21.35678
P (x = d) = 0.1 = 2.5611

a
b
c
d

9
Four non-uniform outcomes:

P (x = a) = 0.97 H(x) = 0.2419 bits


P (x = b) = 0.01 PP(x) = 1.1826
P (x = c) = 0.01
P (x = d) = 0.01

a
b
c
d

Four non-uniform outcomes:

P (x = a) = 0.49 H(x) = 1.1414 bits


P (x = b) = 0.49 PP(x) = 2.2060
P (x = c) = 0.01
P (x = d) = 0.01

a
b
c
d

10
Entropy of uniform distribution over K outcomes is log2 K:

H(x) = 0 H(x) = 1 H(x) = 2


PP(x) = 1 PP(x) = 2 PP(x) = 4

H(x) = 3 H(x) = 2.5850


PP(x) = 8 PP(x) = 6

Any non-uniform distribution over K outcomes has lower entropy than


the corresponding uniform distribution:

H(x) = 2 H(x) = 1.35678 H(x) = 0.2419


PP(x) = 4 PP(x) = 2.5611 PP(x) = 1.1826

11
Cross entropy
We have two discrete distributions both over possible outcomes
1, 2, . . . , K. The masses of the one distribution are denoted as p
and the other as q. The cross entropy is then defined as
K
X
H(p, q) , − Pp (x = k) log2 Pq (x = k)
k=1
XK
=− pk log2 qk
k=1

The cross entropy is the minimum number of bits on average needed


to encode outcomes coming from source p when we use another model
q to construct the codebook.

The cross entropy is an upper bound on the entropy of the source:

H(p) ≤ H(p, q)

The closer model q is to source p, the closer the cross entropy will be
to the entropy. Stated differently, the better the model, the lower the
cross entropy.

Side note: Information theory and machine learning


• In information theory the goal is to get a good model of the
unknown source P (x) so that we can encode x with the shortest
code.

• In machine learning, the goal is to get a good model of the


unknown real-world distribution P (x) so that we can use it to
make predictions for new x.

12
Entropy rate
We’ve looked at the entropy of a single variable. What about se-
quences?

The entropy rate for a random process generating sequences X is


defined as
1
H(X) = lim − H(x1:T )
T →∞ T
1 X
= lim − P (x1:T ) log2 P (x1:T )
T →∞ T x1:T

We normally don’t know the real P (x1:T ). If we have a model θ then


we can calculate the cross entropy rate:
1 X
H(p, θ) = lim − Pp (x1:T ) log2 Pθ (x1:T )
T →∞ T x1:T

using p to explicitly denote the real-world (unknown) distribution.

We still can’t calculate this since we don’t have infinite sequences. So


we estimate the cross entropy:
1
H(p, θ) ≈ − log2 Pθ (x1:T ) = H(x1:T , θ)
T
where x1:T is a long sample from Pp (x1:T ), i.e. x1:T ∼ Pp (x1:T ).

H(x1:T , θ) is the notation we use for the estimated cross entropy of


the model θ.1

1
I’ve gone a bit crazy in overloading H to mean different things: H(x) for
entropy, H(X) for entropy rate, H(p, q) for cross entropy, and H(x1:T , θ) for
estimated cross entropy. Maybe I should have just written H for all of these and
hope that the context is enough. Sometimes the term “entropy” is also used
interchangeably for all these things.

13
This estimated cross entropy is actually what we use to evaluate a
language model θ on some test data x1:T . We normally report the
perplexity:
PP = 2H(x1:T ,θ)
1
= 2− T log2 Pθ (x1:T )
1
= Pθ (x1:T )− T

Side note: Cross-entropy estimate as a Monte Carlo sample


The jump between the equation for cross entropy H(p, θ) and its
estimate H(x1:T , θ) is similar to how we approximate expected values
with Monte Carlo.

Expected values can be approximated (Resnik and Hardisty, 2010):


L
1X
Ep(x) [f (x)] ≈ f (x(l) )
L l=1

where x(l) ∼ p(x) are samples from p(x). With a single sample L = 1:

Ep(x) [f (x)] ≈ f (x(1) )

14
The entropy rate of written English
By using human participants, Shannon (1951) estimated the per-letter
entropy rate of written English:

0.6 ≤ H(x1:T ) ≤ 1.3

Experiment and bound (roughly):

• Subjects were presented with English text and asked to predict


the guess of the next letter (out of 27)

• Used letters rather than words, since sometimes a subject had


to do an exhaustive (27-character) search

• Record the number of guesses to get the correct letter

• Obtained a bound by proving how the number of guesses (a


different random variable) relates to the entropy of English

The estimate is probably low because he used a single text.

But still: What do these estimates imply when thinking of entropy


as the shortest code in bits (yes/no questions), or perplexity as the
weighted branching factor?

15
Videos covered in this note
• What are perplexity and entropy? (14 min)

Further reading
For a formal derivation of why entropy is the average length of the
shortest description of a random variable, see Sec. 5.2 and Sec. 5.3
of (Cover and Thomas, 2006). This is a very accessible textbook.

Huffman codes (Cover and Thomas, 2006, Sec. 5.6) give a way to
construct optimal codewords for a given distribution.

Acknowledgements
This note uses content from Sharon Goldwater’s NLP course at the
University of Edinburgh.

References
T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd
ed., 2006.

P. Z. Peebles, Probability, Random Variables and Random Signal


Principles, 4th ed., 2001.

P. Resnik and E. Hardisty, “Gibbs sampling for the uninitiated,” Uni-


versity of Maryland, 2010.

C. E. Shannon, “Prediction and entropy of printed English,” Bell


System Technical Journal, 1951.

16

You might also like