04 Entropy Perplexity Notes
04 Entropy Perplexity Notes
Herman Kamper
Entropy
Perplexity
Cross entropy
Entropy rate
1
Entropy
If an outcome has a very low probability, that means that that outcome
carries a lot of information:
K
X
H(x) , − P (x = k) log2 P (x = k)
k=1
With log2 entropy is measured in bits (but can also use other bases).
2
Properties:
Information-theoretic meaning:
Equivalent to:
3
Example: The horse race
Example from (Cover and Thomas, 2006).
We are at a race track and want to send the winning horse of each
race over a binary channel. There are eight horses in a race.
Uniform distribution
The probability of winning is equal over the horses, i.e.
1 1 1 1 1 1 1 1
, , , , , , ,
8 8 8 8 8 8 8 8
Horse Codeword
1 001
2 010
3 011
4 100
5 101
6 110
7 111
8 000
4
Non-uniform distribution
Now the probabilities of winning are
1 1 1 1 1 1 1 1
, , , , , , ,
2 4 8 16 64 64 64 64
Horse Codeword
1 0
2 10
3 110
4 1110
5 111100
6 111101
7 111110
8 111111
5
Outcome trees
The above example also illustrates that entropy is the minimum number
of yes/no questions (on average) needed to transmit an outcome.
000
001
010
011
100
101
110
111
0
10
110
1110
111100
111101
111110
111111
6
Perplexity
Perplexity can be seen as the weighted number of choices we have to
make for a random discrete variable x:
PP(x) , 2H(x)
In the examples below, ask yourself how many outcomes are you really
deciding between.
7
Entropy and perplexity examples
Single outcome:
P (x = a) = 1 H(x) = −1 log2 1
= 0 bits
PP(x) = 20
=1
a
cb
8
Four equally likely outcomes:
a
b
c
d
a
b
c
d
9
Four non-uniform outcomes:
a
b
c
d
a
b
c
d
10
Entropy of uniform distribution over K outcomes is log2 K:
11
Cross entropy
We have two discrete distributions both over possible outcomes
1, 2, . . . , K. The masses of the one distribution are denoted as p
and the other as q. The cross entropy is then defined as
K
X
H(p, q) , − Pp (x = k) log2 Pq (x = k)
k=1
XK
=− pk log2 qk
k=1
H(p) ≤ H(p, q)
The closer model q is to source p, the closer the cross entropy will be
to the entropy. Stated differently, the better the model, the lower the
cross entropy.
12
Entropy rate
We’ve looked at the entropy of a single variable. What about se-
quences?
1
I’ve gone a bit crazy in overloading H to mean different things: H(x) for
entropy, H(X) for entropy rate, H(p, q) for cross entropy, and H(x1:T , θ) for
estimated cross entropy. Maybe I should have just written H for all of these and
hope that the context is enough. Sometimes the term “entropy” is also used
interchangeably for all these things.
13
This estimated cross entropy is actually what we use to evaluate a
language model θ on some test data x1:T . We normally report the
perplexity:
PP = 2H(x1:T ,θ)
1
= 2− T log2 Pθ (x1:T )
1
= Pθ (x1:T )− T
where x(l) ∼ p(x) are samples from p(x). With a single sample L = 1:
14
The entropy rate of written English
By using human participants, Shannon (1951) estimated the per-letter
entropy rate of written English:
15
Videos covered in this note
• What are perplexity and entropy? (14 min)
Further reading
For a formal derivation of why entropy is the average length of the
shortest description of a random variable, see Sec. 5.2 and Sec. 5.3
of (Cover and Thomas, 2006). This is a very accessible textbook.
Huffman codes (Cover and Thomas, 2006, Sec. 5.6) give a way to
construct optimal codewords for a given distribution.
Acknowledgements
This note uses content from Sharon Goldwater’s NLP course at the
University of Edinburgh.
References
T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd
ed., 2006.
16