Language Models
Language Models
http://courses.engr.illinois.edu/cs447
Lecture 3:
Language models
Julia Hockenmaier
[email protected]
3324 Siebel Center
Last lecture’s key concepts
Morphology (word structure): stems, affixes
Derivational vs. inflectional morphology
Compounding
Stem changes
Morphological analysis and generation
Finite-state automata
Finite-state transducers
Composing finite-state transducers
Sample space Ω:
The set of all possible outcomes
(all shapes; all words in Alice in Wonderland)
Event ω ⊆ Ω:
An actual outcome (a subset of Ω)
(predicting ‘the’, picking a triangle)
0 ⇥ P( )⇥1
P (⇤) = 0 and P ( ) = 1
P ( i ) = 1 if ⇥j = i : i ⌅ j = ⇤
i and i i =
P (X, Y )
P (X|Y ) =
P (Y )
P(blue | ) = 2/5
x1 x2 x3 x4 x5
0 p1 p1+p2 p1+p2+p3 p1+p2+p3+p4 1
PP(w(w11...w
...wNN))
⇧⇧
A LM with lower perplexity is⌅ ⌅
⌅better
⌅ NN because
1 1
it assigns
= ⇤
N⇤
N
a higher probability to the unseen PP
test
(w
(w i |w corpus.
|w
1 ...w i i 1 )1 )
...w
i=1
i=1 i 1
⇧⇧
⌅⌅ NN
⌅
CS447: Natural Language Processing (J. Hockenmaier)
⌅ 1 35
=def ⇤ N
⇤
N 1
Perplexity PP(w1…wn)
Given a test corpus with N tokens, w1…wN,
and an n-gram model P(wi | wi−1, …, wi−n+1)
we compute its perplexity PP(w1…wN) as follows:
1
11
(w
PPPPPP(w
(w ...wNNN)))
11...w
1 ...w
=
=
= P (w
P (w111...w
...wN
...w N)))
N N
NN
⇥⇥
11
1
=
=
=
N
N
P (w
P (w ...w
(w111...w
...wNN )))
N
⇧⇧
⇧
⌅⌅N
⌅
⌅⌅ N
11
= ⇤
N⌅
⇤
N 1
=
= N⇤
N
(w |w )
(Chain rule)
P (wi|w11...w
P
i=1 P (wii |w
i=1
...wii 11) )
...w
1 i 1
s ⇧
⌅⇧N
i=1
N ⌅N 1 11
’
N= ⇤⌅ N (N-gram
= def
N
=def N
⇤
=def P(wi=1
def N
|w P (w,i...,
|w i wni...w 1))
ii 1) model)
i=1 i
i=1
P (wi |wi n ...wi 1 )
i 1 n+1
with
✓ N ◆
1
PP(w1 ...wN ) =def exp Â
N i=1
log P(wi |wi 1 , ..., wi n+1
Task-based evaluation:
- Train model A, plug it into your system for performing task T
- Evaluate performance of system A on task T.
- Train model B, plug it in, evaluate system B on same task T.
- Compare scores of system A and system B on task T.
A few words
the r-th most (log-scale) 10000 are very frequent
common word wr
(log)
Most words
100
are very rare
10
1
1 10 100 1000 10000 100000
Number of words (log)
English words, sorted by frequency (log-scale)
w1 = the, w2 = to, …., w5346 = computer, ...
In natural language:
- A small number of events (e.g. words) occur with high frequency
- A large number of events occur with very low frequency
CS447: Natural Language Processing (J. Hockenmaier) 47
So….
… we can’t actually evaluate our MLE models on
unseen test data (or system output)…
Today’s reading:
Jurafsky and Martin, Chapter 4, sections 1-4 (2008 edition)
Chapter 3 (3rd Edition)