Language Model PDF
Language Model PDF
Language Model PDF
Modeling
Introduction
to
N-‐grams
Dan
Jurafsky
• More
variables:
P(A,B,C,D)
=
P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The
Chain
Rule
in
General
P(x1,x2,x3,…,xn)
=
P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-‐1)
The
Chain
Rule
applied
to
compute
Dan
Jurafsky
Markov Assumption
• Simplifying
assumption:
Andrei
Markov
• Or
maybe
P(the | its water is so transparent that) ≈ P(the | transparent that)
Dan
Jurafsky
Markov Assumption
P(w1w 2 …w n ) ≈ ∏ P(w i )
i
Some
automatically
generated
sentences
from
a
unigram
model
Bigram
model
Condition
on
the
previous
word:
N-‐gram
models
• We
can
extend
to
trigrams,
4-‐grams,
5-‐grams
• In
general
this
is
an
insufficient
model
of
language
• because
language
has
long-‐distance
dependencies:
“The
computer(s)
which
I
had
just
put
into
the
machine
room
on
the
fifth
floor
is
(are)
crashing.”
count(w i−1,w i )
P(w i | w i−1 ) =
count(w i−1 )
c(w i−1,w i )
P(w i | w i−1 ) =
c(w i−1 )
€
Dan
Jurafsky
An example
More
examples:
Berkeley
Restaurant
Project
sentences
• can
you
tell
me
about
any
good
cantonese restaurants
close
by
• mid
priced
thai food
is
what
i’m looking
for
• tell
me
about
chez
panisse
• can
you
give
me
a
listing
of
the
kinds
of
food
that
are
available
• i’m looking
for
a
good
place
to
eat
breakfast
• when
is
caffe venezia open
during
the
day
Dan
Jurafsky
• Result:
Dan
Jurafsky
Practical
Issues
• We
do
everything
in
log
space
• Avoid
underflow
• (also
adding
is
faster
than
multiplying)
…
Dan
Jurafsky
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dan
Jurafsky
30
Dan
Jurafsky
Perplexity
The
best
language
model
is
one
that
best
predicts
an
unseen
test
set
• Gives
the
highest
P(sentence) −
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity
is
the
inverse
probability
of
the
test
set,
normalized
by
the
number
1
of
words: = N
P(w1w2 ...wN )
Chain rule:
For bigrams:
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1gram rote life have
–Hill he late speaks; or! a more to leg less first you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2gram king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3gram ’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4gram great banquet serv’d in;
–It cannot be but so.
Figure 4.3 Eight sentences randomly generated from four N-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
Dan
Jurafsky
1
gram
Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
Last December through the way to preserve the Hudson corporation N.
2
gram
B. E. C. Taylor would seem to complete the major central planners one
point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
They also point to ninety nine point six billion dollars from two hundred
3
gram
four oh six three percent of the rates of interest stores as Mexico and
Brazil on market conditions
Figure 4.4 Three sentences randomly generated from three N-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
Dan
Jurafsky
43
Dan
Jurafsky
Zeros
• Training
set: • Test
set
…
denied
the
allegations …
denied
the
offer
…
denied
the
reports …
denied
the
loan
…
denied
the
claims
…
denied
the
request
allegations
3
allegations
outcome
reports
2
reports
attack
1
claims
…
request
claims
man
1
request
7
total
• Steal
probability
mass
to
generalize
better
P(w
|
denied
the)
2.5
allegations
allegations
1.5
reports
allegations
outcome
0.5
claims
reports
attack
0.5
request
…
man
claims
request
2
other
7
total
Dan
Jurafsky
Add-‐one estimation
c(wi−1, wi ) +1
• Add-‐1
estimate: PAdd−1 (wi | wi−1 ) =
c(wi−1 ) +V
Dan
Jurafsky
Laplace-smoothed bigrams
Dan
Jurafsky
Reconstituted counts
Dan
Jurafsky
count(wi )
S(wi ) =
64 N
Dan
Jurafsky
65
Dan
Jurafsky
Advanced:
Kneser-Ney Smoothing
Absolute discounting: just subtract a
Dan
Jurafsky
c(wi−1, wi ) − d
PAbsoluteDiscounting (wi | wi−1 ) = + λ (wi−1 )P(w)
c(wi−1 )
unigram
• (Maybe
keeping
a
couple
extra
values
of
d
for
counts
1
and
2)
• But
should
we
really
just
use
the
regular
unigram
P(w)?
70
Dan
Jurafsky
Kneser-Ney Smoothing I
• Better
estimate
for
probabilities
of
lower-‐order
unigrams!
Francisco
glasses
• Shannon
game:
I
can’t
see
without
my
reading___________?
• “Francisco”
is
more
common
than
“glasses”
• …
but
“Francisco”
always
follows
“San”
• The
unigram
is
useful
exactly
when
we
haven’t
seen
this
bigram!
• Instead
of
P(w):
“How
likely
is
w”
• Pcontinuation(w):
“How
likely
is
w
to
appear
as
a
novel
continuation?
• For
each
word,
count
the
number
of
bigram
types
it
completes
• Every
bigram
type
was
a
novel
continuation
the
first
time
it
was
seen
PCONTINUATION (w) ∝ {wi−1 : c(wi−1, w) > 0}
Dan
Jurafsky
• A
frequent
word
(Francisco)
occurring
in
only
one
context
(San)
will
have
a
low
continuation
probability
Dan
Jurafsky
max(c(wi−1, wi ) − d, 0)
PKN (wi | wi−1 ) = + λ (wi−1 )PCONTINUATION (wi )
c(wi−1 )
λ is
a
normalizing
constant;
the
probability
mass
we’ve
discounted
d
λ (wi−1 ) = {w : c(wi−1, w) > 0}
c(wi−1 )
The number of word types that can follow wi-1
the normalized discount = # of word types we discounted
74 = # of times we applied normalized discount
Dan
Jurafsky
i
i−1 max(cKN (wi−n+1 ) − d, 0) i−1 i−1
PKN (wi | wi−n+1 ) = i−1
+ λ (wi−n+1 )PKN (wi | wi−n+2 )
cKN (wi−n+1 )
Advanced:
Kneser-Ney Smoothing