18-IntroNLP II PDF
18-IntroNLP II PDF
▪ Spell correction
▪ The user typed “graffe”
• Computational Biology
Which is closest? • Align two sequences of nucleotides
▪ graf
▪ graft AGGCTATCACCTGACCTCCAGGCCGATGCCC
▪ grail
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
▪ giraffe
• Resulting alignment:
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
▪ An alignment:
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
▪ Given two sequences, align each letter to a letter or gap
▪ Evaluating Machine Translation and speech recognition
8
▪ But the space of all edit sequences is huge!
▪ We can’t afford to navigate naïvely
▪ Lots of distinct paths wind up at the same state.
▪ We don’t have to keep track of all of them
▪ Just the shortest path to each of those revisted states.
9
▪For two strings
▪ X of length n
▪ Y of length m
▪We define D(i,j)
▪ the edit distance between X[1..i] and Y[1..j]
▪ i.e., the first i characters of X and the first j characters
of Y
▪ The edit distance between X and Y is thus D(n,m)
Definition of Minimum Edit Distance
Computing Minimum Edit Distance
▪ Dynamic programming: A tabular computation of D(n,m)
▪ Solving problems by combining solutions to subproblems.
▪ Bottom-up
▪ We compute D(i,j) for small i,j
▪ And compute larger D(i,j) based on previously computed smaller values
▪ i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
▪ Initialization
D(i,0) = i
D(0,j) = j
▪ Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
▪ Termination:
D(N,M) is distance
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Computing Minimum Edit Distance
Backtrace for Computing Alignments
▪ Edit distance isn’t sufficient
▪ We often need to align each character of the two strings to each other
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
▪ Base conditions: Termination:
D(i,0) = i D(0,j) = j D(N,M) is distance
▪ Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j)
LEFT insertion
ptr(i,j)= DOWN deletion
DIAG substitution
x0 …………………… xN
corresponds to
an alignment
of the two sequences
▪ Recurrence Relation:
D(i-1,j) + del[x(i)]
D(i,j)= min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]
▪ Termination:
D(N,M) is distance
…The 1950s were not good years for mathematical research. [the] Secretary of Defense
…had a pathological fear and hatred of the word, research…
I wanted to get across the idea that this was dynamic, this was multistage… I thought, let’s
… take a word that has an absolutely precise meaning, namely dynamic… it’s impossible
to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will
possibly give it a pejorative meaning. It’s impossible.
Thus, I thought dynamic programming was a good name. It was something not even a
Congressman could object to.”
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
▪ Comparing genes or regions from different
species
▪ to find important regions
▪ determine function
▪ uncover evolutionary forces
▪ Recurrence Relation:
D(i-1,j) - d
D(i,j)= min D(i,j-1) - d
D(i-1,j-1) + s[x(i),y(j)]
▪ Termination:
D(N,M) is distance
x1 ……………………………… xM
y1 …………………… yN
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC
GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------
Example:
Search for a mouse gene
within a human chromosome
1. Initialization
For all i, j,
F(i, 0) = 0
F(0, j) = 0
2. Termination
maxi F(i, N)
FOPT = max
maxj F(M, j)
x = aaaacccccggggtta
y = ttcccgggaaccaacc
Modifications to Needleman-Wunsch:
Initialization: F(0, j) = 0
F(i, 0) = 0
0
Iteration: F(i, j) = max F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + s(xi, yj)
Slide from Serafim Batzoglou
Termination:
1. If we want the best local alignment…
▪More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
▪The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
P(w1w2 … wn ) = Õ P(wi | w1w2 … wi-1 )
i
▪Or maybe
count(wi-1,wi )
P(wi | w i-1) =
count(w i-1 )
c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
<s> I am Sam </s>
c(wi-1,wi )
P(wi | w i-1 ) = <s> Sam I am </s>
c(wi-1) <s> I do not like green eggs and ham </s>
▪ can you tell me about any good cantonese restaurants close by
▪ mid priced thai food is what i’m looking for
▪ tell me about chez panisse
▪ can you give me a listing of the kinds of food that are available
▪ i’m looking for a good place to eat breakfast
▪ when is caffe venezia open during the day
▪ Out of 9222 sentences
▪ Normalize by unigrams:
▪ Result:
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
▪ P(english|want) = .0011
▪ P(chinese|want) = .0065
▪ P(to|want) = .66
▪ P(eat | to) = .28
▪ P(food | to) = 0
▪ P(want | spend) = 0
▪ P (i | <s>) = .25
▪We do everything in log space
▪Avoid underflow
▪(also adding is faster than multiplying)
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
▪ http://ngrams.googlelabs.com/
Estimating N-gram Probabilities
Evaluating Perplexity
▪ Does our language model prefer good sentences to bad ones?
▪ Assign higher probability to “real” or “frequently observed” sentences
▪ Than “ungrammatical” or “rarely observed” sentences?
Chain rule:
For bigrams:
allegations
3 allegations
outcome
reports
2 reports
attack
…
claims
request
1 claims
man
1 request
7 total
Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
allegations
1.5 reports
allegations
outcome
0.5 claims
reports
attack
0.5 request
…
man
claims
request
2 other
7 total
▪ Also called Laplace smoothing
▪ Pretend we saw each word one more time than we
did
▪ Just add one to all the counts!
c(wi-1, wi )
PMLE (wi | wi-1 ) =
▪ MLE estimate: c(wi-1 )
▪ Suppose the word “bagel” occurs 400 times in a corpus of a million words
▪ What is the probability that a random word from some other text will be
“bagel”?
▪ MLE estimate is 400/1,000,000 = .0004
▪ This may be a bad estimate for some other corpus
▪ But it is the estimate that makes it most likely that “bagel” will occur 400 times in a
million word corpus.
▪ So add-1 isn’t used for N-grams:
▪ We’ll see better methods
▪ Backoff:
▪ use trigram if you have good evidence,
▪ otherwise bigram, otherwise unigram
▪ Interpolation:
▪ mix unigram, bigram, trigram
Held-
Test
Training Data Out
Data
Data
▪ Choose λs to maximize the probability of held-out data:
▪ Efficiency
▪ Efficient data structures like tries
▪ Bloom filters: approximate language models
▪ Store words as indexes, not strings
▪ Use Huffman coding to fit large numbers of words into two bytes
▪ Quantize probabilities (4-8 bits instead of 8-byte float)
▪“Stupid backoff” (Brants et al. 2007)
▪No discounting, just use relative frequencies
ì i
ïï count(wi-k+1 )
i-k+1 ) > 0
i
i-1
if count(w
S(wi | wi-k+1 ) = í count(wi-k+1 )
i-1
ï i-1
ïî 0.4S(wi | wi-k+2 ) otherwise
count(wi )
S(wi ) =
N 115
▪Add-1 smoothing:
▪ OK for text categorization, not for language
modeling
▪The most commonly used method:
▪ Extended Interpolated Kneser-Ney
▪For very large N-grams like the Web:
▪ Stupid backoff
116
▪ Discriminative models:
▪ choose n-gram weights to improve a task, not to fit the training set
▪ Parsing-based models
▪ Caching Models
▪ Recently used words are more likely to appear
c(w Î history)
PCACHE (w | history) = l P(wi | wi-2 wi-1 ) + (1- l )
| history |
Interpolation, Backoff, and Web-Scale LM’s
<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
Using a biagram language model with add-one smoothing,
what is P(Sam | am)?
119
The Task of Text Classification
▪ 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay,
Madison, Hamilton.
▪ Authorship of 12 of the letters in dispute
S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23,
number 3, pp. 321–346
▪ unbelievably disappointing
▪ Full of zany characters and richly applied satire, and some great plot twists
▪ It was pathetic. The worst part about it was the boxing scenes.
124
MEDLINE Article MeSH Subject Category Hierarchy
▪ Antogonists and Inhibitors
▪ Blood Supply
▪ Chemistry
▪ Drug Therapy
? ▪ Embryology
▪ Epidemiology
▪…
125
▪Assigning subject categories, topics, or genres
▪Spam detection
▪Authorship identification
▪Age/gender identification
▪Language Identification
▪Sentiment analysis
▪…
▪Input:
▪ a document d
▪ a fixed set of classes C = {c1, c2,…, cJ}
129
▪Any kind of classifier
▪ Naïve Bayes
▪ Logistic regression
▪ Support-vector machines
▪ k-Nearest Neighbors
▪…
The Task of Text Classification
Naïve Bayes
▪Simple (“naïve”) classification method based on Bayes
rule
▪Relies on very simple representation of document
▪Bag of words
I love this movie! It's sweet,
but with satirical humor. The
γ
dialogue is great and the
)=c
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
(
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
I love this movie! It's sweet,
but with satirical humor. The
γ
dialogue is great and the
)=c
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
(
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
x love xxxxxxxxxxxxxxxx sweet
xxxxxxx satirical xxxxxxxxxx
γ
xxxxxxxxxxx great xxxxxxx
)=c
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
(
xx several xxxxxxxxxxxxxxxxx
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
great 2
γ
love 2
recommend
laugh
1
1
)=c
(
happy 1
... ...
Test
?
document
Machine Garbage
parser Learning NLP Collection Planning GUI
language
label learning parser garbage planning ...
translation training tag collection temporal
… algorithm training memory reasoning
shrinkage translation optimization plan
network... language... region... language...
• For a document d and a class c
P(d | c)P(c)
P(c | d) =
P(d)
cMAP = argmax P(c | d) MAP is “maximum a
posteriori” = most
cÎC likely class
P(d | c)P(c)
= argmax Bayes Rule
cÎC P(d)
= argmax P(d | c)P(c) Dropping the
denominator
cÎC
cMAP = argmax P(d | c)P(c)
cÎC
Document d
= argmax P(x1, x2,… , xn | c)P(c) represented as
features x1..xn
cÎC
cMAP = argmax P(x1, x2,… , xn | c)P(c)
cÎC
doccount(C = c j )
P̂(c j ) =
N doc
count(wi , c j )
P̂(wi | c j ) =
å count(w, c j )
wÎV
count(wi , c j ) fraction of times word wi appears
P̂(wi | c j ) =
å count(w, c j ) among all words in documents of topic cj
wÎV
▪ What if we have seen no training documents with the word fantastic and classified in the topic
positive (thumbs-up)?
count("fantastic", positive)
P̂("fantastic" positive) = = 0
å count(w, positive)
wÎV
count(wi , c) +1
=
æ ö
çç å count(w, c)÷÷ + V
è wÎV ø
• From training corpus, extract Vocabulary
▪ Calculate P(cj) terms • Calculate P(wk | cj) terms
▪ For each cj in C do • Textj single doc containing all docsj
docsj all docs with class =cj • For each word wk in Vocabulary
nk # of occurrences of wk in Textj
| docs j |
P(c j ) ¬ nk + a
| total # documents| P(wk | c j ) ¬
n + a | Vocabulary |
Add one extra word to the vocabulary, the “unknown word” wu
count(wu, c) +1
P̂(wu | c) =
æ ö
çç å count(w, c)÷÷ + V +1
è wÎV ø
1
=
æ ö
çç å count(w, c)÷÷ + V +1
è wÎV ø
Naïve Bayes: Relationship to
Language Modeling
c=China
153
▪ Naïve bayes classifiers can use any sort of
feature
▪ URL, email address, dictionaries, network features
▪ But if, as in the previous slides
▪ We use only word features
▪ we use all of the words in the text (not a subset)
▪ Then
▪ Naïve bayes has an important similarity to language
modeling.
154
Sec.13.2.1
Class pos
0.1 I
I love this fun film
0.1 love
0.1 0.1 .05 0.01 0.1
0.01this
0.05fun
0.1 film P(s | pos) = 0.0000005
Sec.13.2.1
( b 2 + 1) PR
1
F= =
1
a + (1 - a )
1 b 2
P+R
P R
▪ The harmonic mean is a very conservative average; see IIR § 8.3
▪ People usually use balanced F1 measure
▪ i.e., with = 1 (that is, = ½): F = 2PR/(P+R)
Precision, Recall & F1Score
Text Classification: Evaluation
Sec.14.5
167
Sec.14.5
168
Sec. 15.2.4
▪ Most (over)used data set, 21,578 docs (each 90 types, 200 toknens)
▪ 9603 training, 3299 test articles (ModApte/Lewis split)
▪ 118 categories
▪ An article can be in more than one category
▪ Learn 118 binary category distinctions
Recall: cii
Fraction of docs in class i classified correctly: å cij
j
Precision: cii
Fraction of docs assigned class i that are å c ji
actually about class i: j
å cii
i
Accuracy: (1 - error rate)
åå cij
Fraction of docs classified correctly: j i
172
Sec. 15.2.4
Test Set
Text Classification: Evaluation
Text Classification: Practical Issues
Sec. 15.3.1
178
Sec. 15.3.1
179
Sec. 15.3.1
180
Sec. 15.3.1
181
Sec. 15.3.1
182
Sec. 15.3.1
183
Brill and Banko on spelling correction
▪ Automatic classification
▪ Manual review of uncertain/difficult/"new” cases
184
▪ Multiplying lots of probabilities can result in floating-point underflow.
▪ Since log(xy) = log(x) + log(y)
▪ Better to sum logs of probabilities instead of multiplying probabilities.
▪ Class with highest un-normalized log probability score is still most
probable.
cNB = argmax log P(c j ) +
c j ÎC
å log P(xi | c j )
iÎ positions
186
Text Classification: Practical Issues