0% found this document useful (0 votes)
137 views187 pages

18-IntroNLP II PDF

The document defines minimum edit distance and describes how it can be used for spell checking, machine translation, and computational biology applications. Minimum edit distance is the minimum number of edit operations (insertions, deletions, substitutions) needed to transform one string into another. It is computed using dynamic programming to find the optimal alignment between two strings or sequences. Weights can be added to account for different costs of edit operations. This has applications in areas like biology where certain mutations may be more common.

Uploaded by

aortizavila2544
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views187 pages

18-IntroNLP II PDF

The document defines minimum edit distance and describes how it can be used for spell checking, machine translation, and computational biology applications. Minimum edit distance is the minimum number of edit operations (insertions, deletions, substitutions) needed to transform one string into another. It is computed using dynamic programming to find the optimal alignment between two strings or sequences. Weights can be added to account for different costs of edit operations. This has applications in areas like biology where certain mutations may be more common.

Uploaded by

aortizavila2544
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 187

Definition of Minimum Edit Distance

▪ Spell correction
▪ The user typed “graffe”
• Computational Biology
Which is closest? • Align two sequences of nucleotides
▪ graf
▪ graft AGGCTATCACCTGACCTCCAGGCCGATGCCC
▪ grail
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
▪ giraffe

• Resulting alignment:
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Also for Machine Translation, Information Extraction, Speech


Recognition
▪ The minimum edit distance between two strings
▪ Is the minimum number of editing operations
▪ Insertion
▪ Deletion
▪ Substitution

▪ Needed to transform one into the other


▪ Two strings and their alignment:
▪ If each operation has cost of 1
▪ Distance between these is 5

▪ If substitutions cost 2 (Levenshtein)


▪ Distance between them is 8
▪ Given a sequence of bases

AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
▪ An alignment:

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
▪ Given two sequences, align each letter to a letter or gap
▪ Evaluating Machine Translation and speech recognition

R Spokesman confirms senior government adviser was shot


H Spokesman said the senior adviser was shot dead
S I D I
▪ Named Entity Extraction and Entity Coreference
▪ IBM Inc. announced today
▪ IBM profits
▪ Stanford President John Hennessy announced yesterday
▪ for Stanford University President John Hennessy
▪ Searching for a path (sequence of edits) from the start string to the final string:
▪ Initial state: the word we’re transforming
▪ Operators: insert, delete, substitute
▪ Goal state: the word we’re trying to get to
▪ Path cost: what we want to minimize: the number of edits

8
▪ But the space of all edit sequences is huge!
▪ We can’t afford to navigate naïvely
▪ Lots of distinct paths wind up at the same state.
▪ We don’t have to keep track of all of them
▪ Just the shortest path to each of those revisted states.

9
▪For two strings
▪ X of length n
▪ Y of length m
▪We define D(i,j)
▪ the edit distance between X[1..i] and Y[1..j]
▪ i.e., the first i characters of X and the first j characters
of Y
▪ The edit distance between X and Y is thus D(n,m)
Definition of Minimum Edit Distance
Computing Minimum Edit Distance
▪ Dynamic programming: A tabular computation of D(n,m)
▪ Solving problems by combining solutions to subproblems.
▪ Bottom-up
▪ We compute D(i,j) for small i,j
▪ And compute larger D(i,j) based on previously computed smaller values
▪ i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
▪ Initialization
D(i,0) = i
D(0,j) = j
▪ Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
▪ Termination:
D(N,M) is distance
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Computing Minimum Edit Distance
Backtrace for Computing Alignments
▪ Edit distance isn’t sufficient
▪ We often need to align each character of the two strings to each other

▪ We do this by keeping a “backtrace”


▪ Every time we enter a cell, remember where we came from
▪ When we reach the end,
▪ Trace back the path from the upper right corner to read off the alignment
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
▪ Base conditions: Termination:
D(i,0) = i D(0,j) = j D(N,M) is distance
▪ Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion
D(i,j)= min D(i,j-1) + 1 insertion
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j)
LEFT insertion
ptr(i,j)= DOWN deletion
DIAG substitution
x0 …………………… xN

Every non-decreasing path

from (0,0) to (M, N)

corresponds to
an alignment
of the two sequences

An optimal alignment is composed


y0 ……………………………… yM of optimal subalignments
Slide adapted from Serafim Batzoglou
▪ Two strings and their alignment:
▪Time:
O(nm)
▪Space:
O(nm)
▪Backtrace
O(n+m)
Backtrace for Computing Alignments
Weighted Minimum Edit Distance
▪ Why would we add weights to the computation?
▪ Spell Correction: some letters are more likely to be mistyped than others
▪ Biology: certain kinds of deletions or insertions are more likely than others
▪ Initialization:
D(0,0) = 0
D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N
D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M

▪ Recurrence Relation:

D(i-1,j) + del[x(i)]
D(i,j)= min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]

▪ Termination:
D(N,M) is distance
…The 1950s were not good years for mathematical research. [the] Secretary of Defense
…had a pathological fear and hatred of the word, research…

I decided therefore to use the word, “programming”.

I wanted to get across the idea that this was dynamic, this was multistage… I thought, let’s
… take a word that has an absolutely precise meaning, namely dynamic… it’s impossible
to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will
possibly give it a pejorative meaning. It’s impossible.

Thus, I thought dynamic programming was a good name. It was something not even a
Congressman could object to.”

Richard Bellman, “Eye of the Hurricane: an autobiography” 1984.


Weighted Minimum Edit Distance
Minimum Edit Distance in Computational Biology
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
▪ Comparing genes or regions from different
species
▪ to find important regions
▪ determine function
▪ uncover evolutionary forces

▪ Assembling fragments to sequence DNA


▪ Compare individuals to looking for mutations
▪In Natural Language Processing
▪We generally talk about distance
(minimized)
▪ And weights
▪In Computational Biology
▪We generally talk about similarity
(maximized)
▪ And scores
▪ Initialization:
D(i,0) = -i * d
D(0,j) = -j * d

▪ Recurrence Relation:

D(i-1,j) - d
D(i,j)= min D(i,j-1) - d
D(i-1,j-1) + s[x(i),y(j)]

▪ Termination:
D(N,M) is distance
x1 ……………………………… xM
y1 …………………… yN

(Note that the origin is at the


upper left.)

Slide adapted from Serafim Batzoglou


▪ Maybe it is OK to have an unlimited # of gaps in the
beginning and end:

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC
GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------

• If so, we don’t want to penalize gaps at the ends

Slide from Serafim Batzoglou


Example:
2 overlapping“reads” from a
sequencing project

Example:
Search for a mouse gene
within a human chromosome

Slide from Serafim Batzoglou


x1 ……………………………… xM Changes:
y1 …………………… yN

1. Initialization
For all i, j,
F(i, 0) = 0
F(0, j) = 0

2. Termination
maxi F(i, N)
FOPT = max
maxj F(M, j)

Slide from Serafim Batzoglou


Given two strings x = x1……xM,
y = y1……yN
Find substrings x’, y’ whose similarity
(optimal global alignment value)
is maximum

x = aaaacccccggggtta
y = ttcccgggaaccaacc

Slide from Serafim Batzoglou


Idea: Ignore badly aligning regions

Modifications to Needleman-Wunsch:

Initialization: F(0, j) = 0
F(i, 0) = 0

0
Iteration: F(i, j) = max F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + s(xi, yj)
Slide from Serafim Batzoglou
Termination:
1. If we want the best local alignment…

FOPT = maxi,j F(i, j)

Find FOPT and trace back

2. If we want all local alignments scoring > t

?? For all i, j find F(i, j) > t, and trace back?

Complicated by overlapping local alignments

Slide from Serafim Batzoglou


A T T A T C
X = ATCAT 0 0 0 0 0 0 0
Y = ATTATC A 0
Let: T 0
m = 1 (1 point for match)
d = 1 (-1 point for del/ins/sub) C 0
A 0
T 0
A T T A T C
X = ATCAT 0 0 0 0 0 0 0
Y = ATTATC A 0 1 0 0 1 0 0
T 0 0 2 1 0 2 0
C 0 0 1 1 0 1 3
A 0 1 0 0 2 1 2
T 0 0 2 0 1 3 2
A T T A T C
X = ATCAT 0 0 0 0 0 0 0
Y = ATTATC A 0 1 0 0 1 0 0
T 0 0 2 1 0 2 0
C 0 0 1 1 0 1 3
A 0 1 0 0 2 1 2
T 0 0 2 0 1 3 2
A T T A T C
X = ATCAT 0 0 0 0 0 0 0
Y = ATTATC A 0 1 0 0 1 0 0
T 0 0 2 1 0 2 0
C 0 0 1 1 0 1 3
A 0 1 0 0 2 1 2
T 0 0 2 0 1 3 2
Minimum Edit Distance in Computational Biology
Introduction to N-grams
▪ Today’s goal: assign a probability to a sentence
▪ Machine Translation:
▪ P(high winds tonite) > P(large winds tonite)
▪ Spell Correction
▪ The office is about fifteen minuets from my house
Why? ▪ P(about fifteen minutes from) > P(about fifteen minuets
from)
▪ Speech Recognition
▪ P(I saw a van) >> P(eyes awe of an)
▪ + Summarization, question-answering, etc., etc.!!
▪Goal: compute the probability of a sentence or sequence
of words:
P(W) = P(w1,w2,w3,w4,w5…wn)

▪Related task: probability of an upcoming word:


P(w5|w1,w2,w3,w4)

▪A model that computes either of these:


P(W) or P(w |w ,w …w )
n 1 2is called a language model.
n-1

▪ Better: the grammar But language model or LM is standard


▪ How to compute this joint probability:

▪P(its, water, is, so, transparent, that)

▪ Intuition: let’s rely on the Chain Rule of Probability


▪Recall the definition of conditional probabilities
Rewriting:

▪More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
▪The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
P(w1w2 … wn ) = Õ P(wi | w1w2 … wi-1 )
i

P(“its water is so transparent”) =


P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is
so)
▪ Could we just count and divide?

P(the | its water is so transparent that) =


Count(its water is so transparent that the)
Count(its water is so transparent that)
▪ No! Too many possible sentences!
▪ We’ll never see enough data for estimating these
▪Simplifying assumption: Andrei Markov

P(the | its water is so transparent that) » P(the | that)

▪Or maybe

P(the | its water is so transparent that) » P(the | transparent that)


P(w1w2 … wn ) » Õ P(wi | wi-k … wi-1 )
i

▪In other words, we approximate each


component in the product
P(wi | w1w2 … wi-1) » P(wi | wi-k … wi-1)
P(w1w2 … wn ) » Õ P(w i )
i
Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a,


a, the, inflation, most, dollars, quarter, in, is,
mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the


Condition on the previous word:

P(wi | w1w2 … wi-1) » P(wi | wi-1)


texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november


▪We can extend to trigrams, 4-grams, 5-grams
▪In general this is an insufficient model of language
▪ because language has long-distance dependencies:

“The computer which I had just put into the machine


room on the fifth floor crashed.”

▪But we can often get away with N-gram models


Introduction to N-grams
Estimating N-gram Probabilities
▪ The Maximum Likelihood Estimate

count(wi-1,wi )
P(wi | w i-1) =
count(w i-1 )

c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
<s> I am Sam </s>
c(wi-1,wi )
P(wi | w i-1 ) = <s> Sam I am </s>
c(wi-1) <s> I do not like green eggs and ham </s>
▪ can you tell me about any good cantonese restaurants close by
▪ mid priced thai food is what i’m looking for
▪ tell me about chez panisse
▪ can you give me a listing of the kinds of food that are available
▪ i’m looking for a good place to eat breakfast
▪ when is caffe venezia open during the day
▪ Out of 9222 sentences
▪ Normalize by unigrams:

▪ Result:
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
▪ P(english|want) = .0011
▪ P(chinese|want) = .0065
▪ P(to|want) = .66
▪ P(eat | to) = .28
▪ P(food | to) = 0
▪ P(want | spend) = 0
▪ P (i | <s>) = .25
▪We do everything in log space
▪Avoid underflow
▪(also adding is faster than multiplying)

log(p1 ´ p2 ´ p3 ´ p4 ) = log p1 + log p2 + log p3 + log p4


▪SRILM
▪http://www.speech.sri.com/projects/srilm
/

▪ serve as the incoming 92
▪ serve as the incubator 99
▪ serve as the independent 794
▪ serve as the index 223
▪ serve as the indication 72
▪ serve as the indicator 120
▪ serve as the indicators 45
▪ serve as the indispensable 111
▪ serve as the indispensible 40
▪ serve as the individual 234

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
▪ http://ngrams.googlelabs.com/
Estimating N-gram Probabilities
Evaluating Perplexity
▪ Does our language model prefer good sentences to bad ones?
▪ Assign higher probability to “real” or “frequently observed” sentences
▪ Than “ungrammatical” or “rarely observed” sentences?

▪ We train parameters of our model on a training set.


▪ We test the model’s performance on data we haven’t seen.
▪ A test set is an unseen dataset that is different from our training set, totally unused.
▪ An evaluation metric tells us how well our model does on the test set.
▪ Best evaluation for comparing models A and B
▪ Put each model in a task
▪ spelling corrector, speech recognizer, MT system
▪ Run the task, get an accuracy for A and for B
▪ How many misspelled words corrected properly
▪ How many words translated correctly
▪ Compare accuracy for A and B
▪Extrinsic evaluation
▪ Time-consuming; can take days or weeks
▪So
▪ Sometimes use intrinsic evaluation: perplexity
▪ Bad approximation
▪ unless the test data looks just like the training data
▪ So generally only useful in pilot experiments
▪ But is helpful to think about.
mushrooms 0.1
▪ The Shannon Game: pepperoni 0.1
▪ How well can we predict the next word?
anchovies 0.01
I always order pizza with cheese and ____
….
The 33rd President of the US was ____
fried rice 0.0001
I saw a ____ ….
▪ Unigrams are terrible at this game. (Why?)
and 1e-100
▪ A better model of a text
▪ is one which assigns a higher probability to the word that actually occurs
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence) -
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number
of words: 1
= N
P(w1w2 ...wN )

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability


▪ From Josh Goodman
▪ How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9’
▪ Perplexity 10

▪ How hard is recognizing (30,000) names at Microsoft.


▪ Perplexity = 30,000

▪ If a system has to recognize


▪ Operator (1 in 4)
▪ Sales (1 in 4)
▪ Technical Support (1 in 4)
▪ 30,000 names (1 in 120,000 each)
▪ Perplexity is 53
▪ Perplexity is weighted equivalent branching factor
▪ Let’s suppose a sentence consisting of random digits
▪ What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?
▪Training 38 million words, test 1.5 million words,
WSJ
N-gram Unigra Bigram Trigra
Order m m
Perplexi 962 170 109
ty
Evaluation and Perplexity
Generalization and Zeros
▪ Choose a random bigram
<s> I
(<s>, w) according to its probability I want
▪ Now choose a random bigram (w, want to
x) according to its probability to eat
▪ And so on until we choose </s> eat Chinese
▪ Then string the words together Chinese food
food </s>
I want to eat Chinese food
▪N=884,647 tokens, V=29,066
▪Shakespeare produced 300,000 bigram types out
of V2= 844 million possible bigrams.
▪So 99.96% of the possible bigrams were never seen
(have zero entries in the table)
▪Quadrigrams worse: What's coming out looks
like Shakespeare because it is Shakespeare
▪N-grams only work well for word prediction if the
test corpus looks like the training corpus
▪In real life, it often doesn’t
▪We need to train robust models that generalize!
▪One kind of generalization: Zeros!
▪Things that don’t ever occur in the training set
▪But occur in the test set
▪Training set: • Test set
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request
P(“offer” | denied the) = 0
▪ Bigrams with zero probability
▪ mean that we will assign 0 probability to the test set!

▪ And hence we cannot compute perplexity (can’t divide by 0)!


Generalization and Zeros
Smoothing: Add-one (Laplace) smoothing
▪ When we have sparse statistics:
P(w | denied the)

allegations
3 allegations

outcome
reports
2 reports

attack

claims

request
1 claims

man
1 request
7 total
Steal probability mass to generalize better
P(w | denied the)
2.5 allegations

allegations
1.5 reports

allegations

outcome
0.5 claims

reports

attack
0.5 request

man
claims

request
2 other
7 total
▪ Also called Laplace smoothing
▪ Pretend we saw each word one more time than we
did
▪ Just add one to all the counts!
c(wi-1, wi )
PMLE (wi | wi-1 ) =
▪ MLE estimate: c(wi-1 )

▪ Add-1 estimate: c(wi-1, wi ) +1


PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V
▪ The maximum likelihood estimate
▪ of some parameter of a model M from a training set T
▪ maximizes the likelihood of the training set T given the model M

▪ Suppose the word “bagel” occurs 400 times in a corpus of a million words
▪ What is the probability that a random word from some other text will be
“bagel”?
▪ MLE estimate is 400/1,000,000 = .0004
▪ This may be a bad estimate for some other corpus
▪ But it is the estimate that makes it most likely that “bagel” will occur 400 times in a
million word corpus.
▪ So add-1 isn’t used for N-grams:
▪ We’ll see better methods

▪ But add-1 is used to smooth other NLP models


▪ For text classification
▪ In domains where the number of zeros isn’t so huge.
Smoothing: Add-one (Laplace) smoothing
Interpolation, Backoff, and Web-Scale LM’s
▪ Sometimes it helps to use less context
▪ Condition on less context for contexts you haven’t learned much about

▪ Backoff:
▪ use trigram if you have good evidence,
▪ otherwise bigram, otherwise unigram

▪ Interpolation:
▪ mix unigram, bigram, trigram

▪ Interpolation works better


▪Simple interpolation

▪Lambdas conditional on context:


▪ Use a held-out corpus

Held-
Test
Training Data Out
Data
Data
▪ Choose λs to maximize the probability of held-out data:

▪ Fix the N-gram probabilities (on the training data)


▪ Then search for λs that give largest probability to held-out set:

log P(w1...wn | M(l1...lk )) = å log PM ( l1... lk ) (wi | wi-1 )


i
▪ If we know all the words in advanced
▪ Vocabulary V is fixed
▪ Closed vocabulary task

▪ Often we don’t know this


▪ Out Of Vocabulary = OOV words
▪ Open vocabulary task

▪ Instead: create an unknown word token <UNK>


▪ Training of <UNK> probabilities
▪ Create a fixed lexicon L of size V
▪ At text normalization phase, any training word not in L changed to <UNK>
▪ Now we train its probabilities like a normal word
▪ At decoding time
▪ If text input: Use UNK probabilities for any word not in training
▪ How to deal with, e.g., Google N-gram corpus
▪ Pruning
▪ Only store N-grams with count > threshold.
▪ Remove singletons of higher-order n-grams
▪ Entropy-based pruning

▪ Efficiency
▪ Efficient data structures like tries
▪ Bloom filters: approximate language models
▪ Store words as indexes, not strings
▪ Use Huffman coding to fit large numbers of words into two bytes
▪ Quantize probabilities (4-8 bits instead of 8-byte float)
▪“Stupid backoff” (Brants et al. 2007)
▪No discounting, just use relative frequencies
ì i
ïï count(wi-k+1 )
i-k+1 ) > 0
i
i-1
if count(w
S(wi | wi-k+1 ) = í count(wi-k+1 )
i-1

ï i-1
ïî 0.4S(wi | wi-k+2 ) otherwise

count(wi )
S(wi ) =
N 115
▪Add-1 smoothing:
▪ OK for text categorization, not for language
modeling
▪The most commonly used method:
▪ Extended Interpolated Kneser-Ney
▪For very large N-grams like the Web:
▪ Stupid backoff
116
▪ Discriminative models:
▪ choose n-gram weights to improve a task, not to fit the training set

▪ Parsing-based models
▪ Caching Models
▪ Recently used words are more likely to appear

▪ These perform very poorly for speech recognition (why?)

c(w Î history)
PCACHE (w | history) = l P(wi | wi-2 wi-1 ) + (1- l )
| history |
Interpolation, Backoff, and Web-Scale LM’s
<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
Using a biagram language model with add-one smoothing,
what is P(Sam | am)?

119
The Task of Text Classification
▪ 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay,
Madison, Hamilton.
▪ Authorship of 12 of the letters in dispute

▪ 1963: solved by Mosteller and Wallace using Bayesian methods

James Madison Alexander Hamilton


1. By 1925 present-day Vietnam was divided into three parts under French
colonial rule. The southern region embracing Saigon and the Mekong delta
was the colony of Cochin-China; the central area with its imperial capital at
Hue was the protectorate of Annam…
2. Clara never failed to be astonished by the extraordinary felicity of her own
name. She found it hard to trust herself to the mercy of fate, which had
managed over the years to convert her greatest shame into one of her greatest
assets…

S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23,
number 3, pp. 321–346
▪ unbelievably disappointing

▪ Full of zany characters and richly applied satire, and some great plot twists

▪ this is the greatest screwball comedy ever filmed

▪ It was pathetic. The worst part about it was the boxing scenes.

124
MEDLINE Article MeSH Subject Category Hierarchy
▪ Antogonists and Inhibitors
▪ Blood Supply
▪ Chemistry
▪ Drug Therapy
? ▪ Embryology
▪ Epidemiology
▪…

125
▪Assigning subject categories, topics, or genres
▪Spam detection
▪Authorship identification
▪Age/gender identification
▪Language Identification
▪Sentiment analysis
▪…
▪Input:
▪ a document d
▪ a fixed set of classes C = {c1, c2,…, cJ}

▪Output: a predicted class c  C


▪ Rules based on combinations of words or other features
▪ spam: black-list-address OR (“dollars” AND“have been selected”)

▪ Accuracy can be high


▪ If rules carefully refined by expert

▪ But building and maintaining these rules is expensive


▪Input:
▪ a document d
▪ a fixed set of classes C = {c1, c2,…, cJ}
▪ A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
▪Output:
▪ a learned classifier γ:d → c

129
▪Any kind of classifier
▪ Naïve Bayes
▪ Logistic regression
▪ Support-vector machines
▪ k-Nearest Neighbors

▪…
The Task of Text Classification
Naïve Bayes
▪Simple (“naïve”) classification method based on Bayes
rule
▪Relies on very simple representation of document
▪Bag of words
I love this movie! It's sweet,
but with satirical humor. The

γ
dialogue is great and the

)=c
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen

(
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
I love this movie! It's sweet,
but with satirical humor. The

γ
dialogue is great and the

)=c
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen

(
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
x love xxxxxxxxxxxxxxxx sweet
xxxxxxx satirical xxxxxxxxxx

γ
xxxxxxxxxxx great xxxxxxx

)=c
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

(
xx several xxxxxxxxxxxxxxxxx
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
great 2

γ
love 2
recommend
laugh
1
1
)=c
(
happy 1
... ...
Test
?
document

Machine Garbage
parser Learning NLP Collection Planning GUI
language
label learning parser garbage planning ...
translation training tag collection temporal
… algorithm training memory reasoning
shrinkage translation optimization plan
network... language... region... language...
• For a document d and a class c

P(d | c)P(c)
P(c | d) =
P(d)
cMAP = argmax P(c | d) MAP is “maximum a
posteriori” = most
cÎC likely class

P(d | c)P(c)
= argmax Bayes Rule

cÎC P(d)
= argmax P(d | c)P(c) Dropping the
denominator
cÎC
cMAP = argmax P(d | c)P(c)
cÎC
Document d
= argmax P(x1, x2,… , xn | c)P(c) represented as
features x1..xn
cÎC
cMAP = argmax P(x1, x2,… , xn | c)P(c)
cÎC

O(|X|n•|C|) parameters How often does this


class occur?

Could only be estimated if a very, very


large number of training examples was We can just count the
available. relative frequencies in a
corpus
P(x1, x2,… , xn | c)
▪Bag of Words assumption: Assume position doesn’t
matter
▪Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.

P(x1,… , xn | c) = P(x1 | c)·P(x2 | c)·P(x3 | c)·...·P(xn | c)


cMAP = argmax P(x1, x2,… , xn | c)P(c)
cÎC

cNB = argmax P(c j )Õ P(x | c)


cÎC xÎX
positions  all word positions in test document

cNB = argmax P(c j )


c j ÎC
Õ P(xi | c j )
iÎ positions
Sec.13.3

▪First attempt: maximum likelihood estimates


▪simply use the frequencies in the data

doccount(C = c j )
P̂(c j ) =
N doc
count(wi , c j )
P̂(wi | c j ) =
å count(w, c j )
wÎV
count(wi , c j ) fraction of times word wi appears
P̂(wi | c j ) =
å count(w, c j ) among all words in documents of topic cj
wÎV

▪ Create mega-document for topic j by concatenating all docs in this topic

▪ Use frequency of w in mega-document


Sec.13.3

▪ What if we have seen no training documents with the word fantastic and classified in the topic
positive (thumbs-up)?

count("fantastic", positive)
P̂("fantastic" positive) = = 0
å count(w, positive)
wÎV

▪ Zero probabilities cannot be conditioned away, no matter the other evidence!

cMAP = argmax c P̂(c)Õ P̂(xi | c)


i
count(wi , c) +1
P̂(wi | c) =
å (count(w, c))+1)
wÎV

count(wi , c) +1
=
æ ö
çç å count(w, c)÷÷ + V
è wÎV ø
• From training corpus, extract Vocabulary
▪ Calculate P(cj) terms • Calculate P(wk | cj) terms
▪ For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj • For each word wk in Vocabulary
nk  # of occurrences of wk in Textj
| docs j |
P(c j ) ¬ nk + a
| total # documents| P(wk | c j ) ¬
n + a | Vocabulary |
Add one extra word to the vocabulary, the “unknown word” wu

count(wu, c) +1
P̂(wu | c) =
æ ö
çç å count(w, c)÷÷ + V +1
è wÎV ø
1
=
æ ö
çç å count(w, c)÷÷ + V +1
è wÎV ø
Naïve Bayes: Relationship to
Language Modeling
c=China

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

153
▪ Naïve bayes classifiers can use any sort of
feature
▪ URL, email address, dictionaries, network features
▪ But if, as in the previous slides
▪ We use only word features
▪ we use all of the words in the text (not a subset)
▪ Then
▪ Naïve bayes has an important similarity to language
modeling.

154
Sec.13.2.1

▪ Assigning each word: P(word | c)


▪ Assigning each sentence: P(s|c)=P(word|c)

Class pos
0.1 I
I love this fun film
0.1 love
0.1 0.1 .05 0.01 0.1
0.01this
0.05fun
0.1 film P(s | pos) = 0.0000005
Sec.13.2.1

▪ Which class assigns the higher probability to s?

Model pos Model neg


0.1 I 0.2 I I love this fun film
0.1 love 0.001 love
0.1 0.1 0.01 0.05 0.1
0.01this 0.01this 0.2 0.001 0.01 0.005 0.1

0.05fun 0.005 fun


0.1 film 0.1 film P(s|pos) > P(s|neg)
Do Words Class
Nc c
P̂(c) = Training 1 Chinese Beijing Chinese c
N 2 Chinese Chinese Shanghai c
count(w, c) +1 3 Chinese Macao c
P̂(w | c) = 4 Tokyo Japan Chinese j
count(c)+ | V |
Test 5 Chinese Chinese Chinese Tokyo ?
Prior Japan
s: 3 Choosing a class:
P(c)= 4
1 P(c|d5)  3/4 * (3/7)3 * 1/14 * 1/14
P(j)=
4 ≈ 0.0003
Conditional Probabilities:
P(Chinese|c) =(5+1) / (8+6) = 6/14 = 3/7
P(Tokyo|c) = (0+1) / (8+6) = 1/14 P(j|d5)  1/4 * (2/9)3 * 2/9 * 2/9
P(Japan|c) = (0+1) / (8+6) = 1/14 ≈ 0.0001
P(Chinese|j) = (1+1) / (3+6) = 2/9
P(Tokyo|j) = (1+1) / (3+6) = 2/9
P(Japan|j) = (1+1) / (3+6) = 2/9 157
▪ SpamAssassin Features:
▪ Mentions Generic Viagra
▪ Online Pharmacy
▪ Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
▪ Phrase: impress ... girl
▪ From: starts with many numbers
▪ Subject is all capitals
▪ HTML has a low ratio of text to image area
▪ One hundred percent guaranteed
▪ Claims you can be removed from the list
▪ 'Prestigious Non-Accredited Universities'
▪ http://spamassassin.apache.org/tests_3_3_x.html
▪ Very Fast, low storage requirements
▪ Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
▪ Very good in domains with many equally important features
Decision Trees suffer from fragmentation in such cases – especially if little data
▪ Optimal if the independence assumptions hold: If
assumed independence is correct, then it
is the Bayes Optimal Classifier for problem
▪ A good dependable baseline for text classification

▪ But we will see other classifiers that give better accuracy


The Task of Text Classification
Precision, Recall & F1Score
correct not correct
selected tp fp
not selected fn tn
▪ Precision: % of selected items that are correct
Recall: % of correct items that are selected

correct not correct


selected tp fp
not selected fn tn
▪ A combined measure that assesses the P/R tradeoff is F measure (weighted
harmonic mean):

( b 2 + 1) PR
1
F= =
1
a + (1 - a )
1 b 2
P+R
P R
▪ The harmonic mean is a very conservative average; see IIR § 8.3
▪ People usually use balanced F1 measure
▪ i.e., with  = 1 (that is,  = ½): F = 2PR/(P+R)
Precision, Recall & F1Score
Text Classification: Evaluation
Sec.14.5

▪ Dealing with any-of or multivalue classification


▪ A document can belong to 0, 1, or >1 classes.

▪ For each class c∈C


▪ Build a classifier γc to distinguish c from all other classes c’ ∈C

▪ Given test doc d,


▪ Evaluate it for membership in each class using each γc
▪ d belongs to any class for which γc returns true

167
Sec.14.5

▪ One-of or multinomial classification


▪ Classes are mutually exclusive: each document in exactly one class

▪ For each class c∈C


▪ Build a classifier γc to distinguish c from all other classes c’ ∈C

▪ Given test doc d,


▪ Evaluate it for membership in each class using each γc
▪ d belongs to the one class with maximum score

168
Sec. 15.2.4

▪ Most (over)used data set, 21,578 docs (each 90 types, 200 toknens)
▪ 9603 training, 3299 test articles (ModApte/Lewis split)
▪ 118 categories
▪ An article can be in more than one category
▪ Learn 118 binary category distinctions

▪ Average document (with at least one category) has 1.24 classes


▪ Only about 10 out of 118 categories are large
• Earn (2877, 1087) • Trade (369,119)
• Acquisitions (1650, 179) • Interest (347, 131)
Common categories • Ship (197, 89)
• Money-fx (538, 179)
(#train, #test) • Grain (433, 149) • Wheat (212, 71) 169
• Crude (389, 189) • Corn (182, 56)
Sec. 15.2.4

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981"


NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow,
March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions
on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future
direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to
endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry,
the NPPC added. Reuter
170
&#3;</BODY></TEXT></REUTERS>
▪ For each pair of classes <c1,c2> how many documents from c1 were incorrectly assigned to c2?
▪ c3,2: 90 wheat documents incorrectly assigned to poultry

Docs in test Assign Assign Assign Assigne Assign Assigne


set ed ed ed d coffee ed d trade
UK poultry wheat interes
t
True UK 95 1 13 0 1 0
True poultry 0 1 0 0 0 0
True wheat 10 90 0 1 0 0
True coffee 0 0 0 34 3 7
True interest - 1 2 13 26 5
True trade 0 0 2 14 5 10
171
Sec. 15.2.4

Recall: cii
Fraction of docs in class i classified correctly: å cij
j

Precision: cii
Fraction of docs assigned class i that are å c ji
actually about class i: j

å cii
i
Accuracy: (1 - error rate)
åå cij
Fraction of docs classified correctly: j i
172
Sec. 15.2.4

▪ If we have more than one class, how do we


combine multiple performance measures into
one quantity?
▪Macroaveraging: Compute performance for each
class, then average.
▪Microaveraging: Collect decisions for all classes,
compute contingency table, evaluate.
173
Sec. 15.2.4

Class 1 Class 2 Micro Ave. Table


Truth: Truth: Truth: Truth: Truth: Truth:
yes no yes no yes no
Classifier: yes 10 10 Classifier: yes 90 10 Classifier: yes 100 20

Classifier: no 10 970 Classifier: no 10 890 Classifier: no 20 1860

• Macroaveraged precision: (0.5 + 0.9)/2 = 0.7


• Microaveraged precision: 100/120 = .83
• Microaveraged score is dominated by score on common classes
174
Training set Development Test Set Test Set

▪ Metric: P/R/F1 or Accuracy


▪ Unseen test set
▪ avoid overfitting (‘tuning to the test set’) Training Set Dev Test
▪ more conservative estimate of performance

▪ Cross-validation over multiple splits Training Set Dev Test


▪ Handle sampling errors from different datasets
▪ Pool results over each split Dev Test Training Set
▪ Compute pooled dev set performance

Test Set
Text Classification: Evaluation
Text Classification: Practical Issues
Sec. 15.3.1

▪ Gee, I’m building a text classifier for real, now!


▪ What should I do?

178
Sec. 15.3.1

If (wheat or grain) and not (whole or bread) then


Categorize as grain

▪Need careful crafting


▪ Human tuning on development data
▪ Time-consuming: 2 days per class

179
Sec. 15.3.1

▪Use Naïve Bayes


▪ Naïve Bayes is a “high-bias” algorithm (Ng and Jordan 2002 NIPS)
▪Get more labeled data
▪ Find clever ways to get humans to label data for you
▪Try semi-supervised training methods:
▪ Bootstrapping, EM over unlabeled documents, …

180
Sec. 15.3.1

▪Perfect for all the clever classifiers


▪ SVM
▪ Regularized Logistic Regression
▪You can even use user-interpretable decision
trees
▪ Users like to hack
▪ Management likes quick fixes

181
Sec. 15.3.1

▪Can achieve high accuracy!


▪At a cost:
▪ SVMs (train time) or kNN (test time) can be too slow
▪ Regularized logistic regression can be somewhat better
▪So Naïve Bayes can come back into its own again!

182
Sec. 15.3.1

▪With enough data


▪ Classifier may not matter

183
Brill and Banko on spelling correction
▪ Automatic classification
▪ Manual review of uncertain/difficult/"new” cases

184
▪ Multiplying lots of probabilities can result in floating-point underflow.
▪ Since log(xy) = log(x) + log(y)
▪ Better to sum logs of probabilities instead of multiplying probabilities.
▪ Class with highest un-normalized log probability score is still most
probable.
cNB = argmax log P(c j ) +
c j ÎC
å log P(xi | c j )
iÎ positions

▪ Model is now just max of sum of weights


Sec. 15.3.2

▪ Domain-specific features and weights: very important in real


performance
▪ Sometimes need to collapse terms:
▪ Part numbers, chemical formulas, …
▪ But stemming generally doesn’t help

▪ Upweighting: Counting a word as if it occurred twice:


▪ title words (Cohen & Singer 1996)
▪ first sentence of each paragraph (Murata, 1999)
▪ In sentences that contain title words (Ko et al, 2002)

186
Text Classification: Practical Issues

You might also like