0% found this document useful (0 votes)
12 views118 pages

L3 LanguageModels

This document provides an introduction to probabilities in natural language processing (NLP). It discusses key probability concepts like sample spaces, events, joint probabilities, and conditional probabilities. It also explains how probabilities are used in NLP tasks like speech recognition, machine translation, and language modeling. Probabilities allow evidence from multiple sources to be combined systematically. N-gram language models predict word probabilities based on the previous n words.

Uploaded by

Ike S. Ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views118 pages

L3 LanguageModels

This document provides an introduction to probabilities in natural language processing (NLP). It discusses key probability concepts like sample spaces, events, joint probabilities, and conditional probabilities. It also explains how probabilities are used in NLP tasks like speech recognition, machine translation, and language modeling. Probabilities allow evidence from multiple sources to be combined systematically. N-gram language models predict word probabilities based on the previous n words.

Uploaded by

Ike S. Ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

NLP

Introduction to NLP

123.
Probabilities
Probabilities in NLP
• Very important for language processing
– “Let’s meet in the conference …”
• Example in speech recognition:
– “recognize speech” vs “wreck a nice beach”
• Example in machine translation:
– “l’avocat general”: “the attorney general” vs. “the general avocado”
• Example in information retrieval:
– If a document includes three occurrences of “stir” and one of “rice”, what is the
probability that it is a recipe
• Probabilities make it possible to combine evidence from multiple
sources in a systematic way
Probabilities
• Probability theory
– predicting how likely it is that something will happen
• Experiment (trial)
– e.g., throwing a coin
• Possible outcomes
– heads or tails
• Sample spaces
– discrete (number of occurrences of “rice”) or continuous (e.g., temperature)
• Events
–  is the certain event
–  is the impossible event
– event space - all possible events
Sample Space
• Random experiment: an experiment with uncertain outcome
– e.g., flipping a coin, picking a word from text
• Sample space: all possible outcomes, e.g.,
– Tossing 2 fair coins,  ={HH, HT, TH, TT}
Events
• Event: a subspace of the sample space
– E , E happens iff outcome is in E, e.g.,
• E={HH} (all heads)
• E={HH,TT} (same face)
– Impossible event ()
– Certain event ()
• Probability of Event : 0  P(E) ≤1, s.t.
– P()=1 (outcome always in )
– P(A B)=P(A)+P(B), if (AB)= (e.g., A=same face, B=different
face)
Example: Tossing a Die
• Sample space:  = {1,2,3,4,5,6}
• Fair die:
– p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6
• Unfair die example: p(1) = 0.3, p(2) = 0.2, ...
• N-dimensional die:
–  = {1, 2, 3, 4, …, N}
• Example in modeling text:
– Toss a die to decide which word to write in the next position
–  = {cat, dog, tiger, …}
Example: Flipping a Coin
•  : {Head, Tail}
• Fair coin:
– p(H) = 0.5, p(T) = 0.5
• Unfair coin, e.g.:
– p(H) = 0.3, p(T) = 0.7
• Flipping two fair coins:
– Sample space: {HH, HT, TH, TT}
• Example in modeling text:
– Flip a coin to decide whether or not to include a word in a document
– Sample space = {appear, absence}
Probabilities
• Probabilities
– numbers between 0 and 1
• Probability distribution
– distributes a probability mass of 1 throughout the sample space .
• Example:
– A fair coin is tossed three times.
– What is the probability of 3 heads?
– What is the probability of 2 heads?
Meaning of probabilities

• Frequentist
– I threw the coin 10 times and it turned up heads 5 times
• Subjective
– I am willing to bet 50 cents on heads
Probabilities
• Joint probability: P(AB), also written as P(A, B)
• Conditional Probability:
P(B|A) = P(AB)/P(A)
P(AB) = P(A)P(B|A) = P(B)P(A|B)
P(A|B) = P(B|A)P(A)/P(B) (Bayes’ Rule)
For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A)

• Total probability:
If A1, …, An form a partition of S, then
P(B) = P(BS) = P(B, A1) + … + P(B, An) (why?)
So, P(Ai|B) = P(B|Ai)P(Ai)/P(B) = P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)
This allows us to compute P(Ai|B) based on P(B|Ai)
Conditional Probability
• Prior and posterior probability
• Conditional probability P(A  B)
P(A|B) =
P(B)

A B
AB
Properties of Probabilities

• p() = 0
• P(certain event)=1
• p(X)  p(Y), if X  Y
• p(X  Y) = p(X) + p(Y), if X  Y=
Conditional Probability

• Six-sided fair die


P(D even)=?
P(D>=4)=?
P(D even|D>=4)=?
P(D odd|D>=4)=?
• Multiple conditions
P(D odd|D>=4, D<=5)=?
Answers

• Six-sided fair die


P(D even)=3/6=1/2
P(D>=4)=3/6=1/2
P(D even|D>=4)=2/3
P(D odd|D>=4)=1/3
• Multiple conditions
P(D odd|D>=4, D<=5)=1/2
The Chain Rule
• P(w1,w2,w3…wn) = ?
• Using the chain rule:
– P(w1,w2,w3…wn) =P(w1) P(w2|w1) P(w3|w1,w2)… P(wn|w1,w2…wn-1)
• This rule is used in many ways in statistical NLP,
more specifically in Markov Models
Probability Review

[slide from Brendan O’Connor]


Probability Review 2/2

[slide from Brendan O’Connor]


NLP
Introduction to NLP

125.
Bayes Theorem
Bayes Theorem
• Formula for joint probability
p(A,B) = p(B|A)p(A)
p(A,B) = p(A|B)p(B)
• Therefore
p(B|A) = p(A|B)p(B)/p(A)
• Bayes’ theorem is used to calculate P(A|B) given
P(B|A)
Example
• Diagnostic test
• Test accuracy
p(positive | disease) = 0.05 – false positive
p(negative | disease) = 0.05 – false negative
So: p(positive | disease) = 1-0.05 = 0.95
Same for p(negative | disease)
In general the rates of false positives and false negatives
can be different
Example

• Diagnostic test with errors

A=TEST
P(A|B) Positive Negative
Yes 0.95 0.05
B=DISEASE
No 0.05 0.95
Example
• What is p(disease | positive)?
– P(disease|positive) = P(positive|disease)*P(disease)/P(positive)
– P(disease|positive) = P(positive| disease)*P(disease)/P(positive)
– P(disease|positive)/P(disease|positive) = ?

• We don’t really care about p(positive)


– as long as it is not zero, we can divide both sides by this quantity
Example
• P(disease|positive) / P(disease|positive) =
(P(positive|disease) x P(disease))/(P(positive|disease) x P(disease))
• Suppose P(disease) = 0.001
– so P(disease) = 0.999
• P(disease|positive) / P(disease|positive) = (0.95 x 0.001)/(0.05 x 0.999)
=0.019
• P(disease|positive) + P(disease|positive) = 1
• P(disease|positive) ≈ 0.02
• Notes
– P(disease) is called the prior probability
– P(disease|positive) is called the posterior probability
– In this example the posterior is 20 times larger than the prior
Example: An Unfair Die
• It’s more likely to get a 6 and less likely to get a 1
– p(6) > p(1)
– How likely?
• What if you toss the die 1000 times,
and observe “6” 501 times, “1” 108 times?
– p(6) = 501/1000 = 0.501
– p(1) = 108/1000 = 0.108
– As simple as counting, but principled – maximum likelihood estimate

[slide from Qiaozhu Mei]


What if the Die has More Faces?
• Suitable to represent documents
• Every face corresponds to a word in vocabulary
• The author tosses a die to write a word
• Apparently, an unfair die

[slide from Qiaozhu Mei]


NLP
Introduction to NLP

211.
Language Models
Probabilistic Language Models
• Assign a probability to a sentence
– P(S) = P(w1,w2,w3,...,wn)
• The sum of the probabilities of all possible sentences must be 1.
• Predicting the next word
– Let’s meet in Times …
– General Electric has lost some market …
• Formula
– P(wn|w1,w2,...,wn-1)
Predicting the Next Word
• What word follows “your”? http://norvig.com/ngrams/count_2w.txt

your abilities 160848


your active 140797
your ability 1116122
your activities 226183
your ablum 112926
your activity 156213
your academic 274761
your actual 302488
your acceptance 783544
your ad 1450485
your access 492555
your address 1611337
your accommodation 320408
your admin 117943
your account 8149940
your ads 264771
your accounting 128409
your advantage 242238
your accounts 257118
your adventure 109658
your action 121057
your advert 101178
your actions 492448
your advertisement 172783
your activation 459379
Uses of Language Models
• Speech recognition
– P(“recognize speech”) > P(“wreck a nice beach”)
• Text generation
– P(“three houses”) > P(“three house”)
• Spelling correction
– P(“my cat eats fish”) > P(“my xat eats fish”)
• Machine translation
– P(“the blue house”) > P(“the house blue”)
• Other uses
– OCR
– Summarization
– Document classification
• Usually coupled with a translation model (later)
Probability of a Sentence
• How to compute the probability of a sentence?
– What if the sentence is novel?
• What we need to estimate:
• P(S)=P(w1,w2,w3…wn)
• Using the chain rule:
• P(S)= P(w1) P(w2|w1) P(w3|w1,w2)… P(wn|w1,w2…wn-1)
• Example:
• P(“I would like the pepperoni and spinach pizza”)=?
N-gram Models
• Predict the probability of a word based on the words before:
– P(square|Let’s meet in Times)
• Markov assumption
– Only look at limited history
• N-gram models
– Unigram – no context: P(square)
– Bigram: P(square|Times)
– Trigram: P(square|in Times)
• It is possible to go to 3,4,5-grams
– Longer n-grams suffer from sparseness
• Used for predicting the next word and also for random text generation
Approximating Shakespeare
Approximating the Wall Street Journal
N-Grams
• Shakespeare unigrams
– 29,524 types, approx. 900K tokens
• Bigrams
– 346,097 types, approx. 900K tokens
– How many bigrams are never seen in the data?
• Notice!
– very sparse data!
Google 1-T Corpus
• 1 trillion word tokens
• Number of tokens
– 1,024,908,267,229
• Number of sentences
– 95,119,665,584
• Number of unigrams
– 13,588,391
• Number of bigrams
– 314,843,401
• Number of trigrams
– 977,069,902
• Number of fourgrams
– 1,313,818,354
• Number of fivegrams
– 1,176,470,663
https://catalog.ldc.upenn.edu/ldc2006t13
Parameter Estimation
• Can we compute the conditional probabilities directly?
– No, because the data is sparse
• Markov assumption
• P(“musical” | “I would like two tickets for the”) = P(“musical | the”)
or
• P(“musical” | “I would like two tickets for the”) = P(“musical | for the”)
Maximum Likelihood Estimates
• Use training data
– Count how many times a given context appears in it.

• Unigram example:
– The word “pizza” appears 700 times in a corpus of 10,000,000 words.
– Therefore the MLE for its probability is P’(“pizza”) = 700/10,000,000 = 0.00007

• Bigram example:
– The word “with” appears 1,000 times in the corpus.
– The phrase “with spinach” appears 6 times
– Therefor the MLE for P’(spinach|with) = 6/1,000 = 0.006

• These estimates may not be good for corpora from other genres
Probability of a Sentence
P(“<S> I will see you on Monday</S>”) =
P(I|<S>)
x P(will|I)
x P(see|will)
x P(you|see)
x P(on|you)
x P(Monday|on)
x P(</S>|Monday)
Probability of a Sentence
Example from Jane Austen
• P(“Elizabeth looked at Darcy”)
• Use maximum likelihood estimates for the n-gram probabilities
– unigram: P(wi)=c(wi)/V
– bigram: P(wi|wi-1) = c(wi-1,wi)/c(wi-1)

• Values
– P(“Elizabeth”) = 474/617091 = .000768120
– P(“looked|Elizabeth”) = 5/474 = .010548523
– P(“at|looked”) = 74/337 = .219584569
– P(“Darcy|at”) = 3/4055 = .000739827

• Bigram probability
– P(“Elizabeth looked at Darcy”) = .000000001316 = 1.3 x 10-9

• Unigram probability
– P(“Elizabeth looked at Darcy”) = 474/617091 * 337/617091 * 4055/617091 * 304/617091 =
.000000000001357 = 1.3 x 10-12

• P(“looked Darcy Elizabeth at”) = ?


Generative Models

• Unigram:
– generate a word, then generate the next one, until you generate </S>.

W1 W2 W3 … Wn </S>

• Bigram:
– generate <S>, generate a word, then generate the next one based on
the previous one, etc., until you generate </S>.
<S> W1 W2 W3 … Wn </S>
Engineering Trick

• The MLE values are often on the order of 10-6 or less


– Multiplying 20 such values gives a number on the order of 10-120
– This leads to underflow
• Use logarithms instead
– 10-6 (in base 10) becomes -6
– Use sums instead of products
Language Modeling Tools

• http://www.speech.cs.cmu.edu/SLM_info.html
• http://www.speech.sri.com/projects/srilm/
• https://kheafield.com/code/kenlm/
• http://htk.eng.cam.ac.uk/
NLP
Introduction to NLP

212.
Smoothing and Interpolation
Smoothing
• If the vocabulary size is |V|=1M
– Too many parameters to estimate even a unigram model
– MLE assigns values of 0 to unseen (yet not impossible) data
– Let alone bigram or trigram models
• Smoothing (regularization)
– Reassigning some probability mass to unseen data
Smoothing
• How to model novel words?
– Or novel bigrams?
– Distributing some of the probability mass to allow for novel events
• Add-one (Laplace) smoothing:
– Bigrams: P(wi|wi-1) = (c(wi-1,wi)+1)/(c(wi-1)+V)
– This method reassigns too much probability mass to unseen events
• Possible to do add-k instead of add-one
– Both of these don’t work well in practice
Dealing with Sparse Data

• Two main techniques used


– Backoff
– Interpolation
Backoff
• Going back to the lower-order n-gram model if the
higher-order model is sparse (e.g., frequency <= 1)
• Learning the parameters
– From a development data set
Interpolation
• If P’(wi|wi-1,wi-2) is sparse:
– Use λ1P’(wi|wi-1,wi-2) +λ2P’(wi|wi-1)+λ3P’(wi)
– Ensure that λ1+λ2+λ3=1, λ1,λ2,λ3≤1, λ1,λ2,λ3≥0
– Better than backoff
– Estimate the hyper-parameters λ1,λ2,λ3 from held-out data (or using
EM), e.g., using 5-fold cross-validation
• See [Chen and Goodman 1998] for more details
• Software:
– http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html
NLP
Introduction to NLP

213.
Evaluation of Language Models
Evaluation of LM
• Extrinsic
– Use in an application
• Intrinsic
– Cheaper
– Based on information theory
• Correlate the two for validation purposes
Information Theory

• It is concerned with data transmission, data


compression, and measuring the amount of
information.
• Applies to statistical physics, economics, linguistics.
Information and Uncertainty
• The decrease in uncertainty is called information
• Example
– we know that a certain event will happen next week
– then we learn that it is more likely to happen on a workday
– the new information reduces the uncertainty
– the more new information we get, the smaller the remaining
uncertainty
Entropy

• Entropy tells us how informative a random variable is.


Examples
• One symbol (a)
– uncertainty is 0
• Two symbols (a,b)
– uncertainty is 1
– we can reduce it to 0 by using one bit of information (a=0,b=1)
• Four symbols (a,b,c,d)
– we need two bits of information (e.g., a=00,b=01,c=10,d=11)
• In general we need
– log2k bits, where k is the number of symbols
– note: this only holds if all symbols are equiprobable
Amount of Surprise
• Amount of surprise (given a general prob. distribution)
-log2 p(x) - for a specific outcome x
• If the distribution is uniform:
p(x) = 1/k
k = 1/p(x)
log2k = log2 (1/p(x)) = -log2 p(x)
• Average surprise

• Information need
H(X) = 0 means that we have all the information that we need
H(X) = 1 means that we need one bit of information, etc.
Entropy Example
• Sample 8-character language: A E I O U F G H

• Three bits per character if the characters are


equiprobable
Simplified Polynesian
• Six characters: P T K A I U, not equiprobable

• This number (2.5) can lead one to believe that 3 bits


per character are needed
– e.g. 000, 001, 010, 100, 101, 111
Simplified Polynesian
• More efficient encoding

• Longer codes for less frequent characters


• This can lower the average number of bits per
character to the theoretical estimate of 2.5
• Under what assumption, though?
Joint Entropy
• Amount of information to specify both x and y.

• Measures the amount of surprise of seeing a specific


tag bigram.
Conditional Entropy
• If we know x, how much additional information is
needed to know y.
Chain Rule for Entropy
• Chain rule for entropy
Probabilities of Syllables
• P(C,·) and P(·,V) - marginal probabilities

• P(C,V)
Surprise in Syllables

• Example
Polynesian Syllables (cont’d)
Polynesian Syllables (cont’d)
Pointwise Mutual Information
• Measured between two points (not two distributions)

• If p(x,y) = p(x)p(y), then I(x;y) = log 1 = 0


(independence)
Mutual Information

• Same, but for two distributions (not points)


• How much information does one of the distributions
contain about the other one.
Kullback-Leibler (KL) Divergence
• Measures how far two distributions are from one another

• It measures the number of bits needed to encode p by using q.


• Always non-negative
• D(p||q) = 0, iff p=q
• D(p||q) = ∞, iff ∃x∈X such that p(x)>0 and q(x)=0
• Not symmetric
Divergence as Mutual Information
Perplexity
• Does the model fit the data?
– A good model will give a high probability to a real sentence
• Perplexity
– Average branching factor in predicting the next word
– Lower is better (lower perplexity -> higher probability)
– N = number of words

1
Per  N
P ( w1w2 ...wN )
Perplexity
• Example:
– A sentence consisting of N equiprobable words: p(wi) = 1/k

1
Per  N
P ( w1w2 ...wN )
– Per = ((k-1)N)(-1/N) = k
• Perplexity is like a (weighted) branching factor
• Logarithmic version
– the exponent is = #bits to encode each word

Per  2
 (1 / N )  log 2 P ( wi )
The Shannon Game
• Consider the Shannon game:
– Connecticut governor Ned Lamont said ...
• What is the perplexity of guessing a digit if all digits are equally
likely? Do the math.
– 10
• How about a letter?
– 26

• How about guessing A (“operator”) with a probability of


1/4, B (“sales”) with a probability of 1/4 and 10,000 other
cases with a probability of 1/2 total
– example modified from Joshua Goodman.
Perplexity Across Distributions
• What if the actual distribution is very different from the
expected one?
• Example:
– All of the 10,000 other cases are equally likely but P(A) = P(B) = 0.
• Cross-entropy = log (perplexity), measured in bits
Sample Values for Perplexity
• Wall Street Journal (WSJ) corpus
– 38 M words (tokens)
– 20 K types
• Perplexity
– Evaluated on a separate 1.5M sample of WSJ documents
– Unigram 962
– Bigram 170
– Trigram 109
– More recent results – 47.7 (Yang et al. 2017 using AWD-LSTM
Word Error Rate
• Another evaluation metric
– Number of insertions, deletions, and substitutions
– Normalized by sentence length
– Same as Levenshtein Edit Distance
• Example:
– governor Ned Lamont met with the mayor
– the governor met the senator
– 3 deletions + 1 insertion + 1 substitution = WER of 5
Issues

• Out of vocabulary words (OOV)


– Split the training set into two parts
– Label all words in part 2 that were not in part 1 as <UNK>
• Clustering
– e.g., dates, monetary amounts, organizations, years
Long Distance Dependencies
• This is where n-gram language models fail by definition
• Missing syntactic information
– The students who participated in the game are tired
– The student who participated in the game is tired
• Missing semantic information
– The pizza that I had last night was tasty
– The class that I had last night was interesting
Other Ideas in LM
• Skip-grapm models
– Ms. Jane Doe, Ms. Mary Doe
• Syntactic models
– Condition words on other words that appear in a specific
syntactic relation with them
• Caching models
– Take advantage of the fact that words appear in bursts
Limitations
• Still no general solution for long-distance dependencies
• Cannot handle linear combinations, e.g.,
– Cats eat mice
– People eat broccoli
– Cats ear broccoli
– People eat mice
• Possible solution – use phrases or n-grams as features
(combinatorial explosion)
External Resources
• Google n-gram corpus
– http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-
belong-to-you.html
• Google book n-grams
– http://ngrams.googlelabs.com/
N-gram External Links
• http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
• http://norvig.com/mayzner.html
• http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
• https://books.google.com/ngrams/
• http://www.elsewhere.org/pomo/
• http://pdos.csail.mit.edu/scigen/
• http://www.magliery.com/Band/
• http://www.magliery.com/Country/
• http://johno.jsmf.net/knowhow/ngrams/index.php
• http://www.decontextualize.com/teaching/rwet/n-grams-and-markov-chains/
• http://gregstevens.com/2012/08/16/simulating-h-p-lovecraft
• http://kingjamesprogramming.tumblr.com/
NLP
Introduction to NLP

214.
The Noisy Channel Model
The Noisy Channel Model
• Example:
– Input: Written English (X)
– Encoder: garbles the input (X->Y)
– Output: Spoken English (Y)
• More examples:
– Grammatical English to English with mistakes
– English to bitmaps (characters)
• P(X,Y) = P(X)P(Y|X)
Encoding and Decoding

• Given f, guess e
f
e EF FE e’
encoder decoder

e’ = argmax
e
P(e|f) = argmax
e
P(f|e) P(e)

translation model language model


Example

• Translate “la maison blanche”

P(f|e) P(e)
cat rat piano
house white the
the house white
the red house
the small cat
the white house
Example

• Translate “la maison blanche”

P(f|e) P(e)
cat rat piano - -
house white the + -
the house white
the red house
the small cat
the white house
Example

• Translate “la maison blanche”

P(f|e) P(e)
cat rat piano - -
house white the + -
the house white + -
the red house - +
the small cat - +
the white house + +
Uses of the Noisy Channel Model

• Handwriting recognition
• Text generation
• Text summarization
• Machine translation
• Spelling correction
– See separate lecture on text similarity and edit distance
Spelling Correction

From Peter Norvig: http://norvig.com/ngrams/ch14.pdf


Features

• For each “e”:


– P(e)
– P(f|e)
– what else?
• What about some other task, e.g., POS tagging?
NLP
Introduction to NLP

215.
Part of Speech Tagging
The POS task
• Example
– Bahrainis vote in second round of parliamentary election
• Jabberwocky (by Lewis Carroll, 1872)
`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
Penn Treebank tagset (1/2)
Tag Description Example
CC coordinating conjunction and
CD cardinal number 1
DT determiner the
EX existential there there is
FW foreign word d‘oeuvre
IN preposition/subordinating conjunction in, of, like
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
LS list marker 1)
MD modal could, will
NN noun, singular or mass table
NNS noun plural tables
NNP proper noun, singular John
NNPS proper noun, plural Vikings
PDT predeterminer both the boys
POS possessive ending friend's
Penn Treebank tagset (2/2)
Tag Description Example
PRP personal pronoun I, he, it
PRP$ possessive pronoun my, his
RB adverb however, usually, naturally, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to to go, to him
UH interjection uhhuhhuhh
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
Universal POS

http://universaldependencies.org/u/pos/
Universal Features

http://universaldependencies.org/u/feat/
Some Observations
• Ambiguity
– count (noun) vs. count (verb)
– 11% of all types but 40% of all tokens in the Brown
corpus are ambiguous.
– Examples
• like can be tagged as ADP VERB ADJ ADV NOUN
• present can be tagged as ADJ NOUN VERB ADV
POS Ambiguity

Example from J&M


Some Observations
• More examples:
– transport, object, discount, address
– content
• French pronunciation:
– est, président, fils
• Three main techniques:
– rule-based
– machine learning (e.g., conditional random fields, maximum entropy Markov models,
neural networks)
– transformation-based
• Useful for parsing, translation, text to speech, word sense
disambiguation, etc.
Example

• Bethlehem/NNP Steel/NNP Corp./NNP ,/, hammered/VBN by/IN higher/JJR costs/NNS


• Bethlehem/NNP Steel/NNP Corp./NNP ,/, hammered/VBN by/IN higher/JJR costs/VBZ
Rule-based POS tagging

• Use dictionary or finite-state transducers to find all


possible parts of speech
• Use disambiguation rules
– e.g., ART+V
• Hundreds of constraints need to be designed manually
Example in French
<S> ^ beginning of sentence
La rf b nms u article
teneur nfs nms noun feminine singular
moyenne jfs nfs v1s v2s v3s adjective feminine singular
en p a b preposition
uranium nms noun masculine singular
des p r preposition
rivières nfp noun feminine plural
, x punctuation
bien_que cs subordinating conjunction
délicate jfs adjective feminine singular
à p preposition
calculer v verb
Sample Rules
• BS3 BI1
– A BS3 (3rd person subject personal pronoun) cannot be followed by a BI1 (1st person indirect personal
pronoun).
– In the example: “il nous faut” (= “we need”) – “il” has the tag BS3MS and “nous” has the tags [BD1P
BI1P BJ1P BR1P BS1P].
– The negative constraint “BS3 BI1” rules out “BI1P'', and thus leaves only 4 alternatives for the word
“nous”.
• NK
– The tag N (noun) cannot be followed by a tag K (interrogative pronoun); an example in the test corpus
would be: “... fleuve qui ...” (...river that...).
– Since “qui” can be tagged both as an “E” (relative pronoun) and a “K” (interrogative pronoun), the “E”
will be chosen by the tagger since an interrogative pronoun cannot follow a noun (“N”).
• RV
– A word tagged with R (article) cannot be followed by a word tagged with V (verb): for example “l'
appelle” (calls him/her).
– The word “appelle” can only be a verb, but “l'” can be either an article or a personal pronoun.
– Thus, the rule will eliminate the article tag, giving preference to the pronoun.
Classifier-based POS Tagging

• A baseline method would be to use a classifier to map


each individual word into a likely POS tag
– Why is this method unlikely to work well?
Sources of Information

• Bethlehem/NNP Steel/NNP Corp./NNP ,/, hammered/VBN by/IN higher/JJR costs/NNS


• Bethlehem/NNP Steel/NNP Corp./NNP ,/, hammered/VBN by/IN higher/JJR costs/VBZ

• Knowledge about individual words


– lexical information
– spelling (-or)
– capitalization (IBM)
• Knowledge about neighboring words
Evaluation
• Baseline
– tag each word with its most likely tag
– tag each OOV word as a noun.
– around 90%
• Current accuracy
– around 97% for English
– compared to 98% human performance
NLP

You might also like