Part-Of-Speech Tagging: A Simple But Useful Form of Linguistic Analysis
Part-Of-Speech Tagging: A Simple But Useful Form of Linguistic Analysis
tagging
Christopher Manning
Christopher Manning
Parts of Speech
• Perhaps starting with Aristotle in the West (384–322 BCE), there
was the idea of having parts of speech
• a.k.a lexical categories, word classes, “tags”, POS
• It comes from Dionysius Thrax of Alexandria (c. 100 BCE) the
idea that is still with us that there are 8 parts of speech
• But actually his 8 aren’t exactly the ones we are taught today
• Thrax: noun, verb, article, adverb, preposition, conjunction, participle,
pronoun
• School grammar: noun, verb, adjective, adverb, preposition,
conjunction, pronoun, interjection
Open class (lexical) words
Nouns Verbs Adjectives old older oldest
POS Tagging
• Words often have more than one POS: back
• The back door = JJ
• On my back = NN
• Win the voters back = RB
• Promised to back the bill = VB
• The POS tagging problem is to determine the POS tag for a
particular instance of a word.
Christopher Manning
POS Tagging
• Input: Plays well with others Penn
• Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS Treebank
POS tags
• Output: Plays/VBZ well/RB with/IN others/NNS
• Uses:
• Text-to-speech (how do we pronounce “lead”?)
• Can write regexps like (Det) Adj* N+ over the output for phrases, etc.
• As input to or to speed up a full parser
• If you know the tag, you can back off to it in other tasks
Christopher Manning
Christopher Manning
Part-of-speech
tagging revisited
Christopher Manning
Christopher Manning
Sources of information
• What are the main sources of information for POS tagging?
• Knowledge of neighboring words
• Bill saw that man yesterday
• NNP NN DT NN NN
• VB VB(D) IN VB NN
• Knowledge of word probabilities
• man is rarely used as a verb….
• The latter proves the most useful, but the former also helps
Christopher Manning
Most errors
• Trigram HMM: ~95% / ~55%
on unknown
• Maxent P(t|w): 93.7% / 82.6% words
• TnT (HMM++): 96.2% / 86.0%
• MEMM tagger: 96.9% / 86.9%
• Bidirectional dependencies: 97.2% / 90.0%
• Upper bound: ~98% (human agreement)
Christopher Manning
• We could fix this with a feature that looked at the next word
JJ
NNP NNS VBD VBN .
Intrinsic flaws remained undetected .
w0 w-1 w0 w1
Christopher Manning