POS Tagging: Introduction: Heng Ji
POS Tagging: Introduction: Heng Ji
Heng Ji
[email protected]
Sept 13, 2010
1
Assignment 1
Questions?
2/39
Outline
Parts of speech (POS)
Tagsets
POS Tagging
Rule-based tagging
Markup Format
Open source Toolkits
3/39
What is Part-of-Speech (POS)
4/39
Parts of Speech
8 (ish) traditional parts of speech
Noun, verb, adjective, preposition, adverb, arti
cle, interjection, pronoun, conjunction, etc
Called: parts-of-speech, lexical categories, wo
rd classes, morphological classes, lexical tag
s...
Lots of debate within linguistics about the num
ber, nature, and universality of these
We’ll completely ignore this debate.
5/39
7 Traditional POS Categories
6/39
POS Tagging
The process of assigning a part-of-speech or
lexical class marker to each word in a
WORD tag
collection.
the DET
koala N
put V
the DET
keys N
on P
the DET
table N
7/39
Penn TreeBank POS Tag Set
8/39
Penn Treebank Tagset
9/39
Why POS tagging is useful?
Speech synthesis:
How to pronounce “lead”?
INsult inSULT
OBject obJECT
OVERflow overFLOW
DIScount disCOUNT
CONtent conTENT
Stemming for information retrieval
Can search for “aardvarks” get “aardvark”
Parsing and speech recognition and etc
Possessive pronouns (my, your, her) followed by nouns
Personal pronouns (I, you, he) likely to be followed by verbs
Need to know if a word is an N or V before you can parse
Information extraction
Finding names, relations, etc.
Machine Translation
10/39
Equivalent Problem in Bioinformatics
Durbin et al. Biological
Sequence Analysis, Cambridge
University Press.
Several applications, e.g.
proteins
From primary structure
ATCPLELLLD
Infer secondary structure
HHHBBBBBC..
11/39
Why is POS Tagging Useful?
First step of a vast number of practical tasks
Speech synthesis
How to pronounce “lead”?
INsult inSULT
OBject obJECT
OVERflow overFLOW
DIScount disCOUNT
CONtent conTENT
Parsing
Need to know if a word is an N or V before you can parse
Information extraction
Finding names, relations, etc.
Machine Translation
12/39
Open and Closed Classes
Closed class: a small fixed membership
Prepositions: of, in, by, …
Auxiliaries: may, can, will had, been, …
Pronouns: I, you, she, mine, his, them, …
Usually function words (short common words which
play a role in grammar)
Open class: new ones can be created all the time
English has 4: Nouns, Verbs, Adjectives, Adverbs
Many languages have these 4, but not all!
13/39
Open Class Words
Nouns
Proper nouns (Boulder, Granby, Eli Manning)
English capitalizes these.
Common nouns (the rest).
Count nouns and mass nouns
Count: have plurals, get counted: goat/goats, one goat, two goats
Mass: don’t get counted (snow, salt, communism) (*two snows)
Adverbs: tend to modify things
Unfortunately, John walked home extremely slowly yesterday
Directional/locative adverbs (here,home, downhill)
Degree adverbs (extremely, very, somewhat)
Manner adverbs (slowly, slinkily, delicately)
Verbs
In English, have morphological affixes (eat/eats/eaten)
14/39
Closed Class Words
Examples:
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
15/39
Prepositions from CELEX
16/39
English Particles
17/39
Conjunctions
18/39
POS Tagging
Choosing a Tagset
19/39
Using the Penn Tagset
The/DT grand/JJ jury/NN commmented/VBD
on/IN a/DT number/NN of/IN other/JJ topics/N
NS ./.
Prepositions and subordinating conjunctions
marked IN (“although/IN I/PRP..”)
Except the preposition/complementizer “to” is
just marked “TO”.
20/39
POS Tagging
Words often have more than one POS: back
The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
The POS tagging problem is to determine the
POS tag for a particular instance of a word.
22/39
Current Performance
23/39
Quick Test: Agreement?
24/39
How to do it? History
DeRose/Church Trigram Tagger Combined Methods
Efficient HMM (Kempe) 98%+
Sparse Data 96%+
95%+ Tree-Based Statistics
(Helmut Shmid)
Transformation
Rule Based – 96%+
Greene and Rubin HMM Tagging Based Tagging
Rule Based - 70% (CLAWS) (Eric Brill)
Rule Based – 95%+ Neural Network
93%-95% 96%+
25/39
Two Methods for POS Tagging
1. Rule-based tagging
(ENGTWOL)
2. Stochastic
1. Probabilistic sequence models
HMM (Hidden Markov Model) tagging
MEMMs (Maximum Entropy Markov Models)
26/39
Rule-Based Tagging
Start with a dictionary
Assign all possible tags to words from the
dictionary
Write rules by hand to selectively remove
tags
Leaving the correct tag for each word.
27/39
Rule-based taggers
Early POS taggers all hand-coded
Most of these (Harris, 1962; Greene and Rubin, 197
1) and the best of the recent ones, ENGTWOL (Voutil
ainen, 1995) based on a two-stage architecture
Stage 1: look up word in lexicon to give list of potential
POSs
Stage 2: Apply rules which certify or disallow tag
sequences
Rules originally handwritten; more recently Machine
Learning methods can be used
28/39
Start With a Dictionary
• she: PRP
• promised: VBN,VBD
• to TO
• back: VB, JJ, RB, NN
• the: DT
• bill: NN, VB
29/39
Assign Every Possible Tag
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
30/39
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|
VBD follows “<start> PRP”
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
31/39
Inline Mark-up
POS Tagging
http://nlp.cs.qc.cuny.edu/wsj_pos.zip
Input Format
Pierre Vinken, 61/CD years/NNS old , will join th
e board as a nonexecutive director Nov. 29.
Output Format
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/
JJ ,/, will/MD join/VB the/DT board/NN as/IN a/
DT nonexecutive/JJ director/NN Nov./NNP 29/
CD ./.
32/39
POS Tagging Tools
Stanford tagger (Loglinear tagger )
http://nlp.stanford.edu/software/tagger.shtml
Brill tagger
http://www.tech.plym.ac.uk/soc/staff/guidbugm/software/RULE_B
ASED_TAGGER_V.1.14.tar.Z
tagger LEXICON test BIGRAMS LEXICALRULEFULE CONTEXTU
ALRULEFILE
YamCha (SVM)
http://chasen.org/~taku/software/yamcha/
MXPOST (Maximum Entropy)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
More complete list at:
http://www-nlp.stanford.edu/links/statnlp.html#Taggers
33/39
NLP Toolkits
Uniform CL Annotation Platform
UIMA (IBM NLP platform): http://incubator.apache.org/uima/svn.html
Mallet (UMASS): http://mallet.cs.umass.edu/index.php/Main_Page
MinorThird (CMU): http://minorthird.sourceforge.net/
NLTK: http://nltk.sourceforge.net/
Natural langauge toolkit, with data sets Demo
Information Extraction
Jet (NYU IE toolkit) http://www.cs.nyu.edu/cs/faculty/grishman/jet/license.ht
ml
Gate: http://gate.ac.uk/download/index.html
University of Sheffield IE toolkit
Information Retrieval
INDRI: http://www.lemurproject.org/indri/
Information Retrieval toolkit
Machine Translation
Compara: http://adamastor.linguateca.pt/COMPARA/Welcome.html
ISI decoder: http://www.isi.edu/licensed-sw/rewrite-decoder/
MOSES: http://www.statmt.org/moses/
34/39
Looking Ahead: Next Class
35/39