0% found this document useful (0 votes)
69 views

POS Tagging: Introduction: Heng Ji

This document provides an introduction to part-of-speech (POS) tagging. It discusses the traditional parts of speech categories and examples of tagsets used for POS tagging, including the popular Penn Treebank tagset. It also describes how POS tagging works, including rule-based and statistical approaches. POS tagging is useful for applications like speech synthesis, parsing, information extraction and machine translation. The document reviews the history of POS tagging and current high performance of around 97-98%.

Uploaded by

Hi Blacky Selvan
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

POS Tagging: Introduction: Heng Ji

This document provides an introduction to part-of-speech (POS) tagging. It discusses the traditional parts of speech categories and examples of tagsets used for POS tagging, including the popular Penn Treebank tagset. It also describes how POS tagging works, including rule-based and statistical approaches. POS tagging is useful for applications like speech synthesis, parsing, information extraction and machine translation. The document reviews the history of POS tagging and current high performance of around 97-98%.

Uploaded by

Hi Blacky Selvan
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 35

POS Tagging: Introduction

Heng Ji
[email protected]
Sept 13, 2010

1
Assignment 1

 Questions?

2/39
Outline
 Parts of speech (POS)
 Tagsets
 POS Tagging
 Rule-based tagging
 Markup Format
 Open source Toolkits

3/39
What is Part-of-Speech (POS)

 Generally speaking, Word Classes (=POS) :


 Verb, Noun, Adjective, Adverb, Article, …
 We can also include inflection:
 Verbs: Tense, number, …
 Nouns: Number, proper/common, …
 Adjectives: comparative, superlative, …
 …

4/39
Parts of Speech
 8 (ish) traditional parts of speech
 Noun, verb, adjective, preposition, adverb, arti
cle, interjection, pronoun, conjunction, etc
 Called: parts-of-speech, lexical categories, wo
rd classes, morphological classes, lexical tag
s...
 Lots of debate within linguistics about the num
ber, nature, and universality of these
 We’ll completely ignore this debate.

5/39
7 Traditional POS Categories

N noun chair, bandwidth, pacing


V verb study, debate, munch
 ADJ adj purple, tall, ridiculous
 ADV adverb unfortunately, slowly,
P preposition of, by, to
 PRO pronoun I, me, mine
 DET determiner the, a, that, those

6/39
POS Tagging
 The process of assigning a part-of-speech or
lexical class marker to each word in a
WORD tag
collection.
the DET
koala N
put V
the DET
keys N
on P
the DET
table N

7/39
Penn TreeBank POS Tag Set

 Penn Treebank: hand-annotated corpus of


Wall Street Journal, 1M words
 46 tags
 Some particularities:
 to /TO not disambiguated
 Auxiliaries and verbs not distinguished

8/39
Penn Treebank Tagset

9/39
Why POS tagging is useful?
 Speech synthesis:
 How to pronounce “lead”?
 INsult inSULT
 OBject obJECT
 OVERflow overFLOW
 DIScount disCOUNT
 CONtent conTENT
 Stemming for information retrieval
 Can search for “aardvarks” get “aardvark”
 Parsing and speech recognition and etc
 Possessive pronouns (my, your, her) followed by nouns
 Personal pronouns (I, you, he) likely to be followed by verbs
 Need to know if a word is an N or V before you can parse
 Information extraction
 Finding names, relations, etc.
 Machine Translation
10/39
Equivalent Problem in Bioinformatics
 Durbin et al. Biological
Sequence Analysis, Cambridge
University Press.
 Several applications, e.g.
proteins
 From primary structure
ATCPLELLLD
 Infer secondary structure
HHHBBBBBC..

11/39
Why is POS Tagging Useful?
 First step of a vast number of practical tasks
 Speech synthesis
 How to pronounce “lead”?
 INsult inSULT
 OBject obJECT
 OVERflow overFLOW
 DIScount disCOUNT
 CONtent conTENT

 Parsing
 Need to know if a word is an N or V before you can parse
 Information extraction
 Finding names, relations, etc.
 Machine Translation
12/39
Open and Closed Classes
 Closed class: a small fixed membership
 Prepositions: of, in, by, …
 Auxiliaries: may, can, will had, been, …
 Pronouns: I, you, she, mine, his, them, …
 Usually function words (short common words which
play a role in grammar)
 Open class: new ones can be created all the time
 English has 4: Nouns, Verbs, Adjectives, Adverbs
 Many languages have these 4, but not all!

13/39
Open Class Words
 Nouns
 Proper nouns (Boulder, Granby, Eli Manning)
 English capitalizes these.
 Common nouns (the rest).
 Count nouns and mass nouns
 Count: have plurals, get counted: goat/goats, one goat, two goats
 Mass: don’t get counted (snow, salt, communism) (*two snows)
 Adverbs: tend to modify things
 Unfortunately, John walked home extremely slowly yesterday
 Directional/locative adverbs (here,home, downhill)
 Degree adverbs (extremely, very, somewhat)
 Manner adverbs (slowly, slinkily, delicately)
 Verbs
 In English, have morphological affixes (eat/eats/eaten)

14/39
Closed Class Words
Examples:
 prepositions: on, under, over, …
 particles: up, down, on, off, …
 determiners: a, an, the, …
 pronouns: she, who, I, ..
 conjunctions: and, but, or, …
 auxiliary verbs: can, may should, …
 numerals: one, two, three, third, …

15/39
Prepositions from CELEX

16/39
English Particles

17/39
Conjunctions

18/39
POS Tagging
Choosing a Tagset

 There are so many parts of speech, potential distinctions we ca


n draw
 To do POS tagging, we need to choose a standard set of tags to
work with
 Could pick very coarse tagsets
 N, V, Adj, Adv.

 More commonly used set is finer grained, the “Penn TreeBank t


agset”, 45 tags
 PRP$, WRB, WP$, VBG

 Even more fine-grained tagsets exist

19/39
Using the Penn Tagset
 The/DT grand/JJ jury/NN commmented/VBD
on/IN a/DT number/NN of/IN other/JJ topics/N
NS ./.
 Prepositions and subordinating conjunctions
marked IN (“although/IN I/PRP..”)
 Except the preposition/complementizer “to” is
just marked “TO”.

20/39
POS Tagging
 Words often have more than one POS: back
 The back door = JJ
 On my back = NN
 Win the voters back = RB
 Promised to back the bill = VB
 The POS tagging problem is to determine the
POS tag for a particular instance of a word.

These examples from Dekang Lin


21/39
How Hard is POS Tagging?
Measuring Ambiguity

22/39
Current Performance

 How many tags are correct?


 About 97% currently
 But baseline is already 90%
 Baseline algorithm:
 Tag every word with its most frequent tag
 Tag unknown words as nouns
 How well do people do?

23/39
Quick Test: Agreement?

 the students went to class


 plays well with others
 fruit flies like a banana DT: the, this, that
NN: noun
VB: verb
P: prepostion
ADV: adverb

24/39
How to do it? History
DeRose/Church Trigram Tagger Combined Methods
Efficient HMM (Kempe) 98%+
Sparse Data 96%+
95%+ Tree-Based Statistics
(Helmut Shmid)
Transformation
Rule Based – 96%+
Greene and Rubin HMM Tagging Based Tagging
Rule Based - 70% (CLAWS) (Eric Brill)
Rule Based – 95%+ Neural Network
93%-95% 96%+

1960 1970 1980 1990 2000

Brown Corpus Brown Corpus LOB Corpus


Created (EN-US) Tagged Tagged
1 Million Words
British National
POS Tagging
Corpus
separated from
LOB Corpus (tagged by CLAWS)
other NLP
Created (EN-UK)
1 Million Words Penn Treebank
Corpus
(WSJ, 4.5M)

25/39
Two Methods for POS Tagging
1. Rule-based tagging
 (ENGTWOL)
2. Stochastic
1. Probabilistic sequence models
 HMM (Hidden Markov Model) tagging
 MEMMs (Maximum Entropy Markov Models)

26/39
Rule-Based Tagging
 Start with a dictionary
 Assign all possible tags to words from the
dictionary
 Write rules by hand to selectively remove
tags
 Leaving the correct tag for each word.

27/39
Rule-based taggers
 Early POS taggers all hand-coded
 Most of these (Harris, 1962; Greene and Rubin, 197
1) and the best of the recent ones, ENGTWOL (Voutil
ainen, 1995) based on a two-stage architecture
 Stage 1: look up word in lexicon to give list of potential
POSs
 Stage 2: Apply rules which certify or disallow tag
sequences
 Rules originally handwritten; more recently Machine
Learning methods can be used

28/39
Start With a Dictionary
• she: PRP
• promised: VBN,VBD
• to TO
• back: VB, JJ, RB, NN
• the: DT
• bill: NN, VB

• Etc… for the ~100,000 words of English with more than 1


tag

29/39
Assign Every Possible Tag

NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

30/39
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|
VBD follows “<start> PRP”
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

31/39
Inline Mark-up
 POS Tagging
http://nlp.cs.qc.cuny.edu/wsj_pos.zip
 Input Format
Pierre Vinken, 61/CD years/NNS old , will join th
e board as a nonexecutive director Nov. 29.
 Output Format
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/
JJ ,/, will/MD join/VB the/DT board/NN as/IN a/
DT nonexecutive/JJ director/NN Nov./NNP 29/
CD ./.

32/39
POS Tagging Tools
 Stanford tagger (Loglinear tagger )
http://nlp.stanford.edu/software/tagger.shtml
 Brill tagger
 http://www.tech.plym.ac.uk/soc/staff/guidbugm/software/RULE_B
ASED_TAGGER_V.1.14.tar.Z
 tagger LEXICON test BIGRAMS LEXICALRULEFULE CONTEXTU
ALRULEFILE
 YamCha (SVM)
http://chasen.org/~taku/software/yamcha/
 MXPOST (Maximum Entropy)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
 More complete list at:
http://www-nlp.stanford.edu/links/statnlp.html#Taggers

33/39
NLP Toolkits
 Uniform CL Annotation Platform
 UIMA (IBM NLP platform): http://incubator.apache.org/uima/svn.html
 Mallet (UMASS): http://mallet.cs.umass.edu/index.php/Main_Page
 MinorThird (CMU): http://minorthird.sourceforge.net/
 NLTK: http://nltk.sourceforge.net/
Natural langauge toolkit, with data sets  Demo

 Information Extraction
 Jet (NYU IE toolkit) http://www.cs.nyu.edu/cs/faculty/grishman/jet/license.ht
ml
 Gate: http://gate.ac.uk/download/index.html
University of Sheffield IE toolkit
 Information Retrieval
 INDRI: http://www.lemurproject.org/indri/
Information Retrieval toolkit
 Machine Translation
 Compara: http://adamastor.linguateca.pt/COMPARA/Welcome.html
 ISI decoder: http://www.isi.edu/licensed-sw/rewrite-decoder/
 MOSES: http://www.statmt.org/moses/

34/39
Looking Ahead: Next Class

 Machine Learning for POS Tagging:


Hidden Markov Model

35/39

You might also like