0% found this document useful (0 votes)

83 views35 pages

POS Tagging: Introduction: Heng Ji

This document provides an introduction to part-of-speech (POS) tagging. It discusses the traditional parts of speech categories and examples of tagsets used for POS tagging, including the popular Penn Treebank tagset. It also describes how POS tagging works, including rule-based and statistical approaches. POS tagging is useful for applications like speech synthesis, parsing, information extraction and machine translation. The document reviews the history of POS tagging and current high performance of around 97-98%.

Uploaded by

Hi Blacky Selvan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views35 pages

POS Tagging: Introduction: Heng Ji

Uploaded by

Hi Blacky Selvan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 35

POS Tagging: Introduction

Heng Ji
[email protected]
Sept 13, 2010

1
Assignment 1

 Questions?

2/39
Outline
 Parts of speech (POS)
 Tagsets
 POS Tagging
 Rule-based tagging
 Markup Format
 Open source Toolkits

3/39
What is Part-of-Speech (POS)

 Generally speaking, Word Classes (=POS) :

 Verb, Noun, Adjective, Adverb, Article, …
 We can also include inflection:
 Verbs: Tense, number, …
 Nouns: Number, proper/common, …
 Adjectives: comparative, superlative, …
 …

4/39
Parts of Speech
 8 (ish) traditional parts of speech
 Noun, verb, adjective, preposition, adverb, arti
cle, interjection, pronoun, conjunction, etc
 Called: parts-of-speech, lexical categories, wo
rd classes, morphological classes, lexical tag
s...
 Lots of debate within linguistics about the num
ber, nature, and universality of these
 We’ll completely ignore this debate.

5/39
7 Traditional POS Categories

N noun chair, bandwidth, pacing

V verb study, debate, munch
 ADJ adj purple, tall, ridiculous
 ADV adverb unfortunately, slowly,
P preposition of, by, to
 PRO pronoun I, me, mine
 DET determiner the, a, that, those

6/39
POS Tagging
 The process of assigning a part-of-speech or
lexical class marker to each word in a
WORD tag
collection.
the DET
koala N
put V
the DET
keys N
on P
the DET
table N

7/39
Penn TreeBank POS Tag Set

 Penn Treebank: hand-annotated corpus of

Wall Street Journal, 1M words
 46 tags
 Some particularities:
 to /TO not disambiguated
 Auxiliaries and verbs not distinguished

8/39
Penn Treebank Tagset

9/39
Why POS tagging is useful?
 Speech synthesis:
 How to pronounce “lead”?
 INsult inSULT
 OBject obJECT
 OVERflow overFLOW
 DIScount disCOUNT
 CONtent conTENT
 Stemming for information retrieval
 Can search for “aardvarks” get “aardvark”
 Parsing and speech recognition and etc
 Possessive pronouns (my, your, her) followed by nouns
 Personal pronouns (I, you, he) likely to be followed by verbs
 Need to know if a word is an N or V before you can parse
 Information extraction
 Finding names, relations, etc.
 Machine Translation
10/39
Equivalent Problem in Bioinformatics
 Durbin et al. Biological
Sequence Analysis, Cambridge
University Press.
 Several applications, e.g.
proteins
 From primary structure
ATCPLELLLD
 Infer secondary structure
HHHBBBBBC..

11/39
Why is POS Tagging Useful?
 First step of a vast number of practical tasks
 Speech synthesis
 How to pronounce “lead”?
 INsult inSULT
 OBject obJECT
 OVERflow overFLOW
 DIScount disCOUNT
 CONtent conTENT

 Parsing
 Need to know if a word is an N or V before you can parse
 Information extraction
 Finding names, relations, etc.
 Machine Translation
12/39
Open and Closed Classes
 Closed class: a small fixed membership
 Prepositions: of, in, by, …
 Auxiliaries: may, can, will had, been, …
 Pronouns: I, you, she, mine, his, them, …
 Usually function words (short common words which
play a role in grammar)
 Open class: new ones can be created all the time
 English has 4: Nouns, Verbs, Adjectives, Adverbs
 Many languages have these 4, but not all!

13/39
Open Class Words
 Nouns
 Proper nouns (Boulder, Granby, Eli Manning)
 English capitalizes these.
 Common nouns (the rest).
 Count nouns and mass nouns
 Count: have plurals, get counted: goat/goats, one goat, two goats
 Mass: don’t get counted (snow, salt, communism) (*two snows)
 Adverbs: tend to modify things
 Unfortunately, John walked home extremely slowly yesterday
 Directional/locative adverbs (here,home, downhill)
 Degree adverbs (extremely, very, somewhat)
 Manner adverbs (slowly, slinkily, delicately)
 Verbs
 In English, have morphological affixes (eat/eats/eaten)

14/39
Closed Class Words
Examples:
 prepositions: on, under, over, …
 particles: up, down, on, off, …
 determiners: a, an, the, …
 pronouns: she, who, I, ..
 conjunctions: and, but, or, …
 auxiliary verbs: can, may should, …
 numerals: one, two, three, third, …

15/39
Prepositions from CELEX

16/39
English Particles

17/39
Conjunctions

18/39
POS Tagging
Choosing a Tagset

 There are so many parts of speech, potential distinctions we ca

n draw
 To do POS tagging, we need to choose a standard set of tags to
work with
 Could pick very coarse tagsets
 N, V, Adj, Adv.

 More commonly used set is finer grained, the “Penn TreeBank t

agset”, 45 tags
 PRP$, WRB, WP$, VBG

 Even more fine-grained tagsets exist

19/39
Using the Penn Tagset
 The/DT grand/JJ jury/NN commmented/VBD
on/IN a/DT number/NN of/IN other/JJ topics/N
NS ./.
 Prepositions and subordinating conjunctions
marked IN (“although/IN I/PRP..”)
 Except the preposition/complementizer “to” is
just marked “TO”.

20/39
POS Tagging
 Words often have more than one POS: back
 The back door = JJ
 On my back = NN
 Win the voters back = RB
 Promised to back the bill = VB
 The POS tagging problem is to determine the
POS tag for a particular instance of a word.

These examples from Dekang Lin

21/39
How Hard is POS Tagging?
Measuring Ambiguity

22/39
Current Performance

 How many tags are correct?

 About 97% currently
 But baseline is already 90%
 Baseline algorithm:
 Tag every word with its most frequent tag
 Tag unknown words as nouns
 How well do people do?

23/39
Quick Test: Agreement?

 the students went to class

 plays well with others
 fruit flies like a banana DT: the, this, that
NN: noun
VB: verb
P: prepostion
ADV: adverb

24/39
How to do it? History
DeRose/Church Trigram Tagger Combined Methods
Efficient HMM (Kempe) 98%+
Sparse Data 96%+
95%+ Tree-Based Statistics
(Helmut Shmid)
Transformation
Rule Based – 96%+
Greene and Rubin HMM Tagging Based Tagging
Rule Based - 70% (CLAWS) (Eric Brill)
Rule Based – 95%+ Neural Network
93%-95% 96%+

1960 1970 1980 1990 2000

Brown Corpus Brown Corpus LOB Corpus

Created (EN-US) Tagged Tagged
1 Million Words
British National
POS Tagging
Corpus
separated from
LOB Corpus (tagged by CLAWS)
other NLP
Created (EN-UK)
1 Million Words Penn Treebank
Corpus
(WSJ, 4.5M)

25/39
Two Methods for POS Tagging
1. Rule-based tagging
 (ENGTWOL)
2. Stochastic
1. Probabilistic sequence models
 HMM (Hidden Markov Model) tagging
 MEMMs (Maximum Entropy Markov Models)

26/39
Rule-Based Tagging
 Start with a dictionary
 Assign all possible tags to words from the
dictionary
 Write rules by hand to selectively remove
tags
 Leaving the correct tag for each word.

27/39
Rule-based taggers
 Early POS taggers all hand-coded
 Most of these (Harris, 1962; Greene and Rubin, 197
1) and the best of the recent ones, ENGTWOL (Voutil
ainen, 1995) based on a two-stage architecture
 Stage 1: look up word in lexicon to give list of potential
POSs
 Stage 2: Apply rules which certify or disallow tag
sequences
 Rules originally handwritten; more recently Machine
Learning methods can be used

28/39
Start With a Dictionary
• she: PRP
• promised: VBN,VBD
• to TO
• back: VB, JJ, RB, NN
• the: DT
• bill: NN, VB

• Etc… for the ~100,000 words of English with more than 1

tag

29/39
Assign Every Possible Tag

NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

30/39
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|
VBD follows “<start> PRP”
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill

31/39
Inline Mark-up
 POS Tagging
http://nlp.cs.qc.cuny.edu/wsj_pos.zip
 Input Format
Pierre Vinken, 61/CD years/NNS old , will join th
e board as a nonexecutive director Nov. 29.
 Output Format
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/
JJ ,/, will/MD join/VB the/DT board/NN as/IN a/
DT nonexecutive/JJ director/NN Nov./NNP 29/
CD ./.

32/39
POS Tagging Tools
 Stanford tagger (Loglinear tagger )
http://nlp.stanford.edu/software/tagger.shtml
 Brill tagger
 http://www.tech.plym.ac.uk/soc/staff/guidbugm/software/RULE_B
ASED_TAGGER_V.1.14.tar.Z
 tagger LEXICON test BIGRAMS LEXICALRULEFULE CONTEXTU
ALRULEFILE
 YamCha (SVM)
http://chasen.org/~taku/software/yamcha/
 MXPOST (Maximum Entropy)
ftp://ftp.cis.upenn.edu/pub/adwait/jmx/
 More complete list at:
http://www-nlp.stanford.edu/links/statnlp.html#Taggers

33/39
NLP Toolkits
 Uniform CL Annotation Platform
 UIMA (IBM NLP platform): http://incubator.apache.org/uima/svn.html
 Mallet (UMASS): http://mallet.cs.umass.edu/index.php/Main_Page
 MinorThird (CMU): http://minorthird.sourceforge.net/
 NLTK: http://nltk.sourceforge.net/
Natural langauge toolkit, with data sets  Demo

 Information Extraction
 Jet (NYU IE toolkit) http://www.cs.nyu.edu/cs/faculty/grishman/jet/license.ht
ml
 Gate: http://gate.ac.uk/download/index.html
University of Sheffield IE toolkit
 Information Retrieval
 INDRI: http://www.lemurproject.org/indri/
Information Retrieval toolkit
 Machine Translation
 Compara: http://adamastor.linguateca.pt/COMPARA/Welcome.html
 ISI decoder: http://www.isi.edu/licensed-sw/rewrite-decoder/
 MOSES: http://www.statmt.org/moses/

34/39
Looking Ahead: Next Class

 Machine Learning for POS Tagging:

Hidden Markov Model

35/39

Personality Development Notes 1
No ratings yet
Personality Development Notes 1
83 pages
Derek Wynne - Leisure, Lifestyle and the New Middle Class_ a Case Study (International Library of Sociology) (1998) - Libgen.li
100% (1)
Derek Wynne - Leisure, Lifestyle and the New Middle Class_ a Case Study (International Library of Sociology) (1998) - Libgen.li
185 pages
ai txt unit4
No ratings yet
ai txt unit4
39 pages
Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau
No ratings yet
Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau
108 pages
Lecture#11 (POS Tagging)
No ratings yet
Lecture#11 (POS Tagging)
19 pages
Lecture Part of Speech Tagging
No ratings yet
Lecture Part of Speech Tagging
41 pages
nlp-unit-iii-notes
No ratings yet
nlp-unit-iii-notes
30 pages
Module-2_NLP (1)
No ratings yet
Module-2_NLP (1)
50 pages
Hmm
No ratings yet
Hmm
94 pages
Lecture 20-23 Part of Speech Tagging
No ratings yet
Lecture 20-23 Part of Speech Tagging
36 pages
lec04-2-PartOfSpeechTagging
No ratings yet
lec04-2-PartOfSpeechTagging
56 pages
Lecture 16-17-18-19
No ratings yet
Lecture 16-17-18-19
42 pages
Module 2 HMMppt
No ratings yet
Module 2 HMMppt
31 pages
module-3
No ratings yet
module-3
33 pages
NLP 4
No ratings yet
NLP 4
83 pages
Causes of Conflict
No ratings yet
Causes of Conflict
19 pages
Cme4408 p6 Pos Tagging
No ratings yet
Cme4408 p6 Pos Tagging
33 pages
10pos Tagging PDF
No ratings yet
10pos Tagging PDF
76 pages
3. Language Structure
No ratings yet
3. Language Structure
10 pages
Mana-Chapter 15 - Individual Behavior
No ratings yet
Mana-Chapter 15 - Individual Behavior
38 pages
Munazza Jamshed Khan-Literary Theory Assignment No.02
No ratings yet
Munazza Jamshed Khan-Literary Theory Assignment No.02
14 pages
pos tagging and chunking
No ratings yet
pos tagging and chunking
29 pages
Parts of Speech
No ratings yet
Parts of Speech
26 pages
An Ethical Defense of Private Property
No ratings yet
An Ethical Defense of Private Property
13 pages
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
No ratings yet
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
86 pages
10 - POS Tagging
No ratings yet
10 - POS Tagging
75 pages
Lecture 5
No ratings yet
Lecture 5
56 pages
GRAmm Indo European Grammar
No ratings yet
GRAmm Indo European Grammar
384 pages
Plant_Consciousness_Communication_and_Et
No ratings yet
Plant_Consciousness_Communication_and_Et
13 pages
Final Exam Q & A
No ratings yet
Final Exam Q & A
8 pages
Print Lect6 Pos
No ratings yet
Print Lect6 Pos
11 pages
Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)
No ratings yet
Introduction Machine Learning & NLP: 17B1NCI731 (Credits:3, Contact Hours: 3)
93 pages
Lecture 9: Part of Speech: Kai-Wei Chang CS at University of Virginia
No ratings yet
Lecture 9: Part of Speech: Kai-Wei Chang CS at University of Virginia
21 pages
ATTRIBUTION
No ratings yet
ATTRIBUTION
8 pages
3 cs626 Pos Tagging Week of 8aug22
No ratings yet
3 cs626 Pos Tagging Week of 8aug22
27 pages
National Seminar Report
No ratings yet
National Seminar Report
20 pages
Ilak Pos Tagging
No ratings yet
Ilak Pos Tagging
48 pages
NLPChapter3
No ratings yet
NLPChapter3
14 pages
Automatic tagging. Project, Holovko Yana
No ratings yet
Automatic tagging. Project, Holovko Yana
9 pages
Topic: Decision Making in Human Relations
No ratings yet
Topic: Decision Making in Human Relations
19 pages
Unit 1 NYĀYABINDU
No ratings yet
Unit 1 NYĀYABINDU
5 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Lect6 Pos
No ratings yet
Lect6 Pos
62 pages
Rule_based_POS_Tagging_Example (1)
No ratings yet
Rule_based_POS_Tagging_Example (1)
4 pages
Module-5 (Markov Model and Pos Tagging)
No ratings yet
Module-5 (Markov Model and Pos Tagging)
66 pages
Parts of Speech Tagging
No ratings yet
Parts of Speech Tagging
17 pages
CHAPTER 1 - THE ETHICAL DIMENSIONS OF HUMAN EXISTENCE - For Posting Until MORAL FRAMEWORK
No ratings yet
CHAPTER 1 - THE ETHICAL DIMENSIONS OF HUMAN EXISTENCE - For Posting Until MORAL FRAMEWORK
26 pages
NLP Exp 6
No ratings yet
NLP Exp 6
4 pages
Lec3-posner intro
No ratings yet
Lec3-posner intro
30 pages
Continuing Bonds
No ratings yet
Continuing Bonds
21 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
721
No ratings yet
721
7 pages
Journal v25 I37 2
No ratings yet
Journal v25 I37 2
23 pages
Pos Tagging
No ratings yet
Pos Tagging
84 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
47 pages
Part of Speech Tagging
No ratings yet
Part of Speech Tagging
13 pages
NLP-Lectures 4,5,6
No ratings yet
NLP-Lectures 4,5,6
85 pages
NLP Ia2
No ratings yet
NLP Ia2
18 pages
Diss Symbolic Interactionism
No ratings yet
Diss Symbolic Interactionism
1 page
Natural Language Processing: Parts of Speech Tagging - Pos
No ratings yet
Natural Language Processing: Parts of Speech Tagging - Pos
20 pages
Unit 3
No ratings yet
Unit 3
16 pages
Word Classes and Part-of-Speech (POS) Tagging: CS4705 Julia Hirschberg
No ratings yet
Word Classes and Part-of-Speech (POS) Tagging: CS4705 Julia Hirschberg
40 pages
Ucsp Lesson 3
No ratings yet
Ucsp Lesson 3
5 pages
8 POSNER Intro May 6 2021
No ratings yet
8 POSNER Intro May 6 2021
26 pages
Lec-5 POStagging
No ratings yet
Lec-5 POStagging
24 pages
Chapter Two Natural Language Processing
No ratings yet
Chapter Two Natural Language Processing
141 pages
Part of Speech Tagging (Chapter 5) : Adapted From Kathy Mccoy'S Presentation Downloaded From The Web, September 2010
No ratings yet
Part of Speech Tagging (Chapter 5) : Adapted From Kathy Mccoy'S Presentation Downloaded From The Web, September 2010
63 pages
3 Natural Language Processing-PoS Tagging
No ratings yet
3 Natural Language Processing-PoS Tagging
14 pages
UBUNTU
No ratings yet
UBUNTU
101 pages
RPH
No ratings yet
RPH
3 pages
Part-Of-Speech Tagging: A Simple But Useful Form of Linguistic Analysis
No ratings yet
Part-Of-Speech Tagging: A Simple But Useful Form of Linguistic Analysis
18 pages
Rutuja
No ratings yet
Rutuja
10 pages
Context of Development Administration
No ratings yet
Context of Development Administration
30 pages
The Mind's Best Trick: How We Experience Conscious Will: Daniel M. Wegner
No ratings yet
The Mind's Best Trick: How We Experience Conscious Will: Daniel M. Wegner
5 pages
California Italian Studies, 2 (1) Gallese, Vittorio Wojciehowski, Hannah
No ratings yet
California Italian Studies, 2 (1) Gallese, Vittorio Wojciehowski, Hannah
37 pages
Speech and Language Processing: SLP Chapter 5
No ratings yet
Speech and Language Processing: SLP Chapter 5
56 pages
Tagging and its types
No ratings yet
Tagging and its types
3 pages
POStagging
No ratings yet
POStagging
72 pages
Group 3 Purpy Communi Infos
No ratings yet
Group 3 Purpy Communi Infos
13 pages
Explanation Elaboration - Discussion
No ratings yet
Explanation Elaboration - Discussion
3 pages
Rebel Wisdom - Sensemaking Companion Sec1
No ratings yet
Rebel Wisdom - Sensemaking Companion Sec1
13 pages
Experiment 4
No ratings yet
Experiment 4
3 pages
Bias and Noise: Daniel Kahneman On Errors in Decision-Making
0% (1)
Bias and Noise: Daniel Kahneman On Errors in Decision-Making
10 pages
Speech Recognition Architecture
No ratings yet
Speech Recognition Architecture
13 pages
Group Presentation - EI Test Daniel Goleman
100% (1)
Group Presentation - EI Test Daniel Goleman
29 pages
Part-Of-Speech (POS) Tagging
No ratings yet
Part-Of-Speech (POS) Tagging
53 pages
Motivation, Ability and Opportunity: Hoyer - Macinnis - Pieters
No ratings yet
Motivation, Ability and Opportunity: Hoyer - Macinnis - Pieters
16 pages
Designing Qualitative Research Projects, Susan Silbey
No ratings yet
Designing Qualitative Research Projects, Susan Silbey
4 pages
Quick Cups Of Coca
From Everand
Quick Cups Of Coca
Mura Nava
No ratings yet
Coreference: Fundamentals and Applications
From Everand
Coreference: Fundamentals and Applications
Fouad Sabry
No ratings yet