0% found this document useful (0 votes)
58 views47 pages

Part-of-Speech (POS) Tagging

Uploaded by

saisuraj1510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views47 pages

Part-of-Speech (POS) Tagging

Uploaded by

saisuraj1510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Part-of-Speech (POS) Tagging

POS tagging & terminologies

• Parts of speech - word classes or lexical categories.


• POS tagging - Word category disambiguation
• Collection of tags - tagset
• Classifying words into their parts of speech and labeling
them by considering the adjacent words is known
as part-of-speech tagging or POS-tagging
POS Tagging

The process of assigning a part-of-speech


or lexical class marker to each word in a
corpus:
Some Examples
N noun chair, printer
V verb study, chat
ADJ adjective yellow, tall,
ADV adverb unfortunately, fast
P preposition of, by, to
PRO pronoun I, me, mine
DET determiner the, a, that, an

5
Applications for POS Tagging
Speech synthesis
• Lead – leading a procession
• Lead - Element
Parsing: e.g. Time flies like an arrow
• Is flies an N or V?
Word prediction in speech recognition /Typing
• Possessive pronouns (my, your, her) are likely to be followed by nouns
• Personal pronouns (I, you, he) are likely to be followed by verbs
Machine Translation
To derive the internal structure of a sentence which
• Finds application in IR, IE and word sense disambiguation
6
Closed
Classes in
English
Open Classes
• Brown Corpus 1M words, 87 tags – more informative
but more difficult to tag
Choosing a • Penn Treebank: hand-annotated corpus of Wall
Street Journal, 1M words, 45-46 subset
POS Tagset • The C5 tagset used for the British National Corpus
(BNC) has 61 tags.
Penn
Treebank
Tagset
Using Penn Treebank Tags

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ


topics/NNS ./.
• Prepositions marked IN
• “to” is just marked “TO”
Tag Ambiguity

Words often have more than one POS: back


• The back door = JJ
• On my back = NN
• Promised to back the bill = VB
• The POS tagging problem is to determine the POS tag for a particular
instance of a word
Tagging Whole Sentences with POS

• Ambiguous POS contexts


• E.g., Time flies like an arrow.
• Possible POS assignments
• Time/[V,N] flies/[V,N] like/[V,Prep] an/Det arrow/N
• Time/N flies/V like/Prep an/Det arrow/N
• Time/V flies/N like/Prep an/Det arrow/N
• Time/N flies/N like/V an/Det arrow/N
• …..
How Big is
this
Ambiguity
Problem?
POS

• Many words have only one POS tag (e.g. is, Mary, very, smallest)
• Others have a single most likely tag
• Tags also tend to co-occur regularly with other tags (e.g. Det, N)
• Rule-Based: Human crafted rules based on lexical
and other linguistic knowledge.
• Learning-Based: Trained on human annotated
corpora like the Penn Treebank.
• Statistical models: Hidden Markov Model
POS Tagging (HMM), Maximum Entropy Markov Model
(MEMM), Conditional Random Field (CRF)
Approaches • Rule learning: Transformation Based Learning
(TBL)
• Neural networks: Recurrent networks like Long
Short Term Memory (LSTMs)
• Learning-based approaches have been found to be
more effective.
Some Ways to do POS Tagging

Rule-based tagging
• E.g. EnCG ENGTWOL tagger –English Two level Tagger
Transformation-based tagging
• Learned rules (statistic and linguistic)
• E.g., Brill tagger
Stochastic, or, Probabilistic tagging
• HMM (Hidden Markov Model) tagging
1. Start with a dictionary of words and possible
tags
Rule-Based 2. Assign all possible tags to words using the
dictionary
Tagging 3. Write rules by hand to selectively remove tags
4. Stop when each word has exactly one
(probably correct) tag
she PRP
Start with a promised VBN,VBD

POS to
back
TO
VB, JJ, RB, NN
Dictionary the
bill
DT
NN, VB
Assign All Possible POS to Each Word

NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
Apply Rules Eliminating Some POS

Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”


NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
Apply Rules Eliminating Some POS

Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”


NN
RB
JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
ENG Constraint Grammar
English Constraint Grammar
Grammar for morphological (e.g. part-of-speech) disambiguation
• 1,200 "grammar-based" constraints
• 99.7-100% of all words retain the appropriate morphological reading
• 3-7% of all words remain (partly) ambiguous
• 200 "heuristic" constraints
• resolves some 50% of remaining ambiguities
• after heuristic disambiguation, 99.5% or more retain the appropriate
morphological reading
Grammar for determining syntactic functions
• 830 syntactic constraints for syntactic ambiguity resolution
• some 85-90% of all words become syntactically unambiguous
EngCG
ENGTWOL ✓ 1,100 constraints
✓ 93-97% of the words are correctly
(English Two disambiguated
✓ Heuristic rules can be applied over the rest
Level) Tagger
Sample
ENGTWOL
Dictionary
Regex tagger

✓ Define a regular expression


✓ Define tag for the given expressions.
[( [0-9]+(.[0-9]+)$', 'CD'), # cardinal numbers
( (The|the|A|a|An|an)$', 'AT'), # articles
( .*able$', 'JJ'), # adjectives
( .*ness$', 'NN'), # nouns formed from adj

Regex tagger ( .*ly$', 'RB'), # adverbs


( .*s$', 'NNS'), # plural nouns
( .*ing$', 'VBG'), # gerunds
(.*ed$', 'VBD'), # past tense verbs
(.*', 'NN') # nouns (default)])
Transformation-Based (Brill) Tagging

Combines Rule-based and Stochastic Tagging


• Like rule-based
Rules are used to specify tags
• Like stochastic approach
Uses tagged corpus to find the best performing rules
Input:
• Tagged corpus
• Dictionary (with most frequent tags)
• Step 1: Label every word with most likely tag (from
dictionary)

TBL Tagging • Step 2: Check every possible transformation &


select one improves tag accuracy (Gold)
• Step 3: Re-tag corpus applying this rule, and add
Algorithm rule to end of rule set
• Repeat 2-3 until some stopping criterion is reached
• E.g., X% correct with respect to training corpus
Templates
for TBL
Labels every word with its most-likely tag
• P(NN|race) = .98 P(VB|race)= .02

Sample TBL • is/VBZ expected/VBN to/TO race/NN tomorrow/NN

Rule Apply rule that Improves tag accuracy


“Change NN to VB when previous tag is TO”
Application … is/VBZ expected/VBN to/TO race/NN tomorrow/NN
becomes
… is/VBZ expected/VBN to/TO race/VB tomorrow/NN
✓Keep applying (new) transformations
TBL Issues endlessly
✓Rules may interrelate
Evaluating Tagging Approaches

Possible Gold Standards :


• Annotated corpus
• Human performance (96-97%)
• How well do humans agree?
Methodology: Error Analysis
Confusion matrix:
• E.g. which tags did we most often confuse with which other tags?

8.7% of the total errors caused by mistagging NN as JJ


Tag indeterminacy:
More ✓ Gold /truth is not clear
Tagging multipart words
Complex ✓wouldn’t --> would/MD n’t/RB
Unknown words
Issues ✓Assume all tags equally likely
✓Use morphology
N-gram tagger
Considers previous n words to predict the
POS tag for the given token
• Unigram Tagger
Sequential • Bigram Tagger
• Trigram Tagger
taggers Regex tagger
N-gram taggers comparison

• Unigram - Predicts the most frequent tag for the every given token.
• Bigram tagger
• Given word and the previous word, and tag as tuple
• Get the given tag for the test word.
• Trigram Tagger
• Looks for the previous two words with a similar process.
• Decision Trees and Rule Learning

ML - •

Naïve Bayes and Bayesian Networks
Logistic Regression / Maximum Entropy (MaxEnt)

Classification •
Perceptron and Neural Networks
Support Vector Machines (SVMs)
• Nearest-Neighbor / Instance-Based
Beyond ML-Classification

• Standard classification - Assumes individual classes are disconnected and independent


• Many NLP problems do not satisfy this assumption
• Involve making many connected decisions
• Each resolving a different ambiguity
• mutually dependent
• More sophisticated learning and inference techniques are needed
Sequence Labeling Problem

• Many NLP problems can viewed as sequence labeling.


• Each token in a sequence is assigned a label.
• Labels of tokens are dependent on the labels of other tokens in the sequence
Information Extraction
• Identify phrases in language that refer to specific types of entities and
relations in text.
• Named entity recognition is task of identifying names of people, places,
organizations, etc. in text.
people organizations places
• Sundar Pitchai is the CEO of Google Corporation and lives in New York.
• Extract pieces of information relevant to a specific application, e.g. used
car ads:
make model year mileage price
• For sale, Benz, C3, 2016, 20,000 mi, $11K or best offer. Available
starting July 30, 2017.
Semantic Role Labeling
For each clause, determine the semantic role played by each noun
phrase that is an argument to the verb.
agent target source destination instrument
• John drove Mary from Rome to Greece in his Benz.

“case role analysis,” “thematic analysis,” and “shallow semantic parsing”


Bioinformatics
Sequence labeling also valuable in labeling genetic sequences in genome
analysis.
Extron intron
• AGCTAACGTTCGATACGGATTACAGCCT
Not easy to integrate
information from category of
tokens on both sides.

Problems with
Sequence Difficult to propagate
uncertainty between decisions.
Labeling as
Classification
Difficult to “collectively”
determine the most likely joint
assignment of categories.
Probabilistic sequence models allow

• Integrating uncertainty over multiple,


Probabilistic interdependent classifications
• Collectively determine the most likely global
Sequence assignment.

Models Two standard models

• Hidden Markov Model (HMM)


• Conditional Random Field (CRF)
References Speech & Language Processing By Dan Jurafsky

You might also like