Lecture#11 (POS Tagging)
Lecture#11 (POS Tagging)
by Safdar Hussain
Topics
• Part of Speech Tagging
• The Penn Treebank Part of Speech tags, Problems in POS Taggings
• Techniques of POS Tagging
Parts of Speech (POS) Tagging
• Tag is basically the grammatical category or label & it is one of the part-of-
speech, kind of semantic information and so on.
• Part-of-Speech (POS) tagging is a preprocessing step in NLP that involves
assigning a grammatical category or part-of-speech label (such as noun, verb,
adjective, etc.) to each word in a sentence.
22
Parts of Speech (POS) Tagging-Example
33
The Penn Treebank POS tags
• There are many lists of parts-of-speech, most modern language processing on
English uses the 45-tag Penn Treebank tagset (Marcus et al., 1993).
• This tagset has been used to label a wide variety of corpora, including the
Brown corpus, the Wall Street Journal corpus, and the Switchboard corpus.
44
The Penn Treebank POS tags
55
The Penn Treebank POS tags
Parts-of-speech are generally represented by placing the tag after each word,
delimited by a slash, as given in the followings:
• Sentences with POS tags:
– The grand jury commented on a number of other topics
– The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.
– There are children there
– There/EX are/VBP 70/CD children/NNS there/RB
– Preliminary findings were reported in today’s New England Journal of Medicine
– Preliminary/JJ findings/NNS were/VBD reported/VBN in/IN today/NN ’s/POS
New/NNP England/NNP Journal/NNP of/IN Medicine/NNP ./.
66
The Penn Treebank POS tags
Word Category Description Word Category Description
DT Determiner VBP Verb, non 3rd person singular present
JJ Adjectives CD Cardinal number
NN Common noun RB Adverb
VBD Past tense verb VBN Verb past participle
IN Preposition POS ‘s for possessive ending
NNS Plural noun NNP Proper noun
EX Existential there
• Note that since New England Journal of Medicine is a proper noun, the
Treebank tagging chooses to mark each noun in it separately as NNP,
including journal and medicine, which might otherwise be labeled as
common nouns (NN).
77
Problems in POS Tagging
One of the main challenges in POS tagging is ambiguity. Many words in English can take
several possible parts of speech
Ambiguity: Ambiguity is a word, phrase, statement, or idea that can be understood in more
than one way.
Example: "I saw the man on the hill with the telescope.“
In this sentence, it's unclear whether the man who was seen was on the hill & had a
telescope, or if the observer was on the hill & used a telescope to see a man.
The prepositional phrase "on the hill with the telescope" introduces ambiguity about the
location of the man and the telescope.
88
Example of Ambiguity in POS Tagging
99
Techniques
10
10
Rule-based POS Tagging
11
11
Stochastic POS Tagging
12
12
Word Frequency Measurements
The tag encountered most frequently in the corpus is the one assigned to the ambiguous
words(words having 2 or more possible POS tags).
Let’s understand this approach using some example sentences :
Ambiguous Word = “play”
Sentence 1 : I play cricket every day. POS tag of play = VERB
Sentence 2 : I want to perform a play. POS tag of play = NOUN
The word frequency method will now check the most frequently used POS tag for “play”.
Let’s say this frequent POS tag happens to be VERB; then we assign the POS tag of "play”
= VERB
The main drawback of this approach is that it can yield invalid sequences of tags
13
13
Tag Sequence Probabilities
In this method, the best tag for a given word is determined by the probability that it
occurs with “n” previous tags.
Assume we have a new sequence of 4 words, w1,w2,w3,w4 And we need to identify the
POS tag of w4.
If n = 3, we will consider the POS tags of 3 words prior to w4 in the labeled corpus of text
Let’s say the POS tags for
w1 = NOUN, w2 = VERB , w3 = DETERMINER
In short, N, V, D: NVD
Then in the labeled corpus of text, we will search for this NVD sequence.
Let’s say we found 100 such NVD sequences. Out of these -
10 sequences have the POS of the next word is NOUN ,90 sequences have the POS of the
next word is VERB
Then the POS of the word w4 = VERB
14
14
Transformation-based POS Tagging
Transformation based tagging is also called Brill tagging. It is a rule-based algorithm for
automatic tagging of POS to the given text. It transforms one state to another state by
using transformation rules.
It draws the inspiration from both the previous explained taggers − Rule-based and
Stochastic.
If we see similarity between rule-based and transformation tagger, then like rule-based,
it is also based on the rules that specify what tags need to be assigned to what words.
On the other hand, if we see similarity between stochastic and transformation tagger
then like stochastic, it is machine learning technique in which rules are automatically
induced from data.
15
15
Limitations
Rule-Based POS Tagging Stochastic POS Tagging
16
16
Uses of POS Tagging
Named Entity Recognition (NER): Identifying specific entities like names, places, organizations in text.
Information Retrieval: Using POS tags for specific searches within text data.
17
17
Importance of POS Tagging
Defines Word Roles: Assigns grammatical categories (nouns, verbs, adjectives, etc.)
to words in sentences.
Aids Language Understanding: Facilitates accurate comprehension of sentence
structures.
Enhances NLP Tasks: Improves accuracy in translation, summarization, and other
language processing tasks.
Supports Information Extraction: Assists in identifying entities, grammar checking,
and refining search queries.
Enables Natural-Sounding Speech: Helps in creating more lifelike text-to-speech
systems.
18
18
The End
19