Module-2 NLP
Module-2 NLP
Module 2
Parts of Speech Tagging
Prepared by
Dr. Venkata Rami Reddy Ch
SCOPE
Syllabus
• Parts of Speech Tagging and Named Entities
• Tagging in NLP,
• Sequential tagger,
• N-gram tagger,
• Regex tagger,
• Brill tagger,
• NER tagger;
• Machine learning taggers-MEC, HMM, CRF,
Part-of-Speech (POS) Tagging
• Part-of-speech (POS) tagging is a process in NLP where each word in a text is
labeled with its corresponding part of speech.
• Assigning a part-of-speech to each word in a text.
• This can include nouns, verbs, adjectives, and other grammatical categories.
• It helps algorithms understand the grammatical structure and meaning of a
text.
• POS tagging is useful for a variety of NLP tasks, such as information extraction,
named entity recognition, and machine translation.
Part-of-Speech (POS) Tags
1.Noun: A noun is the name of a person, place, thing, or idea.
Let’s take an
2.Pronoun: A pronoun is a word used in place of a noun. example,
3.Verb: A verb expresses action or being. Text: “The cat sat
4.Adjective: An adjective modifies or describes a noun or pronoun. on the mat.”
5.Adverb: An adverb modifies or describes a verb, an adjective, or POS tags:
another adverb. •The: determiner
6.Preposition: A preposition is a word placed before a noun or •cat: noun
pronoun to form a phrase modifying another word in the sentence. •sat: verb
7.Conjunction: A conjunction joins words, phrases, or clauses. •on: preposition
8.Interjection: An interjection is a word used to express emotion. •the: determiner
9.Determiner or Article: A grammatical marker of definiteness (the) •mat: noun
or indefiniteness (a, an).
POS Tagging architecture
POS Tagset
• Tagset is the collection of tags from which the tagger finds appropriate tags and
attaches to the word
• Different POS tag sets are used depending on the language and the application.
Penn Treebank POS Tagset
lexicons
• lexicons are structured collections of words or phrases that include additional information,
such as part of speech, meanings, synonyms, or domain-specific attributes.
• A dictionary that contains words and their possible tags.
• The WordNet Lexicon is a widely used lexical database in NLP that groups words into
nouns, verbs, adjectives, and adverbs.
• For example, the word "run" might have tags such as verb or noun.
POS Tagger
• A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a
part of speech tag to each word.
• It is a program that carries out POS Tagging
• Taggers utilize a various types of data: lexicons, dictionaries, rules, etc. for POS tagging.
import nltk
nltk.download('averaged_perceptron_tagger')
text = "The quick brown fox jumps over the lazy dog."
tokens = nltk.word_tokenize(text)
# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)
Determiners: \b(the|a|an)\b
Adjectives ending in 'able’ : \b\w+able\b’
Past tense verbs: .*ed$
Adverbs: .*ly$’
Pronouns: \b(I|my|he|him|his|she|her|we|you|it|they)\b
Prepositions: \b(on|in|at|by|with|about|into|to)\b
How it Works:
Training:
• The Unigram Tagger is trained on a collection of tagged words (word-tag
pairs).
• e.g. the Brown Corpus for English. In such corpora each word is associated
with its PoS.
[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'),
('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('place',
'NOUN'), ('.', '.’)]
• It learns the most frequent tag for each word in the training data.
• The result of the training is a table of two columns, the first column is a
word and the second the most-frequent PoS of this word
Unigram Tagger
Tagging:
• During tagging, for a given word in a sentence, the tagger looks up the most likely tag
associated with that word from the training data.
• If the word was seen during training, it assigns the most common tag associated with that
word.
• If the word wasn't seen during training, it assigns a default tag (usually NN).
Cons:
• It doesn’t consider the context or POS tags of previous words. Consequently, a word
is always tagged with the same POS, independent of its context.
Implementation steps:
• Import necessary modules.
• Load the brown corpus and divide the data into training and testing data
• Train the UnigramTagger on training data
• Tag a sentence using the tag() method of UnigramTagger.
Unigram Tagger
import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger
N-gram Models:
•Unigram Tagger (1-gram): It predicts the tag for the current word based on the word alone.
•Bigram Tagger (2-gram): It predicts the tag for the current word based on the previous word
and its tag.
•Trigram Tagger (3-gram): It predicts the tag for the current word based on the previous two
N-gram(bigram) tagger training
N-gram tagger Tagging
bi-gram tagger Tagging Example
Training Data:
"The/DT cat/NN sat/VBD on/IN the/DT mat/NN Tagging Phase:
P(cat∣NN)×P(NN∣DT)=0.5x 1.0
=0.5
P(cat|VBD)xP(VBD|DT)= 0x0
=0
NN tag for cat having highest probabilities so
assign tag NN to cat
import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger
nltk.download('brown')
nltk.download('punkt')
nltk.download('brown')
train_data = brown.tagged_sents()[:3000]
test_data = brown.tagged_sents()[3000:]
print(tagged_sentence) [('The','AT'),('dog','NN'),('sat',
'VBD'), ('on', 'IN'), ('the', 'AT'),
# Evaluate Brill Tagger ('mat', 'NN'), ('.', '.')]
accuracy = brill_tagger.evaluate(test_data)
print(f"Accuracy of the Brill Tagger: {accuracy:.2f}") 0.79
Machine learning taggers
• HMM
• MEC
• CRF
Hidden Markov Model (HMM)
• The Hidden Markov Model (HMM) is a probabilistic sequence model used in POS tagging.
• It assigns the most likely sequence of POS tags to a sentence based on observed words.
• The components of HMMs are
States
•Represents the possible POS tags (e.g., noun, verb, adjective, etc.).
Observations
•Represents the words in the sentence
Transition Probability
• Represents the probability of one POS tag following another in a sentence
Emission Probability
• Represents the probability of a word being associated with a particular POS tag
Hidden Markov Model (HMM)
Transition Probability
• Represents the probability of one POS tag following another in a sentence
Emission Probability
• Represents the probability of a word being associated with a particular POS tag
Hidden Markov Model (HMM)
Q: Set of possible Tags (hidden states)
It must be noted that we get all these Count() from the corpus itself
used for training.
Example
Let us calculate the above two probabilities for the set of sentences
below
Note that Mary Jane, Spot, and Will are all names.
calculate the emission probabilities
1/
4
<S>→N→M→N→N→<E>
=3/4*1/9*3/9*1/4*1/4*2/9*1/9*4/9*4/9=0.00000846754
<S>→N→M→V→N→<E>=3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164
a: (Transition matrix)
b: (Emission matrix )
Code of HMM tagger
import nltk
from nltk.corpus import treebank
from nltk.tag import hmm
nltk.download('treebank')
nltk.download('universal_tagset')
• Conditional Random Fields (CRF) for Part-of-Speech (POS) tagging is a sequence labeling
technique that assigns a POS tag to each word in a sentence while considering contextual
dependencies.
• It is particularly useful because it captures contextual dependencies between words, unlike
traditional classifiers that treat each word independently.
1. Training Phase
• The training phase involves learning the parameters (weights) of the CRF model from
labeled data.
Step 1: Preparing the Training Data
•The dataset consists of sentences where each word is labeled with its corresponding POS tag.
Example training sentence:
Sentence: ["The", "cat", "sits"]
POS Tags: ["DT", "NN", "VBZ"]
Step 2: Feature Extraction
• For each word in a sentence, extract hand-crafted features such as:
Word identity: w_i = "cat“
Previous word: w_{i-1} = "The“
Next word: w_{i+1} = "sits“
Word shape: "Xx" (for capitalized words)
Prefixes/Suffixes: ca-, -at
Step 4: Learning Parameters
•The goal is to find the best weights wkthat maximize the likelihood of the correct POS sequence.
•We use an optimization algorithm like Gradient Descent or L-BFGS to adjust the weights based on
the training data.
Conditional Random Fields (CRF) POS tagging
import sklearn_crfsuite # Sample training dataset
from sklearn_crfsuite import metrics train_data = [
[('The', 'DET'), ('cat', 'NOUN'), ('sat',
# Extract features for each word 'VERB'), ('on', 'PREP'), ('the', 'DET'), ('mat',
def word_features(sentence, index): 'NOUN')],
word = sentence[index][0] [('A', 'DET'), ('dog', 'NOUN'), ('barked',
features = { 'VERB')],
'word': word, ]
'is_first': index == 0,
'is_last': index == len(sentence) - 1, X_train = [sentence_features(sentence) for sentence
'is_title': word.istitle(), in train_data]
'is_upper': word.isupper(), y_train = [sentence_labels(sentence) for sentence
'is_digit': word.isdigit(), in train_data]
'prev_word': '' if index == 0 else
sentence[index - 1][0], # Train CRF model
'next_word': '' if index == len(sentence) - crf = sklearn_crfsuite.CRF(algorithm='lbfgs')
1 else sentence[index + 1][0], crf.fit(X_train, y_train)
'prefix-1': word[:1],
'prefix-2': word[:2],
'suffix-1': word[-1:],
test_sentence = [('The'), ('man'),('running')]
'suffix-2': word[-2:],
X_test = [sentence_features(test_sentence)]
}
y_pred = crf.predict(X_test)
return features
print(y_pred[0]) # Predicted POS tags
# Convert dataset to feature format
def sentence_features(sentence):
return [word_features(sentence, i) for i in
range(len(sentence))]
['DET' 'NOUN' 'VERB']
NER Tagger
• Named Entity Recognition (NER) is a NLP technique to find and classify entities from textual
data into predefined categories called named entities.
Types of Named Entities:
Person: Names of individuals (e.g., Elon Musk, Marie Curie).
Organization: Names of organizations (e.g., Google, United Nations).
Location: Geographical entities (e.g., Paris, Mount Everest).
Date/Time: Temporal expressions (e.g., January 1, 2025, 10:30 AM).
Monetary values: Financial amounts (e.g., $10,000).
Percentages: Percentage figures (e.g., 50%).
Miscellaneous: Other domain-specific categories (e.g., product names, scientific terms).
• A Named Entity Recognition (NER) tagger is a tool used in NLP to identify and classify entities
within a text.
• Named entity recognition is important because it enables organizations to extract valuable
information from unstructured text data.
• For example, an NER system could be used to extract the names of all the companies
mentioned in a set of news articles, along with their stock prices, market capitalization, and
NER using Spacy library
import spacy