0% found this document useful (0 votes)
6 views50 pages

Module-2 NLP

This document outlines the concepts of Parts of Speech (POS) Tagging in Natural Language Processing (NLP), detailing various tagging methods such as rule-based, statistical, and transformation-based approaches. It explains the architecture of POS taggers, including the use of lexicons and examples of different taggers like Unigram, Bigram, and Brill taggers. Additionally, it provides Python code snippets for implementing these tagging techniques using the NLTK library.

Uploaded by

ragebhanukiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views50 pages

Module-2 NLP

This document outlines the concepts of Parts of Speech (POS) Tagging in Natural Language Processing (NLP), detailing various tagging methods such as rule-based, statistical, and transformation-based approaches. It explains the architecture of POS taggers, including the use of lexicons and examples of different taggers like Unigram, Bigram, and Brill taggers. Additionally, it provides Python code snippets for implementing these tagging techniques using the NLTK library.

Uploaded by

ragebhanukiran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Natural Language Processing

Course code: CSE3015

Module 2
Parts of Speech Tagging

Prepared by
Dr. Venkata Rami Reddy Ch
SCOPE
Syllabus
• Parts of Speech Tagging and Named Entities
• Tagging in NLP,
• Sequential tagger,
• N-gram tagger,
• Regex tagger,
• Brill tagger,
• NER tagger;
• Machine learning taggers-MEC, HMM, CRF,
Part-of-Speech (POS) Tagging
• Part-of-speech (POS) tagging is a process in NLP where each word in a text is
labeled with its corresponding part of speech.
• Assigning a part-of-speech to each word in a text.
• This can include nouns, verbs, adjectives, and other grammatical categories.
• It helps algorithms understand the grammatical structure and meaning of a
text.
• POS tagging is useful for a variety of NLP tasks, such as information extraction,
named entity recognition, and machine translation.
Part-of-Speech (POS) Tags
1.Noun: A noun is the name of a person, place, thing, or idea.
Let’s take an
2.Pronoun: A pronoun is a word used in place of a noun. example,
3.Verb: A verb expresses action or being. Text: “The cat sat
4.Adjective: An adjective modifies or describes a noun or pronoun. on the mat.”
5.Adverb: An adverb modifies or describes a verb, an adjective, or POS tags:
another adverb. •The: determiner
6.Preposition: A preposition is a word placed before a noun or •cat: noun
pronoun to form a phrase modifying another word in the sentence. •sat: verb
7.Conjunction: A conjunction joins words, phrases, or clauses. •on: preposition
8.Interjection: An interjection is a word used to express emotion. •the: determiner
9.Determiner or Article: A grammatical marker of definiteness (the) •mat: noun
or indefiniteness (a, an).
POS Tagging architecture
POS Tagset
• Tagset is the collection of tags from which the tagger finds appropriate tags and
attaches to the word
• Different POS tag sets are used depending on the language and the application.
Penn Treebank POS Tagset
lexicons
• lexicons are structured collections of words or phrases that include additional information,
such as part of speech, meanings, synonyms, or domain-specific attributes.
• A dictionary that contains words and their possible tags.
• The WordNet Lexicon is a widely used lexical database in NLP that groups words into
nouns, verbs, adjectives, and adverbs.

• For example, the word "run" might have tags such as verb or noun.
POS Tagger
• A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a
part of speech tag to each word.
• It is a program that carries out POS Tagging
• Taggers utilize a various types of data: lexicons, dictionaries, rules, etc. for POS tagging.
import nltk

nltk.download('averaged_perceptron_tagger')

text = "The quick brown fox jumps over the lazy dog."
tokens = nltk.word_tokenize(text)
# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

print("Tokenized Words with POS Tags:")


for word, tag in pos_tags:
print(f"{word}: {tag}")

• The Averaged Perceptron Tagger is the default tagger used in pos_tag().


• Averaged Perceptron Tagger assigns tags based on learned features from
a large annotated corpus, such as the Penn Treebank
Approaches to POS Tagging
• Rule-based Approach
– Uses set of rules to tag input sentences
e.g. RegExp Tagger
• Statistical approaches(Machine Learning Based)
– Use training corpus to assign a tag to every token in given text.
e.g N-gram tagger, HMM(Hidden Markov Model),
CRF(conditional random field)
• Transformation based(Hybrid)
Rules + machine learning( 1 gram tagger)
e.g Brill Tagger
Regex tagger
• The Regex tagger assigns tags to words based on matching patterns specified using
regular expressions.
• You can specify both a regular expression and an associated tag for identifying pos of
each word and assign associated tag to that word.
• For instance, we might guess that any word ending in ed is the past participle of a verb,
and any word ending with 's is a possessive noun. We can express these as a list of regular
expressions

Determiners: \b(the|a|an)\b
Adjectives ending in 'able’ : \b\w+able\b’
Past tense verbs: .*ed$
Adverbs: .*ly$’
Pronouns: \b(I|my|he|him|his|she|her|we|you|it|they)\b
Prepositions: \b(on|in|at|by|with|about|into|to)\b

\b: Word boundary to ensure you're matching whole words.


Write a Python script to tag parts of speech in the given sentence. Define patterns for pronouns, conjunctions,
prepositions, determiners, adjectives, verbs, adverbs, and nouns. Then, print the tagged words.
import nltk
from nltk.tag import RegexpTagger

# Define patterns for the regular expression tagger


patterns = [
(r'^\d+$', 'CD'), # Cardinal numbers
(r'\b(I|me|my|he|him|his|she|her|we|you|it|they)\b', 'PRP'), # pronouns
(r'\b(on|in|at|by|with|about|into|to)\b', 'IN'),
(r'\b(and|or|but|also)\b', 'CC'),
(r'\b(The|the|A|a|An|an)\b', 'DT'),# Determiners
(r'\b\w+able\b', 'JJ'), # Adjectives ending in 'able'
(r'.*ing$', 'VBG'), # Gerunds
(r'.*ed$', 'VBD'), # Past tense verbs
(r'.*es$', 'VBZ'), # 3rd person singular verbs
(r'.*ly$', 'RB'), # Adverbs
(r'.*', 'NN') # Default: Noun
]

# Create a RegexpTagger using the defined patterns


regexp_tagger = RegexpTagger(patterns)
sentence = "John is running quickly and he catch the 9am train."
words = nltk.word_tokenize(sentence)
[('John', 'NN'), ('is', 'NN'), ('running',
# Tag the words using the regular expression tagger
'VBG'), ('quickly', 'RB'), ('and', 'CC'), ('he'
tagged_words = regexp_tagger.tag(words) 'PRP'), ('catch', 'NN'), ('the', 'DT'), ('9am'
print(tagged_words) 'NN'), ('train', 'NN'), ('.', 'NN')]
Unigram Tagger
• A Unigram Tagger is a type of POS tagger that assigns tags based on individual words.
• In a unigram model, the tag for a word is determined independently, without
considering the POS tags of the previous words.

How it Works:
Training:
• The Unigram Tagger is trained on a collection of tagged words (word-tag
pairs).
• e.g. the Brown Corpus for English. In such corpora each word is associated
with its PoS.
[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'),
('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('place',
'NOUN'), ('.', '.’)]
• It learns the most frequent tag for each word in the training data.
• The result of the training is a table of two columns, the first column is a
word and the second the most-frequent PoS of this word
Unigram Tagger
Tagging:
• During tagging, for a given word in a sentence, the tagger looks up the most likely tag
associated with that word from the training data.
• If the word was seen during training, it assigns the most common tag associated with that
word.
• If the word wasn't seen during training, it assigns a default tag (usually NN).
Cons:
• It doesn’t consider the context or POS tags of previous words. Consequently, a word
is always tagged with the same POS, independent of its context.

Implementation steps:
• Import necessary modules.
• Load the brown corpus and divide the data into training and testing data
• Train the UnigramTagger on training data
• Tag a sentence using the tag() method of UnigramTagger.
Unigram Tagger
import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger

# Download the required NLTK resources


nltk.download('brown')
nltk.download('punkt')

# Load the brown dataset for training


train_data = brown.tagged_sents()[:3000] # Use first 3000 sentences for
training
test_data = treebank.tagged_sents()[3000:] # Use remaining sentences for
testing

# Create a UnigramTagger (1-gram model)


unigram_tagger = UnigramTagger(train_data)

# Tag a sentence using the trained UnigramTagger


sentence = "The dog sat on the mat".split() [('The', 'AT'), ('dog', 'NN'),
tagged_sentence = unigram_tagger.tag(sentence) ('sat', 'VBD'), ('on', 'IN'), ('the',
'AT'), ('mat', 'NN')]
print(tagged_sentence)
0.7641347348147013
N-gram tagger
• N-gram tagger in NLP uses the POS tags of N-1 previous words to predict the tag for a
current word.
• N-gram-Tagger assigns the PoS-tag of the current word by taking into
account the current word itself and the PoS-tag of the N-1 preceding
words.

N-gram Models:
•Unigram Tagger (1-gram): It predicts the tag for the current word based on the word alone.
•Bigram Tagger (2-gram): It predicts the tag for the current word based on the previous word
and its tag.
•Trigram Tagger (3-gram): It predicts the tag for the current word based on the previous two
N-gram(bigram) tagger training
N-gram tagger Tagging
bi-gram tagger Tagging Example
Training Data:
"The/DT cat/NN sat/VBD on/IN the/DT mat/NN Tagging Phase:

Ex: the cat is running

For word cat:

P(cat∣NN)×P(NN∣DT)=0.5x 1.0
=0.5
P(cat|VBD)xP(VBD|DT)= 0x0
=0
NN tag for cat having highest probabilities so
assign tag NN to cat
import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger, BigramTagger, TrigramTagger

nltk.download('brown')
nltk.download('punkt')

train_data = brown.tagged_sents()[:3000] # First 3000 sentences for training


test_data = brown.tagged_sents()[3000:] # Remaining sentences for testing

# Create and train a Unigram Tagger


unigram_tagger = UnigramTagger(train_data)

# Create and train a Bigram Tagger


bigram_tagger = BigramTagger(train_data, backoff=unigram_tagger)

# Create and train a Trigram Tagger


trigram_tagger = TrigramTagger(train_data, backoff=bigram_tagger)

# Test the trigram tagger on a new sentence


sentence = "The dog sat on the mat".split()
tagged_sentence = trigram_tagger.tag(sentence)
[('The', 'AT'), ('dog', 'NN'), ('sat',
print(tagged_sentence)
'VBD'), ('on', 'IN'), ('the', 'AT'), ('mat',
# Print Performnces
'NN')]
print(unigram_tagger.evaluate(test_data))
print(bigram_tagger.evaluate(test_data)) 0.7641347348147013
print(trigram_tagger.evaluate(test_data)) 0.7734657846307614
0.772435283624883
Examples
<S> I like to play with it </S>
<S> You like to play </S>

Total count of unigrams (without <S> and </S> tags) :10

Total count of bigrams (including <S> and </S> tags): 12

Unigram probabilities for: P(like) =


Examples Bigram probability for: P(you like)
<S> I like to play with it </S>
<S> You like to play </S>

Conditional probabilities: P(like | it)

Trigram probability for: P(you like it)


N-gram
Pros of N-gram Tagger
• Captures Local Context
• Improves Predictions
• Easy to Implement
• Flexibility in N-gram Size

Cons of N-gram Tagger:


Requires Large Amount of Training Data:
•N-gram models, especially for higher values of N, require large amounts of labeled data to
accurately estimate the probabilities of sequences.
Memory Intensive:
•Higher-order N-grams (trigrams, 4-grams, etc.) can become memory-intensive because they
require storing a large number of tag combinations
Lack of Flexibility:
•N-gram taggers are not able to capture global syntactic dependencies or hierarchical
structures in sentences.
Brill tagger
• The Brill Tagger is a transformation-based tagger introduced by Eric Brill.
• It uses an initial tagger (unigram tagger) and then applies transformation rules iteratively
to correct errors made by the initial tagger.
• Moreover, it uses a series of rules to correct the results of an initial tagger.
•These rules it follows are scored based.
This score is equal to the no. of errors they corrected minus the no. of new errors they
produced.
Training the Brill Tagger
a. Initial Tagging
•Apply an initial tagger, which can be:
• A simple default tagger or statistical tagger (e.g., unigram or bigram tagger) to assign tags initially.
b. Learning Transformation Rules
1.Define a Template:
1. Specify templates for rules, such as:
1.Change the tag of a word if the previous word is a specific tag.
2.Change the tag of a word if the next word is a specific tag.
Ex: Change NN to VB if the previous word is a noun
2.Error Analysis:
• Compare the initial tagging results with the ground truth from the training data to identify tagging
errors.
3.Rule Generation, Evaluation & Selection
• Generate potential rules from the template to correct the errors.
• Score each rule by no. of errors they corrected minus the no. of new errors they produced.
• Select the best rule based on its score and add it to the list of transformation rules.
4.Iterative Process:
1. Apply the selected rule to the training data and repeat the process until:
1.A stopping condition is met (e.g., no significant improvement, maximum number of rules reached).
Testing the Brill Tagger
a. Apply Initial Tagging
• Use the same initial tagger as in the training phase to assign initial tags to the testing set.
b. Apply Learned Rules
• Apply the learned transformation rules, in the same order as learned, to refine the initial
tags.
c. Evaluate Performance
• Compare the resulting tags with the ground truth annotations in the test set.
import nltk
from nltk.tag import brill, brill_trainer, UnigramTagger
from nltk.corpus import brown

nltk.download('brown')

train_data = brown.tagged_sents()[:3000]
test_data = brown.tagged_sents()[3000:]

# Step 2: Define the Initial Tagger


initial_tagger = UnigramTagger(train_data)

# Step 3: Define Brill Tagger Templates


templates = brill.fntbl37()

# Step 4: Train the Brill Tagger


trainer = brill_trainer.BrillTaggerTrainer(initial_tagger, templates)
brill_tagger = trainer.train(train_data)

sentence = "The dog sat on the mat."


tokenized_sentence = nltk.word_tokenize(sentence)
tagged_sentence = brill_tagger.tag(tokenized_sentence)

print(tagged_sentence) [('The','AT'),('dog','NN'),('sat',
'VBD'), ('on', 'IN'), ('the', 'AT'),
# Evaluate Brill Tagger ('mat', 'NN'), ('.', '.')]
accuracy = brill_tagger.evaluate(test_data)
print(f"Accuracy of the Brill Tagger: {accuracy:.2f}") 0.79
Machine learning taggers

• HMM
• MEC
• CRF
Hidden Markov Model (HMM)
• The Hidden Markov Model (HMM) is a probabilistic sequence model used in POS tagging.
• It assigns the most likely sequence of POS tags to a sentence based on observed words.
• The components of HMMs are
States
•Represents the possible POS tags (e.g., noun, verb, adjective, etc.).
Observations
•Represents the words in the sentence
Transition Probability
• Represents the probability of one POS tag following another in a sentence
Emission Probability
• Represents the probability of a word being associated with a particular POS tag
Hidden Markov Model (HMM)
Transition Probability
• Represents the probability of one POS tag following another in a sentence

Emission Probability
• Represents the probability of a word being associated with a particular POS tag
Hidden Markov Model (HMM)
Q: Set of possible Tags (hidden states)

A: The A matrix contains the tag transition probabilities P(ti|ti−1)


which represent the probability of a tag occurring given the
previous tag.
Example: Calculating A[Verb][Noun]:

O: Sequence of observation (words in the sentence)

B: The B emission probabilities, P(wi|ti), represent the probability,


given a tag (say Verb), that it will be associated with a given word
(say Playing). The emission probability B[Verb][Playing] is calculated
using:

P(Playing | Verb): Count (Playing & Verb)/ Count (Verb)

It must be noted that we get all these Count() from the corpus itself
used for training.
Example
Let us calculate the above two probabilities for the set of sentences
below

Mary Jane can see Will


Spot will see Mary
Will Jane spot Mary?
Mary will pat Spot

Note that Mary Jane, Spot, and Will are all names.
calculate the emission probabilities

counting table emission probabilities


Transition Probabilities

counting table transition probabilities


POS tagging for new sentence

Sentence: Will can spot Mary

All possible combinations with three tags(N, M and V)


POS tagging for new sentence
The next step is to delete all the vertices and edges with probability zero, also the vertices
which do not lead to the endpoint are removed.
POS tagging for new sentence

1/
4
<S>→N→M→N→N→<E>
=3/4*1/9*3/9*1/4*1/4*2/9*1/9*4/9*4/9=0.00000846754
<S>→N→M→V→N→<E>=3/4*1/9*3/9*1/4*3/4*1/4*1*4/9*4/9=0.00025720164

Clearly, the probability of the second Will as a Noun


sequence is much higher and hence the Can as a Model
HMM is going to tag each word in the Spot as a Verb
sentence according to this sequence. Mary as a noun
Working of HMM tagger
Working of HMM tagger
2. Testing (POS Tagging Phase)
• Given a new sentence, the trained HMM is used to predict the sequence of POS tags
using the Viterbi algorithm.

: Viterbi path probability from the previous time

a: (Transition matrix)
b: (Emission matrix )
Code of HMM tagger
import nltk
from nltk.corpus import treebank
from nltk.tag import hmm

nltk.download('treebank')
nltk.download('universal_tagset')

# Prepare the data


train_sents = treebank.tagged_sents(tagset='universal')
test_sents = treebank.tagged_sents(tagset='universal')[3000:]

# Train the HMM tagger


trainer = hmm.HiddenMarkovModelTrainer()
hmm_tagger = trainer.train(train_sents)

sentence = "This is a test sentence".split()

# Tag the sentence


tagged_sentence = hmm_tagger.tag(sentence)
print(tagged_sentence)
Disadvantages of HMM tagger
Inability to Handle Unknown Words (OOV - Out of Vocabulary Words)
• If a word is not present in the training corpus, the model struggles to assign a correct POS
tag.
Limited Context Awareness
• HMM can only use previous tags for prediction.
• Cannot consider future words while tagging the current word

Data Sparsity Problem


• If a word-tag combination is rare or absent, the model performs poorly.
Computational Inefficiency for Large Tagsets
Conditional Random Fields (CRF) POS tagging

• Conditional Random Fields (CRF) for Part-of-Speech (POS) tagging is a sequence labeling
technique that assigns a POS tag to each word in a sentence while considering contextual
dependencies.
• It is particularly useful because it captures contextual dependencies between words, unlike
traditional classifiers that treat each word independently.

Why Use CRF for POS Tagging?


•Sequence Dependencies: Unlike Hidden Markov Models (HMMs), CRF does not assume
independence between features.
•Feature Flexibility: Can incorporate various linguistic features such as word suffixes,
capitalization, and surrounding words.
•Handles Ambiguity Well: Unlike simple classifiers (e.g., SVM, Decision Trees), CRF considers
the entire sentence while assigning POS tags.
Conditional Random Fields (CRF) POS tagging

1. Training Phase
• The training phase involves learning the parameters (weights) of the CRF model from
labeled data.
Step 1: Preparing the Training Data
•The dataset consists of sentences where each word is labeled with its corresponding POS tag.
Example training sentence:
Sentence: ["The", "cat", "sits"]
POS Tags: ["DT", "NN", "VBZ"]
Step 2: Feature Extraction
• For each word in a sentence, extract hand-crafted features such as:
Word identity: w_i = "cat“
Previous word: w_{i-1} = "The“
Next word: w_{i+1} = "sits“
Word shape: "Xx" (for capitalized words)
Prefixes/Suffixes: ca-, -at
Step 4: Learning Parameters
•The goal is to find the best weights wk​that maximize the likelihood of the correct POS sequence.
•We use an optimization algorithm like Gradient Descent or L-BFGS to adjust the weights based on
the training data.
Conditional Random Fields (CRF) POS tagging
import sklearn_crfsuite # Sample training dataset
from sklearn_crfsuite import metrics train_data = [
[('The', 'DET'), ('cat', 'NOUN'), ('sat',
# Extract features for each word 'VERB'), ('on', 'PREP'), ('the', 'DET'), ('mat',
def word_features(sentence, index): 'NOUN')],
word = sentence[index][0] [('A', 'DET'), ('dog', 'NOUN'), ('barked',
features = { 'VERB')],
'word': word, ]
'is_first': index == 0,
'is_last': index == len(sentence) - 1, X_train = [sentence_features(sentence) for sentence
'is_title': word.istitle(), in train_data]
'is_upper': word.isupper(), y_train = [sentence_labels(sentence) for sentence
'is_digit': word.isdigit(), in train_data]
'prev_word': '' if index == 0 else
sentence[index - 1][0], # Train CRF model
'next_word': '' if index == len(sentence) - crf = sklearn_crfsuite.CRF(algorithm='lbfgs')
1 else sentence[index + 1][0], crf.fit(X_train, y_train)
'prefix-1': word[:1],
'prefix-2': word[:2],
'suffix-1': word[-1:],
test_sentence = [('The'), ('man'),('running')]
'suffix-2': word[-2:],
X_test = [sentence_features(test_sentence)]
}
y_pred = crf.predict(X_test)
return features
print(y_pred[0]) # Predicted POS tags
# Convert dataset to feature format
def sentence_features(sentence):
return [word_features(sentence, i) for i in
range(len(sentence))]
['DET' 'NOUN' 'VERB']
NER Tagger
• Named Entity Recognition (NER) is a NLP technique to find and classify entities from textual
data into predefined categories called named entities.
Types of Named Entities:
Person: Names of individuals (e.g., Elon Musk, Marie Curie).
Organization: Names of organizations (e.g., Google, United Nations).
Location: Geographical entities (e.g., Paris, Mount Everest).
Date/Time: Temporal expressions (e.g., January 1, 2025, 10:30 AM).
Monetary values: Financial amounts (e.g., $10,000).
Percentages: Percentage figures (e.g., 50%).
Miscellaneous: Other domain-specific categories (e.g., product names, scientific terms).
• A Named Entity Recognition (NER) tagger is a tool used in NLP to identify and classify entities
within a text.
• Named entity recognition is important because it enables organizations to extract valuable
information from unstructured text data.
• For example, an NER system could be used to extract the names of all the companies
mentioned in a set of news articles, along with their stock prices, market capitalization, and
NER using Spacy library
import spacy

# Load the pre-trained SpaCy model


nlp = spacy.load("en_core_web_sm")

# Define the text


text = """
Elon Musk, the CEO of SpaceX, announced that the company will launch a new mission to
Mars in 2025. The announcement was made in Paris on January 20, 2025, and the mission
will cost around $10,000,000. The United Nations has also shown interest in the project,
especially due to its potential to address climate change. Approximately 50% of the
project's funding will come from private investors.
""" (Elon Musk, SpaceX, Mars, 2025, Paris, January
20, 2025, around $10,000,000, The United
# Process the text Nations, Approximately 50%)
doc = nlp(text)
Elon Musk: PERSON
SpaceX: GPE
#print entities Mars: LOC
print(doc.ents) 2025: DATE
Paris: GPE
# Iterate over the entities detected in the text January 20, 2025: DATE
for ent in doc.ents: around $10,000,000: MONEY
print(f"Text: {ent.text}, Label: {ent.label_}") The United Nations: ORG
Approximately 50%: PERCENT
Applications of NER
Information Extraction:
• Extracting structured information from unstructured text (e.g., extracting dates, names,
and locations from news articles).
Search Engine Optimization:
• Enhancing search relevance by identifying key entities in user queries.
Document Categorization:
• Automatically categorizing documents based on detected entities.
Question Answering Systems:
• Understanding user queries and identifying key entities.
Customer Support:
• Recognizing entities like product names or customer details in support tickets.

You might also like