100% found this document useful (2 votes)
261 views31 pages

NLP UNIT 5 Part B

The document discusses various lexical resources used in Natural Language Processing (NLP), including the Porter Stemmer, Lemmatizer, Penn Treebank, Brill’s Tagger, WordNet, PropBank, FrameNet, Brown Corpus, and British National Corpus. Each resource serves different purposes such as stemming, tagging, semantic analysis, and providing structured collections of words and their meanings. These resources are essential for tasks like language understanding, processing, and generation.

Uploaded by

ushanagsamsani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
261 views31 pages

NLP UNIT 5 Part B

The document discusses various lexical resources used in Natural Language Processing (NLP), including the Porter Stemmer, Lemmatizer, Penn Treebank, Brill’s Tagger, WordNet, PropBank, FrameNet, Brown Corpus, and British National Corpus. Each resource serves different purposes such as stemming, tagging, semantic analysis, and providing structured collections of words and their meanings. These resources are essential for tasks like language understanding, processing, and generation.

Uploaded by

ushanagsamsani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT 5

Part B
LEXICAL RESOURCES
Contents
Lexical Resources:
• Porter Stemmer
• Lemmatizer
• Penn Treebank
• Brill’s Tagger
• WordNet
• PropBank
• FrameNet
• Brown Corpus
• British National Corpus(BNC)
LEXICAL RESOURCES

• In NLP, a lexical resource refers to a structured collection of words,


phrases, and their associated information, which is used to support
language understanding, processing, and generation tasks.

• These resources contain information such as word meanings,


synonyms, antonyms, part-of-speech (POS) tags, morphological
variations, and semantic relationships.
Porter Stemmer
• The Porter Stemmer is one of the most widely used stemming algorithms in
Natural Language Processing.

• Stemming is a text preprocessing technique used to reduce words to their root


or base form, known as the "stem."

• The Porter Stemmer uses a set of heuristic rules to remove suffixes from words.

• It was developed by Martin Porter in 1980 and is based on the idea that certain
suffixes can be stripped off systematically.

• It’s a rule-based approach, not always perfect, but effective for many
applications.
Ex:
1) "running" → "run"
2) better" → "better" (no change, as it doesn’t follow the rules for
suffix removal)
3) "happiness" → "happi"
Advantages:
4) Simple and fast to implement.
5) Widely used due to its effectiveness and availability in many NLP
libraries (e.g., NLTK in Python).
6) Good for reducing dimensionality in text data.
7) This helps in normalizing text data, making it easier to analyze and
process.
Output:
lemmatizer
• A lemmatizer is a tool or algorithm in NLP that reduces words to their
base or dictionary form, known as the lemma.

• Unlike stemming (e.g., Porter Stemmer), which simply chops off


suffixes based on rules and may produce non-valid words,
lemmatization considers the context and part of speech of a word to
ensure the output is a valid word.

• The goal is to return the morphological root of a word, which is


linguistically correct.
The Penn Treebank (PTB)
1) The Penn Treebank (PTB) is a widely used resource in natural
language processing (NLP) and computational linguistics.
2) It is a corpus of text that has been annotated with syntactic
structures.
3) It is used for training and evaluating NLP models, particularly those
involved in tasks like part-of-speech (POS) tagging, parsing, and
grammar induction.
4) Developed at the University of Pennsylvania in the early 1990s.
5) Contains a large collection of text from various sources, such as the
Wall Street Journal, Brown Corpus, and other domains.
6) The most popular version, PTB-3, includes about 4.5 million words of
American English text.
7) The corpus is annotated with detailed syntactic structures, including
phrase structures and dependency trees
8) Each sentence is tagged with part-of-speech labels (e.g., noun, verb,
adjective) and parsed into a tree structure that represents its
grammatical structure.
9) Uses a specific set of grammatical rules and tags, such as the Penn
Treebank Tag Set, which includes labels like NN (noun, singular), VB
(verb, base form), and PP (prepositional phrase).
10) The data is typically represented in bracketed notation or as
constituency trees, which show how words in a sentence are grouped
and related hierarchically.
11) For example, a simple sentence like "The cat sleeps" might be
represented as:
(S (NP (DT The) (NN cat)) (VP (VBZ sleeps)))
Brill’s Tagger
• The Brill Tagger is a rule-based, error-driven, transformation-based
part-of-speech (POS) tagging method, invented by Eric Brill (1993).
• It is a supervised learning algorithm that iteratively corrects errors in
initial POS tagging using predefined transformation rules.
How Brill Tagger Works
1.Initialization Phase:
1. For known words: Assigns the most frequent POS tag from a lexicon.
2. For unknown words: Assigns the default tag (e.g., noun) based on linguistic
assumptions.
2.Rule-Based Corrections:
1. Rules iteratively correct errors based on context (e.g., previous/following
words).
example rule: IN → NN if previous tag is DT
This changes "while" from IN (preposition) to NN (noun) when preceded by a
determiner (e.g., "a while").
Iterative Transformation Process:
• The tagger applies correction rules repeatedly until no more improvements can be
made.
• The rules can be learned from a pre-tagged corpus using machine learning.
Implementaion of
brill tagger in python
using nltk
Output:

Tagged Sentence (Using Brill


Tagger): [('The', 'DT'), ('dog',
None), ('barked', None), ('at',
'IN'), ('the', 'DT'), ('cat', None)]
Brill Tagger Accuracy: 87.03 %

Why Use Brill Tagger?


•More accurate than simple
unigram/bigram taggers.
•Interpretable rules for correcting
errors.
•Useful for small NLP datasets
where deep learning is
unnecessary.
WordNet
• WordNet is a large lexical database of English words, developed at
Princeton University.
• it is widely used in Natural Language Processing (NLP) for understanding
word meanings and relationships.
Applications of WordNet in NLP
• ✅ Word Sense Disambiguation (WSD) – Helps determine the correct
meaning of a word.
✅ Text Similarity & Semantic Analysis – Measures similarity between
words.
✅ Chatbots & AI Assistants – Enhances understanding of user queries.
✅ Search Engines – Expands search terms using synonyms.
Features of WordNet
1. Synsets (Synonym Sets)
Groups words with similar meanings.
Ex: {“happy” , ”joy” , ”cheerful”}
2. Hypernyms & Hyponyms (Hierarchy)
Hypernym (general term): "animal" → "dog“
Hyponym (specific term): "dog" → "bulldog“
3. Antonyms (Opposites) Ex: good x bad
4. Meronyms & Holonyms (Part-Whole Relationship)
Meronym (part of a whole): ex: wheel as a part of car
Holonym (whole of a part): car has a wheel
5. Sense Definitions & Examples:
Bank : financial Institution or river bank
Output:
PropBank (Proposition Bank)
• PropBank (Proposition Bank) is a lexical resource that provides semantic
role labeling (SRL) for verbs in sentences.
• It extends the Penn Treebank (PTB) by adding annotations for predicate-
argument structures, making it useful for semantic analysis and NLP
tasks.
Features of PropBank
1.Predicate-Argument Structure
•Each verb is annotated with rolesets defining its possible
arguments.
•Example: "give" has roles who gives, what is given, and to whom.
2.Numbered Arguments (Arg0, Arg1, etc.)
•Arg0 → Agent/Doer
•Arg1 → Patient/Theme (Object)
•Arg2 → Indirect Object (Recipient)
•Arg3, Arg4, ... → Additional roles
Propbank arguments

Propbank example
FrameNet
• FrameNet is a lexical database for semantic roles and frame semantics
in natural language, developed by the International Computer Science
Institute (ICSI).
• It is designed to capture semantic structures of language and how
different words evoke specific conceptual frames.
• In FrameNet, a frame represents a conceptual structure or a mental
model that helps us understand the world. For example, a buying
frame includes the buyer, seller, product, and money as participants in
the action of buying.
Key Features of FrameNet
1. Frames:
Frames represent conceptual structures or scenarios.
Example: A "Buy" frame includes roles like Buyer, Seller, Product, Money.

2. Frame Elements:
Frame elements (FE) are the core roles or participants in a frame.
Example: In the Buy frame, Buyer (agent), Seller (agent), and Product (theme) are
frame elements.

3. Lexical Units (LU):


Lexical units are words or phrases that evoke a specific frame.
Example: The verb "buy" evokes the Buy frame.

4. Frame-to-Frame Relations:
Frames can be related to one another (e.g., CAUSE, RESULT).
Example: "Buy" and "Sell" are related as opposites or counterparts in many contexts.
FrameNet Concept
Let's take the "Buying" frame:
• Frame Elements: Buyer, Seller, Product, Money
• Lexical Units: buy, purchase, sell
• Frame Relation: Opposite frame — "Sell“

In the sentence "John bought a book from Mary," the elements would be:
• Buyer: John
• Seller: Mary
• Product: book
• Money: (if mentioned, e.g., "for $10")
A FrameNet Model
Brown Corpus
• The Brown Corpus is one of the first and most well-known corpora in
Natural Language Processing (NLP) and computational linguistics.
• It was created in 1961 at Brown University and has played a crucial role
in the development of language modeling and syntactic analysis.
• Features of the Brown Corpus
1. Text Classification
• The Brown Corpus contains texts from a variety of genres and domains.
• It is tagged with part-of-speech (POS) labels, making it an excellent
resource for POS tagging and syntactic parsing.
2. Size and Composition
• 1 million words of American English text.
• The corpus is divided into 15 categories, including fiction, news, academic
writing, and more.
Categories include:Press (News),Fiction (Novels),Science,
Fiction,Poetry,Religion,Hobbies, etc.
POS Tagging
• The corpus is annotated with POS tags, which can be used for training POS
taggers and evaluating models.
• Tag set: Uses a relatively simple set of lexical categories (nouns, verbs,
adjectives, etc.).
output
British National Corpus (BNC)
• The British National Corpus (BNC) is a large-scale, balanced collection of
written and spoken British English, widely used in computational
linguistics and natural language processing (NLP) tasks.
• It contains diverse text samples across different genres and domains,
representing the language used in everyday life.
Key Features of the British National Corpus
1.Size and Composition
• The BNC contains 100 million words of British English, collected from both
written and spoken texts.
• Genres: It covers various genres, including literature, academic articles,
newspapers, fiction, conversations, and broadcasts.
British National Corpus (BNC)
2. Written and Spoken Texts
• The corpus is divided into two main parts:
• Written texts (90% of the corpus): Includes books, newspapers, journals, and
more.
• Spoken texts (10% of the corpus): Covers transcriptions of conversations,
radio programs, and interviews.
3. POS Tagging
• The BNC is annotated with part-of-speech (POS) tags, similar to the
Penn Treebank and Brown Corpus. It allows for tasks like POS tagging,
syntax parsing, and semantic analysis.

You might also like