100% found this document useful (2 votes)

261 views31 pages

NLP UNIT 5 Part B

The document discusses various lexical resources used in Natural Language Processing (NLP), including the Porter Stemmer, Lemmatizer, Penn Treebank, Brill’s Tagger, WordNet, PropBank, FrameNet, Brown Corpus, and British National Corpus. Each resource serves different purposes such as stemming, tagging, semantic analysis, and providing structured collections of words and their meanings. These resources are essential for tasks like language understanding, processing, and generation.

Uploaded by

ushanagsamsani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

261 views31 pages

NLP UNIT 5 Part B

Uploaded by

ushanagsamsani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

UNIT 5

Part B
LEXICAL RESOURCES
Contents
Lexical Resources:
• Porter Stemmer
• Lemmatizer
• Penn Treebank
• Brill’s Tagger
• WordNet
• PropBank
• FrameNet
• Brown Corpus
• British National Corpus(BNC)
LEXICAL RESOURCES

• In NLP, a lexical resource refers to a structured collection of words,

phrases, and their associated information, which is used to support
language understanding, processing, and generation tasks.

• These resources contain information such as word meanings,

synonyms, antonyms, part-of-speech (POS) tags, morphological
variations, and semantic relationships.
Porter Stemmer
• The Porter Stemmer is one of the most widely used stemming algorithms in
Natural Language Processing.

• Stemming is a text preprocessing technique used to reduce words to their root

or base form, known as the "stem."

• The Porter Stemmer uses a set of heuristic rules to remove suffixes from words.

• It was developed by Martin Porter in 1980 and is based on the idea that certain
suffixes can be stripped off systematically.

• It’s a rule-based approach, not always perfect, but effective for many
applications.
Ex:
1) "running" → "run"
2) better" → "better" (no change, as it doesn’t follow the rules for
suffix removal)
3) "happiness" → "happi"
Advantages:
4) Simple and fast to implement.
5) Widely used due to its effectiveness and availability in many NLP
libraries (e.g., NLTK in Python).
6) Good for reducing dimensionality in text data.
7) This helps in normalizing text data, making it easier to analyze and
process.
Output:
lemmatizer
• A lemmatizer is a tool or algorithm in NLP that reduces words to their
base or dictionary form, known as the lemma.

• Unlike stemming (e.g., Porter Stemmer), which simply chops off

suffixes based on rules and may produce non-valid words,
lemmatization considers the context and part of speech of a word to
ensure the output is a valid word.

• The goal is to return the morphological root of a word, which is

linguistically correct.
The Penn Treebank (PTB)
1) The Penn Treebank (PTB) is a widely used resource in natural
language processing (NLP) and computational linguistics.
2) It is a corpus of text that has been annotated with syntactic
structures.
3) It is used for training and evaluating NLP models, particularly those
involved in tasks like part-of-speech (POS) tagging, parsing, and
grammar induction.
4) Developed at the University of Pennsylvania in the early 1990s.
5) Contains a large collection of text from various sources, such as the
Wall Street Journal, Brown Corpus, and other domains.
6) The most popular version, PTB-3, includes about 4.5 million words of
American English text.
7) The corpus is annotated with detailed syntactic structures, including
phrase structures and dependency trees
8) Each sentence is tagged with part-of-speech labels (e.g., noun, verb,
adjective) and parsed into a tree structure that represents its
grammatical structure.
9) Uses a specific set of grammatical rules and tags, such as the Penn
Treebank Tag Set, which includes labels like NN (noun, singular), VB
(verb, base form), and PP (prepositional phrase).
10) The data is typically represented in bracketed notation or as
constituency trees, which show how words in a sentence are grouped
and related hierarchically.
11) For example, a simple sentence like "The cat sleeps" might be
represented as:
(S (NP (DT The) (NN cat)) (VP (VBZ sleeps)))
Brill’s Tagger
• The Brill Tagger is a rule-based, error-driven, transformation-based
part-of-speech (POS) tagging method, invented by Eric Brill (1993).
• It is a supervised learning algorithm that iteratively corrects errors in
initial POS tagging using predefined transformation rules.
How Brill Tagger Works
1.Initialization Phase:
1. For known words: Assigns the most frequent POS tag from a lexicon.
2. For unknown words: Assigns the default tag (e.g., noun) based on linguistic
assumptions.
2.Rule-Based Corrections:
1. Rules iteratively correct errors based on context (e.g., previous/following
words).
example rule: IN → NN if previous tag is DT
This changes "while" from IN (preposition) to NN (noun) when preceded by a
determiner (e.g., "a while").
Iterative Transformation Process:
• The tagger applies correction rules repeatedly until no more improvements can be
made.
• The rules can be learned from a pre-tagged corpus using machine learning.
Implementaion of
brill tagger in python
using nltk
Output:

Tagged Sentence (Using Brill

Tagger): [('The', 'DT'), ('dog',
None), ('barked', None), ('at',
'IN'), ('the', 'DT'), ('cat', None)]
Brill Tagger Accuracy: 87.03 %

Why Use Brill Tagger?

•More accurate than simple
unigram/bigram taggers.
•Interpretable rules for correcting
errors.
•Useful for small NLP datasets
where deep learning is
unnecessary.
WordNet
• WordNet is a large lexical database of English words, developed at
Princeton University.
• it is widely used in Natural Language Processing (NLP) for understanding
word meanings and relationships.
Applications of WordNet in NLP
• ✅ Word Sense Disambiguation (WSD) – Helps determine the correct
meaning of a word.
✅ Text Similarity & Semantic Analysis – Measures similarity between
words.
✅ Chatbots & AI Assistants – Enhances understanding of user queries.
✅ Search Engines – Expands search terms using synonyms.
Features of WordNet
1. Synsets (Synonym Sets)
Groups words with similar meanings.
Ex: {“happy” , ”joy” , ”cheerful”}
2. Hypernyms & Hyponyms (Hierarchy)
Hypernym (general term): "animal" → "dog“
Hyponym (specific term): "dog" → "bulldog“
3. Antonyms (Opposites) Ex: good x bad
4. Meronyms & Holonyms (Part-Whole Relationship)
Meronym (part of a whole): ex: wheel as a part of car
Holonym (whole of a part): car has a wheel
5. Sense Definitions & Examples:
Bank : financial Institution or river bank
Output:
PropBank (Proposition Bank)
• PropBank (Proposition Bank) is a lexical resource that provides semantic
role labeling (SRL) for verbs in sentences.
• It extends the Penn Treebank (PTB) by adding annotations for predicate-
argument structures, making it useful for semantic analysis and NLP
tasks.
Features of PropBank
1.Predicate-Argument Structure
•Each verb is annotated with rolesets defining its possible
arguments.
•Example: "give" has roles who gives, what is given, and to whom.
2.Numbered Arguments (Arg0, Arg1, etc.)
•Arg0 → Agent/Doer
•Arg1 → Patient/Theme (Object)
•Arg2 → Indirect Object (Recipient)
•Arg3, Arg4, ... → Additional roles
Propbank arguments

Propbank example
FrameNet
• FrameNet is a lexical database for semantic roles and frame semantics
in natural language, developed by the International Computer Science
Institute (ICSI).
• It is designed to capture semantic structures of language and how
different words evoke specific conceptual frames.
• In FrameNet, a frame represents a conceptual structure or a mental
model that helps us understand the world. For example, a buying
frame includes the buyer, seller, product, and money as participants in
the action of buying.
Key Features of FrameNet
1. Frames:
Frames represent conceptual structures or scenarios.
Example: A "Buy" frame includes roles like Buyer, Seller, Product, Money.

2. Frame Elements:
Frame elements (FE) are the core roles or participants in a frame.
Example: In the Buy frame, Buyer (agent), Seller (agent), and Product (theme) are
frame elements.

3. Lexical Units (LU):

Lexical units are words or phrases that evoke a specific frame.
Example: The verb "buy" evokes the Buy frame.

4. Frame-to-Frame Relations:
Frames can be related to one another (e.g., CAUSE, RESULT).
Example: "Buy" and "Sell" are related as opposites or counterparts in many contexts.
FrameNet Concept
Let's take the "Buying" frame:
• Frame Elements: Buyer, Seller, Product, Money
• Lexical Units: buy, purchase, sell
• Frame Relation: Opposite frame — "Sell“

In the sentence "John bought a book from Mary," the elements would be:
• Buyer: John
• Seller: Mary
• Product: book
• Money: (if mentioned, e.g., "for $10")
A FrameNet Model
Brown Corpus
• The Brown Corpus is one of the first and most well-known corpora in
Natural Language Processing (NLP) and computational linguistics.
• It was created in 1961 at Brown University and has played a crucial role
in the development of language modeling and syntactic analysis.
• Features of the Brown Corpus
1. Text Classification
• The Brown Corpus contains texts from a variety of genres and domains.
• It is tagged with part-of-speech (POS) labels, making it an excellent
resource for POS tagging and syntactic parsing.
2. Size and Composition
• 1 million words of American English text.
• The corpus is divided into 15 categories, including fiction, news, academic
writing, and more.
Categories include:Press (News),Fiction (Novels),Science,
Fiction,Poetry,Religion,Hobbies, etc.
POS Tagging
• The corpus is annotated with POS tags, which can be used for training POS
taggers and evaluating models.
• Tag set: Uses a relatively simple set of lexical categories (nouns, verbs,
adjectives, etc.).
output
British National Corpus (BNC)
• The British National Corpus (BNC) is a large-scale, balanced collection of
written and spoken British English, widely used in computational
linguistics and natural language processing (NLP) tasks.
• It contains diverse text samples across different genres and domains,
representing the language used in everyday life.
Key Features of the British National Corpus
1.Size and Composition
• The BNC contains 100 million words of British English, collected from both
written and spoken texts.
• Genres: It covers various genres, including literature, academic articles,
newspapers, fiction, conversations, and broadcasts.
British National Corpus (BNC)
2. Written and Spoken Texts
• The corpus is divided into two main parts:
• Written texts (90% of the corpus): Includes books, newspapers, journals, and
more.
• Spoken texts (10% of the corpus): Covers transcriptions of conversations,
radio programs, and interviews.
3. POS Tagging
• The BNC is annotated with part-of-speech (POS) tags, similar to the
Penn Treebank and Brown Corpus. It allows for tasks like POS tagging,
syntax parsing, and semantic analysis.

Equations and Patterns
No ratings yet
Equations and Patterns
230 pages
Toaz - Info Cyberpunk 2020 Adventure All Fall Down Ag5040 PR - PDF
100% (1)
Toaz - Info Cyberpunk 2020 Adventure All Fall Down Ag5040 PR - PDF
34 pages
Method Statement For Installation of SCL HUH Bifurcation P5106 Under Shutdown Arrangement
No ratings yet
Method Statement For Installation of SCL HUH Bifurcation P5106 Under Shutdown Arrangement
22 pages
Lecture 1: Introduction To NLP: Understand Concepts Applications
No ratings yet
Lecture 1: Introduction To NLP: Understand Concepts Applications
32 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
21ML1601 NLP QB
No ratings yet
21ML1601 NLP QB
34 pages
Word Sense Disambiguation: by Under The Guidance of
No ratings yet
Word Sense Disambiguation: by Under The Guidance of
99 pages
Applet Life Cycle in Java
No ratings yet
Applet Life Cycle in Java
6 pages
ML-3-Decision Tree
No ratings yet
ML-3-Decision Tree
17 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
45 pages
Topic: Classification of Data Structure
No ratings yet
Topic: Classification of Data Structure
24 pages
Aecs Lab Manual Final - 2019-20
No ratings yet
Aecs Lab Manual Final - 2019-20
101 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
118 pages
Ch11 3 Tries
No ratings yet
Ch11 3 Tries
11 pages
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
No ratings yet
Lecture 3: Text Processing & Minimum Edit Distance Algorithm
57 pages
Conceptual Dependency and Natural Language Processing
No ratings yet
Conceptual Dependency and Natural Language Processing
59 pages
BCT Techknowledge Want All Subjects Notes Pls
No ratings yet
BCT Techknowledge Want All Subjects Notes Pls
193 pages
Chapter 7
No ratings yet
Chapter 7
49 pages
Lesson 1: Structure of A Compiler
No ratings yet
Lesson 1: Structure of A Compiler
20 pages
Chapter 12 Context Free Grammars
100% (1)
Chapter 12 Context Free Grammars
68 pages
Chapter 6
100% (1)
Chapter 6
28 pages
CH 9: Connectionist Models
No ratings yet
CH 9: Connectionist Models
35 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
19CSE453 - Natural Language Processing: Part of Speech Tagging
No ratings yet
19CSE453 - Natural Language Processing: Part of Speech Tagging
59 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
46 pages
IT WorkShop Lab Manual
No ratings yet
IT WorkShop Lab Manual
111 pages
ML Unit-4
No ratings yet
ML Unit-4
9 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
OS R23 - UNIT-3 (Part-1)
No ratings yet
OS R23 - UNIT-3 (Part-1)
15 pages
IS 7118 Unit-5 POS Tagging
No ratings yet
IS 7118 Unit-5 POS Tagging
89 pages
Augmented Transition Networks: An Augmented Transition Network (ATN) Is A Type of Graph
No ratings yet
Augmented Transition Networks: An Augmented Transition Network (ATN) Is A Type of Graph
8 pages
JNTUA Advanced Data Structures and Algorithms Lab Manual R20
No ratings yet
JNTUA Advanced Data Structures and Algorithms Lab Manual R20
71 pages
Syntax Directed Translation
No ratings yet
Syntax Directed Translation
23 pages
OS - Module 5 - Memory Management
No ratings yet
OS - Module 5 - Memory Management
81 pages
Bootstrapping in Compiler Design
No ratings yet
Bootstrapping in Compiler Design
12 pages
Adsa Lab Manual
No ratings yet
Adsa Lab Manual
52 pages
AI Ch-14 Inroduction To Prolog
No ratings yet
AI Ch-14 Inroduction To Prolog
15 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
Communication Operations
No ratings yet
Communication Operations
70 pages
Unit 4 Knowledge Representation
No ratings yet
Unit 4 Knowledge Representation
13 pages
Natural Language Processing:: N-Gram Language Models
No ratings yet
Natural Language Processing:: N-Gram Language Models
48 pages
Android Interview Questions PDF
No ratings yet
Android Interview Questions PDF
24 pages
TE AI Honor Course
No ratings yet
TE AI Honor Course
18 pages
Database Indexing and Hashing
No ratings yet
Database Indexing and Hashing
7 pages
PPL I-GGoyal U2.1 Structured - Data - Objects 2022-11-18 20 - 07 Office Lens
100% (1)
PPL I-GGoyal U2.1 Structured - Data - Objects 2022-11-18 20 - 07 Office Lens
49 pages
DAA Unit1
No ratings yet
DAA Unit1
26 pages
Is Unit 4
No ratings yet
Is Unit 4
97 pages
(XXXX) 2 Marks (XXXX) : - Class B Extends A
No ratings yet
(XXXX) 2 Marks (XXXX) : - Class B Extends A
11 pages
Compiler Unit 1
No ratings yet
Compiler Unit 1
110 pages
Web Technology-Lab-Manual III-II r22 Updated 24
No ratings yet
Web Technology-Lab-Manual III-II r22 Updated 24
114 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
37 pages
Designing A Learning System: DR - Chandrika.J Professor CSE Course Faculty
No ratings yet
Designing A Learning System: DR - Chandrika.J Professor CSE Course Faculty
22 pages
Java PPT'S
No ratings yet
Java PPT'S
417 pages
Unit 2 AI
No ratings yet
Unit 2 AI
22 pages
Stmlabexperiments 1 - 11
No ratings yet
Stmlabexperiments 1 - 11
68 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
Lecture 01 - UML Case Tools
No ratings yet
Lecture 01 - UML Case Tools
44 pages
NLP Unit-1
No ratings yet
NLP Unit-1
12 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
NLP UNIT 2 Part 2
No ratings yet
NLP UNIT 2 Part 2
6 pages
NLP Soln
No ratings yet
NLP Soln
6 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
GeM Bidding 5879144
No ratings yet
GeM Bidding 5879144
5 pages
HLID Credential 2023
No ratings yet
HLID Credential 2023
21 pages
15-Datenblatt Power-Reflex Membranen MB-DN 25 - 40 - Englisch
No ratings yet
15-Datenblatt Power-Reflex Membranen MB-DN 25 - 40 - Englisch
2 pages
How To Make Speakers
No ratings yet
How To Make Speakers
4 pages
AMER - BRO - Stroboscopy Solution - (MKENT-2482EN-U Rev 2) - 09.2020
No ratings yet
AMER - BRO - Stroboscopy Solution - (MKENT-2482EN-U Rev 2) - 09.2020
4 pages
Networking Device
No ratings yet
Networking Device
26 pages
8051 External Memory Interfacing
No ratings yet
8051 External Memory Interfacing
17 pages
ADBMS Lab Outline
No ratings yet
ADBMS Lab Outline
3 pages
Colorbond Brochure 140220
No ratings yet
Colorbond Brochure 140220
40 pages
BSBINM601 Assessment2
100% (1)
BSBINM601 Assessment2
8 pages
Company Profile: Pt. Rekayasa Energi Bersama
No ratings yet
Company Profile: Pt. Rekayasa Energi Bersama
35 pages
MCAD
No ratings yet
MCAD
24 pages
The Augmented Matrix of A Linear System
No ratings yet
The Augmented Matrix of A Linear System
14 pages
SME and SI of STEM STUDENTSFINAL
No ratings yet
SME and SI of STEM STUDENTSFINAL
80 pages
Ft-950 Usa Exp Eu Om Eng Eh031h206
No ratings yet
Ft-950 Usa Exp Eu Om Eng Eh031h206
132 pages
RSCH Methods - 511 Paris - Exam Paper
No ratings yet
RSCH Methods - 511 Paris - Exam Paper
2 pages
Notification Styler
No ratings yet
Notification Styler
2 pages
Parts List EUPOLO150 (JC150T)
No ratings yet
Parts List EUPOLO150 (JC150T)
44 pages
DTC B0102
No ratings yet
DTC B0102
4 pages
p102613 Docjl Burnerspec Sheet 3
No ratings yet
p102613 Docjl Burnerspec Sheet 3
2 pages
IoT The Network Protocols and Technologies - v4
No ratings yet
IoT The Network Protocols and Technologies - v4
28 pages
Sugar Rush Project Fudge Wreck-It Ralph Fanon Wiki Fandom
No ratings yet
Sugar Rush Project Fudge Wreck-It Ralph Fanon Wiki Fandom
1 page
2022 JamesCook Katalog EN Homepage
No ratings yet
2022 JamesCook Katalog EN Homepage
36 pages
Amir Maleki Moghaddam: Advanced Workflow To Evaluate and Compare The Performance of Directional Drilling Control Tools
No ratings yet
Amir Maleki Moghaddam: Advanced Workflow To Evaluate and Compare The Performance of Directional Drilling Control Tools
80 pages
Files With Fstream: Short Answer
No ratings yet
Files With Fstream: Short Answer
9 pages
Renolit Poliplex Series - en
No ratings yet
Renolit Poliplex Series - en
2 pages
Top 41 SAP Security Interview Questions and Answers
No ratings yet
Top 41 SAP Security Interview Questions and Answers
6 pages

NLP UNIT 5 Part B

Uploaded by

NLP UNIT 5 Part B

Uploaded by

UNIT 5

• In NLP, a lexical resource refers to a structured collection of words,

• These resources contain information such as word meanings,

• Stemming is a text preprocessing technique used to reduce words to their root

• Unlike stemming (e.g., Porter Stemmer), which simply chops off

• The goal is to return the morphological root of a word, which is

Tagged Sentence (Using Brill

Why Use Brill Tagger?

3. Lexical Units (LU):

You might also like