NLP UNIT 5 Part B
NLP UNIT 5 Part B
Part B
LEXICAL RESOURCES
Contents
Lexical Resources:
• Porter Stemmer
• Lemmatizer
• Penn Treebank
• Brill’s Tagger
• WordNet
• PropBank
• FrameNet
• Brown Corpus
• British National Corpus(BNC)
LEXICAL RESOURCES
• The Porter Stemmer uses a set of heuristic rules to remove suffixes from words.
• It was developed by Martin Porter in 1980 and is based on the idea that certain
suffixes can be stripped off systematically.
• It’s a rule-based approach, not always perfect, but effective for many
applications.
Ex:
1) "running" → "run"
2) better" → "better" (no change, as it doesn’t follow the rules for
suffix removal)
3) "happiness" → "happi"
Advantages:
4) Simple and fast to implement.
5) Widely used due to its effectiveness and availability in many NLP
libraries (e.g., NLTK in Python).
6) Good for reducing dimensionality in text data.
7) This helps in normalizing text data, making it easier to analyze and
process.
Output:
lemmatizer
• A lemmatizer is a tool or algorithm in NLP that reduces words to their
base or dictionary form, known as the lemma.
Propbank example
FrameNet
• FrameNet is a lexical database for semantic roles and frame semantics
in natural language, developed by the International Computer Science
Institute (ICSI).
• It is designed to capture semantic structures of language and how
different words evoke specific conceptual frames.
• In FrameNet, a frame represents a conceptual structure or a mental
model that helps us understand the world. For example, a buying
frame includes the buyer, seller, product, and money as participants in
the action of buying.
Key Features of FrameNet
1. Frames:
Frames represent conceptual structures or scenarios.
Example: A "Buy" frame includes roles like Buyer, Seller, Product, Money.
2. Frame Elements:
Frame elements (FE) are the core roles or participants in a frame.
Example: In the Buy frame, Buyer (agent), Seller (agent), and Product (theme) are
frame elements.
4. Frame-to-Frame Relations:
Frames can be related to one another (e.g., CAUSE, RESULT).
Example: "Buy" and "Sell" are related as opposites or counterparts in many contexts.
FrameNet Concept
Let's take the "Buying" frame:
• Frame Elements: Buyer, Seller, Product, Money
• Lexical Units: buy, purchase, sell
• Frame Relation: Opposite frame — "Sell“
In the sentence "John bought a book from Mary," the elements would be:
• Buyer: John
• Seller: Mary
• Product: book
• Money: (if mentioned, e.g., "for $10")
A FrameNet Model
Brown Corpus
• The Brown Corpus is one of the first and most well-known corpora in
Natural Language Processing (NLP) and computational linguistics.
• It was created in 1961 at Brown University and has played a crucial role
in the development of language modeling and syntactic analysis.
• Features of the Brown Corpus
1. Text Classification
• The Brown Corpus contains texts from a variety of genres and domains.
• It is tagged with part-of-speech (POS) labels, making it an excellent
resource for POS tagging and syntactic parsing.
2. Size and Composition
• 1 million words of American English text.
• The corpus is divided into 15 categories, including fiction, news, academic
writing, and more.
Categories include:Press (News),Fiction (Novels),Science,
Fiction,Poetry,Religion,Hobbies, etc.
POS Tagging
• The corpus is annotated with POS tags, which can be used for training POS
taggers and evaluating models.
• Tag set: Uses a relatively simple set of lexical categories (nouns, verbs,
adjectives, etc.).
output
British National Corpus (BNC)
• The British National Corpus (BNC) is a large-scale, balanced collection of
written and spoken British English, widely used in computational
linguistics and natural language processing (NLP) tasks.
• It contains diverse text samples across different genres and domains,
representing the language used in everyday life.
Key Features of the British National Corpus
1.Size and Composition
• The BNC contains 100 million words of British English, collected from both
written and spoken texts.
• Genres: It covers various genres, including literature, academic articles,
newspapers, fiction, conversations, and broadcasts.
British National Corpus (BNC)
2. Written and Spoken Texts
• The corpus is divided into two main parts:
• Written texts (90% of the corpus): Includes books, newspapers, journals, and
more.
• Spoken texts (10% of the corpus): Covers transcriptions of conversations,
radio programs, and interviews.
3. POS Tagging
• The BNC is annotated with part-of-speech (POS) tags, similar to the
Penn Treebank and Brown Corpus. It allows for tasks like POS tagging,
syntax parsing, and semantic analysis.