Irs Unit-2 Modified
Irs Unit-2 Modified
Overview
The inverted file structure is a widely used data structure for information retrieval.
It stores an "inversion" of documents, listing the documents where each word
appears.
Components
1. Document File: Contains the actual documents.
2. Inversion List (Posting List):
Lists document IDs where a word appears.
Example: For the word "bit" in positions 10, 12, and 18 of document #1:
bit: 1(10), 1(12), 1(18).
3. Dictionary:
A sorted list of words with pointers to their inversion lists.
Stores additional data like list lengths for query optimization.
Features
Zoning: Improves precision by focusing on specific document sections.
Weights: Assigns importance to words in the inversion list.
Special Dictionaries: Handles unique data like dates or numbers efficiently.
Ranking: Reorganizes results in ranked order for better relevance.
Search Process
1. Retrieve inversion lists for query terms.
2. Apply logical operations to find matching documents.
3. Refine results using weights and ranking algorithms.
Using B-Trees
B-Trees can manage inversion lists instead of a dictionary.
Properties:
Root node: 2 to 2m keys.
Internal nodes: m to 2m keys, sorted.
Leaf nodes: Same or nearly same level.
Inversion lists are stored at leaf levels or accessed via pointers.
Benefits
Efficient retrieval of relevant documents.
Optimized search with ranking and special data handling.
Information Extraction
Information extraction is the process of identifying and extracting relevant facts, data, or text from
documents to organize them into structured formats, such as fields in a database or summaries for
easy access and use.
Information extraction involves two main processes:
Key
Difference
Hypertext (HTML) is for linking and displaying information on websites.
XML is for structuring and exchanging data between systems.
Key Points:
1. Partitioning Words:
Words are grouped into blocks (e.g., 5 words per block).
Each word gets a 16-bit signature with 5 bits set to "1".
2. Searching:
Searches are done by matching patterns (templates) on the bit positions.
3. Example:
For the text "Computer Science graduate students study", and block size = 5:
Each word is converted into a 16-bit signature.
The signatures are combined (OR operation) to form a block signature.
Example word signature for "Computer": 0001 0110 0000 0110
Final block signature: 1001 0111 1110 0110.
Key Concepts:
1. Consonants and Vowels:
A consonant is any letter except A, E, I, O, U.
The algorithm uses patterns of vowels (V) and consonants (C) to define word stems.
2. Measure (m):
The measure (m) is the count of VC sequences in a word's stem.
Example: In "running" (stem = run), the pattern is VC, so m = 1.
3. Special Conditions:
*X: Stem ends with a letter X.
*v*: Stem contains at least one vowel.
*d: Stem ends with a double consonant (e.g., TT, SS).
*o: Stem ends with a consonant-vowel-consonant sequence, but the last consonant is
not W, X, or Y (e.g., "hop", "wil").
4. Rules:
Each rule checks for specific suffix patterns and replaces them to get the root word.
The rules are grouped into steps to define their application order.
Examples of Rules:
Step Condition Suffix Replacement Example
1a None Sses Ss Stresses → Stress
1b *v* Ing Null Making → Make
1b1 None At Ate Inflated → Inflate
1c *v* Y I Happy → Happi
Step Condition Suffix Replacement Example
2 m>0 Aliti Al Formaliti → Formal
3 m>0 Icate Ic Duplicate → Duplic
4 m>1 Able Null Adjustable → Adjust
5a m>1 E Null Inflate → Inflat
5b m > 1 and *d None Single letter Control → Control
How It Works:
1. The algorithm identifies suffixes and evaluates their conditions.
2. Based on these, it transforms the word step by step.
3. For example:
"Stresses" → Remove "Sses" → "Stress".
"Making" → Remove "Ing" → "Mak".
This systematic process helps reduce words like "formalities", "happiness", and "duplicate" to their
roots ("formal", "happi", and "duplic").
Successor Stemmers:
1. How They Work:
These stemmers break a word into smaller parts and choose the best part as the stem.
They use different methods to decide where to break the word:
Cutoff: Set a fixed length for the stem.
Peak: Break the word where certain letters stand out.
Word Boundaries: Break at natural word boundaries.
Entropy: Use letter patterns to decide where to break.
In short, dictionary lookup stemmers replace words with their root forms based on a dictionary,
while successor stemmers break words into parts and choose the best one.