0% found this document useful (0 votes)
31 views7 pages

Irs Unit-2 Modified

The document provides an overview of various data structures used in information retrieval, including inverted file structures, XML, and hypertext data structures. It discusses the components, features, and benefits of these structures, as well as the processes of information extraction and stemming algorithms. Additionally, it explains the workings of signature file structures and different types of stemmers, highlighting their applications and advantages.

Uploaded by

Balle Manasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views7 pages

Irs Unit-2 Modified

The document provides an overview of various data structures used in information retrieval, including inverted file structures, XML, and hypertext data structures. It discusses the components, features, and benefits of these structures, as well as the processes of information extraction and stemming algorithms. Additionally, it explains the workings of signature file structures and different types of stemmers, highlighting their applications and advantages.

Uploaded by

Balle Manasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Inverted File Structure

 Overview
 The inverted file structure is a widely used data structure for information retrieval.
 It stores an "inversion" of documents, listing the documents where each word
appears.

Components
1. Document File: Contains the actual documents.
2. Inversion List (Posting List):
 Lists document IDs where a word appears.
 Example: For the word "bit" in positions 10, 12, and 18 of document #1:
bit: 1(10), 1(12), 1(18).
3. Dictionary:
 A sorted list of words with pointers to their inversion lists.
 Stores additional data like list lengths for query optimization.

Features
 Zoning: Improves precision by focusing on specific document sections.
 Weights: Assigns importance to words in the inversion list.
 Special Dictionaries: Handles unique data like dates or numbers efficiently.
 Ranking: Reorganizes results in ranked order for better relevance.

Search Process
1. Retrieve inversion lists for query terms.
2. Apply logical operations to find matching documents.
3. Refine results using weights and ranking algorithms.

Using B-Trees
 B-Trees can manage inversion lists instead of a dictionary.
 Properties:
 Root node: 2 to 2m keys.
 Internal nodes: m to 2m keys, sorted.
 Leaf nodes: Same or nearly same level.
 Inversion lists are stored at leaf levels or accessed via pointers.

Benefits
 Efficient retrieval of relevant documents.
 Optimized search with ranking and special data handling.
Information Extraction
Information extraction is the process of identifying and extracting relevant facts, data, or text from
documents to organize them into structured formats, such as fields in a database or summaries for
easy access and use.
Information extraction involves two main processes:

1. Fact Extraction for Databases


 Identifies and extracts structured facts to store in database fields.
 Process Name: Automatic File Build.
 Evaluation Metrics:
 Recall: Measures how much correct information was extracted compared to all
relevant information in the item.
 Precision: Measures how accurately the extracted information matches the relevant
data.
 Overgeneration: Checks how much irrelevant information was extracted.
 Fallout: Measures incorrect slot fillers (wrong data in extracted fields).

2. Text Extraction for Summarization


 Extracts key text to create a summary of an item.
 Goal: Capture the most important ideas while reducing the size of the content.
 Examples:
 Titles, table of contents, and abstracts.
 Abstracts: Serve as a summary for quick understanding or search indexing.
XML and Hypertext Data Structures
Hypertext Data Structure
 Definition: Hypertext links one piece of information to another using embedded pointers,
making navigation between items easy.
 Storage: Stored in HTML (HyperText Markup Language) format.
 Features:
 Used for displaying information on websites.
 Allows users to click links and move between pages.
XML Data Structure
 Definition: XML (eXtensible Markup Language) organizes and stores data in a structured
format.
 Structure:
 Uses tools like DTD (Document Type Definition) and DOM (Document Object
Model) to define data.
 Focuses on data meaning and organization, not just display.
 Advantages:
 Helps in sharing data between systems.
 Easy to read and process due to its hierarchical format.

Key

Difference
 Hypertext (HTML) is for linking and displaying information on websites.
 XML is for structuring and exchanging data between systems.

Signature File Structure


 Signature file structures are based on coding words into unique word signatures.
 A word signature is a fixed-length binary code where specific bits are set to "1" using a
hash function.
 The word signatures for all words in a document are combined using a logical OR operation
to create the document's block signature.

Key Points:
1. Partitioning Words:
 Words are grouped into blocks (e.g., 5 words per block).
 Each word gets a 16-bit signature with 5 bits set to "1".
2. Searching:
 Searches are done by matching patterns (templates) on the bit positions.
3. Example:
For the text "Computer Science graduate students study", and block size = 5:
 Each word is converted into a 16-bit signature.
 The signatures are combined (OR operation) to form a block signature.
Example word signature for "Computer": 0001 0110 0000 0110
Final block signature: 1001 0111 1110 0110.

Applications and Advantages:


 Useful for medium-sized databases, low-frequency term databases, WORM devices, and
distributed systems.
 Works well in parallel processing environments.
 Prevents excessive "1s" in signatures by limiting words per block.

Stemming Algorithms (Even Simpler):


1. What is Stemming?
 Stemming is used to change words to their base form (the stem).
 For example, connect, connected, connecting, and connection all become connect.
2. Why is Stemming Used?
 It helps improve search results by treating similar words as one.
 Reduces the total number of words, making the system faster and simpler.
3. How Does it Help?
 Better Recall: More relevant results when searching.
 Simplified Data: Less data to process, making the system work better.
4. What to Watch Out For:
 Don’t apply stemming to proper names (like IARE) or acronyms.
 Stemming can sometimes remove important information needed for understanding.
In short, stemming helps group similar words together, making searches quicker and more accurate.

Porter Stemming Algorithm (Simplified Explanation):


The Porter Stemming Algorithm is used to reduce words to their root or stem form by applying a
set of rules and conditions. This helps in standardizing words for tasks like text search and indexing.

Key Concepts:
1. Consonants and Vowels:
 A consonant is any letter except A, E, I, O, U.
 The algorithm uses patterns of vowels (V) and consonants (C) to define word stems.
2. Measure (m):
 The measure (m) is the count of VC sequences in a word's stem.
 Example: In "running" (stem = run), the pattern is VC, so m = 1.
3. Special Conditions:
 *X: Stem ends with a letter X.
 *v*: Stem contains at least one vowel.
 *d: Stem ends with a double consonant (e.g., TT, SS).
 *o: Stem ends with a consonant-vowel-consonant sequence, but the last consonant is
not W, X, or Y (e.g., "hop", "wil").
4. Rules:
 Each rule checks for specific suffix patterns and replaces them to get the root word.
 The rules are grouped into steps to define their application order.

Examples of Rules:
Step Condition Suffix Replacement Example
1a None Sses Ss Stresses → Stress
1b *v* Ing Null Making → Make
1b1 None At Ate Inflated → Inflate
1c *v* Y I Happy → Happi
Step Condition Suffix Replacement Example
2 m>0 Aliti Al Formaliti → Formal
3 m>0 Icate Ic Duplicate → Duplic
4 m>1 Able Null Adjustable → Adjust
5a m>1 E Null Inflate → Inflat
5b m > 1 and *d None Single letter Control → Control

How It Works:
1. The algorithm identifies suffixes and evaluates their conditions.
2. Based on these, it transforms the word step by step.
3. For example:
 "Stresses" → Remove "Sses" → "Stress".
 "Making" → Remove "Ing" → "Mak".
This systematic process helps reduce words like "formalities", "happiness", and "duplicate" to their
roots ("formal", "happi", and "duplic").

Dictionary Lookup Stemmers:


1. How They Work:
 These stemmers check a word in a dictionary to find its root form (stem).
 If the word is in the dictionary, it's replaced with its root form.
 If the word is not found, it may remove suffixes based on rules.
2. What It Uses:
 A dictionary of words.
 Exception lists for words that shouldn’t be changed (e.g., “suites” to "suite").
 Special rules for proper nouns (like names) to prevent stemming.

Successor Stemmers:
1. How They Work:
 These stemmers break a word into smaller parts and choose the best part as the stem.
 They use different methods to decide where to break the word:
 Cutoff: Set a fixed length for the stem.
 Peak: Break the word where certain letters stand out.
 Word Boundaries: Break at natural word boundaries.
 Entropy: Use letter patterns to decide where to break.
In short, dictionary lookup stemmers replace words with their root forms based on a dictionary,
while successor stemmers break words into parts and choose the best one.

You might also like