0% found this document useful (0 votes)

31 views7 pages

Irs Unit-2 Modified

The document provides an overview of various data structures used in information retrieval, including inverted file structures, XML, and hypertext data structures. It discusses the components, features, and benefits of these structures, as well as the processes of information extraction and stemming algorithms. Additionally, it explains the workings of signature file structures and different types of stemmers, highlighting their applications and advantages.

Uploaded by

Balle Manasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views7 pages

Irs Unit-2 Modified

Uploaded by

Balle Manasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Inverted File Structure

 Overview
 The inverted file structure is a widely used data structure for information retrieval.
 It stores an "inversion" of documents, listing the documents where each word
appears.

Components
1. Document File: Contains the actual documents.
2. Inversion List (Posting List):
 Lists document IDs where a word appears.
 Example: For the word "bit" in positions 10, 12, and 18 of document #1:
bit: 1(10), 1(12), 1(18).
3. Dictionary:
 A sorted list of words with pointers to their inversion lists.
 Stores additional data like list lengths for query optimization.

Features
 Zoning: Improves precision by focusing on specific document sections.
 Weights: Assigns importance to words in the inversion list.
 Special Dictionaries: Handles unique data like dates or numbers efficiently.
 Ranking: Reorganizes results in ranked order for better relevance.

Search Process
1. Retrieve inversion lists for query terms.
2. Apply logical operations to find matching documents.
3. Refine results using weights and ranking algorithms.

Using B-Trees
 B-Trees can manage inversion lists instead of a dictionary.
 Properties:
 Root node: 2 to 2m keys.
 Internal nodes: m to 2m keys, sorted.
 Leaf nodes: Same or nearly same level.
 Inversion lists are stored at leaf levels or accessed via pointers.

Benefits
 Efficient retrieval of relevant documents.
 Optimized search with ranking and special data handling.
Information Extraction
Information extraction is the process of identifying and extracting relevant facts, data, or text from
documents to organize them into structured formats, such as fields in a database or summaries for
easy access and use.
Information extraction involves two main processes:

1. Fact Extraction for Databases

 Identifies and extracts structured facts to store in database fields.
 Process Name: Automatic File Build.
 Evaluation Metrics:
 Recall: Measures how much correct information was extracted compared to all
relevant information in the item.
 Precision: Measures how accurately the extracted information matches the relevant
data.
 Overgeneration: Checks how much irrelevant information was extracted.
 Fallout: Measures incorrect slot fillers (wrong data in extracted fields).

2. Text Extraction for Summarization

 Extracts key text to create a summary of an item.
 Goal: Capture the most important ideas while reducing the size of the content.
 Examples:
 Titles, table of contents, and abstracts.
 Abstracts: Serve as a summary for quick understanding or search indexing.
XML and Hypertext Data Structures
Hypertext Data Structure
 Definition: Hypertext links one piece of information to another using embedded pointers,
making navigation between items easy.
 Storage: Stored in HTML (HyperText Markup Language) format.
 Features:
 Used for displaying information on websites.
 Allows users to click links and move between pages.
XML Data Structure
 Definition: XML (eXtensible Markup Language) organizes and stores data in a structured
format.
 Structure:
 Uses tools like DTD (Document Type Definition) and DOM (Document Object
Model) to define data.
 Focuses on data meaning and organization, not just display.
 Advantages:
 Helps in sharing data between systems.
 Easy to read and process due to its hierarchical format.

Key

Difference
 Hypertext (HTML) is for linking and displaying information on websites.
 XML is for structuring and exchanging data between systems.

Signature File Structure

 Signature file structures are based on coding words into unique word signatures.
 A word signature is a fixed-length binary code where specific bits are set to "1" using a
hash function.
 The word signatures for all words in a document are combined using a logical OR operation
to create the document's block signature.

Key Points:
1. Partitioning Words:
 Words are grouped into blocks (e.g., 5 words per block).
 Each word gets a 16-bit signature with 5 bits set to "1".
2. Searching:
 Searches are done by matching patterns (templates) on the bit positions.
3. Example:
For the text "Computer Science graduate students study", and block size = 5:
 Each word is converted into a 16-bit signature.
 The signatures are combined (OR operation) to form a block signature.
Example word signature for "Computer": 0001 0110 0000 0110
Final block signature: 1001 0111 1110 0110.

Applications and Advantages:

 Useful for medium-sized databases, low-frequency term databases, WORM devices, and
distributed systems.
 Works well in parallel processing environments.
 Prevents excessive "1s" in signatures by limiting words per block.

Stemming Algorithms (Even Simpler):

1. What is Stemming?
 Stemming is used to change words to their base form (the stem).
 For example, connect, connected, connecting, and connection all become connect.
2. Why is Stemming Used?
 It helps improve search results by treating similar words as one.
 Reduces the total number of words, making the system faster and simpler.
3. How Does it Help?
 Better Recall: More relevant results when searching.
 Simplified Data: Less data to process, making the system work better.
4. What to Watch Out For:
 Don’t apply stemming to proper names (like IARE) or acronyms.
 Stemming can sometimes remove important information needed for understanding.
In short, stemming helps group similar words together, making searches quicker and more accurate.

Porter Stemming Algorithm (Simplified Explanation):

The Porter Stemming Algorithm is used to reduce words to their root or stem form by applying a
set of rules and conditions. This helps in standardizing words for tasks like text search and indexing.

Key Concepts:
1. Consonants and Vowels:
 A consonant is any letter except A, E, I, O, U.
 The algorithm uses patterns of vowels (V) and consonants (C) to define word stems.
2. Measure (m):
 The measure (m) is the count of VC sequences in a word's stem.
 Example: In "running" (stem = run), the pattern is VC, so m = 1.
3. Special Conditions:
 *X: Stem ends with a letter X.
 *v*: Stem contains at least one vowel.
 *d: Stem ends with a double consonant (e.g., TT, SS).
 *o: Stem ends with a consonant-vowel-consonant sequence, but the last consonant is
not W, X, or Y (e.g., "hop", "wil").
4. Rules:
 Each rule checks for specific suffix patterns and replaces them to get the root word.
 The rules are grouped into steps to define their application order.

Examples of Rules:
Step Condition Suffix Replacement Example
1a None Sses Ss Stresses → Stress
1b *v* Ing Null Making → Make
1b1 None At Ate Inflated → Inflate
1c *v* Y I Happy → Happi
Step Condition Suffix Replacement Example
2 m>0 Aliti Al Formaliti → Formal
3 m>0 Icate Ic Duplicate → Duplic
4 m>1 Able Null Adjustable → Adjust
5a m>1 E Null Inflate → Inflat
5b m > 1 and *d None Single letter Control → Control

How It Works:
1. The algorithm identifies suffixes and evaluates their conditions.
2. Based on these, it transforms the word step by step.
3. For example:
 "Stresses" → Remove "Sses" → "Stress".
 "Making" → Remove "Ing" → "Mak".
This systematic process helps reduce words like "formalities", "happiness", and "duplicate" to their
roots ("formal", "happi", and "duplic").

Dictionary Lookup Stemmers:

1. How They Work:
 These stemmers check a word in a dictionary to find its root form (stem).
 If the word is in the dictionary, it's replaced with its root form.
 If the word is not found, it may remove suffixes based on rules.
2. What It Uses:
 A dictionary of words.
 Exception lists for words that shouldn’t be changed (e.g., “suites” to "suite").
 Special rules for proper nouns (like names) to prevent stemming.

Successor Stemmers:
1. How They Work:
 These stemmers break a word into smaller parts and choose the best part as the stem.
 They use different methods to decide where to break the word:
 Cutoff: Set a fixed length for the stem.
 Peak: Break the word where certain letters stand out.
 Word Boundaries: Break at natural word boundaries.
 Entropy: Use letter patterns to decide where to break.
In short, dictionary lookup stemmers replace words with their root forms based on a dictionary,
while successor stemmers break words into parts and choose the best one.

IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
SAP CPI Palette Operations Prepared by Vengal
100% (3)
SAP CPI Palette Operations Prepared by Vengal
18 pages
Unit 2
No ratings yet
Unit 2
10 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
Unit Iii Data Structure
No ratings yet
Unit Iii Data Structure
43 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
Paper For Aptech Dism-Unsolved
100% (1)
Paper For Aptech Dism-Unsolved
16 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Lecture3 Hadoop-NLP
No ratings yet
Lecture3 Hadoop-NLP
44 pages
Irs Ii
No ratings yet
Irs Ii
39 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Chapter 2 Part II
No ratings yet
Chapter 2 Part II
75 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
II - 2 Unit
No ratings yet
II - 2 Unit
73 pages
03text Processing
No ratings yet
03text Processing
22 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
10 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Unit 2
No ratings yet
Unit 2
40 pages
Ranking Algorithms in Information Retrieval
No ratings yet
Ranking Algorithms in Information Retrieval
10 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
4 Indexing
No ratings yet
4 Indexing
59 pages
3 Irs Mid Important Questions
No ratings yet
3 Irs Mid Important Questions
6 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Introduction - Types of Stemming Algorithms
No ratings yet
Introduction - Types of Stemming Algorithms
28 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
ch3 - Indexing - 2019
No ratings yet
ch3 - Indexing - 2019
38 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Stemming Algorithms: A Comparative Study and Their Analysis: Deepika Sharma (ME CSE)
No ratings yet
Stemming Algorithms: A Comparative Study and Their Analysis: Deepika Sharma (ME CSE)
6 pages
04 Word Normalization and Stemming 11-47
No ratings yet
04 Word Normalization and Stemming 11-47
5 pages
IR Chapter Three
No ratings yet
IR Chapter Three
59 pages
Unit 5
No ratings yet
Unit 5
14 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
IRSunit 2
No ratings yet
IRSunit 2
20 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
Indexing 2021
No ratings yet
Indexing 2021
44 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
2011 Dawson Stemmer
No ratings yet
2011 Dawson Stemmer
7 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
What Is Structured Data?: Information Retrieval
No ratings yet
What Is Structured Data?: Information Retrieval
6 pages
Heaps Law Linguistic Pre-Processing Index Terms
No ratings yet
Heaps Law Linguistic Pre-Processing Index Terms
8 pages
Unit 2 Irs
No ratings yet
Unit 2 Irs
25 pages
Indexing Structure: Chapter Four
No ratings yet
Indexing Structure: Chapter Four
26 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
KSOU Distance MCA Syllabus
No ratings yet
KSOU Distance MCA Syllabus
26 pages
Chapter 2: Data Mapping and Exchange: Visit
No ratings yet
Chapter 2: Data Mapping and Exchange: Visit
99 pages
Introducing Oracle XML Gateway
No ratings yet
Introducing Oracle XML Gateway
13 pages
PHP and XMLUnit 4 Complete Notes
No ratings yet
PHP and XMLUnit 4 Complete Notes
24 pages
(Web Design / Web Technology / Web Engineering) : Follow Us On Facebook Join Our Telegram Channel Join Discussion Board
No ratings yet
(Web Design / Web Technology / Web Engineering) : Follow Us On Facebook Join Our Telegram Channel Join Discussion Board
31 pages
Chapter 1 - Introduction To Multimedia
No ratings yet
Chapter 1 - Introduction To Multimedia
36 pages
SOA Syllabus
No ratings yet
SOA Syllabus
3 pages
Web Programming
No ratings yet
Web Programming
10 pages
WT 2 Assig (52133)
No ratings yet
WT 2 Assig (52133)
11 pages
Neural Network For PLC PDF
No ratings yet
Neural Network For PLC PDF
7 pages
Radiation Suppliment 1
No ratings yet
Radiation Suppliment 1
18 pages
Introdution To HTML 1
No ratings yet
Introdution To HTML 1
54 pages
MCA 2nd Yr Practical List S-2020
No ratings yet
MCA 2nd Yr Practical List S-2020
8 pages
Systems Analysis & Design: Analyzing Systems Using Data Dictionaries
No ratings yet
Systems Analysis & Design: Analyzing Systems Using Data Dictionaries
57 pages
Integrative Programming Midterm 45
No ratings yet
Integrative Programming Midterm 45
25 pages
Write Down The Syntax Rules For XML Declaration
No ratings yet
Write Down The Syntax Rules For XML Declaration
35 pages
XML Chap8 Sebesta Web2
No ratings yet
XML Chap8 Sebesta Web2
52 pages
WML Lab File
No ratings yet
WML Lab File
42 pages
Isra Was7 Deploy Guide 34fp5
No ratings yet
Isra Was7 Deploy Guide 34fp5
103 pages
Web Technology Paper
No ratings yet
Web Technology Paper
2 pages
Using XML and Databases: W3C Standards in Practice
No ratings yet
Using XML and Databases: W3C Standards in Practice
24 pages
NS LogMessages
No ratings yet
NS LogMessages
54 pages
Known XML Vulnerabilities Are Still A Threat To Popular Parsers and Open Source Systems
No ratings yet
Known XML Vulnerabilities Are Still A Threat To Popular Parsers and Open Source Systems
9 pages
IET DAVV 2014 Com
No ratings yet
IET DAVV 2014 Com
15 pages
Extensible Markup Language
No ratings yet
Extensible Markup Language
4 pages
Sample Question Paper: Foundation of IT, Class X, SA-2
No ratings yet
Sample Question Paper: Foundation of IT, Class X, SA-2
8 pages
Integrated Development Environments: Accelerating XML Application Development in The Enterprise
No ratings yet
Integrated Development Environments: Accelerating XML Application Development in The Enterprise
18 pages
W3C QA - Recommended List of Doctype Declarations
No ratings yet
W3C QA - Recommended List of Doctype Declarations
5 pages

Irs Unit-2 Modified

Uploaded by

Irs Unit-2 Modified

Uploaded by

Inverted File Structure

1. Fact Extraction for Databases

2. Text Extraction for Summarization

Signature File Structure

Applications and Advantages:

Stemming Algorithms (Even Simpler):

Porter Stemming Algorithm (Simplified Explanation):

Dictionary Lookup Stemmers:

You might also like