33 Vector Space Model For XML Retrieval

Uploaded by

Notes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views29 pages

33 Vector Space Model For XML Retrieval

Uploaded by

Notes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Information Retrieval

Topic- XML Retrieval

(Vector space model for XML IR)
Lecture-33

Prepared By

Dr. Rasmita Rautray & Dr. Rasmita Dash

Associate Professor
Dept. of CSE
XML Retrieval
Content
• XML Retrieval
• XML Retrieval : Vector space model
• XML Retrieval : Assessment
IR and relational databases
• IR systems are often contrasted with relational databases (RDB).
• Traditionally, IR systems retrieve information from unstructured
text (“raw” text without markup).
• RDB systems are used for querying relational data: sets of
records that have values for predefined attributes such as
employee number, title and salary.
RDB search unstructured IR
objects records unstructured docs
main data structure table inverted index
model relational model vector space & others
queries SQL free text queries
• Some structured data sources containing text are best modeled as
structured documents rather than relational data ( Structured
retrieval).
Structured retrieval
• Basic setting: queries are structured or unstructured;
documents are structured.
Applications of structured retrieval
• Digital libraries, patent databases, blogs, tagged text
with entities like persons and locations (named entity
tagging).
Why RDB is not suitable in this case
• Three main problems :
• An unranked system (DB) would return a potentially large number
of articles that mention the Vatican, the Coliseum and sightseeing
tours without ranking them by relevance to the query.
• Difficult for users to precisely state structural constraints -
may not know which structured elements are supported by the
system.
tours AND(COUNTRY : Vatican OR
LANDMARK : Coliseum) ?
tours AND (STATE :Vatican OR BUILDING : Coliseum) ?
• Users may be completely unfamiliar with structured search and
advanced search interfaces or unwilling to use them.
• Solution: adapt ranked retrieval to structured documents to address
these problems.
Structured Retrieval

• Standard for encoding structured documents: Extensible

Markup Language ( XML)
• structured IR → XML IR
• also applicable to other types of markup (HTML, SGML, ...)
First challenge: document parts to retrieve
• Structured or XML retrieval: users want us to return parts of
documents (i.e., XML elements), not entire documents as IR
systems usually do in unstructured retrieval.
Example
• If we query Shakespeare’s plays for Macbeth’s castle, should
we return the scene, the act or the entire play?
– In this case, the user is probably looking for the scene.
– However, an otherwise unspecified search for Macbeth
should return the play of this name, not a subunit.
Solution: structured document retrieval principle
Second challenge: document parts to index

• Central notion for indexing and ranking in IR: document unit

or indexing unit.
• In unstructured retrieval, usually straightforward: files on your
desktop, email messages, web pages on the web etc.
• In structured retrieval, there are four main different
approaches to defining the indexing unit.
– non-overlapping pseudodocuments
– top down
– bottom up
– all
Third challenge: nested elements
• Because of the redundancy caused by nested elements it is
common to restrict the set of elements eligible for retrieval.
Restriction strategies include:
– discard all small elements
– discard all element types that users do not look at (working
XML retrieval system logs)
– discard all element types that assessors generally do not
judge to be relevant (if relevance assessments are
available)
– only keep element types that a system designer or librarian
has deemed to be useful search results
• In most of these approaches, result sets will still contain nested
elements.
Third challenge: nested elements

Further techniques:
• remove nested elements in a post processing step to reduce
redundancy.
• collapse several nested elements in the results list and use
highlighting of query terms to draw the user’s attention to
the relevant passages.
Nested elements and term statistics
• Further challenge related to nesting: we may need to distinguish
different contexts of a term when we compute term statistics for
ranking, in particular inverse document frequency (idf).
Example
• The term Gates under the node author is unrelated to an
occurrence under a content node like section if used to refer to
the plural of gate. It makes little sense to compute a single
document frequency for Gates in this example.
Solution: compute idf for XML-context term pairs.
• sparse data problems (many XML-context pairs occur too rarely
to reliably estimate df)
compromise: consider the parent node x of the term and not the
rest of the path from the root to x to distinguish contexts.
Vector space model for XML IR
Main idea: lexicalised subtrees
• Aim: to have each dimension of the vector space encode a
word together with its position within the XML tree.
• How: Map XML documents to lexicalised subtrees.
Main idea: lexicalised subtrees
• Take each text node (leaf) and break it into multiple nodes, one
for each word. E.g. split Bill Gates into Bill and Gates.
• Define the dimensions of the vector space to be lexicalized
subtrees of documents – subtrees that contain at least one
vocabulary term.
Lexicalised subtrees
• We can now represent queries and documents as vectors in
this space of lexicalized subtrees and compute matches
between them,
e.g. using the vector space formalism.
Vector space formalism in unstructured VS. structured IR
• The main difference is that the dimensions of vector space
in unstructured retrieval are vocabulary terms whereas
they are lexicalized subtrees in XML retrieval.
Structural term
• There is a tradeoff between the dimensionality of the space and
accuracy of query results.
– If we restrict dimensions to vocabulary terms, then we have
a standard vector space retrieval system that will retrieve
many documents that do not match the structure of the query
(e.g., Gates in the title as opposed to the author element).
– If we create a separate dimension for each lexicalized
subtree occurring in the collection, the dimensionality of the
space becomes too large.
Compromise: index all paths that end in a single vocabulary term,
in other words, all XML-context term pairs. We call such an
XML-context term pair a structural term and denote it by <c, t>: a
pair of XML-context c and vocabulary term t.
Context resemblance
• A simple measure of the similarity of a path cq in a query and a
path cd in a document is the following context resemblance
function CR:
1  c q
 if cq matches cd
C R (cq , cd ) 1  cd (1)

0
if cq does not match cd

• |cq| and |cd are the number of nodes in the query path and
document path, resp.
cq matches cd iff we can transform cq into cd by inserting
additional nodes.
Context resemblance example
Context resemblance exercise
Document similarity measure
Evaluation of XML Retrieval
Initiative for the Evaluation of XML Retrieval
(INEX)
• INEX: standard benchmark evaluation (yearly) that has produced
test collections (documents, sets of queries, and relevance
judgments).
• Based on IEEE journal collection (since 2006 INEX uses the
much larger English Wikipedia as a test collection).
• The relevance of documents is judged by human assessors.
INEX 2002 collection statistics
INEX topics
• Two types:
– content-only or CO topics: regular keyword queries as
in unstructured information retrieval
– content-and-structure or CAS topics: have structural
constraints in addition to keywords
• Since CAS queries have both structural and content criteria,
relevance assessments are more complicated than in
unstructured retrieval.
INEX relevance assessments
• INEX 2002 defined component coverage and topical relevance
as orthogonal dimensions of relevance.
Component coverage
• Evaluates whether the element retrieved is “structurally”
correct, i.e., neither too low nor too high in the tree.
We distinguish four cases:
– Exact coverage (E): The information sought is the main topic of the
component and the component is a meaningful unit of information.
– Too small (S): The information sought is the main topic of the
component, but the component is not a meaningful (self-contained)
unit of information.
– Too large (L): The information sought is present in the component, but
is not the main topic.
– No coverage (N): The information sought is not a topic of the
Component.
INEX relevance assessments
The topical relevance dimension also has four levels: highly
relevant (3), fairly relevant (2), marginally relevant (1) and
nonrelevant (0).

Combining the relevance dimensions

Components are judged on both dimensions and the judgments
are then combined into a digit-letter code, e.g. 2S is a fairly
relevant component that is too small. In theory, there are 16
combinations of coverage and relevance, but many cannot
occur. For example, a nonrelevant component cannot have
exact coverage, so the combination 3N is not possible.
INEX relevance assessments
• The relevance-coverage combinations are quantized as follows:

• This evaluation scheme takes account of the fact that binary

relevance judgments, which are standard in unstructured IR, are
not appropriate for XML retrieval. The quantization function Q
does not impose a binary choice relevant / non-relevant and
instead allows us to grade the component as partially relevant.
The number of relevant components in a retrieved set A of
components can then be computed as:
INEX evaluation measures

• Recent INEX focus: develop algorithms and evaluation

measures that return non-redundant results lists and
evaluate them properly.
Thank You

XML Notes
No ratings yet
XML Notes
49 pages
Information Retrieval and Web Search
No ratings yet
Information Retrieval and Web Search
69 pages
10 XML
No ratings yet
10 XML
42 pages
Chap 4 Text IR PDF
No ratings yet
Chap 4 Text IR PDF
19 pages
International Journal of Computational Engineering Research (IJCER)
No ratings yet
International Journal of Computational Engineering Research (IJCER)
7 pages
Jurnal XML 2
No ratings yet
Jurnal XML 2
7 pages
cs6200 f12 XML Retrieval
No ratings yet
cs6200 f12 XML Retrieval
45 pages
Text Search For Fine-Grained Semi-Structured Data: Soumen Chakrabarti Indian Institute of Technology, Bombay
No ratings yet
Text Search For Fine-Grained Semi-Structured Data: Soumen Chakrabarti Indian Institute of Technology, Bombay
22 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Emutye
No ratings yet
Emutye
20 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
A Survey On Various Architectures, Models and Methodologies For Information Retrieval
No ratings yet
A Survey On Various Architectures, Models and Methodologies For Information Retrieval
13 pages
Efficient Searching On Data Using Forward Search
No ratings yet
Efficient Searching On Data Using Forward Search
8 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Question Answering
No ratings yet
Question Answering
68 pages
Efficient Data Mining For XML Queries - Answering Support: G. Seshadri Sekhar, Dr.S. Murali Krishna
No ratings yet
Efficient Data Mining For XML Queries - Answering Support: G. Seshadri Sekhar, Dr.S. Murali Krishna
10 pages
E-Commerce Data: Topic-7: Text Mining/Analytics
No ratings yet
E-Commerce Data: Topic-7: Text Mining/Analytics
37 pages
Language Independent Document
No ratings yet
Language Independent Document
10 pages
IRS Assignment
No ratings yet
IRS Assignment
8 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Relational Databases For Querying XML Documents
No ratings yet
Relational Databases For Querying XML Documents
14 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
The Extensible Markup Language
No ratings yet
The Extensible Markup Language
6 pages
Baeza Yates 2003
No ratings yet
Baeza Yates 2003
8 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Ccs369-Unit 3
No ratings yet
Ccs369-Unit 3
28 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Apweb 2014
No ratings yet
Apweb 2014
8 pages
Ranking Clustered Keyword Search On Semi Structured Data
No ratings yet
Ranking Clustered Keyword Search On Semi Structured Data
7 pages
Bulu
No ratings yet
Bulu
47 pages
Fqas 2009
No ratings yet
Fqas 2009
12 pages
8 Info-Retrieval PDF
No ratings yet
8 Info-Retrieval PDF
60 pages
A New Method For Applicant of Explicit Semantic Analysis and Word Sense Disambiguation in Concept-Based Information Retrieval
No ratings yet
A New Method For Applicant of Explicit Semantic Analysis and Word Sense Disambiguation in Concept-Based Information Retrieval
10 pages
IRS 2nd Chap
No ratings yet
IRS 2nd Chap
42 pages
The Sesame Lucenesail: RDF Queries With Full-Text Search: Nepomuk Technical Report 2008-1
No ratings yet
The Sesame Lucenesail: RDF Queries With Full-Text Search: Nepomuk Technical Report 2008-1
14 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
2008d Sigirforum Murdock
No ratings yet
2008d Sigirforum Murdock
4 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Relational Databases For Querying XML Documents Limitations and Opportunities
No ratings yet
Relational Databases For Querying XML Documents Limitations and Opportunities
13 pages
1 introIR
No ratings yet
1 introIR
15 pages
10 1 1 68 8932
No ratings yet
10 1 1 68 8932
426 pages
Acm Iconiaac 2014
No ratings yet
Acm Iconiaac 2014
8 pages
Unit3 QueryLanguages Berlin
No ratings yet
Unit3 QueryLanguages Berlin
29 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Semantic News Finder: A Semantic Retrieval From News Items: M.Thangaraj G.Sujatha
No ratings yet
Semantic News Finder: A Semantic Retrieval From News Items: M.Thangaraj G.Sujatha
9 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
16 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
First Frequent Pattern-Tree Based XML Pattern Fragment Growth Method For Web Contents
No ratings yet
First Frequent Pattern-Tree Based XML Pattern Fragment Growth Method For Web Contents
6 pages
Unit 5 - Query Operations and Languages
No ratings yet
Unit 5 - Query Operations and Languages
11 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
Applications of NLP
No ratings yet
Applications of NLP
48 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Unit 2
No ratings yet
Unit 2
58 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Unit 4 Information Retrieval
No ratings yet
Unit 4 Information Retrieval
20 pages
29 Relevance Feedback and Rocchio Algo
No ratings yet
29 Relevance Feedback and Rocchio Algo
23 pages
Chapter 3 Basics of Cloud Computing
No ratings yet
Chapter 3 Basics of Cloud Computing
29 pages
PIP Minor Assignment - 5 (Saswat Mohanty - 1941012407 - CSE-D)
No ratings yet
PIP Minor Assignment - 5 (Saswat Mohanty - 1941012407 - CSE-D)
8 pages
USP Minor Assignment - 2 (Saswat Mohanty - 1941012407 - CSE-D)
No ratings yet
USP Minor Assignment - 2 (Saswat Mohanty - 1941012407 - CSE-D)
37 pages
PIP Minor Assignment - 10 (Saswat Mohanty - 1941012407 - CSE-D)
No ratings yet
PIP Minor Assignment - 10 (Saswat Mohanty - 1941012407 - CSE-D)
11 pages
Sathi A Das 2003
No ratings yet
Sathi A Das 2003
10 pages
The Jolly Writer
No ratings yet
The Jolly Writer
272 pages
Eid444: E-Commerce: Faculty # A.S.Venkata Praneel
No ratings yet
Eid444: E-Commerce: Faculty # A.S.Venkata Praneel
25 pages
Open XML Deep Dive
No ratings yet
Open XML Deep Dive
81 pages
What Is XML
No ratings yet
What Is XML
4 pages
Document Management Techniques and Techn
No ratings yet
Document Management Techniques and Techn
9 pages
BCOM COMPUTER APPLICATIONS 5th SEM (New) E COMMERCE NOTES All Units by Bhanu
No ratings yet
BCOM COMPUTER APPLICATIONS 5th SEM (New) E COMMERCE NOTES All Units by Bhanu
75 pages

33 Vector Space Model For XML Retrieval

Uploaded by

33 Vector Space Model For XML Retrieval

Uploaded by

Information Retrieval

Topic- XML Retrieval

Dr. Rasmita Rautray & Dr. Rasmita Dash

• Standard for encoding structured documents: Extensible

• Central notion for indexing and ranking in IR: document unit

Combining the relevance dimensions

• This evaluation scheme takes account of the fact that binary

• Recent INEX focus: develop algorithms and evaluation

You might also like