0% found this document useful (0 votes)
12 views29 pages

33 Vector Space Model For XML Retrieval

Uploaded by

Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views29 pages

33 Vector Space Model For XML Retrieval

Uploaded by

Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Information Retrieval

Topic- XML Retrieval


(Vector space model for XML IR)
Lecture-33

Prepared By

Dr. Rasmita Rautray & Dr. Rasmita Dash


Associate Professor
Dept. of CSE
XML Retrieval
Content
• XML Retrieval
• XML Retrieval : Vector space model
• XML Retrieval : Assessment
IR and relational databases
• IR systems are often contrasted with relational databases (RDB).
• Traditionally, IR systems retrieve information from unstructured
text (“raw” text without markup).
• RDB systems are used for querying relational data: sets of
records that have values for predefined attributes such as
employee number, title and salary.
RDB search unstructured IR
objects records unstructured docs
main data structure table inverted index
model relational model vector space & others
queries SQL free text queries
• Some structured data sources containing text are best modeled as
structured documents rather than relational data ( Structured
retrieval).
Structured retrieval
• Basic setting: queries are structured or unstructured;
documents are structured.
Applications of structured retrieval
• Digital libraries, patent databases, blogs, tagged text
with entities like persons and locations (named entity
tagging).
Why RDB is not suitable in this case
• Three main problems :
• An unranked system (DB) would return a potentially large number
of articles that mention the Vatican, the Coliseum and sightseeing
tours without ranking them by relevance to the query.
• Difficult for users to precisely state structural constraints -
may not know which structured elements are supported by the
system.
tours AND(COUNTRY : Vatican OR
LANDMARK : Coliseum) ?
tours AND (STATE :Vatican OR BUILDING : Coliseum) ?
• Users may be completely unfamiliar with structured search and
advanced search interfaces or unwilling to use them.
• Solution: adapt ranked retrieval to structured documents to address
these problems.
Structured Retrieval

• Standard for encoding structured documents: Extensible


Markup Language ( XML)
• structured IR → XML IR
• also applicable to other types of markup (HTML, SGML, ...)
First challenge: document parts to retrieve
• Structured or XML retrieval: users want us to return parts of
documents (i.e., XML elements), not entire documents as IR
systems usually do in unstructured retrieval.
Example
• If we query Shakespeare’s plays for Macbeth’s castle, should
we return the scene, the act or the entire play?
– In this case, the user is probably looking for the scene.
– However, an otherwise unspecified search for Macbeth
should return the play of this name, not a subunit.
Solution: structured document retrieval principle
Second challenge: document parts to index

• Central notion for indexing and ranking in IR: document unit


or indexing unit.
• In unstructured retrieval, usually straightforward: files on your
desktop, email messages, web pages on the web etc.
• In structured retrieval, there are four main different
approaches to defining the indexing unit.
– non-overlapping pseudodocuments
– top down
– bottom up
– all
Third challenge: nested elements
• Because of the redundancy caused by nested elements it is
common to restrict the set of elements eligible for retrieval.
Restriction strategies include:
– discard all small elements
– discard all element types that users do not look at (working
XML retrieval system logs)
– discard all element types that assessors generally do not
judge to be relevant (if relevance assessments are
available)
– only keep element types that a system designer or librarian
has deemed to be useful search results
• In most of these approaches, result sets will still contain nested
elements.
Third challenge: nested elements

Further techniques:
• remove nested elements in a post processing step to reduce
redundancy.
• collapse several nested elements in the results list and use
highlighting of query terms to draw the user’s attention to
the relevant passages.
Nested elements and term statistics
• Further challenge related to nesting: we may need to distinguish
different contexts of a term when we compute term statistics for
ranking, in particular inverse document frequency (idf).
Example
• The term Gates under the node author is unrelated to an
occurrence under a content node like section if used to refer to
the plural of gate. It makes little sense to compute a single
document frequency for Gates in this example.
Solution: compute idf for XML-context term pairs.
• sparse data problems (many XML-context pairs occur too rarely
to reliably estimate df)
compromise: consider the parent node x of the term and not the
rest of the path from the root to x to distinguish contexts.
Vector space model for XML IR
Main idea: lexicalised subtrees
• Aim: to have each dimension of the vector space encode a
word together with its position within the XML tree.
• How: Map XML documents to lexicalised subtrees.
Main idea: lexicalised subtrees
• Take each text node (leaf) and break it into multiple nodes, one
for each word. E.g. split Bill Gates into Bill and Gates.
• Define the dimensions of the vector space to be lexicalized
subtrees of documents – subtrees that contain at least one
vocabulary term.
Lexicalised subtrees
• We can now represent queries and documents as vectors in
this space of lexicalized subtrees and compute matches
between them,
e.g. using the vector space formalism.
Vector space formalism in unstructured VS. structured IR
• The main difference is that the dimensions of vector space
in unstructured retrieval are vocabulary terms whereas
they are lexicalized subtrees in XML retrieval.
Structural term
• There is a tradeoff between the dimensionality of the space and
accuracy of query results.
– If we restrict dimensions to vocabulary terms, then we have
a standard vector space retrieval system that will retrieve
many documents that do not match the structure of the query
(e.g., Gates in the title as opposed to the author element).
– If we create a separate dimension for each lexicalized
subtree occurring in the collection, the dimensionality of the
space becomes too large.
Compromise: index all paths that end in a single vocabulary term,
in other words, all XML-context term pairs. We call such an
XML-context term pair a structural term and denote it by <c, t>: a
pair of XML-context c and vocabulary term t.
Context resemblance
• A simple measure of the similarity of a path cq in a query and a
path cd in a document is the following context resemblance
function CR:
1  c q
 if cq matches cd
C R (cq , cd ) 1  cd (1)

0
if cq does not match cd

• |cq| and |cd are the number of nodes in the query path and
document path, resp.
cq matches cd iff we can transform cq into cd by inserting
additional nodes.
Context resemblance example
Context resemblance exercise
Document similarity measure
Evaluation of XML Retrieval
Initiative for the Evaluation of XML Retrieval
(INEX)
• INEX: standard benchmark evaluation (yearly) that has produced
test collections (documents, sets of queries, and relevance
judgments).
• Based on IEEE journal collection (since 2006 INEX uses the
much larger English Wikipedia as a test collection).
• The relevance of documents is judged by human assessors.
INEX 2002 collection statistics
INEX topics
• Two types:
– content-only or CO topics: regular keyword queries as
in unstructured information retrieval
– content-and-structure or CAS topics: have structural
constraints in addition to keywords
• Since CAS queries have both structural and content criteria,
relevance assessments are more complicated than in
unstructured retrieval.
INEX relevance assessments
• INEX 2002 defined component coverage and topical relevance
as orthogonal dimensions of relevance.
Component coverage
• Evaluates whether the element retrieved is “structurally”
correct, i.e., neither too low nor too high in the tree.
We distinguish four cases:
– Exact coverage (E): The information sought is the main topic of the
component and the component is a meaningful unit of information.
– Too small (S): The information sought is the main topic of the
component, but the component is not a meaningful (self-contained)
unit of information.
– Too large (L): The information sought is present in the component, but
is not the main topic.
– No coverage (N): The information sought is not a topic of the
Component.
INEX relevance assessments
The topical relevance dimension also has four levels: highly
relevant (3), fairly relevant (2), marginally relevant (1) and
nonrelevant (0).

Combining the relevance dimensions


Components are judged on both dimensions and the judgments
are then combined into a digit-letter code, e.g. 2S is a fairly
relevant component that is too small. In theory, there are 16
combinations of coverage and relevance, but many cannot
occur. For example, a nonrelevant component cannot have
exact coverage, so the combination 3N is not possible.
INEX relevance assessments
• The relevance-coverage combinations are quantized as follows:

• This evaluation scheme takes account of the fact that binary


relevance judgments, which are standard in unstructured IR, are
not appropriate for XML retrieval. The quantization function Q
does not impose a binary choice relevant / non-relevant and
instead allows us to grade the component as partially relevant.
The number of relevant components in a retrieved set A of
components can then be computed as:
INEX evaluation measures

• Recent INEX focus: develop algorithms and evaluation


measures that return non-redundant results lists and
evaluate them properly.
Thank You

You might also like