0% found this document useful (0 votes)
52 views29 pages

Chapter 4: Query Languages: Baeza-Yates, 1999 Modern Information Retrieval

This document discusses query languages for information retrieval. It covers keyword-based queries including single-word, context, Boolean, and natural language queries. It also discusses pattern matching queries and structural queries for fixed, hypertext, and hierarchical structures. Finally, it examines query protocols like Z39.50 and trends/issues in query language research.

Uploaded by

Tamizharasi A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views29 pages

Chapter 4: Query Languages: Baeza-Yates, 1999 Modern Information Retrieval

This document discusses query languages for information retrieval. It covers keyword-based queries including single-word, context, Boolean, and natural language queries. It also discusses pattern matching queries and structural queries for fixed, hypertext, and hierarchical structures. Finally, it examines query protocols like Z39.50 and trends/issues in query language research.

Uploaded by

Tamizharasi A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 29

Chapter 4 : Query Languages

Baeza-Yates, 1999
Modern Information Retrieval
Outline
 Keyword-Based Querying
 Patten Matching
 Structural Queries
 Query Protocols
 Trends and Research Issues
Keyword-Based Querying
A query is formulation of a user information need
Keyword-based queries are popular

1. Single-Word Queries Data Retrieval


2. Context Queries
3. Boolean Queries
4. Natural Language Information Retrieval
Single-Word Queries
 A query is formulated by a word
 A document is formulated by long sequences of
words
 A word is a sequence of letters surrounded by
separators
 What are letters and separators? e.g,’on-line’
The division of the text into words is not
arbitrary
Context Queries
 Definition
- Search words in a given context
 Types
 Phrase
>a sequence of single-word queries
>e.g, enhance retrieval
 Proximity
>a sequence of single words or phrases, and a maximum
allowed distance between them are specified
>e.g,within distance (enhance, retrieval, 4) will match
‘…enhance the power of retrieval…’
Boolean Queries
 Definition
 A syntax composed of atoms that retrieve documents, and of
Boolean operators which work on their operands
 e.g, translation AND syntax OR syntactic

 Fuzzy Boolean
 Retrieve documents appearing in some operands (The AND
may require it to appear in more operands than the OR)
Natural Language
 Generalization of “fuzzy Boolean”
 A query is an enumeration of words and context
queries
 All the documents matching a portion of the user
query are retrieved
Pattern Matching
 Data retrieval
 A pattern is a set of syntactic features that must
occur in a text segment
 Types
 Words
 Prefixes
e.q ‘comput’->’computer’ ,’computation’,’computing’,etc
 Suffixes
e.q ‘ters’->’computers’,’testers’,’painters’,etc
 Substrings
e.q ‘tal’->’coastal’,’talk’,’metallic’,etc
 Ranges
between ‘held’ and ‘hold’->’hoax’ and ‘hissing’
Allowing errors
 Retrieve all text words which all ‘similar’ to the
given word
 edit distance:
the minimum number of character insertions,
deletions, and replacements needed to make two
strings equal, e.q , ‘flower’ and ‘flo wer’
 maximum allowed edit distance:
query specifies the maximum number of allowed
errors for a word to match the pattern
Regular expressions
 union: if e1 and e2 are regular expressions , then(e1|e2)
matches what e1 or e2 matches
 concatenation: if e1 and e2 are regular expressions, the
occurrences of (e1e2) are formed by the occurrences of e1
immediately followed by those of e2
 repetition: if e is a regular expression , then (e*)
matches a sequence of zero or more contiguous
occurrence of e
 ‘pro(blem|tein)(s|є)(0|1|2)*’->’problem2’ and
‘proteins’
Structural Queries
 Mixing contents and structure in queries
- contents: words, phrases, or patterns
- structural constraints: containment, proximity,
or other restrictions on structural elements
 Three main structures
- Fixed structure
- Hypertext structure
- Hierarchical structure
Fixed Structure
Document:a fixed set of fields
EX: a mail has a sender, a receiver, a date, a subject and a body field
Search for the mails sent to a given person with “football” in the
Subject field
Hypertext
A hypertext is a directed graph where nodes hold some
text (text contents)
the links represent connections between nodes or
between positions inside nodes (structural connectivity)
Hypertext : WebGlimpse

WebGlimpse: combine browsing and searching on


the Web
Hierarchical Structure
Hierarchical Structure
Hierarchical Structure
 PAT Expressions
 Overlapped Lists
 Lists of References
 Proximal Nodes
 Tree Matching
Query Protocols
 Z39.50
 WAIS (Wide Area Information Service)
Z39.50
 American National Standard Information
Retrieval Application Service Definition
 Can be implemented on any platform
 Query bibliographical information using a
standard interface between the client and the
host database manager
 Z39.50 protocol is part of WAIS
Z39.50 Brief history
 Z39.50-1988(version 1)
 Z39.50-1992(version 2)
 Z39.50-1995(version 3)
 Version 4, development began in Autumn 1995
Using Z39.50 over the WWW

WWW Client WWW Z39.50

Z39.50 Repository
Server Digital library
Z39.50 Client
WAIS (Wide Area Information Service)

 Beginning in the 1990s


 Query databases through the Internet
Trends and Research Issues

Model Queries allowed


Boolean word,set operations
Vector words
Probabilistic words
BBN words

Relationship between types of queries and models


Query Language Taxonomy

The types of queries covered and how they are structured


PAT Tree Expression
 The model allow for the areas of a region to
overlap or nest
Overlapped Lists
 The model allow for the areas of a region to
overlap, but not to nest
 It is not clear, whether overlapping is good or
not for capturing the structural properties
Lists of References
 Overlap and nest are not allowed
 All elements must be of the same type,e.g only
sections, or only paragraphs.
 A reference is a pointer to a region of the
database.
Proximal Nodes
 This model tries to find a good compromise
between expressiveness and efficiency.
 It does not define a specific language, but a
model in which it is shown that a number of
useful operators can be included achieving good
efficiency.
Tree Matching
 The leaves of the query can be not only
structural elements but also text patterns,
meaning that the ancestor of the leaf must
contain that pattern.

You might also like