Unit3 QueryLanguages Berlin
Unit3 QueryLanguages Berlin
• Data retrieval
– Pattern-based querying
– Retrieve docs that contains (or exactly match) the
objects that satisfy the conditions clearly specified in
the query
– A single erroneous object implies failure!
• Information retrieval
– Keyword-based querying
– Retrieve relevant docs in response to the query (the
formulation of a user information need)
– Allow the answer to be ranked
2
The Kinds of Queries
4
Keyword-based Querying
• Keywords
– Those words can be used for retrieval by a query
– A small set of words extracted from the docs
• Preprocessing is needed
6
Keyword-based Querying
7
Keyword-based Querying
• Context queries
– Complement single-word queries with ability to
search words in a given context, i.e., near other
words
8
Keyword-based Querying
• Proximity
– A relaxed version of the phrase query
May not consider – A sequence of single words (or phrases) is
word ordering
given together with a maximum allowed
distance between them
» E.g., two keywords occur within four words
D: “…enhance the power of retrieval…” 9
Keyword-based Querying
10
Keyword-based Querying
• Boolean Queries
– Have a syntax composed of atoms (basic queries)
that retrieve docs, and of Boolean operators which
work on their operands
AND
translation OR
Leaves: basic queries
Internal nodes: operators
syntax syntactic
3d
2
4d
1
3
2
d d d
• AND, e.g. (e1 AND e2) d
7
d
10
7
8d
4
7
d
– Select all docs which satisfy both e1 and e2 d
8
10
13
Keyword-based Querying
• Natural language
– Push the fuzzy Boolean model even further
• The distinction between AND and OR are
complete blurred
– A query is an enumeration of words and context
queries
– All the documents matching a portion of the user
query are retrieved
• Docs matching more parts of the query assigned a
higher ranking
– Negation also can be handled by penalizing the
ranking score
• E.g. some words are not desired
14
Pattern Matching
• Types of patterns
– Words
– Prefixes: a string from the beginning of a text word
• E.g. ‘comput’: ‘computer’, ‘computation’,…
– Suffixes: a string from the termination of a text word
• E.g. ‘ters’: ‘computers’, ‘testers’, ‘painters’,…
– Substrings: A string within a text word
• E.g. ‘tal’: ‘coastal’, ‘talk’, ‘metallic’, …
– Ranges: a pair of strings matching any words lying
between them in lexicographic order
• E.g. between ‘held’ and ‘hold’: ‘hoax’ and ‘hissing’,…
16
Pattern Matching
– Allowing errors: a word together with an error
threshold
• Useful for when query or doc contains typos or
misspelling
• Retrieve all text words which are ‘similar’ to the
given word
• edit (or Levenshtein) distance: the minimum
number of character insertions, deletions, and
replacements needed to make two strings equal
– E.g. ‘flower’ and ‘flo wer’
• maximum allowed edit distance: query specifies
the maximum number of allowed errors for a word
to match the pattern
17
Pattern Matching
– Regular Expressions
• General patterns are built up by simple strings and
several operations
• union: if e1 and e2 are regular expressions, then (e1 |
e2) matches what e1 or e2 matches
• concatenation: if e1 and e2 are regular expressions,
the occurrences of (e1 e2) are formed by the
occurrences of e1 immediately followed by those of e2
• repetition (Kleene closure): if e is a regular
expression, then (e*) matches a sequence of zero or
more contiguous occurrence of e
• Example:
– ‘pro (blem | tein) (s | ε) (0 | 1 | 2)*’ matches words
‘problem2’, ‘proteins’, etc.
18
Pattern Matching
– Extended Patterns
• Subsets of the regular expressions expressed with a
simpler syntax
• System can convert extended patterns into regular
expressions, or search them with specific algorithms
• E.g.: classes of characters:
19
Structural Queries
Structural
Query
20
Structural Queries
21
Form-like Fixed Structure
• Docs have a fixed set of fields, much like a filled
form
– Each field has some text inside
text
24
Issues of Hierarchical Structure
25
Issues of Hierarchical Structure
28
Trends and Research Issues
29