0% found this document useful (0 votes)

5 views

Unit3 QueryLanguages Berlin

The document discusses various types of query languages used in data and information retrieval, including pattern-based and keyword-based querying. It outlines the characteristics of different query types, such as Boolean and natural language queries, and emphasizes the importance of context and structural constraints in retrieving relevant documents. Additionally, it highlights the significance of query protocols for effective communication between information systems.

Uploaded by

kukdejagaurav2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Unit3 QueryLanguages Berlin

Uploaded by

kukdejagaurav2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Query Languages

Berlin Chen 2003

Reference:
1. Modern Information Retrieval, chapter 4
The Kinds of Queries

• Data retrieval
– Pattern-based querying
– Retrieve docs that contains (or exactly match) the
objects that satisfy the conditions clearly specified in
the query
– A single erroneous object implies failure!

• Information retrieval
– Keyword-based querying
– Retrieve relevant docs in response to the query (the
formulation of a user information need)
– Allow the answer to be ranked

2
The Kinds of Queries

• On-line databases or CD-ROM archives

– High level software packages should be viewed as
query languages
– Named “protocols”

Different query languages are formulated and then

used at different situations, by considering
- The underlying retrieval models
- The content (semantics) and structure (syntax) of the text

Models: Boolean, vector-space, HMM ….

Formulations/word-treating machineries: Stop-word list,
stemming, query-expansion, ….
3
The Retrieval Units

• The retrieval unit: the basic element which can

be retrieved as an answer to a query
– A set of such basic elements with ranking information

• The retrieval unit can be a file, a doc, a Web

page, a paragraph, a passage, or some other
structural units

• Simply referred as “docs”

4
Keyword-based Querying

• Keywords
– Those words can be used for retrieval by a query
– A small set of words extracted from the docs
• Preprocessing is needed

• Characteristics of keyword-based queries

– A query composed of keywords and the docs
containing such keywords are searching for
– Intuitive, easy to express, and allowing for fast
ranking
– A query can be a single keyword or more complex
combination of operation involving several keywords
5
Keyword-based Querying
• Single-word queries
– Query: The elementary query is a word
– Docs: The docs are long sequences of words
– What is a word in English ?
• A word is a sequence of letters surrounded by
separators
• Some characters are not letters but do not split a
word, e.g. the hyphen in ‘on-line’
• Words possess semantic/conceptual information

6
Keyword-based Querying

• Single-word queries (cont.)

– The use of word statistics for IR ranking
• Word occurrences inside texts
– Term frequency: number of times a word in a
doc
– Inverse document frequency: number of docs in
which a word appears

– Word positions in the docs

• May be required, e.g., a interface highlighting each
occurrence of a specific word

7
Keyword-based Querying
• Context queries
– Complement single-word queries with ability to
search words in a given context, i.e., near other
words

– Words appearing near each other may signal a

higher likelihood of relevance than if they appear
apart
• Phrases of words or words are proximal in the
text

8
Keyword-based Querying

• Context queries (cont.)

– Two types of queries
• Phrase
Separators in the
– A sequence of single-word queries
text or query may
not be the same Q: “enhance” and “retrieval”
D: “…enhance the retrieval….”

• Proximity
– A relaxed version of the phrase query
May not consider – A sequence of single words (or phrases) is
word ordering
given together with a maximum allowed
distance between them
» E.g., two keywords occur within four words
D: “…enhance the power of retrieval…” 9
Keyword-based Querying

• Context queries (cont.)

– Ranking
• Phrases: analogous to single words
• Proximity queries: the same way if physical
proximity is not used as a parameter in ranking
– Just as a hard-limiter
– But physical proximity has semantic value !

10
Keyword-based Querying

• Boolean Queries
– Have a syntax composed of atoms (basic queries)
that retrieve docs, and of Boolean operators which
work on their operands

AND

translation OR
Leaves: basic queries
Internal nodes: operators
syntax syntactic

A query syntax tree. 11

Keyword-based Querying

• Boolean Queries (cont.)

– Commonly used operators e and e are basic queries
1 2
• OR, e.g. (e1 OR e2)
– Select all docs which satisfy e1 or e2. Duplicates
e e e OR e
are eliminated d
1

3d
2

4d
1

3
2

d d d
• AND, e.g. (e1 AND e2) d
7
d
10
7

8d
4

7
d
– Select all docs which satisfy both e1 and e2 d
8

• BUT, e.g. (e1 BUT e2)

– Select all docs which satisfy e1 but not e2

No partial matching between a doc and a query

No ranking of retrieved docs are provided!
12
Keyword-based Querying

• Boolean Queries (cont.)

– A relaxed version: a “fuzzy Boolean” set of operators
• The meaning of AND and OR can be relaxed
– all : the AND operator
– one: the OR operator (at least one)
– some: retrieval elements appearing in more
operands than the OR
• Docs are ranked higher when having a larger
number of elements in common with the query

13
Keyword-based Querying
• Natural language
– Push the fuzzy Boolean model even further
• The distinction between AND and OR are
complete blurred
– A query is an enumeration of words and context
queries
– All the documents matching a portion of the user
query are retrieved
• Docs matching more parts of the query assigned a
higher ranking
– Negation also can be handled by penalizing the
ranking score
• E.g. some words are not desired
14
Pattern Matching

• Pattern matching: allow the retrieval of doc

based on some patterns
– A pattern is a set of syntactic features must occur in a
text segments
• Segments satisfying the pattern specifications are
said to “match the pattern”
• E.g. the prefix of a word
– A kind of data retrieval
• Pattern matching (data retrieval) can be viewed
as an enhanced tool for information retrieval
– Require more sophisticated data structures and
algorithms to retrieve efficiently
15
Pattern Matching

• Types of patterns
– Words
– Prefixes: a string from the beginning of a text word
• E.g. ‘comput’: ‘computer’, ‘computation’,…
– Suffixes: a string from the termination of a text word
• E.g. ‘ters’: ‘computers’, ‘testers’, ‘painters’,…
– Substrings: A string within a text word
• E.g. ‘tal’: ‘coastal’, ‘talk’, ‘metallic’, …
– Ranges: a pair of strings matching any words lying
between them in lexicographic order
• E.g. between ‘held’ and ‘hold’: ‘hoax’ and ‘hissing’,…

16
Pattern Matching
– Allowing errors: a word together with an error
threshold
• Useful for when query or doc contains typos or
misspelling
• Retrieve all text words which are ‘similar’ to the
given word
• edit (or Levenshtein) distance: the minimum
number of character insertions, deletions, and
replacements needed to make two strings equal
– E.g. ‘flower’ and ‘flo wer’
• maximum allowed edit distance: query specifies
the maximum number of allowed errors for a word
to match the pattern
17
Pattern Matching
– Regular Expressions
• General patterns are built up by simple strings and
several operations
• union: if e1 and e2 are regular expressions, then (e1 |
e2) matches what e1 or e2 matches
• concatenation: if e1 and e2 are regular expressions,
the occurrences of (e1 e2) are formed by the
occurrences of e1 immediately followed by those of e2
• repetition (Kleene closure): if e is a regular
expression, then (e*) matches a sequence of zero or
more contiguous occurrence of e
• Example:
– ‘pro (blem | tein) (s | ε) (0 | 1 | 2)*’ matches words
‘problem2’, ‘proteins’, etc.
18
Pattern Matching

– Extended Patterns
• Subsets of the regular expressions expressed with a
simpler syntax
• System can convert extended patterns into regular
expressions, or search them with specific algorithms
• E.g.: classes of characters:

19
Structural Queries

• Docs are allowed to be queried with respect to

both their text content and structural constraints
– Text content: words, phrases, or patterns
– Structural constraints: containment, proximity, or
other restrictions on the structural elements (e.g.,
chapters, sections, etc.)
Mixing contents and structures in queries

Query on Text Retrieval A Set of The Final Set of

Boolean model
Text Content model Retrieved Documents Retrieved Documents

Structural
Query

20
Structural Queries

• Three main structures discussed here

– Form-like fixed structure simple
– Hierarchical structure
– Hypertext structure complex

What structure a text may have?

What can be queried about that
structure? (the query model)
How to rank docs?

21
Form-like Fixed Structure
• Docs have a fixed set of fields, much like a filled
form
– Each field has some text inside
text

– Some fields are not presented in all docs text

– Text has to be classified into a field

fields
– Fields are not allow to nest or overlap text

– A given pattern only can be associated text

with a specified filed

– E.g., a mail achieve (sender, receiver, date, subject, body ..)
• Search for the mail sent to a given person with
“football” in the subject field
• Compared with the relational database systems
– Different fields with different data types more rigid !
22
Hypertext Structure
• A hypertext is a directed graph where
– Nodes hold some text (content)
– The links represents connection (structural
connectivity) between nodes or between positions
inside the nodes
• Retrieval from a hypertext began as a merely
navigational activity
– Manually traverse the hypertext nodes following links A

to search what one wanted C

– It’s still not possible to query the hypertext based on

its structure B

• An interesting proposal to combine browsing and

searching on the web WebGlimpse
23
Hierarchical Structure

• Represent a recursive decomposition of the text

and is a natural model for many text collections
– E.g., books, articles, legal documents,…

24
Issues of Hierarchical Structure

• Static or dynamic structure

– Statistic: one or more explicit hierarchies can be
queried, e.g., by ancestry
– Dynamic: not really a hierarchy, the required
elements are built on the fly
• Implemented over a normal text index
• Restrictions on the structure
– The text or the answers may have restrictions about
nesting and/or overlapping for efficiency reasons
– In other cases, the query language is restricted to
avoid restricting the structure

25
Issues of Hierarchical Structure

• Integration with text

– Effective Integration of queries on text content with
queries on text structure
– From perspectives of classical IR models
and structural models, respectively Classical model: primary -> text
secondary->structure
Structural model: primary -> structure
• Query language secondary->text
– Some features for queries on structure including
selection of areas that
• Contain (or not) other areas
• Are contained (or not) in other areas
• Follow (or are followed by) other areas
• Are close to other areas
– Also including set manipulation
26
Query Protocols

• The query languages used automatically by

software applications to query text databases
– Standards for querying CD-ROMs
– Or, intermediate languages to query library systems
• Important query protocols
– Z39.50
• For bibliographical information systems
• Protocols for not only the query language but also
the client-server connection
– WAIS (Wide Area Information Service)
• A networking publishing protocol
• For querying database through the Internet
27
Query Protocols

• CD-ROM publishing protocols

– Provide “disk interchangeability”: flexibility in data
communication between primary information
providers and end users
– Some example protocols
• CCL (Common Command Language)
• CD-RDx (Compact Disk Read only Data exchange)
• SFQL (Structured Full-text Query Languages)

28
Trends and Research Issues

• Types of queries and how they are structured

Ship Stability Notes BS222
100% (2)
Ship Stability Notes BS222
74 pages
PRIMARY 2 SAMPLE TEST - Grade 2
100% (1)
PRIMARY 2 SAMPLE TEST - Grade 2
10 pages
3.electricity and Magnetism EOT 2021 Test
No ratings yet
3.electricity and Magnetism EOT 2021 Test
9 pages
Made By:-Bhawana Agarwal Cs Iiiyr
No ratings yet
Made By:-Bhawana Agarwal Cs Iiiyr
29 pages
Chapter Five (ISR)
No ratings yet
Chapter Five (ISR)
17 pages
Query Languages
No ratings yet
Query Languages
34 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
emutye
No ratings yet
emutye
20 pages
Irs 3
No ratings yet
Irs 3
14 pages
Chapter #4: Query Languages
No ratings yet
Chapter #4: Query Languages
16 pages
Chapter 4: Query Languages: Baeza-Yates, 1999 Modern Information Retrieval
No ratings yet
Chapter 4: Query Languages: Baeza-Yates, 1999 Modern Information Retrieval
29 pages
7 Query Languages Operations
No ratings yet
7 Query Languages Operations
12 pages
6-Query Languages
No ratings yet
6-Query Languages
19 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
IR Chap7
No ratings yet
IR Chap7
30 pages
6&7-Query Languages & Operations
No ratings yet
6&7-Query Languages & Operations
45 pages
Query Languages-WPS Office
No ratings yet
Query Languages-WPS Office
8 pages
Unit II
No ratings yet
Unit II
73 pages
Query Languages and Query Operation: Chapter Seven
No ratings yet
Query Languages and Query Operation: Chapter Seven
20 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
MIR Mod _03(Chapter04-Query languages)
No ratings yet
MIR Mod _03(Chapter04-Query languages)
31 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Query Languages
No ratings yet
Query Languages
5 pages
4
No ratings yet
4
35 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
33 pages
1-Introduction-MIR
No ratings yet
1-Introduction-MIR
35 pages
chapter 2
No ratings yet
chapter 2
45 pages
Module1PartBInformationRetrievalWebdocuments
No ratings yet
Module1PartBInformationRetrievalWebdocuments
49 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
ISR chap..7
No ratings yet
ISR chap..7
9 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
IR Unit-3
No ratings yet
IR Unit-3
75 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Text
No ratings yet
Text
5 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
chapter 1 ir (1)
No ratings yet
chapter 1 ir (1)
37 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
MOD_3_IRS
No ratings yet
MOD_3_IRS
18 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Modern Information Retrieval: Queries: Languages & Properties
No ratings yet
Modern Information Retrieval: Queries: Languages & Properties
67 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Ir - Chapter 1
No ratings yet
Ir - Chapter 1
7 pages
Intro Notes
No ratings yet
Intro Notes
11 pages
L01
No ratings yet
L01
33 pages
1-Overview of Information Retrieval_new
No ratings yet
1-Overview of Information Retrieval_new
47 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
Writing up your Research
From Everand
Writing up your Research
Dan Remenyi
No ratings yet
pm_note1
No ratings yet
pm_note1
5 pages
TSEC_BC_Unit_6
No ratings yet
TSEC_BC_Unit_6
61 pages
CAP Theorem Lect 2
No ratings yet
CAP Theorem Lect 2
77 pages
Blockchain Using Python Lect_3
No ratings yet
Blockchain Using Python Lect_3
23 pages
Najah Mubashira Final STT 351 Project
No ratings yet
Najah Mubashira Final STT 351 Project
7 pages
Tesla Patent
No ratings yet
Tesla Patent
14 pages
Mercedes Benz W639 Vito Viano Gear Selector Shift Stick Lever Bush Repair Fix Kit Manual Install Instructions Guide
No ratings yet
Mercedes Benz W639 Vito Viano Gear Selector Shift Stick Lever Bush Repair Fix Kit Manual Install Instructions Guide
13 pages
Lec-3 (Inverse Use of Area Table (Area Z Get X As Answer) - MTH264
No ratings yet
Lec-3 (Inverse Use of Area Table (Area Z Get X As Answer) - MTH264
11 pages
First Magnetic Roasting Plant
No ratings yet
First Magnetic Roasting Plant
17 pages
Appendix 2 Serva PCTLR 721a Double Pump Cementer
No ratings yet
Appendix 2 Serva PCTLR 721a Double Pump Cementer
2 pages
TAMU MEEN 431 HW 1 Solutions
No ratings yet
TAMU MEEN 431 HW 1 Solutions
10 pages
Introduction To Oil and Gas Allocation
No ratings yet
Introduction To Oil and Gas Allocation
33 pages
Common Spinning Formulas
No ratings yet
Common Spinning Formulas
8 pages
Vacuum Cannon - Reverse Engineering Project Report
No ratings yet
Vacuum Cannon - Reverse Engineering Project Report
5 pages
Unit 1 Test Review: Physics Basics, Movement, and Vectors Chapters 1 - 3
No ratings yet
Unit 1 Test Review: Physics Basics, Movement, and Vectors Chapters 1 - 3
8 pages
Q Learning
No ratings yet
Q Learning
38 pages
Novedades Gerber 8.5 AE
No ratings yet
Novedades Gerber 8.5 AE
56 pages
Table 1
No ratings yet
Table 1
2 pages
Atmotsfera
No ratings yet
Atmotsfera
241 pages
Amomax-10 - A Novel Ammonia Synthesis Catalyst: Norbert Ringer Dr. Marcus Michel
No ratings yet
Amomax-10 - A Novel Ammonia Synthesis Catalyst: Norbert Ringer Dr. Marcus Michel
6 pages
Weak Dominance and Never Best Responses: 4.1 Elimination of Weakly Dominated Strate-Gies
No ratings yet
Weak Dominance and Never Best Responses: 4.1 Elimination of Weakly Dominated Strate-Gies
7 pages
Pool Filtration PDF Document Aqua Middle East FZC
No ratings yet
Pool Filtration PDF Document Aqua Middle East FZC
22 pages
OSPF Questions
No ratings yet
OSPF Questions
6 pages
Digital Electronics Suggestion (1)
No ratings yet
Digital Electronics Suggestion (1)
2 pages
CNC Retrofit Kit
No ratings yet
CNC Retrofit Kit
4 pages
Bank Reconciliation: Prepared By: Nurul Hassanah Binti Hamzah
No ratings yet
Bank Reconciliation: Prepared By: Nurul Hassanah Binti Hamzah
9 pages
GRE Complete List - 2024 - Sample
No ratings yet
GRE Complete List - 2024 - Sample
39 pages
HRD Rig Control
No ratings yet
HRD Rig Control
68 pages
MODBUS Interface Module QJ71MB91 Users Manual Hardware IB - NA - 0800329-B
No ratings yet
MODBUS Interface Module QJ71MB91 Users Manual Hardware IB - NA - 0800329-B
30 pages
QM77040 Data Sheet
No ratings yet
QM77040 Data Sheet
40 pages
Lecture ESD 40
No ratings yet
Lecture ESD 40
53 pages

Unit3 QueryLanguages Berlin

Uploaded by

Unit3 QueryLanguages Berlin

Uploaded by

Query Languages

Berlin Chen 2003

• On-line databases or CD-ROM archives

Different query languages are formulated and then

Models: Boolean, vector-space, HMM ….

• The retrieval unit: the basic element which can

• The retrieval unit can be a file, a doc, a Web

• Simply referred as “docs”

• Characteristics of keyword-based queries

• Single-word queries (cont.)

– Word positions in the docs

– Words appearing near each other may signal a

• Context queries (cont.)

• Context queries (cont.)

A query syntax tree. 11

• Boolean Queries (cont.)

• BUT, e.g. (e1 BUT e2)

No partial matching between a doc and a query

• Boolean Queries (cont.)

• Pattern matching: allow the retrieval of doc

• Docs are allowed to be queried with respect to

Query on Text Retrieval A Set of The Final Set of

• Three main structures discussed here

What structure a text may have?

– Some fields are not presented in all docs text

– Text has to be classified into a field

– A given pattern only can be associated text

with a specified filed

to search what one wanted C

– It’s still not possible to query the hypertext based on

• An interesting proposal to combine browsing and

• Represent a recursive decomposition of the text

• Static or dynamic structure

• Integration with text

• The query languages used automatically by

• CD-ROM publishing protocols

• Types of queries and how they are structured

You might also like