0% found this document useful (0 votes)
7 views

7 Query Languages Operations

Uploaded by

redmonter John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

7 Query Languages Operations

Uploaded by

redmonter John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 12

Query Languages

1
Keyword-based querying
• Queries are combinations of words.
• The document collection is searched for documents
that contain these words.
• Word queries are intuitive, easy to express and
provide fast ranking.
• The concept of word must be defined.
– A word is a sequence of letters terminated by a separator
(period, comma, space, etc).
– Definition of letter and separator is flexible; e.g., hyphen
could be defined as a letter or as a separator.
– Usually, common words (such as “a”, “the”, “of”, …) are
ignored. 2
Single-word queries
• A query is a single word
– Usually used for searching in document images

• Simplest form of query.


• What are the possible documents retrieved as
relevant?
– All documents that include this word are retrieved.

• On what base documents are ranked?


– Documents may be ranked by the frequency of the
query word in the document.
– Documents containing more of the query word are
given the highest priority
3
Phrase queries
• A query is a sequence of words treated as a single
unit. Also called “literal string” or “exact phrase” query.
–Phrase is usually surrounded by quotation marks.
–All documents that include this phrase are retrieved.
• Usually, separators (commas, colons, ...) & common
words (“a”, “the”, “of”, “for”…) in the phrase are
ignored
• In effect, this query is for a set of words that must
appear in sequence.
–Allows users to specify a context and thus gain precision.
–Ex.: “Information Processing for Document Retrieval”.
• What are the possible documents retrieved as relevant?
–All documents that include phrase query are retrieved.
• On what base documents are ranked?
4
Multiple-word queries
• A query is a set of words (or phrases).
– Ex.: what is the result for the query “Data Mining and Intelligent
Database Design”?
• What are the possible documents retrieved as relevant?
– Two options: A document is retrieved if it includes
• any of the query words, or
• each of the query words.
• On what bases documents be ranked to list
according to best matching principle?
– Documents are ranked by the number of query words they
contain. A document containing n query words is ranked
higher than a document containing m < n query words.
– Documents are ranked in decreasing order:
• those containing all the query words are ranked at the top, only one
query word at bottom.
– Frequency counts may be used to break ties among 5
documents that contain the same query words.
Boolean queries
• Queries are formulated based on concepts from logic:
AND, OR, NOT
–It describes the information needed by relating multiple
words with Boolean operators.
• Semantics: For each query word w a corresponding
set Dw is constructed that includes the documents that
contain w.
• The Boolean expression is then interpreted as an
expression on the corresponding document sets with
corresponding set operators:
–AND: Finds only documents containing all of the specified
words or phrases.
–OR: Finds documents containing at least one of the specified
words or phrases.
–NOT: Excludes documents containing the specified word or
phrase. 6
Examples: Boolean queries
1.computer OR server
–Finds documents containing either computer, server or
both
2. (computer OR server) NOT mainframe
– Select all documents that discuss computers or servers,
do not select any documents that discuss mainframes.
3. computer NOT (server OR mainframe)
–Select all documents that discuss computers, and do not
discuss either servers or mainframes.
4. computer OR server NOT mainframe
–Select all documents that discuss computers, or
documents that discuss servers but do not discuss
mainframes. 7
Pattern queries
• What is Pattern?
– An expression that defines a set of objects. Pattern shows
the internal representation of an object.
– What is the pattern of a word?
• Pattern matching: A word matches a pattern if it is equal
to one of the words defined by the pattern. In other words,
– the semantics are of disjunction: A pattern P that defines a
word (c1, c2, …, cn) is interpreted as c1 v c2 v … v cn.
• Similarity pattern. Specifies a string and a radius
– defines all the words whose distance from the string is within
the radius.
– Assume the distance between two strings is measured by
the number of one-character changes (insertions, deletions,
replacements) required to transform one string into the other.
• The similarity pattern (king, 2) defines kin, kong, knig, kings, cling, …
• Useful to compensate for typing or scanning (OCR) errors.
– One of the technique used for pattern matching is string editing
8
Natural language
• Using natural language for querying is very
attractive.
• Example: Find all the documents that discuss
“ campaign finance reforms, including documents
that discuss violations of campaign financing
regulations. Do not include documents that discuss
campaign contributions by the gun and the tobacco
industries”.
• Natural language queries are converted to a formal
language for processing against a set of
documents.
• Such translation requires intelligence and is still a
challenge 9
Problems with Keywords
May not retrieve relevant documents that
include synonymy terms.
◦ “restaurant” vs. “café”
◦ “Abyssinia” vs. “Ethiopia”

May retrieve irrelevant documents that


include polysomy terms.
◦ “Apple” (company vs. fruit vs. sport club)
◦ “bit” (unit of data vs. act of eating)
◦ “bat” (baseball vs. mammal)
Research area: Intelligent IR
• Take into account the meaning of the words used

• Take into account the order of words in the query

• Adapt to the user need based on automatic or semi-


automatic feedback

• Extend search with related terms

• Perform automatic spell checking


Programming Assignment (due date:
______)
Construct indexing File
• Given text document collection generate index terms and
organize them using inverted file indexing, include TF, DF
& CF for each index term and position/location
information of terms in each document.
Hint :
• Text corpus size=20 (written in Notepad save in .txt
format)
• Two types of files: Vocabulary and Posting files
• All text operations must be performed( like stop word
detection, stemming…)

You might also like