ISR Chap..7
ISR Chap..7
Query Languages
Keyword-based querying
• Queries are combinations of words.
• The document collection is searched for documents
that contain these words.
• Word queries are
• intuitive,
• easy to express and
• provide fast ranking.
• The concept of word must be defined.
• A word is a sequence of letters terminated by a
separator (period, comma, space, etc).
• Definition of letter and separator is flexible; e.g.,
hyphen could be defined as a letter or as a
separator.
• Usually, common words (such as “a”, “the”, “of”, …)
are ignored.
Single-word queries
• A query is a single word – Usually used for
searching in document images
• Simplest form of query.
• All documents that include this word are
retrieved.
• Documents may be ranked by the frequency of
this word in the document.
• Disadvantages:
• Ambiguity:
• Lack of Specificity
Phrase queries
• A query is a sequence of words treated as a single unit.
• Phrase is usually surrounded by quotation marks.
• All documents that include this phrase are retrieved.
• Usually, separators (commas, colons, etc.) and common
words (e.g., “a”, “the”, “of”, “for”…) in the phrase are
ignored.
•In effect, this query is for a set of words that must
appear in sequence.
• Allows users to specify a context and thus gain precision.
• Example: “Information Processing for Document
Retrieval”.
Multiple-word queries
• A query is a set of words (or phrases).
• Two options: A document is retrieved if it includes
• any of the query words, or
• each of the query words.
•Documents are ranked by the number of query words
they contain:
• A document containing n query words is ranked higher
than a document containing m < n query words.
• Documents are ranked in decreasing order:
• those containing all the query words are ranked at the
top, only one query word at bottom.
• –Frequency counts may be used to break tie among
documents that contain the same query words.
Boolean queries
•Based on concepts from logic: AND, OR, NOT
• It describes the information needed by relating multiple
words with Boolean operators.
•Semantics: For each query word w a corresponding set
Dw is constructed that includes the documents that
contain w.
• AND: Finds only documents containing all of the
specified words or phrases.
• OR: Finds documents containing at least one of the
specified words or phrases.
• NOT: Excludes documents containing the specified word
or phrase.
Examples: Boolean queries
1.computer OR server
• Finds documents containing either computer, server or
both
2. (computer OR server) NOT mainframe
• Select all documents that discuss computers or servers,
do not select any documents that discuss mainframes.
3. computer NOT (server OR mainframe)
• Select all documents that discuss computers, and do
not discuss either servers or mainframes.
4. computer OR server NOT mainframe
• Select all documents that discuss computers, or
documents that discuss servers but do not discuss
mainframes.
Weighted queries
•Each of the words is assigned a different weight, expressing
the relative importance of the word within the query.
•A query is then a set of word-weight pairs:
(k1, w1), …, (kn, wn).
•The ranking of a document is the sum of the weights for the
query words that it satisfies.
• Example: given Query: (A,0.8,), (B,0.9), (C,0.3); and
• Document 1: (A, B, D) and Document 2: (A, C, D) which document
ranked first ?
• Rank of Document 1: 0.8+0.9 = 1.7
• Rank of Document 2: 0.8+0.3 = 1.1
• Each document includes two words from the query, but
Document1 is ranked higher because it includes more
important words.
Assignment
Explore and summarize the following concepts related to
probabilistic models in information retrieval:
a. Probabilistic Indexing
b. Information Retrieval as Probabilistic Inference
c. Binary Independence Model (BIM)
d. Bayesian Networks for Text Retrieval
e. Language Model Approach to Information Retrieval
For each concept, provide a brief overview highlighting
its significance and application in the field of
information retrieval.