UNIT5-User Search Techniques
UNIT5-User Search Techniques
1.At the first level, the user attempts to specify the information
needed, using his/her vocabulary and past experience.
3. At the final level, the system reconsiders the query based upon the
specific database. For example, assigning weights to the terms
based upon the document frequency of each term.
Example : impact(.308),oil(.606),petroleum(.65),spills(.12),
accidents(.23),Alaska(.45),price(.16),cost(.25),value(.10)
Boolean Queries
• Boolean queries are natural in systems where weights are binary.
A term either applies or does not apply to a query. Each term T is
associated with the set of documents DT
– A AND B : Retrieve all documents for which both A and B are
relevant ( D A D B )
– A OR B : Retrieve all documents for which either A or B are
relevant ( D A D B )
– A NOT B : Retrieve all documents for which A is relevant and
B is not relevant (D A D B )
• Consider two “unnatural” situations:
– Boolean queries in systems that index documents with weighted terms.
– Weighted Boolean queries in systems that use non-weighted (binary)
terms.
Boolean Queries in weighted Systems
• Environment:
– A weighted system, where the relevance of a term to a document is
expressed with a weight.
– Boolean queries, involving AND and OR.
• Possible approach: use threshold to convert all weights to
binary representations.
• Possible approach:
– Transform the query to disjunctive normalform:
• Sets of conjunctions of the form T1
T2
T3..
• Connected by a operators.
– Given a document D
• First, its relevance to each conjunct is computed as the minimum weight
of any document term that appears in the conjunct.
• Then, the document relevance for the complete query is the maximum of
the conjunct weights.
Boolean Queries in Weighted Systems(cont.)
• Example: Two documents indexed by 3 terms:
– Doc1 = Term1 / 0.2, Term2 / 0.5, term3 / 0.6
– Doc2 = Term1 / 0.7, Term2 / 0.2, term3 / 0.1
– Query: ( Term1 AND Term2 ) OR Term3
• SIM(Di,Dj)= W
k 1
ik Wjk
• SIM(Q,D) =
(q d )
i 1
i i
d
k k
qi
2 2
i 1
i
i 1
• Formula measures the cosine of the angle between the two vectors
• As cosine approaches 1, the two vectors become coincident(document and
query represent unrelated concepts)
• Problem: Does not take into account the length of the vectors
• Consider
– Query = (4,8,0)
– Doc1 = (1,2,0)
– Doc2 = (3,6,0)
• SIM(Query,Doc1) and SIM(Query,Doc2)are identical, even though Doc2 has
significantly higher weights in the terms in common
Similarity Measures : Summary
• Four well-known measures of vector similarity
• Similarity Measure Evaluation for Binary Evaluation for Weighted
• sim(X, Y) Term Vectors Termt
Vectors
• Inner product X Y x yi 0
i i
r
X Y
x y
i 1
i i
• Dice coefficient X Y
r r
x y
i 1
i
i 1
i
X Y x y
i 0
i i
• Cosine coefficient 1/ 2 1/ 2 r r
X Y
x y i i
i 0 i 0
X Y x y i i
• Jaccard coefficient X Y X Y r
i 0
r r
x y xi y
2 2
i i i
i 0 i 0 i 0
Similarity Measures: Summary(cont)
• Observations :
• All four measures use the same inner product as
nominator.
• The denominators of the last three maybe viewed as
normalizations of the inner product.
• The definitions for binary term vectors are more intuitive.
• All measures are 1 when X = Y and 0 when X and Y are
disjoint
Thresholds
• Use of similarity measures may return the entire database as a search result,
because the similarity measure might yield close-to-zero values for most, if
not all, of the documents.
• Similarity measures must be used with thresholds :
– Threshold : a value that the similarity measure must exceed
– It might also be a limit on the size of the answer.
• Example :
Terms:
American, geography, lake, Mexico, painter, oil, reserve, subject.
Doc1 : geography of Mexico suggests oil reserves are available.
Doc1 = ( 0, 1, 0, 1, 0, 1, 1, 0)
Doc2 : American geography has lakes available everywhere.
Doc2 = (1, 1, 1, 0, 0, 0, 0, 0)
Doc3 : painter suggest Mexico lakes as subjects.
Doc3 = (0, 0, 1, 1, 1, 0, 0, 1)
Query : oil reserves in Mexico
Query = (0, 0, 0, 1, 0, 1, 1, 0)
Thresholds(cont.)
• Example(cont.)
• Using the inner product measures :
• SIM(Query, Doc1) = 3
• SIM(Query, Doc2) = 0
• SIM(Query, Doc3) = 1
• If a threshold of 2 is selected, then only Doc1 is retrieved.
• Use of thresholds may decrease recall when documents are clustered, and
search compares queries to centroids.
• There may be documents in a cluster that are not retrieved, even though they
are similar enough to the query, because their cluster centroid is not close
enough to the query.
• The risk increases as the deviation in the cluster increases (the documents are
not clustered around the centroid the centroid -- bad cluster)
Ranking
• Similarity measures provide a means for ranking
the set of retrieved documents:
– Ordering the documents from the most likely to satisfy
the query to the least likely.
– Ranking reduces the user overhead.
– Because similarity measures are not accurate, precise
ranking may be misleading; documents may be grouped
into sets, and the documents sets are ranked in order of
relevance.
Relevance Feedback
• An initial query might not provide an accurate description of the
user’s needs:
– User’s lack of knowledge of the domain.
– User’s vocabulary does not match authors’ vocabulary.
• After examining the result of his query, a user can often
improve the description of his needs:
– Querying is an iterative process.
– Further iterations are generated either manually, or automatically.
• Relevance feedback: Knowledge of which returned documents
are relevant and which are not, is used to generate the next
query.
– Assumption: the documents relevant to a query resemble each
other(similar vectors).
– Hence, if a document is known to be relevant, the query can be
improved by increasing its similarity to that document.
– Similarly, if a document is known to be non-relevant, the query
can be improved by decreasing its similarity to that document.
Relevance Feedback(cont.)
• Given a query (a vector) we
– add to it the average (centroid) of the relevant documents in the
result, and
– subtract from it the average (centroid) of the non-relevant
documents in the result.
• A vector algebra expression:
1 1
Qi 1 Qi D D
r DR nr DNR
where
– Qi = The present query.
– Qi+1 = The revised query.
– D = A document in the result.
– R = The relevant documents in the result(r = cardinality or R)
– NR = The non-relevant documents in the result(nr = cardinality of
NR)
Relevance Feedback (cont.)
• A revised formula, giving more control over the various components:
Qi 1 Qi D D
DR DNR
where
– Tuning constants; for example, 1.0, 0.5, 0.25
– D = Positive feedback factor. Uses the user’s judgments on
DR
relevant documents to increase the values of terms. Moves the query to
retrieve documents similar to relevant documents retrieved (in the
direction of more relevant documents).
D
– DNR = Negative feedback factor. Uses the user’s judgments on
non-relevant documents to decrease the values of terms. Moves the query
away from non-relevant documents.
– Positive feedback often weighted significantly more than negative
feedback; often, only positive feedback is used.
Relevance Feedback(Cont.)
• Illustration : Impact of relevance feedback. Illustration shows the effect of positive feedback only or negative feedback only :