0% found this document useful (0 votes)
24 views24 pages

UNIT5-User Search Techniques

The document discusses various methods of searching documents and evaluating similarity between queries and documents. It covers: 1) Binding of search statements to translate user queries into the system's query language and database terms. 2) Boolean queries in both weighted and non-weighted systems, including how to apply thresholds or calculate similarity to conjuncts and disjuncts. 3) Common similarity measures like inner product, cosine, and Dice coefficient to evaluate similarity between query and document vectors based on term weights and frequencies. It also discusses normalization techniques.

Uploaded by

Lakshmi V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views24 pages

UNIT5-User Search Techniques

The document discusses various methods of searching documents and evaluating similarity between queries and documents. It covers: 1) Binding of search statements to translate user queries into the system's query language and database terms. 2) Boolean queries in both weighted and non-weighted systems, including how to apply thresholds or calculate similarity to conjuncts and disjuncts. 3) Common similarity measures like inner product, cosine, and Dice coefficient to evaluate similarity between query and document vectors based on term weights and frequencies. It also discusses normalization techniques.

Uploaded by

Lakshmi V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 24

Searching

• Binding of search statements


• Boolean queries
– Boolean queries in weighted systems
– Weighted Boolean queries in non-weighted systems
• Similarity measures
– Well-known measures
– Thresholds
– Ranking
• Relevance feedback
•Selective dissemination of information search
•Searching the internet and hypertext
Binding of Search Statements
• Search statements are generated by users to describe their
information needs.
• Typically,a search statement uses Boolean logic
or natural language.
• Three level of binding may be observed.
At each level the query statement becomes more specific .

1.At the first level, the user attempts to specify the information
needed, using his/her vocabulary and past experience.

Example : “Find me information on the impact of oil spills in Alaska


on the price of oil”
Binding of Search Statements(cont.)
2. At the next level,the system translates the query to its own internal
language.
This process is similar to that of processing (indexing) a new
document.

Example: “impact, oil (petroleum), spills (accidents),


Alaska,price(cost,value)”

3. At the final level, the system reconsiders the query based upon the
specific database. For example, assigning weights to the terms
based upon the document frequency of each term.

Example : impact(.308),oil(.606),petroleum(.65),spills(.12),
accidents(.23),Alaska(.45),price(.16),cost(.25),value(.10)
Boolean Queries
• Boolean queries are natural in systems where weights are binary.
A term either applies or does not apply to a query. Each term T is
associated with the set of documents DT
– A AND B : Retrieve all documents for which both A and B are
relevant ( D A  D B )
– A OR B : Retrieve all documents for which either A or B are
relevant ( D A  D B )
– A NOT B : Retrieve all documents for which A is relevant and
B is not relevant (D A  D B )
• Consider two “unnatural” situations:
– Boolean queries in systems that index documents with weighted terms.
– Weighted Boolean queries in systems that use non-weighted (binary)
terms.
Boolean Queries in weighted Systems
• Environment:
– A weighted system, where the relevance of a term to a document is
expressed with a weight.
– Boolean queries, involving AND and OR.
• Possible approach: use threshold to convert all weights to
binary representations.
• Possible approach:
– Transform the query to disjunctive normalform:
• Sets of conjunctions of the form T1 
T2 
T3..
• Connected by a operators.

– Given a document D
• First, its relevance to each conjunct is computed as the minimum weight
of any document term that appears in the conjunct.
• Then, the document relevance for the complete query is the maximum of
the conjunct weights.
Boolean Queries in Weighted Systems(cont.)
• Example: Two documents indexed by 3 terms:
– Doc1 = Term1 / 0.2, Term2 / 0.5, term3 / 0.6
– Doc2 = Term1 / 0.7, Term2 / 0.2, term3 / 0.1
– Query: ( Term1 AND Term2 ) OR Term3

Document Term1 ^ Term2 Term3 Overall


Doc1 0.2 0.6 0.6
Doc2 0.2 0.1 0.2

• Relevance of Doc1 to the query is 0.6


• Relevance of Doc2 to the query is 0.2
Weighted Boolean Queries in Non-weighted
Systems
• Environment:
– A conventional system, where a term is either relevant or non-relevant to a
document.
– Boolean queries, in which users associate a weight (importance) with each
term.
• Possible approach:
– OR
• A1  B1 includes all the documents in DA  DB.
• A1  B0 includes all the documents in DA.
• As the weight of B changes from 0 to 1, Documents from D B - DA are added to
DA.
– AND
 
• A1 B1 includes all the documents in DA DB.

• A1 B0 includes all the documents in DA.
• As the
 weight of B changes from 1to 0, Documents from D A - DB are added to
DA DB .
Weighted Boolean Queries in Non-weighted
Systems (cont.)
• NOT
– A1  B1 includes all the documents in D A - DB.
– A1  B0 includes all the documents in D A.
– As the weight of B changes from 1 to 0, Documents from D A  are added
D B
to DA - DB.
• Algorithm:
– determine the documents that satisfy the either of the “extreme” interpretations.
– Determine the centroid of the inner set.
– Calculate the similarity of the documents outside of the inner set and the
centroid.
– Determine the number of document of documents to be added, by multiplying
the actual
 weight B (a value between 0 and 1 ) by the number of documents
outside of the inner set.
– Select the documents to be added as those most similar to the centroid.
Similarity Measures
• Typical, similarity measures are used when both queries and
documents are described by vectors
• A similarity measures gauges the similarity between two documents
(for the purpose of search we do not consider here them similarity,but
many of the consideration are identical)
• The measure increases as similarity grows(0 reflects total
dissimilarity)
• A variety of similarity measures has been proposed and experimented
with.
• As queries are analogous to documents,the same similarity measures
can be used to measure
– document-document similarity (used in document clustering)
– document-query similarity (used in searching)
– query-query similarity(?)
Similarity Measures:Inner Product
• Consider again
n

• SIM(Di,Dj)= W
k 1
ik  Wjk

• Where the weights Wik are simple frequency counts


• The problem with this simple measure is that it is not normalized to
account for variances in the length of documents
– This might be corrected by dividing each frequency count by the length of
the document
– It may be also be corrected by dividing each frequency count by the
maximum frequency count for the document
• Additional normalization is often performed to force all similarity
values to the range between 0 and 1
Similarity Measures:Inner Product(cont.)
• This is a refinement of the previous measure
• (alternatively,the measure remains the inner product,but the representation are
different):
n
• SIM(Q,D)=
• where
k 1
kq d
k

• m is the number of documents in the collection


• n is the number of indexing terms
• Each document is a sequence of n weights: D= (d 1,…,dn)
• A query is also a sequence of n weights: Q= (q1,…,qn)
• Each weight qk or dk = IDFk*TFk / MF
• IDFk= The inverse document frequency for term T k : that is, a value that
decreases as the frequency of the term in the collection increases;
• for example,log2m/Dfki+1,where DFk counts the number of documents in
which term Tk appears)
• TFk/MF = The frequency of term Tk in this document,divided by the maximal
frequency of any term in this document
• There are other constants for fine-tuning the formula’s performance
Similarity Measures:Cosine
• A document or a query are treated as n-dimensional vectors
k

• SIM(Q,D) =
 (q  d )
i 1
i i

 d 
k k
   qi 

2 2

i 1  
i
i 1

• Formula measures the cosine of the angle between the two vectors
• As cosine approaches 1, the two vectors become coincident(document and
query represent unrelated concepts)
• Problem: Does not take into account the length of the vectors
• Consider
– Query = (4,8,0)
– Doc1 = (1,2,0)
– Doc2 = (3,6,0)
• SIM(Query,Doc1) and SIM(Query,Doc2)are identical, even though Doc2 has
significantly higher weights in the terms in common
Similarity Measures : Summary
• Four well-known measures of vector similarity
• Similarity Measure Evaluation for Binary Evaluation for Weighted
• sim(X, Y) Term Vectors Termt
Vectors
• Inner product X Y x yi 0
i i
r

X Y
x y
i 1
i i

• Dice coefficient X Y
r r

x  y
i 1
i
i 1
i

X Y x y
i 0
i i

• Cosine coefficient 1/ 2 1/ 2 r r
X Y
x  y i i
i 0 i 0

X Y x y i i
• Jaccard coefficient X  Y  X Y r
i 0
r r

x   y   xi y
2 2
i i i
i 0 i 0 i 0
Similarity Measures: Summary(cont)
• Observations :
• All four measures use the same inner product as
nominator.
• The denominators of the last three maybe viewed as
normalizations of the inner product.
• The definitions for binary term vectors are more intuitive.
• All measures are 1 when X = Y and 0 when X and Y are
disjoint
Thresholds
• Use of similarity measures may return the entire database as a search result,
because the similarity measure might yield close-to-zero values for most, if
not all, of the documents.
• Similarity measures must be used with thresholds :
– Threshold : a value that the similarity measure must exceed
– It might also be a limit on the size of the answer.
• Example :
 Terms:
American, geography, lake, Mexico, painter, oil, reserve, subject.
 Doc1 : geography of Mexico suggests oil reserves are available.
Doc1 = ( 0, 1, 0, 1, 0, 1, 1, 0)
 Doc2 : American geography has lakes available everywhere.
Doc2 = (1, 1, 1, 0, 0, 0, 0, 0)
 Doc3 : painter suggest Mexico lakes as subjects.
Doc3 = (0, 0, 1, 1, 1, 0, 0, 1)
 Query : oil reserves in Mexico
Query = (0, 0, 0, 1, 0, 1, 1, 0)
Thresholds(cont.)
• Example(cont.)
• Using the inner product measures :
• SIM(Query, Doc1) = 3
• SIM(Query, Doc2) = 0
• SIM(Query, Doc3) = 1
• If a threshold of 2 is selected, then only Doc1 is retrieved.
• Use of thresholds may decrease recall when documents are clustered, and
search compares queries to centroids.
• There may be documents in a cluster that are not retrieved, even though they
are similar enough to the query, because their cluster centroid is not close
enough to the query.
• The risk increases as the deviation in the cluster increases (the documents are
not clustered around the centroid the centroid -- bad cluster)
Ranking
• Similarity measures provide a means for ranking
the set of retrieved documents:
– Ordering the documents from the most likely to satisfy
the query to the least likely.
– Ranking reduces the user overhead.
– Because similarity measures are not accurate, precise
ranking may be misleading; documents may be grouped
into sets, and the documents sets are ranked in order of
relevance.
Relevance Feedback
• An initial query might not provide an accurate description of the
user’s needs:
– User’s lack of knowledge of the domain.
– User’s vocabulary does not match authors’ vocabulary.
• After examining the result of his query, a user can often
improve the description of his needs:
– Querying is an iterative process.
– Further iterations are generated either manually, or automatically.
• Relevance feedback: Knowledge of which returned documents
are relevant and which are not, is used to generate the next
query.
– Assumption: the documents relevant to a query resemble each
other(similar vectors).
– Hence, if a document is known to be relevant, the query can be
improved by increasing its similarity to that document.
– Similarly, if a document is known to be non-relevant, the query
can be improved by decreasing its similarity to that document.
Relevance Feedback(cont.)
• Given a query (a vector) we
– add to it the average (centroid) of the relevant documents in the
result, and
– subtract from it the average (centroid) of the non-relevant
documents in the result.
• A vector algebra expression:
1 1
Qi 1  Qi   D  D
r DR nr DNR
where
– Qi = The present query.
– Qi+1 = The revised query.
– D = A document in the result.
– R = The relevant documents in the result(r = cardinality or R)
– NR = The non-relevant documents in the result(nr = cardinality of
NR)
Relevance Feedback (cont.)
• A revised formula, giving more control over the various components:
Qi 1  Qi    D    D
DR DNR
where
– Tuning constants; for example, 1.0, 0.5, 0.25
–  D = Positive feedback factor. Uses the user’s judgments on
DR
relevant documents to increase the values of terms. Moves the query to
retrieve documents similar to relevant documents retrieved (in the
direction of more relevant documents).
 D
– DNR = Negative feedback factor. Uses the user’s judgments on
non-relevant documents to decrease the values of terms. Moves the query
away from non-relevant documents.
– Positive feedback often weighted significantly more than negative
feedback; often, only positive feedback is used.
Relevance Feedback(Cont.)
• Illustration : Impact of relevance feedback. Illustration shows the effect of positive feedback only or negative feedback only :

• Boxes : filled = present query ; hollow = modified query.


• Oval : set of documents retrieved by present query.
• Circles : filled = non-relevant documents ; hollow = relevant.

Positive feedback Negative feedback


Relevance Feedback(Cont.)
• Example :
– Assume query Q = (3,0,0,2,0) retrieved three documents Doc1, Doc2, Doc3.
– Assume Doc1 and Doc2 are judged relevant and Doc3 is judged non-relevant.
– Assume the constants used are 1.0, 0.5, 0.25.

Term1 Term2 Term3 Term4 Term5


Q 3 0 0 2 0
Doc1 query2 is :
– The revised 4 0 0 2
Q’ =Doc2
(3, 0, 0, 2, 0) 1 3 0 0 0
+Doc3
0.5 * ((2+1)/2,0 (4+3)/2,0(0+0)/2, (0+0)/2,
4 3
(2+0)/2) 3
-Q’
0.25 * (0, 0,3.75
4, 3, 2) 1.75 0 1.25 0
= (3.75, 1.75, -1, 1.25, 0) = (3.75, 1.75, 0, 1.25, 0)
Relevance Feedback(Cont.)
• Example(cont.) :
n
– Using the similarity formula S IM (
Q i,D j )=  w w
ikjk

k
1

we can compare the similarity of Q and Q’ to the three documents :

Doc1 Doc2 Doc3


Q
– Compared to the original 6 3 6
query, new query is more similar to Doc1
and Doc2(judged Q’
relevant),11.5
and less7.5
similar3.25
to Doc3(judged non-
relevant).
– Notice how the new query added Term2, which was not in the
original query.
• For example, a user may be searching for “word processor” to be used
on a “PC”, and the revised query may introduce the term “Mac”.
Relevance Feedback(Cont.)
• Problem : Relevance feedback may not operate satisfactorily, if the identified
relevant documents do not form a tight cluster.
– Possible solution : Cluster the identified relevant documents, then split the original
query into several, by constructing a new query for each cluster.
• Problem : Some of the query terms might not be found in any of the retrieved
documents. This will lead to reduction of their relative weight in the modified
query(or even elimination). Undesirable, because these terms might still be
found in future iterations.
– Possible solutions : Ensure the original terms are kept ; or present all modified
queries to the user for review.
• “Fully automatic” relevance feedback:The rank values for the documents in
the first answer are used as relevance feedback to automatically generate the
second query(no human judgment).
– The highest ranking documents are assumed to be relevant(positive feedback only).

You might also like