Boolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models
Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
Retrieval Models
A retrieval model specifies the details of:
Document representation Query representation Retrieval function
Determines a notion of relevance. Notion of relevance can be binary or continuous (i.e. ranked retrieval).
2
Probabilistic models
3
User Task
Retrieval Browsing
Remove common stopwords (e.g. a, the, it, etc.). Detect common phrases (possibly using a domain specific dictionary). Build inverted index (keyword list of docs containing it).
5
Boolean Model
A document is represented as a set of keywords.
Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope.
[[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
Statistical Models
A document is typically represented by a bag of words (unordered words with frequencies). Bag = set that allows multiple occurrences of the same element. User specifies a set of desired terms with optional weights:
Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > Unweighted query terms: Q = < database; text; information > No Boolean conditions specified in the query.
9
Statistical Retrieval
Retrieval based on similarity between query and documents. Output documents are ranked according to similarity to query. Similarity based on occurrence frequencies of keywords in query and document. Automatic relevance feedback can be supported:
Relevant documents added to query. Irrelevant documents subtracted from query.
10
How to determine the degree of importance of a term within a document and within the entire collection? How to determine the degree of similarity between a document and the query? In the case of the web, what is a collection and what are the effects of links, formatting information, etc.?
11
Each term, i, in a document or query, j, is given a real-valued weight, wij. Both documents and queries are expressed as t-dimensional vectors:
dj = (w1j, w2j, , wtj)
12
Graphic Representation
Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3
2 3
T3
5
T1
D2 = 3T1 + 7T2 + T3
T2
Is D1 or D2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection?
13
Document Collection
A collection of n documents can be represented in the vector space model by a term-document matrix. An entry in the matrix corresponds to the weight of a term in the document; zero means the term has no significance in the document or it simply doesnt exist in the document. T1 T2 . Tt D1 w11 w21 wt1 D2 w12 w22 wt2 : : : : : : : : Dn w1n w2n wtn
14
15
TF-IDF Weighting
A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi) A term occurring frequently in the document but rarely in the rest of the collection is given high weight. Many other ways of determining term weights have been proposed. Experimentally, tf-idf has been found to work well.
17
18
Query Vector
Query vector is typically treated as a document and also tf-idf weighted. Alternative is for the user to supply weights for the given query terms.
19
Similarity Measure
A similarity measure is a function that computes the degree of similarity between two vectors. Using a similarity measure between the query and each document:
It is possible to rank the retrieved documents in the order of presumed relevance. It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled.
20
sim(dj,q) = djq =
w w
i 1 ij
iq
where wij is the weight of term i in document j and wiq is the weight of term i in the query
For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). For weighted term vectors, it is the sum of the products of the weights of the matched terms.
21
22
0
1
Q = 1, 0 , 1, 0, 0,
Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query
sim(D, Q) = 3
Weighted:
D1 = 2T1 + 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 D2 = 3T1 + 7T2 + 1T3
23
t3
D1
2
Q
t1
t2
D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3 D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
24
Nave Implementation
Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V. Convert query to a tf-idf-weighted vector q. For each dj in D do Compute score sj = cosSim(dj, q) Sort documents by decreasing score. Present top ranked documents to the user.
Time complexity: O(|V||D|) Bad for large V & D ! |V| = 10,000; |D| = 100,000; |V||D| = 1,000,000,000
25
26