Lecture 6 Score - Term Weight - Vector Space Model
Lecture 6 Score - Term Weight - Vector Space Model
Retrieval
Introduction to
Information Retrieval
Topic: Scoring, Term Weighting and the Vector Space Model
Introduction to Information
Retrieval
Ranked retrieval
▪ Thus far, our queries have all been Boolean.
▪ Documents either match or don’t.
5
Introduction to Information
Retrieval
▪ John is quicker than Mary and Mary is quicker than John have
the same vectors
Term frequency tf
▪ A document or zone that mentions a query term more often
has more to do with that query and therefore should receive
more score.
Term frequency tf
▪ We want to use tf when computing query-document match
scores. But how?
Log-frequency weighting
▪ The log frequency weight of term t in d is
idf weight
▪ dft is the document frequency of t: the number of
documents that contain t
▪ dft is an inverse measure of the informativeness of t
▪ dft ≤ N
▪ We define the idf (inverse document frequency) of t
by
16
Introduction to Information
Retrieval
tf-idf weighting
▪ The tf-idf weighting scheme assigns to term ‘t’ a weight in
document ‘d’ given by
[1]
tf-idf weighting
▪ One may view each document as a vector with one component
corresponding to each term in the dictionary, together with a
weight for each component that is given by eq. 1.
▪ The score of a document ‘d’ for a query ‘q’ is the sum, of the
tf-idf weight of each term of ‘q’ in ‘d’.
----(2)
Introduction to Information
Retrieval
Example:
Here is given the idf’s of terms with various frequencies in the Reuters
collection of 806,791 documents
Consider the table of term frequencies for 3 documents denoted Doc1, Doc2,
Doc3 in
Table 2. Compute the tf-idf weights for the terms car, auto, insurance, best,
for each document, using the idf values from Table 1.
Introduction to Information
Retrieval
-----(3)
Example:
Consider the document given below in table. Apply Euclidean
normalization to the ‘tf’ values from the table, for each of the
three documents in the table.
Introduction to Information
Retrieval
example
▪ Table below shows the number of occurrences of three terms (affection, jealous and
gossip) in each of the following three novels: Jane Austen’s Sense and Sensibility (SaS) and
Pride and Prejudice (PaP) and Emily Brontë’s Wuthering Heights (WH).
Table1: Term frequencies in three novels. Tabel2: Term vectors for the three novels
of table1
▪ Now consider the cosine similarities between pairs of the resulting three-dimensional
vectors. A simple computation shows that sim(v(SAS), v(PAP)) is 0.999, whereas
sim(v(SAS), v(WH)) is 0.888; thus, the two books authored by Austen (SaS and PaP) are
considerably closer to each other than to Brontë’s Wuthering Heights. In fact, the
similarity between the first two is almost perfect (when restricted to the three terms we
consider). Here we have considered tf weights, but we could of course use other term
weight functions.
Introduction to Information
Retrieval
Queries as vectors
▪ we can also view a query as a vector.
▪ The key idea now: to assign to each document ‘d’ a score
equal to the dot product
34
Introduction to Information
Retrieval
Example:
Suppose we query an IR system for the query "gold silver truck". The
database collection consists of three documents (D = 3) shown below.
Query, Q: “gold silver truck”.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
35
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Similarity Analysis
37
Introduction to Information
Retrieval
38
Introduction to Information
Retrieval