Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
Ranked retrieval
Thus far, our queries have all been Boolean.
Documents either match or don’t.
Good for expert users with precise understanding of
their needs and the collection.
Also good for applications: Applications can easily
consume 1000s of results.
Not good for the majority of users.
Most users incapable of writing Boolean queries (or they
are, but they think it’s too much work).
Most users don’t want to wade through 1000s of results.
This is particularly true of web search.
Introduction to Information Retrieval Ch. 6
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Introduction to Information Retrieval
Term frequency tf
The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
Note: Frequency means count in IR
We want to use tf when computing query-document
match scores. But how?
Raw term frequency is not what we want:
A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
But not 10 times more relevant.
Relevance does not increase proportionally with
term frequency.
Introduction to Information Retrieval Sec. 6.2
Log-frequency weighting
The log frequency weight of term t in d is
1 log10 tf t,d , if tf t,d 0
wt,d
0, otherwise
idf weight
dft is the document frequency of t: the number of
documents that contain t
dft is an inverse measure of the informativeness of t
dft N
We define the idf (inverse document frequency) of t
by idf log ( N/df )t 10 t
19
Introduction to Information Retrieval Sec. 6.2.2
tf-idf weighting
The tf-idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d log(1 tf t ,d ) log10 ( N / df t )
Score(q,d) tf.idft,d
t qd
21
Introduction to Information Retrieval Sec. 6.3
Documents as vectors
So we have a |V|-dimensional vector space
Terms are axes of the space
Documents are points or vectors in this space
Very high-dimensional: tens of millions of dimensions
when you apply this to a web search engine
These are very sparse vectors - most entries are zero.
Introduction to Information Retrieval Sec. 6.3
Queries as vectors
Key idea 1: Do the same for queries: represent them
as vectors in the space
Key idea 2: Rank documents according to their
proximity to the query in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
Introduction to Information Retrieval Sec. 6.3
Length normalization
A vector can be (length-) normalized by dividing each
of its components by its length – for this we use the
L2 norm:
x 2 i xi2
cosine(query,document)
Dot product Unit vectors
V
qd q d q di
i 1 i
cos(q , d )
q
qd d
i1 i
V 2 V
2
q
i 1 i
d
cos(q, d ) q d qi di
V
i1
for q, d length-normalized.
32
Introduction to Information Retrieval
33
Introduction to Information Retrieval Sec. 6.3
Sensibility jealous 10 7 11
gossip 2 0 6
PaP: Pride and
wuthering 0 0 38
Prejudice, and
WH: Wuthering Term frequencies (counts)
Heights?