0% found this document useful (0 votes)

48 views52 pages

Ranked Retrieval

Ranked retrieval aims to return documents in order of their relevance to a query, rather than just returning documents that match or don't match like in Boolean retrieval. It assigns a score between 0 and 1 to each document based on how well the document matches the query. This score is based on factors like term frequency (how many times a term appears in a document) and inverse document frequency (how common or rare a term is across all documents). A popular weighting scheme is tf-idf, which multiplies term frequency by inverse document frequency to give higher weight to rarer terms that appear many times in a relevant document. Documents and queries can then be represented as vectors in a high dimensional space, with weights along each term dimension

Uploaded by

Install Mac

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views52 pages

Ranked Retrieval

Uploaded by

Install Mac

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Ranked retrieval

Nisheeth
Ch. 6

Ranked retrieval
• Thus far, our queries have all been Boolean.
– Documents either match or don’t.
• Good for expert users with precise
understanding of their needs and the
collection.
– Also good for applications: Applications can easily
consume 1000s of results.
• Not good for the majority of users.
– Writing Boolean queries is hard
Ch. 6

Problem with Boolean search:

feast or famine
• Boolean queries often result in either too few
(=0) or too many (1000s) results.
• Query 1: “standard user dlink 650” → 200,000
hits
• Query 2: “standard user dlink 650 no card
found”: 0 hits
• It takes a lot of skill to come up with a query
that produces a manageable number of hits.
– AND gives too few; OR gives too many
Ranked retrieval models
• Rather than a set of documents satisfying a query expression,
in ranked retrieval, the system returns an ordering over the
(top) documents in the collection for a query
• Free text queries: Rather than a query language of operators
and expressions, the user’s query is just one or more words in
a human language
• Ranked list of results: No more feast or famine

4
Ch. 6

Scoring as the basis of ranked

retrieval
• We wish to return in order the documents
most likely to be useful to the searcher
• How can we rank-order the documents in the
collection with respect to a query?
• Assign a score – say in [0, 1] – to each
document
• This score measures how well document and
query “match”.
Ch. 6

Query-document matching scores

• We need a way of assigning a score to a

query/document pair
• If the query term does not occur in the
document: score should be 0
• The more frequent the query term in the
document, the higher the score (should be)
Ch. 6

Jaccard coefficient

• jaccard(A,B) = |A ∩ B| / |A ∪ B|
• jaccard(A,A) = 1
• jaccard(A,B) = 0 if A ∩ B = 0
• Always assigns a number between 0 and 1.
Ch. 6

Issues with Jaccard for scoring

• Privileges shorter documents

– We need a more sophisticated way of normalizing
for length | A  B | / | A  B |
• It doesn’t consider term frequency
– how many times a term occurs in a document
• Does not account for term informativeness
– How important is the term in the document
Sec. 6.2

Accounting for term frequency

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Each document is represented by a binary vector ∈ {0,1}|V|

Term frequency tf

• The term frequency tft,d of term t in document

d is defined as the number of times that t
occurs in d.
• We want to use tf when computing query-
document match scores. But how?
• Raw term frequency is not what we want:
– A document with 10 occurrences of the term is
more relevant than a document with 1 occurrence
of the term.
– But not 10 times more relevant.
Sec. 6.2

Log-frequency weighting
• The log frequency weight of term t in d is
1 + log10 tf t,d , if tf t,d > 0
wt,d =
 0, otherwise

• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

• Score for a document-query pair: sum over
terms t in both q and d:
• score = ∑t∈q∩d (1 + log tf t ,d )
• The score is 0 if none of the query terms is
present in the document.
Sec. 6.2.1

Document frequency

• Rare terms are more informative than frequent

terms
– Recall stop words
• Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
• A document containing this term is very likely to
be relevant to the query arachnocentric
• → We want a high weight for rare terms like
arachnocentric.
Sec. 6.2.1

Document frequency, continued

• Frequent terms are less informative than rare

terms
• Consider a query term that is frequent in the
collection (e.g., high, increase, line)
• A document containing such a term is more
likely to be relevant than a document that
doesn’t
• But it’s not a sure indicator of relevance.
– How/when will it break?
idf weight

• dft is the document frequency of t: the

number of documents that contain t
– dft is an inverse measure of the informativeness of
t
– dft ≤ N
• We define the idf (inverse document
frequency) of t by
idf t = log10 ( N/df t )
– We use log (N/dft) instead of N/dft to “dampen”
the effect of idf.
idf example, suppose N = 1 million

term dft idft

calpurnia 1 6
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
the 1,000,000 0

idf t = log10 ( N/df t )

There is one idf value for each term t in a collection.
tf.idf weighting
• The tf.idf weight of a term is the product of its tf
weight and its idf weight.

w t ,d = log(1 + tf t ,d ) × log10 ( N / df t )
• Best known weighting scheme in information retrieval
• Increases with the number of occurrences within a
document
• Increases with the rarity of the term in the collection
Effect of idf on ranking

• Does idf have an effect on ranking for one-

term queries, like
– iPhone
• idf has no effect on ranking one term queries
– idf affects the ranking of documents for queries
with at least two terms
– For the query capricious person, idf weighting
makes occurrences of capricious count for much
more in the final document ranking than
occurrences of person.
17
Score for a document given a query

Score(q,d) = ∑ tf.idft,d
t ∈q∩d

• There are many variants

– How “tf” is computed (with/without logs)
– Whether the terms in the query are also weighted
–…
tf-idf weighting has many variants

Columns headed ‘n’ are acronyms for weight schemes.

Weighting may differ in queries vs
documents
• Many search engines allow for different
weightings for queries vs. documents
• SMART Notation: denotes the combination in
use in an engine, with the notation ddd.qqq,
using the acronyms from the previous table
– A very standard weighting scheme is: lnc.ltc
• Document: logarithmic tf (l as first character), no idf
and cosine normalization
• Query: logarithmic tf (l in leftmost column), idf (t in
second column), no normalization …
tf-idf example: lnc.ltc
Document: car insurance auto insurance
Query: best car insurance
Term Query Document Pro
d
tf- tf-wt df idf wt n’liz tf-raw tf-wt wt n’liz
raw e e
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Exercise: what is N, the number of docs?

Doc length = 12 + 0 2 + 12 + 1.32 ≈ 1.92
Score = 0+0+0.27+0.53 = 0.8
Binary → count → weight matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued

vector of tf-idf weights ∈ R|V|
Sec. 6.3

Documents as vectors

• So we have a |V|-dimensional vector space

• Terms are axes of the space
• Documents are points or vectors in this space
• Very high-dimensional: tens of millions of
dimensions when you apply this to a web
search engine
• These are very sparse vectors - most entries
are zero.
Sec. 6.3

Queries as vectors
• Key idea 1: Do the same for queries: represent them
as vectors in the space
• Key idea 2: Rank documents according to their
proximity to the query in this space
• proximity = similarity of vectors
• proximity ≈ inverse of distance
• We do this because we want to get away from the
you’re-either-in-or-out Boolean model.
• Instead: rank more relevant documents higher than
less relevant documents
Euclidean distance is a bad idea

• The Euclidean
distance between q
• and d2 is large even
though the
• distribution of
terms in the query
q and the
distribution of
• terms in the
document d2 are
• very similar.
Sec. 6.3

cosine(query,document)

Dot product Unit vectors

  

∑
V
  q•d q d q di
cos( q , d ) =   =  •  = i =1 i
q d
∑ ∑i=1 i
V V
qd q2
d 2
i =1 i

qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.
Length normalization
• A vector can be (length-) normalized by dividing each
of its components by its length – for this we use the
L2 norm: 
x 2 = ∑i xi2
• Dividing a vector by its L2 norm makes it a unit
(length) vector (on surface of unit hypersphere)
• Effect on the two documents d and d′ (d appended
to itself) from earlier slide: they have identical
vectors after length-normalization.
– Long and short documents now have comparable
weights
Cosine for length-normalized
vectors
• For length-normalized vectors, cosine
similarity is simply the dot product (or scalar
product):
     V
cos(q, d ) = q • d = ∑ qi di

 i=1

for q, d length-normalized.
Sec. 6.3

Cosine similarity amongst 3 documents

• How similar are

these novels term SaS PaP WH

• SaS: Sense and affection 115 58 20

Sensibility jealous 10 7 11

• PaP: Pride and gossip 2 0 6

Prejudice, and wuthering 0 0 38

• WH: Wuthering
Term frequencies (counts)
Heights?

Note: To simplify this example, we don’t do idf weighting.

3 documents example contd.
• Log frequency • After length
weighting normalization
term SaS PaP WH term SaS PaP WH
affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405
wuthering 0 0 2.58 wuthering 0 0 0.588

cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
Computing cosine scores
Summary – vector space models

• Represent the query as a weighted tf-idf vector

• Represent each document as a weighted tf-idf
vector
• Compute the cosine similarity score for the
query vector and each document vector
• Rank documents with respect to the query by
score
• Return the top K (e.g., K = 10) to the user
Ranked retrieval

LANGUAGE MODELS
Trouble with frequency-based models
• Too literal
• Can’t deal with misspellings, synonyms etc.
• Natural language queries are hard to deal with
if you don’t address these difficulties
Language Model
• Unigram language model
– probability distribution over the words in a
language
– generation of text consists of pulling words out of
a “bucket” according to the probability
distribution and replacing them
• N-gram language model
– some applications use bigram and trigram
language models where probabilities depend on
previous words
Semantic distance
Sample topic
Language Model
• A topic in a document or query can be
represented as a language model
– i.e., words that tend to occur often when discussing a
topic will have high probabilities in the corresponding
language model
– The basic assumption is that words cluster in semantic
space
• Multinomial distribution over words
– text is modeled as a finite sequence of words, where
there are t possible words at each point in the
sequence
– commonly used, but not only possibility
– doesn’t model burstiness
Has interesting applications
LMs for Retrieval
• 3 possibilities:
– probability of generating the query text from a
document language model
– probability of generating the document text from
a query language model
– comparing the language models representing the
query and document topics
• Models of topical relevance
Query-Likelihood Model
• Rank documents by the probability that the
query could be generated by the document
model (i.e. same topic)
• Given query, start with P(D|Q)
• Using Bayes’ Rule

• Assuming prior is uniform, unigram model

Other query constructions
Estimating Probabilities
• Obvious estimate for unigram probabilities is

• Maximum likelihood estimate

– makes the observed value of fq ;D most likely
i

• If query words are missing from document,

score will be zero
– Missing 1 out of 4 query words same as missing 3
out of 4
Smoothing
• Document texts are a sample from the
language model
– Missing words should not have zero probability of
occurring
• Smoothing is a technique for estimating
probabilities for missing (or unseen) words
– lower (or discount) the probability estimates for
words that are seen in the document text
– assign that “left-over” probability to the estimates
for the words that are not seen in the text
Estimating Probabilities
• Estimate for unseen words is αDP(qi|C)
– P(qi|C) is the probability for query word i in the
collection language model for collection C
(background probability)
– αD is a parameter
• Estimate for words that occur is
(1 − αD) P(qi|D) + αD P(qi|C)
• Different forms of estimation come from
different αD
Jelinek-Mercer Smoothing
• αD is a constant, λ
• Gives estimate of

• Ranking score

• Use logs for convenience

– accuracy problems multiplying small numbers
Compare with tf.idf

- proportional to the term frequency, inversely

proportional to the collection frequency
Dirichlet Smoothing
• αD depends on document length

• Gives probability estimation of

• and document score

Query Likelihood Example
• For the term “president”
– fqi,D = 15, cqi = 160,000
• For the term “lincoln”
– fqi,D = 25, cqi = 2,400
• document |d| is assumed to be 1,800 words long
• collection is assumed to be 109 words long
– 500,000 documents times an average of 2,000 words
• μ = 2,000
Query Likelihood Example

• Negative number because summing logs

of small numbers
Query Likelihood Example
Going beyond tf.idf

D D M

q q

The Story of Halloween
100% (1)
The Story of Halloween
18 pages
Chemistry SPM Forecast Papers
0% (1)
Chemistry SPM Forecast Papers
16 pages
Contemporary Art Pangalay: Pre Conquest
No ratings yet
Contemporary Art Pangalay: Pre Conquest
8 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
TF Idf
100% (3)
TF Idf
38 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
16 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
I R Rank
No ratings yet
I R Rank
52 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
6 Tfidf
No ratings yet
6 Tfidf
48 pages
Chapter 3 Term Weighting
No ratings yet
Chapter 3 Term Weighting
11 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
IR - 2 Unit
No ratings yet
IR - 2 Unit
46 pages
Lecture4 VSM
No ratings yet
Lecture4 VSM
101 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
Vector Space and IR Evaluation
No ratings yet
Vector Space and IR Evaluation
41 pages
IR Presentation 2
No ratings yet
IR Presentation 2
28 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
NLP Week10 IR Enc Dec Annotated - by - Ces
No ratings yet
NLP Week10 IR Enc Dec Annotated - by - Ces
83 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
No ratings yet
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
40 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Scoring
No ratings yet
Scoring
49 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
Lec 4
No ratings yet
Lec 4
39 pages
UNIT 6 Applications of NLP
No ratings yet
UNIT 6 Applications of NLP
60 pages
Text Representation
No ratings yet
Text Representation
16 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
4 Lec 2025
No ratings yet
4 Lec 2025
57 pages
Lecture6-Tfidf Vector Space Model
No ratings yet
Lecture6-Tfidf Vector Space Model
45 pages
Lect 13-Text Ranking
No ratings yet
Lect 13-Text Ranking
58 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Module 2-1
No ratings yet
Module 2-1
6 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
IR - Models
100% (3)
IR - Models
58 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
The Boolean Model: Simple Model Based On Set Theory Queries Specified As Boolean Expressions
No ratings yet
The Boolean Model: Simple Model Based On Set Theory Queries Specified As Boolean Expressions
26 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
IR Slides Lec03 PDF
No ratings yet
IR Slides Lec03 PDF
72 pages
Ijeta V2i6p1 PDF
No ratings yet
Ijeta V2i6p1 PDF
6 pages
Tugas Kelompok, Maks 3 Mhs Per Kelompok. Dikumpulkan Dalam Bentuk Hardcopy. Information Retrieval Assignment
100% (1)
Tugas Kelompok, Maks 3 Mhs Per Kelompok. Dikumpulkan Dalam Bentuk Hardcopy. Information Retrieval Assignment
2 pages
The Porter Stemming Algorithm
No ratings yet
The Porter Stemming Algorithm
12 pages
An Attendance Monitoring System Using Biometrics Authentication
100% (2)
An Attendance Monitoring System Using Biometrics Authentication
5 pages
G-9 Non-Mendelian Genetics
50% (2)
G-9 Non-Mendelian Genetics
5 pages
The Church Triumphant Seminar
No ratings yet
The Church Triumphant Seminar
12 pages
Motion 2
No ratings yet
Motion 2
5 pages
Symbiosis International University, Pune: Case Analysis
100% (1)
Symbiosis International University, Pune: Case Analysis
15 pages
Unit V - Blocking Oscillators and Time Base Generators
No ratings yet
Unit V - Blocking Oscillators and Time Base Generators
29 pages
Blow Blow
No ratings yet
Blow Blow
9 pages
Congenital Disease
No ratings yet
Congenital Disease
39 pages
How To Solve Just About Any Problem - Book - Print Version - LATEST - FINAL - EDITED
No ratings yet
How To Solve Just About Any Problem - Book - Print Version - LATEST - FINAL - EDITED
228 pages
Outsourcing by Prashant Priyadarshi
No ratings yet
Outsourcing by Prashant Priyadarshi
14 pages
Chapter 1 Developing A Business Mindset
No ratings yet
Chapter 1 Developing A Business Mindset
41 pages
How To Build Products Users Love
100% (2)
How To Build Products Users Love
41 pages
History of Eth & Horn Chapter One
No ratings yet
History of Eth & Horn Chapter One
28 pages
Hume On 'Is' and 'Ought': A Defense of MacIntyre
No ratings yet
Hume On 'Is' and 'Ought': A Defense of MacIntyre
9 pages
Confidentiality Issues in Arbitration 2013
No ratings yet
Confidentiality Issues in Arbitration 2013
9 pages
Structure of Nursing Knowledge
No ratings yet
Structure of Nursing Knowledge
4 pages
Neron The Time Demon
No ratings yet
Neron The Time Demon
4 pages
Nonverbal Communication
No ratings yet
Nonverbal Communication
5 pages
HITEC 1st Merit List 22 Nov 20191574405819
No ratings yet
HITEC 1st Merit List 22 Nov 20191574405819
4 pages
Bidyut Profile
No ratings yet
Bidyut Profile
10 pages
Garvey Et Al-2006-Research in Nursing & Health PDF
No ratings yet
Garvey Et Al-2006-Research in Nursing & Health PDF
11 pages
Cisco Identity Services Engine Network Component Compatibility, Release 2.3
No ratings yet
Cisco Identity Services Engine Network Component Compatibility, Release 2.3
28 pages
CRW2601 2023
No ratings yet
CRW2601 2023
5 pages
4Th Quarter Final Examination in English
No ratings yet
4Th Quarter Final Examination in English
6 pages
Evolution of Dance
No ratings yet
Evolution of Dance
23 pages
Cambridge O Level: Second Language Urdu For Examination From 2024
No ratings yet
Cambridge O Level: Second Language Urdu For Examination From 2024
10 pages
6 Coin Puzzle Script
No ratings yet
6 Coin Puzzle Script
4 pages