0% found this document useful (0 votes)

9 views28 pages

IR Presentation 2

Uploaded by

Jawad Abid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views28 pages

IR Presentation 2

Uploaded by

Jawad Abid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Introduction to

Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Limitations of Boolean Retrieval Model
• Thus far, our queries have all been Boolean.
• Documents either match or don’t.
• Good for expert users with a precise understanding of their needs
and the collection.
• Also good for applications: Applications can easily consume 1000s of
results.
• Not good for the majority of users.
• Writing Boolean queries is hard
Limitations of Boolean Retrieval Model
• Exact matching may retrieve too few (∼ 0) or too many documents (∼
1000)

• Query 1: “standard user dlink 650” → 200,000 hits

• Query 2: “standard user dlink 650 no card found”: 0 hits

• It takes a lot of skill to come up with a query that produces a

manageable number of hits.
• AND gives too few, OR gives too many
Ranked Retrieval

• Ranks documents by relevance to the query

• Return top k documents by relevance

• Allow for free text queries:

• Rather than considering query language of operators and expressions, it
consider words of human language
Ranked
Retrieval
Very large/Small: No issue in Ranked Retrieval

• When the system produces a ranked result set, the large result set
is not an issue.
• The size of the result is not an issue

• Just show the top k (k ∼ 10,20,100) results.

• Don’t overwhelm the user

• Premise: the ranking algorithm works

Scoring as Basis of Ranked Retrieval

• We wish to return in order the documents most likely to be useful

to the searcher

• How to rank-order documents in a collection with respect to a

query?

• Assign a score – say in [0,1] – to each document

• This score measures how well the document and query “match”.
Scoring as Basis of Ranked Retrieval

• We need a way of assigning a score to a query/document pair

• If the query term does not occur in the document: score should be
0

• The more frequent the query term in the document, the higher the
score (should be)
Jaccard coefficient

• jaccard(A,B) = |A ∩ B| / |A ∪ B|
• jaccard(A,A) = 1
• jaccard(A,B) = 0 if A ∩ B = 0
• Always assigns a number between 0
and 1.
Jaccard coefficient: Scoring Example

• What is the query-document match score that the Jaccard coefficient

computes for each of the two documents below?

J(q,d1)=1/6=0.16
Query: ides of march=3
J(q,d2)=1/5=0.2
D1: Caesar died in March=4
D2: the long March=3
Issues with Jaccard for scoring

• Privileges shorter documents

• We need a more sophisticated way of normalizing for length | A ∩ B | /√ | A ∪ B |

• It doesn’t consider term frequency

– how many times a term occurs in a document

– Rare terms in collection are more informative than frequent terms.

– Jaccard doesn’t consider this information

Binary Term Incidence Matrix

• Each document is represented by a binary vector ∈ {0,1}|V|

Count Matrix
Bag of Words
• We do not consider the order of words in a document.
• Represented the same way:

• This is called a bag of words model.

• In a sense, this is a step back: The positional index was able to
distinguish these two documents.
Term frequency tf
• The term frequency tft,d of term t in document d is defined as the
number of times that t occurs in d.
• We want to use tf when computing query document match
scores. But how?
• – A document with 10 occurrences of the term is more relevant
than a document with 1 occurrence of the term.
• Raw term frequency is not what we want:
– But not 10 times more relevant.
–Relevance does not increase proportionally with term frequency.
Instead of raw frequency: Log frequency weighting
Desired weight for rare terms
Document frequency
idf weight
idf weight-
Example
Effect of idf on ranking
Collection frequency vs. Document frequency
tf-idf weighting

• Combine the term frequency and

document frequency to produce a
composite weight for each term in
each document
Binary→Count → Weight Matrix
tf-idf weighting

• Increases with the number of occurrences in the document.

• Increases with the rarity of the term in the collection.

• Lowest when the term occurs in virtually all docs

Score of document

• 𝑆𝑐𝑜𝑟𝑒 𝑞, 𝑑 = σ𝑡∈𝑞 𝑡𝑓 − 𝑖𝑑𝑓𝑡,𝑑

• Score of document d is the sum , over all query terms, of the

number of times each of the query term occurs in d.

• Tf-idf of each term in d.

L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
CS726 Information Retrieval Techniques Complete Handouts (Downloded From Cluesbook - Com)
No ratings yet
CS726 Information Retrieval Techniques Complete Handouts (Downloded From Cluesbook - Com)
237 pages
Black Hat SEO Crash Course V1
No ratings yet
Black Hat SEO Crash Course V1
15 pages
IR - Ricardo Unit II
No ratings yet
IR - Ricardo Unit II
512 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
IR - Models
100% (3)
IR - Models
58 pages
Modern Information Retrieval: Modeling
No ratings yet
Modern Information Retrieval: Modeling
197 pages
Good Irmodeling
No ratings yet
Good Irmodeling
263 pages
Modern Information Retrieval: Modeling
No ratings yet
Modern Information Retrieval: Modeling
263 pages
Unit 1
No ratings yet
Unit 1
181 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
16 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Lect 13-Text Ranking
No ratings yet
Lect 13-Text Ranking
58 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Lecture4 VSM
No ratings yet
Lecture4 VSM
101 pages
IR - 2 Unit
No ratings yet
IR - 2 Unit
46 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
No ratings yet
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
40 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Resume Parser and Job Recommendation System Using Machine Learning
No ratings yet
Resume Parser and Job Recommendation System Using Machine Learning
6 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
4 Lec 2025
No ratings yet
4 Lec 2025
57 pages
I R Rank
No ratings yet
I R Rank
52 pages
Vector Space and IR Evaluation
No ratings yet
Vector Space and IR Evaluation
41 pages
Lec 4
No ratings yet
Lec 4
39 pages
10 Prob
No ratings yet
10 Prob
33 pages
Lecture6-Tfidf Vector Space Model
No ratings yet
Lecture6-Tfidf Vector Space Model
45 pages
TF Idf
100% (3)
TF Idf
38 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
6 Tfidf
No ratings yet
6 Tfidf
48 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
TFIDF
No ratings yet
TFIDF
45 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
15 pages
L03
No ratings yet
L03
16 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
Unit 2
No ratings yet
Unit 2
58 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Ir4 Retrieval Models - 6up
No ratings yet
Ir4 Retrieval Models - 6up
7 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Module 2-1
No ratings yet
Module 2-1
6 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
10212cs214 Data Visualization Unit III 19.02.2024
No ratings yet
10212cs214 Data Visualization Unit III 19.02.2024
127 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
35 pages
Natural Language Processing (NLP) Introduction:: Top 10 NLP Interview Questions For Beginners
No ratings yet
Natural Language Processing (NLP) Introduction:: Top 10 NLP Interview Questions For Beginners
24 pages
TTS Unit 3 QAS
No ratings yet
TTS Unit 3 QAS
241 pages
Sentiment Analysis of Twitter Data Using TF-IDF and Machine Learning Techniques
No ratings yet
Sentiment Analysis of Twitter Data Using TF-IDF and Machine Learning Techniques
4 pages
PDFF
No ratings yet
PDFF
15 pages
Natural Language Processing
No ratings yet
Natural Language Processing
1 page
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
15 pages
Vietnamese Text Clasification
No ratings yet
Vietnamese Text Clasification
7 pages
Data Analytics and Visualization Previous Year Questions
No ratings yet
Data Analytics and Visualization Previous Year Questions
4 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
Privacy-Preserving Multi-Keyword Top-K K Similarity Search Over Encrypted Data
No ratings yet
Privacy-Preserving Multi-Keyword Top-K K Similarity Search Over Encrypted Data
14 pages
Quanteda
No ratings yet
Quanteda
106 pages
Vector Space Modeling With TFIDF
No ratings yet
Vector Space Modeling With TFIDF
4 pages
Deep Learning-Based Depression Detection From Social Media
No ratings yet
Deep Learning-Based Depression Detection From Social Media
20 pages
Social Computing
No ratings yet
Social Computing
35 pages
Search Engine Architecture 1
No ratings yet
Search Engine Architecture 1
23 pages
Assignment 3 Instructions
No ratings yet
Assignment 3 Instructions
10 pages
Module 4: Advanced Analytics - Theory and Methods: Lesson 6: Linear Regression
No ratings yet
Module 4: Advanced Analytics - Theory and Methods: Lesson 6: Linear Regression
43 pages
Natural Language Processing: Neural Question Answering
No ratings yet
Natural Language Processing: Neural Question Answering
37 pages
A Big Data-Driven Root Cause Analysis System
No ratings yet
A Big Data-Driven Root Cause Analysis System
16 pages
The Ultimate Beginners Guide To Natural Language Processing
No ratings yet
The Ultimate Beginners Guide To Natural Language Processing
21 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
CST466
No ratings yet
CST466
5 pages
Research On Regional Languages
No ratings yet
Research On Regional Languages
6 pages
Ontology Modelling For FDA Adverse Event Reporting System
No ratings yet
Ontology Modelling For FDA Adverse Event Reporting System
5 pages
IRDM Assignment-I PDF
No ratings yet
IRDM Assignment-I PDF
4 pages

IR Presentation 2

Uploaded by

IR Presentation 2

Uploaded by

Introduction to

• Query 1: “standard user dlink 650” → 200,000 hits

• Query 2: “standard user dlink 650 no card found”: 0 hits

• It takes a lot of skill to come up with a query that produces a

• Ranks documents by relevance to the query

• Return top k documents by relevance

• Allow for free text queries:

• Just show the top k (k ∼ 10,20,100) results.

• Don’t overwhelm the user

• Premise: the ranking algorithm works

• We wish to return in order the documents most likely to be useful

• How to rank-order documents in a collection with respect to a

• Assign a score – say in [0,1] – to each document

• We need a way of assigning a score to a query/document pair

• What is the query-document match score that the Jaccard coefficient

• Privileges shorter documents

• It doesn’t consider term frequency

– Rare terms in collection are more informative than frequent terms.

– Jaccard doesn’t consider this information

• Each document is represented by a binary vector ∈ {0,1}|V|

• This is called a bag of words model.

• Combine the term frequency and

• Increases with the number of occurrences in the document.

• Increases with the rarity of the term in the collection.

• Lowest when the term occurs in virtually all docs

• 𝑆𝑐𝑜𝑟𝑒 𝑞, 𝑑 = σ𝑡∈𝑞 𝑡𝑓 − 𝑖𝑑𝑓𝑡,𝑑

• Score of document d is the sum , over all query terms, of the

number of times each of the query term occurs in d.

• Tf-idf of each term in d.

You might also like