0% found this document useful (0 votes)

8 views30 pages

Lecture5 6

Uploaded by

.moashraf.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views30 pages

Lecture5 6

Uploaded by

.moashraf.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Search Engines Indexes

Lecture 5-6
Indexes
• Indexes: are data structures designed to make
search faster
• Text search has unique requirements, which
leads to unique data structures
• Most common data structure is inverted index
– general name for a class of structures
– “inverted” because documents are associated
with words, rather than words with documents
• similar to a concordance
Indexes and Ranking
• Indexes are designed to support search
– faster response time, supports updates
• Text search engines use a particular form of
search: ranking
– documents are retrieved in sorted order according to
a score computing using the document
representation, the query, and a ranking algorithm
• What is a reasonable abstract model for ranking?
– enables discussion of indexes without details of
retrieval model
Abstract Model of Ranking
Inverted Index
• Each index term is associated with an inverted
list
– Contains lists of documents, or lists of word
occurrences in documents, and other information
– Each entry is called a posting
– The part of the posting that refers to a specific
document or location is called a pointer
– Each document in the collection is given a unique
number
– Lists are usually document-ordered (sorted by
document number)
Example “Collection”
Proximity Matches
• Matching phrases or words within a window
– e.g., "tropical fish", or “find tropical within
5 words of fish”
• Word positions in inverted lists make these
types of query features efficient
– e.g.,
Fields and Extents
• Document structure is useful in search
– field restrictions
• e.g., date, from:, etc.
– some fields more important
• e.g., title
• Options:
– separate inverted lists for each field type
– add information about fields to postings
– use extent lists
Extent Lists
• An extent is a contiguous region of a
document
– represent extents using word positions
– inverted list records all extents for a given field
type
– e.g.,

extent list
Other Issues
• Precomputed scores in inverted list
– e.g., list for “fish” [(1:3.6), (3:2.2)], where 3.6 is
total feature value for document 1
– improves speed but reduces flexibility
• Score-ordered lists
– query processing engine can focus only on the top
part of each inverted list, where the highest-
scoring documents are recorded
– very efficient for single-word queries
Compression
• Inverted lists are very large
– e.g., 25-50% of collection for TREC collections using
Indri search engine
– Much higher if n-grams are indexed
• Compression of indexes saves disk and/or memory
space
– Typically have to decompress lists to use them
– Best compression techniques have good compression
ratios and are easy to decompress
• Lossless compression – no information lost
Distributed Indexing
• Distributed processing driven by need to index
and analyze huge amounts of data (i.e., the
Web)
• Large numbers of inexpensive servers used
rather than larger, more expensive machines
• MapReduce is a distributed programming tool
designed for indexing and analysis tasks
Caching
• Query distributions similar to Zipf
– About ½ each day are unique, but some are very
popular
• Caching can significantly improve effectiveness
– Cache popular query results
– Cache common inverted lists
• Inverted list caching can help with unique
queries
• Cache must be refreshed to prevent stale data
Performance Evaluation
of Information Retrieval Systems

14
Why System Evaluation?
• There are many retrieval models/ algorithms/
systems, which one is the best?
• What is the best component for:
– Ranking function (dot-product, cosine, …)
– Term selection (stopword removal, stemming…)
– Term weighting
• How far down the ranked list will a user need
to look to find some/all relevant documents?

15
Difficulties in Evaluating IR Systems
• Effectiveness is related to the relevancy of retrieved
items.
• Relevancy is not typically binary but continuous.
• Even if relevancy is binary, it can be a difficult
judgment to make.
• Relevancy, from a human standpoint, is:
– Subjective: Depends upon a specific user’s judgment.
– Situational: Relates to user’s current needs.
– Cognitive: Depends on human perception and behavior.
– Dynamic: Changes over time.

16
Precision and Recall

relevant irrelevant
Entire document retrieved
Not retrieved
collection Relevant Retrieved &
documents documents & irrelevant
irrelevant

retrieved not retrieved

& relevant but relevant
retrieved not retrieved

Number of relevant documents retrieved

recall 
Total number of relevant documents

Number of relevant documents retrieved

precision 
Total number of documents retrieved

17
Precision and Recall
• Precision
– The ability to retrieve top-ranked documents that are mostly relevant.
• Recall
– The ability of the search to find all of the relevant items in the corpus.

Number of relevant documents retrieved

recall 
Total number of relevant documents

Number of relevant documents retrieved

precision 
Total number of documents retrieved

18
Determining Recall is Difficult
• Total number of relevant items is sometimes
not available:
– Sample across the database and perform
relevance judgment on these items.
– Apply different retrieval algorithms to the same
database for the same query. The aggregate of
relevant items is taken as the total relevant set.

19
Trade-off between Recall and Precision
Returns relevant documents but
misses many useful ones too The ideal
1
Precision

0 1
Recall Returns most relevant
documents but includes
lots of junk

20
Computing Recall/Precision Points
• For a given query, produce the ranked list of
retrievals.
• Adjusting a threshold on this ranked list produces
different sets of retrieved documents, and therefore
different recall/precision measures.
• Mark each document in the ranked list that is
relevant according to the gold standard.
• Compute a recall/precision pair for each position in
the ranked list that contains a relevant document.

21
Computing Recall/Precision Points:
Example 1

n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 589 x
3 576
R=1/6=0.167; P=1/1=1
4 590 x
5 986
R=2/6=0.333; P=2/2=1
6 592 x
7 984 R=3/6=0.5; P=3/4=0.75
8 988
9 578 R=4/6=0.667; P=4/6=0.667
10 985
11 103 Missing one
12 591 relevant document.
Never reach
13 772 x R=5/6=0.833; p=5/13=0.38 100% recall
14 990
22
Computing Recall/Precision Points:
Example 2

n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 576
3 589 x
R=1/6=0.167; P=1/1=1
4 342
5 590 x
R=2/6=0.333; P=2/3=0.667
6 717
7 984
8 772 x
9 321 x
10 498
11 113
12 628
13 772
14 592 x R=6/6=1.0; p=6/14=0.429
23
Average Recall/Precision Curve
• Typically average performance over a large set
of queries.
• Compute average precision at each standard
recall level across all queries.
• Plot average precision/recall curves to
evaluate overall system performance on a
document/query corpus.

24
Compare Two or More Systems
• The curve closest to the upper right-hand
corner of the graph indicates the best
performance
1
0.8 NoStem Stem
Precision

0.6
0.4
0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall

25
F-Measure
• One measure of performance that takes into
account both recall and precision.
• Harmonic mean of recall and precision:
2 PR 2
F 1 1
P  R RP

• Compared to arithmetic mean, both need to

be high for harmonic mean to be high.
26
E Measure (parameterized F Measure)
• A variant of F measure that allows weighting
emphasis on precision over recall:
(1   ) PR (1   )
2 2
E 2  2 1
 PR 
R P
• Value of  controls trade-off:
–  = 1: Equally weight precision and recall (E=F).
–  > 1: Weight recall more.
–  < 1: Weight precision more.

27
Mean Average Precision
(MAP)
• Average Precision: Average of the precision
values at the points at which each relevant
document is retrieved.
– Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633
– Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625

• Mean Average Precision: Average of the average

precision value for a set of queries.

28
Other Factors to Consider
• User effort: Work required from the user in
formulating queries, conducting the search, and
screening the output.
• Response time: Time interval between receipt of a
user query and the presentation of system responses.
• Form of presentation: Influence of search output
format on the user’s ability to utilize the retrieved
materials.
• Collection coverage: Extent to which any/all relevant
items are included in the document corpus.

29
Experimental Setup for Benchmarking
• Analytical performance evaluation is difficult for
document retrieval systems because many
characteristics such as relevance, distribution of
words, etc., are difficult to describe with
mathematical precision.
• Performance is measured by benchmarking. That is,
the retrieval effectiveness of a system is evaluated on
a given set of documents, queries, and relevance
judgments.
• Performance data is valid only for the environment
under which the system is evaluated.

Multilingual Information Retrieval
No ratings yet
Multilingual Information Retrieval
18 pages
NLP and IR Tanvier Siddiqui, U.S. Tiwary
No ratings yet
NLP and IR Tanvier Siddiqui, U.S. Tiwary
18 pages
Information Storage and Retrival
No ratings yet
Information Storage and Retrival
31 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Introduction To Object Detection
No ratings yet
Introduction To Object Detection
24 pages
Web Intelligence: What Is Webintelligence?
No ratings yet
Web Intelligence: What Is Webintelligence?
25 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Information Retrieval System
No ratings yet
Information Retrieval System
4 pages
Deep Learning: Dr. Sanjeev Sharma
No ratings yet
Deep Learning: Dr. Sanjeev Sharma
61 pages
CS8080 Irt Unit 4 23 24
No ratings yet
CS8080 Irt Unit 4 23 24
36 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
Lecture 6 - Scoring, Term Weighting, Vector Space Model - Part 2
No ratings yet
Lecture 6 - Scoring, Term Weighting, Vector Space Model - Part 2
44 pages
Efficient Content-Based Image Retrieval Using Integrated Dual Deep Convolutional Neural Network
No ratings yet
Efficient Content-Based Image Retrieval Using Integrated Dual Deep Convolutional Neural Network
8 pages
93512information Retrieval LecturesNotes2024
No ratings yet
93512information Retrieval LecturesNotes2024
153 pages
Performance Evaluation of Information Retrieval Systems
No ratings yet
Performance Evaluation of Information Retrieval Systems
46 pages
Full Proceedings
No ratings yet
Full Proceedings
79 pages
Revision Sheet
No ratings yet
Revision Sheet
20 pages
Performance Evaluation of Information Retrieval Systems
No ratings yet
Performance Evaluation of Information Retrieval Systems
28 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
IR Practical
No ratings yet
IR Practical
24 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Revison Sheet 2
No ratings yet
Revison Sheet 2
10 pages
Hindi To English and Marathi To English Cross Lang
No ratings yet
Hindi To English and Marathi To English Cross Lang
9 pages
Information Retrivals Ans
No ratings yet
Information Retrivals Ans
78 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Last Revision Sheet - Sheet #3
No ratings yet
Last Revision Sheet - Sheet #3
8 pages
Topic 6 W7 W8 - IREvaluation - Uodated
No ratings yet
Topic 6 W7 W8 - IREvaluation - Uodated
37 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Module 2-1
No ratings yet
Module 2-1
6 pages
Unit-V
No ratings yet
Unit-V
54 pages
Unit 3
No ratings yet
Unit 3
27 pages
Lecture 7 - Evaluation in IR, Relevance Feedback, Query Expansion
No ratings yet
Lecture 7 - Evaluation in IR, Relevance Feedback, Query Expansion
79 pages
Evaluation 1
No ratings yet
Evaluation 1
63 pages
Chapter3 MA212 Evaluation
No ratings yet
Chapter3 MA212 Evaluation
63 pages
Chapter 3 Od Part B
No ratings yet
Chapter 3 Od Part B
33 pages
Ip 8
No ratings yet
Ip 8
51 pages
Thesis
No ratings yet
Thesis
49 pages
Modern Information Retrieval
No ratings yet
Modern Information Retrieval
58 pages
Chapter 6-8IR Revised
No ratings yet
Chapter 6-8IR Revised
76 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Efficient Graph-Friendly COCO Metric Computation For Train-Time Model Evaluation
No ratings yet
Efficient Graph-Friendly COCO Metric Computation For Train-Time Model Evaluation
7 pages
5-Retrieval Effectiveness
No ratings yet
5-Retrieval Effectiveness
20 pages
Bulu
No ratings yet
Bulu
47 pages
Evaluation
No ratings yet
Evaluation
41 pages
Deep Ensemble Architectures With Heterogeneous Approach For An Efficient Content-Based Image Retrieval
No ratings yet
Deep Ensemble Architectures With Heterogeneous Approach For An Efficient Content-Based Image Retrieval
13 pages
ISR Chap... 6
No ratings yet
ISR Chap... 6
14 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
5 Retrieval Evaluation
No ratings yet
5 Retrieval Evaluation
20 pages
Multi-Meta-RAG Improving RAG For Multi-Hop Queries Using Database Filtering With LLM-Extracted Metadata
No ratings yet
Multi-Meta-RAG Improving RAG For Multi-Hop Queries Using Database Filtering With LLM-Extracted Metadata
10 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
Learning Guide Unit 5 - Home
No ratings yet
Learning Guide Unit 5 - Home
12 pages
1727759531-6 Evaluation in Information Retrieval
No ratings yet
1727759531-6 Evaluation in Information Retrieval
24 pages
Week 3 - Probabilistic Retrieval and Relevance Feedback
No ratings yet
Week 3 - Probabilistic Retrieval and Relevance Feedback
37 pages
5 Retrievalefective
No ratings yet
5 Retrievalefective
13 pages
CS336 MIR w5 Evaluation
No ratings yet
CS336 MIR w5 Evaluation
38 pages
6 Retrieval Effectiveness
No ratings yet
6 Retrieval Effectiveness
18 pages
IR Lecture 5b
No ratings yet
IR Lecture 5b
36 pages
L15 IRSW Evaluation
No ratings yet
L15 IRSW Evaluation
49 pages
Evaluation of Information Retrieval Systems: Thanks To Marti Hearst, Ray Larson, Chris Manning
No ratings yet
Evaluation of Information Retrieval Systems: Thanks To Marti Hearst, Ray Larson, Chris Manning
108 pages
L05-IR Models MMN
No ratings yet
L05-IR Models MMN
23 pages
3 Retrieval Evaluation
No ratings yet
3 Retrieval Evaluation
31 pages
Chapter 5 Retrieval Efective
No ratings yet
Chapter 5 Retrieval Efective
24 pages
1 s2.0 S2666521225000031 Main - DessyNovita
No ratings yet
1 s2.0 S2666521225000031 Main - DessyNovita
9 pages
CS 3308 Learning Journal 6
No ratings yet
CS 3308 Learning Journal 6
8 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
IR Lecture 5b
No ratings yet
IR Lecture 5b
36 pages
SIT772 Lecture 10
No ratings yet
SIT772 Lecture 10
34 pages
IR Chapt 5
No ratings yet
IR Chapt 5
55 pages
IR Evaluation Tugas Kampus
No ratings yet
IR Evaluation Tugas Kampus
25 pages
5 Retrievalefective
No ratings yet
5 Retrievalefective
22 pages
5 Retrieval Effectiveness
No ratings yet
5 Retrieval Effectiveness
20 pages
Lecture8-Evaluation 2013
No ratings yet
Lecture8-Evaluation 2013
44 pages
Information Retrieval: IR Evaluation
No ratings yet
Information Retrieval: IR Evaluation
36 pages
IR - Chapter 5
No ratings yet
IR - Chapter 5
28 pages
Evaluation and Result Summaries
No ratings yet
Evaluation and Result Summaries
60 pages
6 Retrieval Evaluation
No ratings yet
6 Retrieval Evaluation
28 pages
Multimedia Information Retrieval
No ratings yet
Multimedia Information Retrieval
143 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Ch5 Retrieval Evaluation 2021
No ratings yet
Ch5 Retrieval Evaluation 2021
26 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
No ratings yet
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
65 pages
Performance Evaluation of Information Retrieval Systems
No ratings yet
Performance Evaluation of Information Retrieval Systems
45 pages
Information Retrieval CMSC 476/676: Evaluation and Result Summaries
No ratings yet
Information Retrieval CMSC 476/676: Evaluation and Result Summaries
45 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
TREC Evalution Measures
No ratings yet
TREC Evalution Measures
10 pages
Chapter 5 IR Evaluation
No ratings yet
Chapter 5 IR Evaluation
45 pages
Chapter-5: Retrieval Effectiveness
No ratings yet
Chapter-5: Retrieval Effectiveness
25 pages

Lecture5 6

Uploaded by

Lecture5 6

Uploaded by

Search Engines Indexes

retrieved not retrieved

Number of relevant documents retrieved

Number of relevant documents retrieved

Number of relevant documents retrieved

Number of relevant documents retrieved

• Compared to arithmetic mean, both need to

• Mean Average Precision: Average of the average

You might also like