0% found this document useful (0 votes)
8 views30 pages

Lecture5 6

Uploaded by

.moashraf.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views30 pages

Lecture5 6

Uploaded by

.moashraf.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Search Engines Indexes

Lecture 5-6
Indexes
• Indexes: are data structures designed to make
search faster
• Text search has unique requirements, which
leads to unique data structures
• Most common data structure is inverted index
– general name for a class of structures
– “inverted” because documents are associated
with words, rather than words with documents
• similar to a concordance
Indexes and Ranking
• Indexes are designed to support search
– faster response time, supports updates
• Text search engines use a particular form of
search: ranking
– documents are retrieved in sorted order according to
a score computing using the document
representation, the query, and a ranking algorithm
• What is a reasonable abstract model for ranking?
– enables discussion of indexes without details of
retrieval model
Abstract Model of Ranking
Inverted Index
• Each index term is associated with an inverted
list
– Contains lists of documents, or lists of word
occurrences in documents, and other information
– Each entry is called a posting
– The part of the posting that refers to a specific
document or location is called a pointer
– Each document in the collection is given a unique
number
– Lists are usually document-ordered (sorted by
document number)
Example “Collection”
Proximity Matches
• Matching phrases or words within a window
– e.g., "tropical fish", or “find tropical within
5 words of fish”
• Word positions in inverted lists make these
types of query features efficient
– e.g.,
Fields and Extents
• Document structure is useful in search
– field restrictions
• e.g., date, from:, etc.
– some fields more important
• e.g., title
• Options:
– separate inverted lists for each field type
– add information about fields to postings
– use extent lists
Extent Lists
• An extent is a contiguous region of a
document
– represent extents using word positions
– inverted list records all extents for a given field
type
– e.g.,

extent list
Other Issues
• Precomputed scores in inverted list
– e.g., list for “fish” [(1:3.6), (3:2.2)], where 3.6 is
total feature value for document 1
– improves speed but reduces flexibility
• Score-ordered lists
– query processing engine can focus only on the top
part of each inverted list, where the highest-
scoring documents are recorded
– very efficient for single-word queries
Compression
• Inverted lists are very large
– e.g., 25-50% of collection for TREC collections using
Indri search engine
– Much higher if n-grams are indexed
• Compression of indexes saves disk and/or memory
space
– Typically have to decompress lists to use them
– Best compression techniques have good compression
ratios and are easy to decompress
• Lossless compression – no information lost
Distributed Indexing
• Distributed processing driven by need to index
and analyze huge amounts of data (i.e., the
Web)
• Large numbers of inexpensive servers used
rather than larger, more expensive machines
• MapReduce is a distributed programming tool
designed for indexing and analysis tasks
Caching
• Query distributions similar to Zipf
– About ½ each day are unique, but some are very
popular
• Caching can significantly improve effectiveness
– Cache popular query results
– Cache common inverted lists
• Inverted list caching can help with unique
queries
• Cache must be refreshed to prevent stale data
Performance Evaluation
of Information Retrieval Systems

14
Why System Evaluation?
• There are many retrieval models/ algorithms/
systems, which one is the best?
• What is the best component for:
– Ranking function (dot-product, cosine, …)
– Term selection (stopword removal, stemming…)
– Term weighting
• How far down the ranked list will a user need
to look to find some/all relevant documents?

15
Difficulties in Evaluating IR Systems
• Effectiveness is related to the relevancy of retrieved
items.
• Relevancy is not typically binary but continuous.
• Even if relevancy is binary, it can be a difficult
judgment to make.
• Relevancy, from a human standpoint, is:
– Subjective: Depends upon a specific user’s judgment.
– Situational: Relates to user’s current needs.
– Cognitive: Depends on human perception and behavior.
– Dynamic: Changes over time.

16
Precision and Recall

relevant irrelevant
Entire document retrieved
Not retrieved
collection Relevant Retrieved &
documents documents & irrelevant
irrelevant

retrieved not retrieved


& relevant but relevant
retrieved not retrieved

Number of relevant documents retrieved


recall 
Total number of relevant documents

Number of relevant documents retrieved


precision 
Total number of documents retrieved

17
Precision and Recall
• Precision
– The ability to retrieve top-ranked documents that are mostly relevant.
• Recall
– The ability of the search to find all of the relevant items in the corpus.

Number of relevant documents retrieved


recall 
Total number of relevant documents

Number of relevant documents retrieved


precision 
Total number of documents retrieved

18
Determining Recall is Difficult
• Total number of relevant items is sometimes
not available:
– Sample across the database and perform
relevance judgment on these items.
– Apply different retrieval algorithms to the same
database for the same query. The aggregate of
relevant items is taken as the total relevant set.

19
Trade-off between Recall and Precision
Returns relevant documents but
misses many useful ones too The ideal
1
Precision

0 1
Recall Returns most relevant
documents but includes
lots of junk

20
Computing Recall/Precision Points
• For a given query, produce the ranked list of
retrievals.
• Adjusting a threshold on this ranked list produces
different sets of retrieved documents, and therefore
different recall/precision measures.
• Mark each document in the ranked list that is
relevant according to the gold standard.
• Compute a recall/precision pair for each position in
the ranked list that contains a relevant document.

21
Computing Recall/Precision Points:
Example 1

n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 589 x
3 576
R=1/6=0.167; P=1/1=1
4 590 x
5 986
R=2/6=0.333; P=2/2=1
6 592 x
7 984 R=3/6=0.5; P=3/4=0.75
8 988
9 578 R=4/6=0.667; P=4/6=0.667
10 985
11 103 Missing one
12 591 relevant document.
Never reach
13 772 x R=5/6=0.833; p=5/13=0.38 100% recall
14 990
22
Computing Recall/Precision Points:
Example 2

n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 576
3 589 x
R=1/6=0.167; P=1/1=1
4 342
5 590 x
R=2/6=0.333; P=2/3=0.667
6 717
7 984
8 772 x
9 321 x
10 498
11 113
12 628
13 772
14 592 x R=6/6=1.0; p=6/14=0.429
23
Average Recall/Precision Curve
• Typically average performance over a large set
of queries.
• Compute average precision at each standard
recall level across all queries.
• Plot average precision/recall curves to
evaluate overall system performance on a
document/query corpus.

24
Compare Two or More Systems
• The curve closest to the upper right-hand
corner of the graph indicates the best
performance
1
0.8 NoStem Stem
Precision

0.6
0.4
0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall

25
F-Measure
• One measure of performance that takes into
account both recall and precision.
• Harmonic mean of recall and precision:
2 PR 2
F 1 1
P  R RP

• Compared to arithmetic mean, both need to


be high for harmonic mean to be high.
26
E Measure (parameterized F Measure)
• A variant of F measure that allows weighting
emphasis on precision over recall:
(1   ) PR (1   )
2 2
E 2  2 1
 PR 
R P
• Value of  controls trade-off:
–  = 1: Equally weight precision and recall (E=F).
–  > 1: Weight recall more.
–  < 1: Weight precision more.

27
Mean Average Precision
(MAP)
• Average Precision: Average of the precision
values at the points at which each relevant
document is retrieved.
– Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633
– Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625

• Mean Average Precision: Average of the average


precision value for a set of queries.

28
Other Factors to Consider
• User effort: Work required from the user in
formulating queries, conducting the search, and
screening the output.
• Response time: Time interval between receipt of a
user query and the presentation of system responses.
• Form of presentation: Influence of search output
format on the user’s ability to utilize the retrieved
materials.
• Collection coverage: Extent to which any/all relevant
items are included in the document corpus.

29
Experimental Setup for Benchmarking
• Analytical performance evaluation is difficult for
document retrieval systems because many
characteristics such as relevance, distribution of
words, etc., are difficult to describe with
mathematical precision.
• Performance is measured by benchmarking. That is,
the retrieval effectiveness of a system is evaluated on
a given set of documents, queries, and relevance
judgments.
• Performance data is valid only for the environment
under which the system is evaluated.

30

You might also like