0% found this document useful (0 votes)
27 views145 pages

01 Intro

The document discusses Boolean retrieval for information retrieval systems, including an inverted index data structure and processing Boolean queries. It provides definitions of information retrieval and the Boolean model. The Boolean model treats queries as Boolean expressions and returns all documents that satisfy the expression.

Uploaded by

Rajput Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views145 pages

01 Intro

The document discusses Boolean retrieval for information retrieval systems, including an inverted index data structure and processing Boolean queries. It provides definitions of information retrieval and the Boolean model. The Boolean model treats queries as Boolean expressions and returns all documents that satisfy the expression.

Uploaded by

Rajput Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

Information Retrieval Systems

MCAE 0303

Lecture 1: Boolean Retrieval

1
Information Retrieval Systems

2
Information Retrieval Systems

3
Take Away

 Boolean Retrieval: Design and data structures


of a simple information retrieval system

 What topics will be covered in this class?

4 4
Outline

❶ Introduction

❷ Inverted index

❸ Processing Boolean queries

❹ Query optimization

5
Definition of Information Retrieval

Information retrieval (IR) is finding material (usually


documents) of an unstructured nature (usually text)
that satisfies an information need from within large
collections (usually stored on computers).

6 6
7 7
8 8
Boolean Retrieval

 The Boolean model is arguably the simplest model to base an


information retrieval system on.

 Queries are Boolean expressions, e.g., CAESAR AND BRUTUS

 The search engine returns all documents that satisfy the


Boolean expression.

Does Google use the Boolean model? 9


9
Outline

❶ Introduction

❷ Inverted index

❸ Processing Boolean queries

❹ Query optimization

10
Unstructured data in 1650: Shakespeare

The term “unstructured


data” refers to data which
does not have clear,
semantically overt, easy-
for-a-computer structure

11 11
Unstructured data in 1650

 Which plays of Shakespeare contain the words BRUTUS AND


CAESAR, but not CALPURNIA?
 One could grep all of Shakespeare’s plays for BRUTUS and
CAESAR, then strip out lines containing CALPURNIA
 Why is grep not the solution?
 Slow (for large collections)
 grep is line-oriented, IR is document-oriented
 “NOT CALPURNIA” is non-trivial
 Other operations (e.g., find the word ROMANS near
COUNTRYMAN ) not feasible

12 12
Term-Document Incidence Matrix
Anthony Julius Caesar The Hamlet Othello
and Tempest Macbeth
Cleopatra
ANTHONY 1 1 0 0 0 1
BRUTUS 1 1 0 1 0 0
CAESAR 1 1 0 1 1 1
CALPURNIA 0 1 0 0 0 0
CLEOPATRA 1 0 0 0 0 0
MERCY 1 0 1 1 1 1
WORSER 1 0 1 1 1 0
...

Entry is 1 if term occurs. Example: CALPURNIA occurs in Julius Caesar.


Entry is 0 if term doesn’t occur. Example: CALPURNIA
doesn’t occur in The tempest.
13 13
Incidence Vectors

 So we have a 0/1 vector for each term.


 To answer the query BRUTUS AND CAESAR AND NOT
CALPURNIA:
 Take the vectors for BRUTUS, CAESAR AND NOT
CALPURNIA
 Complement the vector of CALPURNIA
 Do a (bitwise) and on the three vectors
 110100 AND 110111 AND 101111 = 100100

14 14
0/1 vector for BRUTUS
Anthony Julius The Hamlet Othello
and Caesar Tempest Macbeth
Cleopatra
ANTHONY 1 1 0 0 0 1
BRUTUS 1 1 0 1 0 0
CAESAR 1 1 0 1 1 1
CALPURNIA 0 1 0 0 0 0
CLEOPATRA 1 0 0 0 0 0
MERCY 1 0 1 1 1 1
WORSER 1 0 1 1 1 0
...
result: 1 0 0 1 0 0

15 15
Answers to query

Anthony and Cleopatra, Act III, Scene ii


Agrippa [Aside to Domitius Enobarbus]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar: I was killed i’
the Capitol; Brutus killed me.

16 16
17
18
Sec. 6.2

How to know search engine is good or not


• How do we know if our results are any good?
• Evaluating a search engine
• Benchmarks
• Precision and recall
• Results summaries:
• Making our good results usable to a user

19
Sec. 8.3

Unranked retrieval evaluation:


Precision and Recall
• Precision: fraction of retrieved docs that are relevant (P).
• Recall: fraction of relevant docs that are retrieved (R)

Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn

• Precision P = tp/(tp + fp)


• Recall R = tp/(tp + fn)

20
• Recall = Number of pages that were retrieved and relevant / Total
number of relevant pages.

• Precision = Number of pages that were retrieved and relevant / Total


number of retrieved pages.

• Example: Let us say there exist a total of 5 pages labelled P1, P2, P3,
P4 and P5. Let us assume that for the query “weather in Los Angeles”,
the pages that are relevant are P3, P4 and P5 (the green pages shown
below). So the total number of relevant pages is 3. Let us assume that
a search engine returns the pages P2 and P3. So the number of
retrieved pages is 2.

21
22
• The search engine returns the pages P2 and P3 but only P3 is
relevant. So the number of pages that are retrieved and relevant is 1
(only P3).

• So based on the formula,


• Recall = 1 / 3 = 0.67

• Precision = 1 / 2 = 0.5

• Higher values of precision and recall (closer to 1) are better.

23
• Now let us think about why we need both precision and recall.

• Suppose we are trying to build our own search engine. In one case,
say we design our search engine to return only one page for any
query. If that one page is relevant,

• The precision will be = Number of Retrieved and Relevant Pages /


Number of Retrieved Pages = 1 / 1 which is 100%.

24
• If there are actually 1000 relevant pages that exist, the recall will be 1
/ 1000 which is 0.1%.

• Clearly, this system is not performing well with such a poor recall.

• If we didn’t have recall but only had precision as an evaluation metric,


this system would be incorrectly assumed to be performing very well,
whereas in reality it isn’t.

25
Bigger Collections

 Consider N = 106 documents, each with about 1000 tokens ⇒


total of 109 tokens
 On average 6 bytes per token, including spaces and
punctuation ⇒ size of document collection is about 6 ・ 109 =
6 GB
 Assume there are M = 500,000 distinct terms in the
collection
 (Notice that we are making a term/token distinction.)

26 26
Can’t build the incidence matrix

 M = 500,000 × 106 = half a trillion 0s and 1s.


 But the matrix has no more than one billion 1s.
 Matrix is extremely sparse.
 What is a better representations?
 We only record the 1s.

27 27
Inverted Index

For each term t, we store a list of all documents that contain t.

dictionary postings
28 28
Inverted Index

For each term t, we store a list of all documents that contain t.

dictionary postings
29 29
Inverted Index

For each term t, we store a list of all documents that contain t.

dictionary postings
30 30
Inverted index construction
❶ Collect the documents to be indexed:

❷ Tokenize the text, turning each document into a list of tokens:

❸ Do linguistic preprocessing, producing a list of normalized


tokens, which are the indexing terms:

❹ Index the documents that each term occurs in by creating an


inverted index, consisting of a dictionary and postings.
31
Tokenizing and preprocessing

32 32
Generate posting

33 33
Sort postings

34 34
Create postings lists, determine document frequency

35 35
Split the result into dictionary and postings file

dictionary postings
36 36
Questions
• The posting list in an inverted index is sorted by
• A. Term frequency
• B. Document frequency
• C. Term ID
• D. Document ID

37
Questions
• Stemming is a technique used for
• A. Tokenization
• B. Normalization
• C. Document ranking
• D. Case folding

38
Questions
• Dictionary in inverted index is sorted?
• A. because it looks good
• B. it is easy to apply linear search
• C. it is easy to apply binary search

39
Questions
• Normalization helps in:
• A. reducing dictionary size
• B. making search fast
• C. reduce posting list size

40
Draw inverted index
• Draw the inverted index that would be built for the following
document collection.

Doc 1 new home sales top forecasts


Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise

41
Later in this course

 Index construction: how can we create inverted indexes for


large collections?
 How much space do we need for dictionary and index?
 Index compression: how can we efficiently store and process
indexes for large collections?
 Ranked retrieval: what does the inverted index look like when
we want the “best” answer?

42
Outline

❶ Introduction

❷ Inverted index

❸ Processing Boolean queries

❹ Query optimization

43
Simple conjunctive query (two terms)

 Consider the query: BRUTUS AND CALPURNIA


 To find all matching documents using inverted index:
❶ Locate BRUTUS in the dictionary
❷ Retrieve its postings list from the postings file
❸ Locate CALPURNIA in the dictionary
❹ Retrieve its postings list from the postings file
❺ Intersect the two postings lists
❻ Return intersection to user

44 44
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = <>
 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

45 45
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = <>
 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

46 46
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = <>
 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

47 47
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = <>
 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

48 48
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = <>
 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

49 49
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

50 50
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

51 51
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

52 52
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

53 53
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

54 54
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2

55 55
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

56 56
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

57 57
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

58 58
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

59 59
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

60 60
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

61 61
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

62 62
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

63 63
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

64 64
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

65 65
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

66 66
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

67 67
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

68 68
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

69 69
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

70 70
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

71 71
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

72 72
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

73 73
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

74 74
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

75 75
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

76 76
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

77 77
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

78 78
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

79 79
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31

80 80
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

81 81
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

82 82
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

83 83
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

84 84
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

85 85
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

86 86
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

87 87
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

88 88
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

89 89
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

90 90
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

91 91
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

92 92
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

93 93
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

94 94
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

95 95
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54

96 96
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101

97 97
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101

98 98
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101

99 99
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101

100 100
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101

101 101
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101

102 102
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101

103 103
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101

104 104
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101 𝑁𝐼𝐿

105 105
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101 𝑁𝐼𝐿

106 106
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101 𝑁𝐼𝐿

107 107
Intersecting two posting lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

 𝑎𝑛𝑠𝑤𝑒𝑟 = < 2, 31 >


 𝑑𝑜𝑐𝐼𝐷(𝑝1 ) = 1 2 4 11 31 45 173
 𝑑𝑜𝑐𝐼𝐷(𝑝2 ) = 2 31 54 101 𝑁𝐼𝐿

108 108
Intersecting two posting lists
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar

109 109
Query processing: Exercise

Compute hit list for ((paris AND NOT france) OR lear)

110 110
Boolean queries
 The Boolean retrieval model can answer any query that is a
Boolean expression.
 Boolean queries are queries that use AND, OR and NOT to join
 query terms.
 Views each document as a set of terms.
 Is precise: Document matches condition or not.
 Primary commercial retrieval tool for 3 decades
 Many professional searchers (e.g., lawyers) still like Boolean
queries.
 You know exactly what you are getting.
 Many search systems you use are also Boolean: spotlight,
email, intranet etc.
111 111
Commercially successful Boolean retrieval: Westlaw
 Largest commercial legal search service in terms of the number of
paying subscribers
 Over half a million subscribers performing millions of searches a day
over tens of terabytes of text data
 The service was started in 1975.
 In 2005, Boolean search (called “Terms and Connectors” by Westlaw)
was still the default, and used by a large percentage of users . . .
 . . . although ranked retrieval has been available since 1992.

112 112
OR two posting lists
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

1 2 3 4 5 8 13 16 21 32 34 64 128

113
OR two posting lists 2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

1 2 3 4 5 8 13 16 21 32 34 64 128
• 𝑂𝑅 𝑝1 , 𝑝2
1. 𝑎𝑛𝑠𝑤𝑒𝑟 ← <>
2. 𝒘𝒉𝒊𝒍𝒆 𝑝1 ≠ 𝑁𝐼𝐿 𝑎𝑛𝑑 𝑝2 ≠ 𝑁𝐼𝐿
3. 𝒅𝒐 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 = 𝑑𝑜𝑐𝐼𝐷 𝑝2
4. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1 While break as 𝑝2 becomes NIL
5. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1 but
6. 𝑝2 ← 𝑛𝑒𝑥𝑡(𝑝2) 64 and 128 should also be
7. 𝒆𝒍𝒔𝒆 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 < 𝑑𝑜𝑐𝐼𝐷 𝑝2 part of answer!
8. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1
9. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1
10. 𝒆𝒍𝒔𝒆 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝2
11. 𝑝2 ← 𝑛𝑒𝑥𝑡 𝑝2
114
OR two posting lists 2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

1 2 3 4 5 8 13 16 21 32 34 64 128
• 𝑂𝑅 𝑝1 , 𝑝2
1. 𝑎𝑛𝑠𝑤𝑒𝑟 ← <>
2. 𝒘𝒉𝒊𝒍𝒆 𝑝1 ≠ 𝑁𝐼𝐿 𝑎𝑛𝑑 𝑝2 ≠ 𝑁𝐼𝐿
3. 𝒅𝒐 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 = 𝑑𝑜𝑐𝐼𝐷 𝑝2
4. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1 While break as 𝑝2 becomes NIL
5. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1 but
6. 𝑝2 ← 𝑛𝑒𝑥𝑡(𝑝2) 64 and 128 should also be
7. 𝒆𝒍𝒔𝒆 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 < 𝑑𝑜𝑐𝐼𝐷 𝑝2 part of answer!
8. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1
9. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1
10. 𝒆𝒍𝒔𝒆 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝2
11. 𝑝2 ← 𝑛𝑒𝑥𝑡 𝑝2
115
OR two posting lists 2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar

1 2 3 4 5 8 13 16 21 32 34 64 128
• 𝑂𝑅 𝑝1 , 𝑝2
1. 𝑎𝑛𝑠𝑤𝑒𝑟 ← <>
2. 𝒘𝒉𝒊𝒍𝒆 𝑝1 ≠ 𝑁𝐼𝐿 𝑜𝑟 𝑝2 ≠ 𝑁𝐼𝐿
3. 𝒅𝒐 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 = 𝑑𝑜𝑐𝐼𝐷 𝑝2
4. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1 While break as 𝑝2 becomes NIL
5. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1 but
6. 𝑝2 ← 𝑛𝑒𝑥𝑡(𝑝2) 64 and 128 should also be
7. 𝒆𝒍𝒔𝒆 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 < 𝑑𝑜𝑐𝐼𝐷 𝑝2 part of answer!
8. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1
9. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1
10. 𝒆𝒍𝒔𝒆 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝2
11. 𝑝2 ← 𝑛𝑒𝑥𝑡 𝑝2
116
• 𝑂𝑅 𝑝1 , 𝑝2
2 4 8 16 32 64 128 Brutus
1. 𝑎𝑛𝑠𝑤𝑒𝑟 ← <>
2. 𝒘𝒉𝒊𝒍𝒆 𝑝1 ≠ 𝑁𝐼𝐿 𝑜𝑟 𝑝2 ≠ 𝑁𝐼𝐿 1 2 3 5 8 13 21 34 Caesar
3. 𝒅𝒐 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 = 𝑑𝑜𝑐𝐼𝐷 𝑝2
4. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1
5. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1
6. 𝑝2 ← 𝑛𝑒𝑥𝑡(𝑝2)
7. 𝒆𝒍𝒔𝒆 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 < 𝑑𝑜𝑐𝐼𝐷 𝑝2
8. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1 Not dry run!
9. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1 To check if this change is
10. 𝒆𝒍𝒔𝒆 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝2 enough?
11. 𝑝2 ← 𝑛𝑒𝑥𝑡 𝑝2
12. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑎𝑛𝑠𝑤𝑒𝑟

117
• 𝑂𝑅 𝑝1 , 𝑝2
2 4 8 16 32 64 128 Brutus
1. 𝑎𝑛𝑠𝑤𝑒𝑟 ← <>
2. 𝒘𝒉𝒊𝒍𝒆 𝑝1 ≠ 𝑁𝐼𝐿 𝑜𝑟 𝑝2 ≠ 𝑁𝐼𝐿 1 2 3 5 8 13 21 34 Caesar
3. 𝒅𝒐 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 = 𝑑𝑜𝑐𝐼𝐷 𝑝2
4. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1
5. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1
6. 𝑝2 ← 𝑛𝑒𝑥𝑡(𝑝2)
7. 𝒆𝒍𝒔𝒆 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 < 𝑑𝑜𝑐𝐼𝐷 𝑝2
8. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1 but only this change in while
9. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1 statement is not enough!
10. 𝒆𝒍𝒔𝒆 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝2
11. 𝑝2 ← 𝑛𝑒𝑥𝑡 𝑝2
12. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑎𝑛𝑠𝑤𝑒𝑟

118
• 𝑂𝑅 𝑝1 , 𝑝2
1. 𝑎𝑛𝑠𝑤𝑒𝑟 ← <> 2 4 8 16 32 64 128 Brutus
2.
3.
𝒘𝒉𝒊𝒍𝒆 𝑝1 ≠ 𝑁𝐼𝐿 𝑜𝑟 𝑝2 ≠ 𝑁𝐼𝐿
𝒅𝒐 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 = 𝑑𝑜𝑐𝐼𝐷 𝑝2
1 2 3 5 8 13 21 34 Caesar
4. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1
5. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1
6. 𝑝2 ← 𝑛𝑒𝑥𝑡(𝑝2)
7. 𝒆𝒍𝒔𝒆 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 < 𝑑𝑜𝑐𝐼𝐷 𝑝2
8. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1
9. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1
10. 𝒆𝒍𝒔𝒆 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝2
11. 𝑝2 ← 𝑛𝑒𝑥𝑡 𝑝2 Now dry run and verify!
12. 𝒊𝒇 𝑝1 = 𝑁𝐼𝐿 𝑎𝑛𝑑 𝑝2 ≠ 𝑁𝐼𝐿
13. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝2
14. 𝑝2 ← 𝑛𝑒𝑥𝑡 𝑝2
15. 𝒊𝒇 𝑝1 ≠ 𝑁𝐼𝐿 𝑎𝑛𝑑 𝑝2 = 𝑁𝐼𝐿
16. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1
17. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1
18. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑎𝑛𝑠𝑤𝑒𝑟

119
What is the disadvantage of Boolean retrieval
model?

• a) Easy to implement

• b) Difficult to rank output

• c) Difficult to process a query

• d) It is one of the complex retrieval models

120
What is the disadvantage of Boolean retrieval
model?

• a) Easy to implement

• b) Difficult to rank output

• c) Difficult to process a query

• d) It is one of the complex retrieval models

121
An inverted index is a database index that
____.
• a) stores, for each term t, the list of all documents that contain term t

• b) stores mapping from documents to words

• c) orders the terms in a different order which is not a sequential


order.

• d) All of the above

122
An inverted index is a database index that
____.

• a) stores, for each term t, the list of all documents that contain term t

• b) stores mapping from documents to words

• c) orders the terms in a different order which is not a sequential order.

• d) All of the above

123
Boolean queries often result in:
• A. Too many or too few results
• B. None of the above.
• C. Too few results
• D. Too many results.

124
Boolean queries often result in:
• A. Too many or too few results
• B. None of the above.
• C. Too few results
• D. Too many results.

125
Term-document incidence matrix is:
• A. Sparse
• B. Depends upon the data
• C. Dense
• D. Cannot predict

126
Postings list should be sorted by:
• A. Document Frequency
• B. DocID
• C. TermID
• D. Term frequency

127
Postings list should be sorted by:
• A. Document Frequency
• B. DocID
• C. TermID
• D. Term frequency

128
A large repository of documents in IR is called
as:
• A. Corpus
• B. Database
• C. Dictionary
• D. Collection

129
A large repository of documents in IR is called
as:
• A. Corpus
• B. Database
• C. Dictionary
• D. Collection

130
Basic Terminologies

131
NOT Brutus
1
2
Document
Collection 10

Answer: Based on NOT


132
A better idea to build a term-document matrix
is ______ where we record only the things that
do occur and their links
• A. Incidence matrix.
• B. Adjacency matrix.
• C. index
• D. Inverted index

133
A better idea to build a term-document matrix
is ______ where we record only the things that
do occur and their links
• A. Incidence matrix.
• B. Adjacency matrix.
• C. index
• D. Inverted index

134
Calculate the posting list for PARIS and LEAR

• A. 15
• B. 12
• C. 6
• D. 10

135
Calculate the posting list for PARIS OR LEAR

• A. 2, 6, 10, 12, 14, 15


• B. 12, 15
• C. 2, 6, 10, 12, 14
• D. 12

136
Calculate the posting list for PARIS
𝐴𝑁𝐷 𝑁𝑂𝑇 LEAR

• A. 2, 6, 10, 12, 14, 15


• B. 12, 15
• C. 2, 6, 10, 12, 14
• D. 2, 6, 10, 14

137
And Not two posting lists
𝐴𝑛𝑑𝑁𝑜𝑡 𝑝1 , 𝑝2 2 4 8 16 32 64 128 Brutus
1. 𝑎𝑛𝑠𝑤𝑒𝑟 ← <>
2. 𝒘𝒉𝒊𝒍𝒆 𝑝1 ≠ 𝑁𝐼𝐿 1 2 3 5 8 13 21 34 Caesar
3. 𝒅𝒐 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 = 𝑑𝑜𝑐𝐼𝐷 𝑝2
4. 𝒕𝒉𝒆𝒏 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1
5. 𝑝2 ← 𝑛𝑒𝑥𝑡(𝑝2)
6. 𝒆𝒍𝒔𝒆 𝒊𝒇 𝑑𝑜𝑐𝐼𝐷 𝑝1 < 𝑑𝑜𝑐𝐼𝐷 𝑝2 or p2 = 𝑁𝐼𝐿
7. 𝒕𝒉𝒆𝒏 𝐴𝐷𝐷 𝑎𝑛𝑠𝑤𝑒𝑟, 𝑑𝑜𝑐𝐼𝐷 𝑝1
8. 𝑝1 ← 𝑛𝑒𝑥𝑡 𝑝1
9. 𝒆𝒍𝒔𝒆 𝑝2 ← 𝑛𝑒𝑥𝑡 𝑝2
10. 𝒓𝒆𝒕𝒖𝒓𝒏 𝑎𝑛𝑠𝑤𝑒𝑟

138
Westlaw: Example queries
Information need: Information on the legal theories involved in
preventing the disclosure of trade secrets by employees formerly
employed by a competing company Query: “trade secret” /s
disclos! /s prevent /s employe! Information need: Requirements

for disabled people to be able to access a workplace Query:


disab! /p access! /s work-site work-place (employment /3 place)

Information need: Cases about a host’s responsibility for drunk


guests Query: host! /p (responsib! liab!) /p (intoxicat! drunk!)
/p guest

139 139
Westlaw: Comments
 Proximity operators: /3 = within 3 words, /s = within a
sentence, /p = within a paragraph
 Space is disjunction, not conjunction! (This was the default
in search pre-Google.)
 Long, precise queries: incrementally developed, not like
web search
 Why professional searchers often like Boolean search:
precision, transparency, control
 When are Boolean queries the best way of searching?
Depends on: information need, searcher, document
collection, . . .

140 140
Outline

❶ Introduction

❷ Inverted index

❸ Processing Boolean queries

❹ Query optimization

141
Query optimization

 Consider a query that is an and of n terms, n > 2


 For each of the terms, get its postings list, then and them
together
 Example query: BRUTUS AND CALPURNIA AND CAESAR
 What is the best order for processing this query?

142 142
Query optimization

 Example query: BRUTUS AND CALPURNIA AND CAESAR


 Simple and effective optimization: Process in order of
increasing frequency
 Start with the shortest postings list, then keep cutting further
 In this example, first CAESAR, then CALPURNIA, then BRUTUS

143 143
Optimized intersection algorithm for
conjunctive queries

144 144
More general optimization

 Example query: (MADDING OR CROWD) and (IGNOBLE OR STRIFE)


 Get frequencies for all terms
 Estimate the size of each or by the sum of its frequencies
(conservative)
 Process in increasing order of or sizes

145 145

You might also like