0% found this document useful (0 votes)

27 views145 pages

01 Intro

The document discusses Boolean retrieval for information retrieval systems, including an inverted index data structure and processing Boolean queries. It provides definitions of information retrieval and the Boolean model. The Boolean model treats queries as Boolean expressions and returns all documents that satisfy the expression.

Uploaded by

Rajput Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views145 pages

01 Intro

Uploaded by

Rajput Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 145

Information Retrieval Systems

MCAE 0303

Lecture 1: Boolean Retrieval

1
Information Retrieval Systems

2
Information Retrieval Systems

3
Take Away

 Boolean Retrieval: Design and data structures

of a simple information retrieval system

 What topics will be covered in this class?

4 4
Outline

❶ Introduction

❷ Inverted index

❸ Processing Boolean queries

❹ Query optimization

5
Definition of Information Retrieval

Information retrieval (IR) is finding material (usually

documents) of an unstructured nature (usually text)
that satisfies an information need from within large
collections (usually stored on computers).

6 6
7 7
8 8
Boolean Retrieval

 The Boolean model is arguably the simplest model to base an

information retrieval system on.

 Queries are Boolean expressions, e.g., CAESAR AND BRUTUS

 The search engine returns all documents that satisfy the

Boolean expression.

Does Google use the Boolean model? 9

9
Outline

❶ Introduction

❷ Inverted index

❸ Processing Boolean queries

❹ Query optimization

10
Unstructured data in 1650: Shakespeare

The term “unstructured

data” refers to data which
does not have clear,
semantically overt, easy-
for-a-computer structure

11 11
Unstructured data in 1650

 Which plays of Shakespeare contain the words BRUTUS AND

CAESAR, but not CALPURNIA?
 One could grep all of Shakespeare’s plays for BRUTUS and
CAESAR, then strip out lines containing CALPURNIA
 Why is grep not the solution?
 Slow (for large collections)
 grep is line-oriented, IR is document-oriented
 “NOT CALPURNIA” is non-trivial
 Other operations (e.g., find the word ROMANS near
COUNTRYMAN ) not feasible

12 12
Term-Document Incidence Matrix
Anthony Julius Caesar The Hamlet Othello
and Tempest Macbeth
Cleopatra
ANTHONY 1 1 0 0 0 1
BRUTUS 1 1 0 1 0 0
CAESAR 1 1 0 1 1 1
CALPURNIA 0 1 0 0 0 0
CLEOPATRA 1 0 0 0 0 0
MERCY 1 0 1 1 1 1
WORSER 1 0 1 1 1 0
...

Entry is 1 if term occurs. Example: CALPURNIA occurs in Julius Caesar.

Entry is 0 if term doesn’t occur. Example: CALPURNIA
doesn’t occur in The tempest.
13 13
Incidence Vectors

 So we have a 0/1 vector for each term.

 To answer the query BRUTUS AND CAESAR AND NOT
CALPURNIA:
 Take the vectors for BRUTUS, CAESAR AND NOT
CALPURNIA
 Complement the vector of CALPURNIA
 Do a (bitwise) and on the three vectors
 110100 AND 110111 AND 101111 = 100100

14 14
0/1 vector for BRUTUS
Anthony Julius The Hamlet Othello
and Caesar Tempest Macbeth
Cleopatra
ANTHONY 1 1 0 0 0 1
BRUTUS 1 1 0 1 0 0
CAESAR 1 1 0 1 1 1
CALPURNIA 0 1 0 0 0 0
CLEOPATRA 1 0 0 0 0 0
MERCY 1 0 1 1 1 1
WORSER 1 0 1 1 1 0
...
result: 1 0 0 1 0 0

15 15
Answers to query

Anthony and Cleopatra, Act III, Scene ii

Agrippa [Aside to Domitius Enobarbus]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar: I was killed i’
the Capitol; Brutus killed me.

16 16
17
18
Sec. 6.2

How to know search engine is good or not

• How do we know if our results are any good?
• Evaluating a search engine
• Benchmarks
• Precision and recall
• Results summaries:
• Making our good results usable to a user

19
Sec. 8.3

Unranked retrieval evaluation:

Precision and Recall
• Precision: fraction of retrieved docs that are relevant (P).
• Recall: fraction of relevant docs that are retrieved (R)

Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn

• Precision P = tp/(tp + fp)

• Recall R = tp/(tp + fn)

20
• Recall = Number of pages that were retrieved and relevant / Total
number of relevant pages.

• Precision = Number of pages that were retrieved and relevant / Total

number of retrieved pages.

• Example: Let us say there exist a total of 5 pages labelled P1, P2, P3,
P4 and P5. Let us assume that for the query “weather in Los Angeles”,
the pages that are relevant are P3, P4 and P5 (the green pages shown
below). So the total number of relevant pages is 3. Let us assume that
a search engine returns the pages P2 and P3. So the number of
retrieved pages is 2.

21
22
• The search engine returns the pages P2 and P3 but only P3 is
relevant. So the number of pages that are retrieved and relevant is 1
(only P3).

• So based on the formula,

• Recall = 1 / 3 = 0.67

• Precision = 1 / 2 = 0.5

• Higher values of precision and recall (closer to 1) are better.

23
• Now let us think about why we need both precision and recall.

• Suppose we are trying to build our own search engine. In one case,
say we design our search engine to return only one page for any
query. If that one page is relevant,

• The precision will be = Number of Retrieved and Relevant Pages /

Number of Retrieved Pages = 1 / 1 which is 100%.

24
• If there are actually 1000 relevant pages that exist, the recall will be 1
/ 1000 which is 0.1%.

• Clearly, this system is not performing well with such a poor recall.

• If we didn’t have recall but only had precision as an evaluation metric,

this system would be incorrectly assumed to be performing very well,
whereas in reality it isn’t.

25
Bigger Collections

 Consider N = 106 documents, each with about 1000 tokens ⇒

total of 109 tokens
 On average 6 bytes per token, including spaces and
punctuation ⇒ size of document collection is about 6 ・ 109 =
6 GB
 Assume there are M = 500,000 distinct terms in the
collection
 (Notice that we are making a term/token distinction.)

26 26
Can’t build the incidence matrix

 M = 500,000 × 106 = half a trillion 0s and 1s.

 But the matrix has no more than one billion 1s.
 Matrix is extremely sparse.
 What is a better representations?
 We only record the 1s.

27 27
Inverted Index

For each term t, we store a list of all documents that contain t.

dictionary postings
28 28
Inverted Index

For each term t, we store a list of all documents that contain t.

dictionary postings
29 29
Inverted Index

For each term t, we store a list of all documents that contain t.

dictionary postings
30 30
Inverted index construction
❶ Collect the documents to be indexed:

❷ Tokenize the text, turning each document into a list of tokens:

❸ Do linguistic preprocessing, producing a list of normalized

tokens, which are the indexing terms:

❹ Index the documents that each term occurs in by creating an

inverted index, consisting of a dictionary and postings.
31
Tokenizing and preprocessing

32 32
Generate posting

33 33
Sort postings

34 34
Create postings lists, determine document frequency

35 35
Split the result into dictionary and postings file

dictionary postings
36 36
Questions
• The posting list in an inverted index is sorted by
• A. Term frequency
• B. Document frequency
• C. Term ID
• D. Document ID

37
Questions
• Stemming is a technique used for
• A. Tokenization
• B. Normalization
• C. Document ranking
• D. Case folding

38
Questions
• Dictionary in inverted index is sorted?
• A. because it looks good
• B. it is easy to apply linear search
• C. it is easy to apply binary search

39
Questions
• Normalization helps in:
• A. reducing dictionary size
• B. making search fast
• C. reduce posting list size

40
Draw inverted index
• Draw the inverted index that would be built for the following
document collection.

Doc 1 new home sales top forecasts

Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise

41
Later in this course

 Index construction: how can we create inverted indexes for

large collections?
 How much space do we need for dictionary and index?
 Index compression: how can we efficiently store and process
indexes for large collections?
 Ranked retrieval: what does the inverted index look like when
we want the “best” answer?

42
Outline

❶ Introduction

❷ Inverted index

❸ Processing Boolean queries

❹ Query optimization

43
Simple conjunctive query (two terms)

 Consider the query: BRUTUS AND CALPURNIA

 To find all matching documents using inverted index:
❶ Locate BRUTUS in the dictionary
❷ Retrieve its postings list from the postings file
❸ Locate CALPURNIA in the dictionary
❹ Retrieve its postings list from the postings file
❺ Intersect the two postings lists
❻ Return intersection to user

44 44
Intersecting two posting lists