lecture02 - IR
lecture02 - IR
Boolean model
Lecture 2
1
Recap: IR System & Tasks Involved
INDEX
Recap: IR System & Tasks Involved
SEARCH
INDEX
Recap: IR System & Tasks Involved
PERFORMANCE EVALUATION
Unstructured data in 1650:Shakespeare
5
Unstructured data in 1650
One could grep all of Shakespeare’s plays for
Brutus and Caesar, then strip out lines containing
Calpurnia?
Grep: the linear scan through documents
Why is grep not the solution?
Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the word Romans near
countrymen) not feasible
Ranked retrieval (best documents to return)
Later lectures 6
Indexing
The way to avoid linearly scanning the texts for
each query to index documents in advance.
7
Boolean retrieval
8
Term-document incidence
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
10
0/1 vector for Brutus
11
Answers to query
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
12
Bigger corpora
Consider N = 1M documents, each with about 1K
terms.
Avg 6 bytes/term incl spaces/punctuation
6GB of data in the documents.
Say there are m = 500K distinct terms among
these.
13
Can’t build the matrix
500K x 1M matrix has half-a-trillion 0’s and 1’s.
But it has no more than one billion 1’s.
matrix is extremely sparse. Why?
What’s a better representation?
We only record the 1 positions.
14
Ad-hoc retrieval
To retrieve documents that are relevant to the
information need of the user
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
Tokenizer
Token stream. Friends Romans Countrymen
More on Linguistic
these later. modules
Modified tokens. friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index.
countryman 1318 16
Indexer steps(1)
Sequence of (Modified token, Document ID) pairs.
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
Doc 1 Doc 2 killed
me
1
1
so 2
let 2
it 2
I did enact Julius So let it be with be 2
with 2
Caesar I was killed Caesar. The noble caesar
the
2
2
Sort by terms.
I 1 ambitious 2
did
enact
1
1
be
brutus
2
1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
Core indexing step. was
killed
1
1
caesar
caesar
2
2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i' 1
so 2 it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with
20 2
Indexer steps(3)
Multiple term entries in a
Term Doc # Term Doc # Term freq
ambitious 2 ambitious 2 1
merged.
brutus 2 1
capitol 1 capitol 1 1
caesar 1 caesar 1 1
Frequency information is
caesar 2 caesar 2 2
caesar
did
2
1
did 1 1
added.
enact 1 1
enact 1 hath 2 1
hath 1 I 1 2
I 1 i' 1 1
I 1 it 2 1
i' 1 julius 1 1
Why frequency? it
julius
2
1
killed
let
1
2
2
1
Will discuss later. killed
killed
1
1
me
noble
1
2
1
1
let 2
so 2 1
me 1
the 1 1
noble 2
the 2 1
so 2
told 2 1
the 1
you 2 1
the 2
was 1 1
told 2
was 2 1
you 2
with 2 1
was 1
was 2
with 2 21
The result is split into a Dictionary file and a
Postings file.
22
Where do we pay in storage?
Doc # Freq
Term N docs Coll freq 2 1
2 1
ambitious 1 1
1 1
be 1 1 2 1
brutus 2 2 1 1
capitol
caesar
1
2
1
3
1
2 Will quantify
1
2
the storage,
did 1 1 1 1
1 1
enact 1 1
2 1
hath
I
1
1
1
2
1
1
later.
2
1
i' 1 1 2 1
1 1
Terms it
julius
1
1
1
1 1 2
2 1
killed 1 2 1 1
let 1 1 2 1
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
2 1
the 2 2
2 1
told 1 1 1 1
you 1 1 2 1
was 2 2 2 1
with 1 1
23
Pointers
The index we just built
Today’s
How do we process a query? focus
Later - what kinds of queries can we process?
24
Query processing: AND
Consider processing the query:
Brutus AND Caesar
Locate Brutus in the Dictionary;
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
25
The merge
Walk through the two postings simultaneously, in
time linear in the total number of postings entries
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
27
Query optimization
28
Query optimization
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 16 21 34
Caesar 13 16
30
Query optimization example
Process in order of increasing freq:
start with smallest set, then keep cutting further.
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 16 21 34
Caesar 13 16
MERGE(p, q)
1 answer ← ()
2 while p ≠ NIL and q ≠ NIL
3 do if docID[p] = docID[q]
4 then ADD(answer, docID[p])
5 else if docID[p] < docID[q]
6 then p ← next[p]
7 else q ← next[q]
8 return answer
32
More general optimization
e.g., (madding OR crowd) AND (ignoble
OR strife)
Get freq’s for all terms.
Estimate the size of each OR by the sum
of its freq’s (conservative).
Process in increasing order of OR sizes.
33
Exercise
Recommend a query
processing order for
INTERSECT((t1,……….tn))
1 terms ← SORTBYINCRESINGFREQ((t1,……tn))
2 result ← postings (first(terms))
3 terms ← rest(terms)
4 while terms ≠ NIL and result ≠ NIL
5 do result← INTERSECT( result, postings
(first(terms)))
6 terms ← rest(terms)
7 return result
35
Thank you for your attention!