Introduction To: Information Retrieval
Introduction To: Information Retrieval
Information Retrieval
Boolean Retrieval
Sec. 1.1
Term-document incidence
matrices
Anto ny and Cle o patra Julius Cae s ar The Te mpe s t Hamle t Othe llo Mac be th
Anto ny 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Cae s ar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cle o patra 1 0 0 0 0 0
me rc y 1 0 1 1 1 1
wo rs e r 1 0 1 1 1 0
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented)
bitwise AND. Anto ny
Brutus
Anto ny and Cle o patra
1
1
Julius Cae s ar
1
1
The Te mpe s t
0
0
Hamle t
0
1
Othe llo
0
0
Mac be th
1
0
– 110111 AND me rc y
wo rs er
1
1
0
0
1
1
1
1
1
1
1
0
– 101111 =
– 100100
4
Sec. 1.1
Answers to query
5
Sec. 1.1
Bigger collections
• Consider N = 1 million documents, each with
about 1000 words.
• Avg 6 bytes/word including
spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among
these.
6
Sec. 1.1
7
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Sec. 1.2
Inverted index
• For each term t, we must store a list of all
documents that contain t.
– Identify each doc by a docID, a document serial
number
• Can we use fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable length
arrays Posting
• Some tradeoffs in size/ease of insertion
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Dictionary Postings
Sorted by docID (more later on why). 10
Sec. 1.2
Tokenizer
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec. 1.2
Doc 1 Doc 2
Why frequency?
Will discuss later.
Sec. 1.2
Lists of
docIDs
Terms
and
counts
IR system
implementation
•How do we index
efficiently?
•How much storage
do we need?
Pointers 16
Introduction to
Information Retrieval
Query processing with an inverted index
Sec. 1.3
18
Sec. 1.3
19
Sec. 1.3
The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
21
Sec. 1.3
Boolean queries:
More general merges
• Exercise: Adapt the merge for the quer:
Brutus AND NOT Caesar
25
Sec. 1.3
Query optimization
• What is the best order for query
processing?
• Consider a query that is an AND of n terms.
• For each of the n terms, get its postings,
then AND them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16 21 34
Calpurnia 13 16
28
Exercise
• Recommend a query
processing order for
29
Exercise
• Recommend a query
processing order for
30
Does Google use the Boolean model?
On Google, the default interpretation of a query [w1 w2 . . .wn] is w1 AND w2 AND . .
.AND wn
Cases where you get hits that do not contain one of the wi :
anchor text
page contains variant of wi (morphology, spelling correction,
synonym)
long queries (n large)
boolean expression generates very few hits
Simple Boolean vs. Ranking of result set
Simple Boolean retrieval returns matching documents in no
particular order.
Google (and most well designed Boolean engines) rank the result
set – they rank good hits (according to some estimator of
relevance) higher than bad hits.
31 31