Unit 1 Intro to IR
Unit 1 Intro to IR
Information Retrieval
Retrieval
Note :
Many images, graphs, texts, slides, definitions etc. are adapted from
various books as well as various sources of World Wide Web. This is
simply a presentation of concept based on the original work of many
contributors to the field as well as WWW.
Information Retrieval
▪ Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers).
▪ These days we frequently think first of web search, but there
are many other cases:
▪E-mail search
▪Searching your laptop
▪Corporate knowledge bases
▪Legal information retrieval
..
Retrieval
Term-document incidence
matrices
Incidence vectors
▪ So we have a 0/1 vector for each term.
▪ To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented) ➔ bitwise AND. ▪ 110100
AND
▪ 110111 AND
▪ 101111 =
▪ 100100
..
Retrieval
Answers to query
▪ Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
✦Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
Retrieval
Bigger collections
..
Inverted index
..
Inverted index
▪ We need variable-size postings lists
▪ On disk, a continuous run of postings is normal and best
P
▪ In memory, can use linked lists or variable length arrays
o
s
▪ Some tradeoffs in size/ease of insertion
t
i
n
Brutus
1 2 4 11 31 45 174
173 g
C
1 2 4 5 6 16 57 132
a
Calpurnia e s a
2 31 54 101
Dictionary Postings
r
Sorted by docID (more later on why).
..
Retrieval
Indexer friend 24
roman 1 2
Inverted index
116
countryman
Retrieval
Why frequency?
Retrieval
..
is
t
Where do we pay in
stor
age?
s
T d
e o
r c
ms a I
n D
o IR
f
d c
o
u
n
t
s
Pointers
s
sys te
m im ple me nta tio
..
Retrieval
O
u
The index we just built
r
f
▪ How do we process a query?
o
▪ Later – what kinds of queries can we process?
c
u
s
..
Retrieval
Query processing: AND
▪ Consider processing the query:
Brutus AND Caesar
✦Locate Brutus in the Dictionary;
▪ Retrieve its postings.
✦Locate Caesar in the Dictionary;
▪ Retrieve its postings.
✦“Merge” the two postings (intersect the document sets):
64 3 Brutus
21 Caesar
2 4 8 16 32 1 2 3 5 8 1 128 34
Retrieval
The merge
..
2 4 8 16 32 64 128
3 34 Caesar
123581
21
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
Retrieval
Boolean queries:
More general merges
▪ Exercise: Adapt the merge for the queries:
Brutus AND NOT Caesar
Brutus OR NOT Caesar
✦ Can we still run through the merge in time O(x+y)? What can
we achieve?
..
Retrieval
Merging
What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
✦ Can we always merge in “linear” time?
▪ Linear in what?
✦ Can we do better?
..
Retrieval
Query optimization