0% found this document useful (0 votes)
4 views

Unit 1 Intro to IR

Information Retrieval (IR) involves finding unstructured documents that meet user information needs from large collections. Key concepts include precision and recall, as well as the use of inverted indexes for efficient document retrieval. The document also discusses query processing, Boolean queries, and optimization techniques for handling complex queries.

Uploaded by

candymandy2925
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 1 Intro to IR

Information Retrieval (IR) involves finding unstructured documents that meet user information needs from large collections. Key concepts include precision and recall, as well as the use of inverted indexes for efficient document retrieval. The document also discusses query processing, Boolean queries, and optimization techniques for handling complex queries.

Uploaded by

candymandy2925
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Retrieval

Information Retrieval
Retrieval

Note :
Many images, graphs, texts, slides, definitions etc. are adapted from
various books as well as various sources of World Wide Web. This is
simply a presentation of concept based on the original work of many
contributors to the field as well as WWW.

Dr. Sunita Jahirabadkar


Retrieval

Information Retrieval
▪ Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers).
▪ These days we frequently think first of web search, but there
are many other cases:
▪E-mail search
▪Searching your laptop
▪Corporate knowledge bases
▪Legal information retrieval
..
Retrieval

Basic assumptions of Information Retrieval


▪ Collection: A set of documents
▪ Assume it is a static collection for the moment

▪ Goal: Retrieve documents with information that is


relevant to the user’s information need and helps the
user complete a task
Retrieval
..

How good are the retrieved docs?


▪ Precision : Fraction of retrieved docs that are relevant
to the user’s information need
▪ Recall : Fraction of relevant docs in collection that are
retrieved

▪ More precise definitions and measurements to


follow later
..
Retrieval

Unstructured data in 1620


▪ Which plays of Shakespeare contain the words Brutus AND
Caesar but NOT Calpurnia?
▪ One could grep all of Shakespeare’s plays for Brutus and
Caesar, then strip out lines containing Calpurnia? ▪ Why
is that not the answer?
▪ Slow (for large corpora)
▪ NOT Calpurnia is non-trivial
▪ Other operations (e.g., find the word Romans near
countrymen) not feasible
▪ Ranked retrieval (best documents to return)
▪Later lectures
Retrieval
..

Term-document incidence
matrices

Brutus AND Caesar BUT NOT


Calpurnia contain
1 if
play
..
Retrieval

Incidence vectors
▪ So we have a 0/1 vector for each term.
▪ To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented) ➔ bitwise AND. ▪ 110100
AND
▪ 110111 AND
▪ 101111 =
▪ 100100
..
Retrieval

Answers to query
▪ Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
✦Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
Retrieval

Bigger collections
..

▪ Consider N = 1 million documents, each with about


1000 words.
▪ Avg 6 bytes/word including spaces/punctuation
▪ 6GB of data in the documents.
▪ Say there are M = 500K distinct terms among these.
..
Retrieval

Can’t build the matrix


▪ 500K x 1M matrix has half-a-trillion 0’s and 1’s.

▪ But it has no more ▪ What’s a better


than one billion 1’s. ▪ representation? ▪ We
matrix is extremely sparse. only record the 1 positions.
Why?
Retrieval

Inverted index
..

▪ For each term t, we must store a list of all documents


that contain t.
▪ Identify each doc by a docID, a document serial number
▪ Can we use fixed-size arrays for this?
1 2 4 11 31 45 174
Brutus 173
C
1 2 4 5 6 16 57 132
a
Calpurnia e 2 31 54 101
word Caesar is
sar
added to document
What happens if the
..
Retrieval

Inverted index
▪ We need variable-size postings lists
▪ On disk, a continuous run of postings is normal and best
P
▪ In memory, can use linked lists or variable length arrays
o
s
▪ Some tradeoffs in size/ease of insertion
t
i
n
Brutus
1 2 4 11 31 45 174
173 g

C
1 2 4 5 6 16 57 132
a
Calpurnia e s a
2 31 54 101

Dictionary Postings
r
Sorted by docID (more later on why).
..
Retrieval

Inverted index construction


Romans, countrymen.
Documents to be indexed Tokenizer
Friends ,Lingui

Token streamFriends Romans Countrymen stic


modul
es friend roman countryman
Modified tokens

Indexer friend 24
roman 1 2
Inverted index
116
countryman
Retrieval

Initial stages of text processing


▪ Tokenization
▪ Cut character sequence into word tokens
▪ Deal with “John’s”, a state-of-the-art solution
▪ Normalization
▪ Map text and query term to same form
▪ You want U.S.A. and USA to match
▪ Stemming
▪ We may wish different forms of a root to match ▪
authorize, authorization
▪ Stop words
▪ We may omit very common words (or not)
▪ the, a, to, of
..
Retrieval

Indexer steps: Token sequence ▪

Sequence of (Modified token, Document ID) pairs.


Doc 2
So let it be with
Caesar. The noble
Doc 1 Brutus hath told you
I did enact Julius Caesar was
Caesar I was killed ambitious
i’ the Capitol;
Brutus killed me.
..
Retrieval

Indexer steps: Sort


▪Sort by terms
▪At least conceptually
▪And then docID
Core indexing step
Retrieval
..

Indexer steps: Dictionary & Postings


▪ Multiple term entries in
a single document are
merged.
▪ Split into Dictionary
and Postings
▪ Doc. frequency
information is added.

Why frequency?
Retrieval
..
is
t
Where do we pay in
stor
age?
s
T d
e o
r c
ms a I
n D
o IR
f
d c
o
u
n
t
s
Pointers
s
sys te
m im ple me nta tio
..
Retrieval
O
u
The index we just built
r
f
▪ How do we process a query?
o
▪ Later – what kinds of queries can we process?
c
u
s
..
Retrieval
Query processing: AND
▪ Consider processing the query:
Brutus AND Caesar
✦Locate Brutus in the Dictionary;
▪ Retrieve its postings.
✦Locate Caesar in the Dictionary;
▪ Retrieve its postings.
✦“Merge” the two postings (intersect the document sets):
64 3 Brutus
21 Caesar
2 4 8 16 32 1 2 3 5 8 1 128 34
Retrieval
The merge
..

▪ Walk through the two postings simultaneously, in time


linear in the total number of postings entries
Brutus

2 4 8 16 32 64 128
3 34 Caesar
123581
21
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
Retrieval

Intersecting two postings lists (a


“merge” algorithm)
..
Retrieval

Boolean queries: Exact match


▪ The Boolean retrieval model is being able to ask a query that is
a Boolean expression:
▪ Boolean Queries are queries using AND, OR and NOT to
join query terms
▪ Views each document as a set of words
▪ Is precise: document matches condition or not.
▪ Perhaps the simplest model to build an IR system on
▪ Primary commercial retrieval tool for 3 decades. ▪ Many
search systems you still use are Boolean: ▪ Email, library
catalog, macOS Spotlight
..
Retrieval

Boolean queries:
More general merges
▪ Exercise: Adapt the merge for the queries:
Brutus AND NOT Caesar
Brutus OR NOT Caesar

✦ Can we still run through the merge in time O(x+y)? What can
we achieve?
..
Retrieval

Merging
What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
✦ Can we always merge in “linear” time?
▪ Linear in what?
✦ Can we do better?
..
Retrieval

Query optimization

▪ What is the best order for query processing?


▪ Consider a query that is an AND of n terms.
▪ For each of the n terms, get its postings, then AND them
together.
Calpurnia
2 4 8 16 32 64 128 1 2 3 5
Brutus
8 16 21 34 13 16
Caesar
Query: Brutus AND Calpurnia AND Caesar
..
Retrieval

Query optimization example ▪

Process in order of increasing freq:


▪ start with smallest set, then keep cutting further.
This is why we
kept
document freq.
in dictionary
2 4 8 16 32 64 128 1 2 3 5
Brutus 8 16 21 34 13 16
Caesar
Calpurnia
Execute the query as (Calpurnia AND Brutus) AND Caesa

You might also like