0% found this document useful (0 votes)

4 views

Unit 1 Intro to IR

Information Retrieval (IR) involves finding unstructured documents that meet user information needs from large collections. Key concepts include precision and recall, as well as the use of inverted indexes for efficient document retrieval. The document also discusses query processing, Boolean queries, and optimization techniques for handling complex queries.

Uploaded by

candymandy2925

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Unit 1 Intro to IR

Uploaded by

candymandy2925

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Retrieval

Information Retrieval
Retrieval

Note :
Many images, graphs, texts, slides, definitions etc. are adapted from
various books as well as various sources of World Wide Web. This is
simply a presentation of concept based on the original work of many
contributors to the field as well as WWW.

Dr. Sunita Jahirabadkar

Retrieval

Information Retrieval
▪ Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers).
▪ These days we frequently think first of web search, but there
are many other cases:
▪E-mail search
▪Searching your laptop
▪Corporate knowledge bases
▪Legal information retrieval
..
Retrieval

Basic assumptions of Information Retrieval

▪ Collection: A set of documents
▪ Assume it is a static collection for the moment

▪ Goal: Retrieve documents with information that is

relevant to the user’s information need and helps the
user complete a task
Retrieval
..

How good are the retrieved docs?

▪ Precision : Fraction of retrieved docs that are relevant
to the user’s information need
▪ Recall : Fraction of relevant docs in collection that are
retrieved

▪ More precise definitions and measurements to

follow later
..
Retrieval

Unstructured data in 1620

▪ Which plays of Shakespeare contain the words Brutus AND
Caesar but NOT Calpurnia?
▪ One could grep all of Shakespeare’s plays for Brutus and
Caesar, then strip out lines containing Calpurnia? ▪ Why
is that not the answer?
▪ Slow (for large corpora)
▪ NOT Calpurnia is non-trivial
▪ Other operations (e.g., find the word Romans near
countrymen) not feasible
▪ Ranked retrieval (best documents to return)
▪Later lectures
Retrieval
..

Term-document incidence
matrices

Brutus AND Caesar BUT NOT

Calpurnia contain
1 if
play
..
Retrieval

Incidence vectors
▪ So we have a 0/1 vector for each term.
▪ To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented) ➔ bitwise AND. ▪ 110100
AND
▪ 110111 AND
▪ 101111 =
▪ 100100
..
Retrieval

Answers to query
▪ Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
✦Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
Retrieval

Bigger collections
..

▪ Consider N = 1 million documents, each with about

1000 words.
▪ Avg 6 bytes/word including spaces/punctuation
▪ 6GB of data in the documents.
▪ Say there are M = 500K distinct terms among these.
..
Retrieval

Can’t build the matrix

▪ 500K x 1M matrix has half-a-trillion 0’s and 1’s.

▪ But it has no more ▪ What’s a better

than one billion 1’s. ▪ representation? ▪ We
matrix is extremely sparse. only record the 1 positions.
Why?
Retrieval

Inverted index
..

▪ For each term t, we must store a list of all documents

that contain t.
▪ Identify each doc by a docID, a document serial number
▪ Can we use fixed-size arrays for this?
1 2 4 11 31 45 174
Brutus 173
C
1 2 4 5 6 16 57 132
a
Calpurnia e 2 31 54 101
word Caesar is
sar
added to document
What happens if the
..
Retrieval

Inverted index
▪ We need variable-size postings lists
▪ On disk, a continuous run of postings is normal and best
P
▪ In memory, can use linked lists or variable length arrays
o
s
▪ Some tradeoffs in size/ease of insertion
t
i
n
Brutus
1 2 4 11 31 45 174
173 g

C
1 2 4 5 6 16 57 132
a
Calpurnia e s a
2 31 54 101

Dictionary Postings
r
Sorted by docID (more later on why).
..
Retrieval

Inverted index construction

Romans, countrymen.
Documents to be indexed Tokenizer
Friends ,Lingui

Token streamFriends Romans Countrymen stic

modul
es friend roman countryman
Modified tokens

Indexer friend 24
roman 1 2
Inverted index
116
countryman
Retrieval

Initial stages of text processing

▪ Tokenization
▪ Cut character sequence into word tokens
▪ Deal with “John’s”, a state-of-the-art solution
▪ Normalization
▪ Map text and query term to same form
▪ You want U.S.A. and USA to match
▪ Stemming
▪ We may wish different forms of a root to match ▪
authorize, authorization
▪ Stop words
▪ We may omit very common words (or not)
▪ the, a, to, of
..
Retrieval

Indexer steps: Token sequence ▪

Sequence of (Modified token, Document ID) pairs.

Doc 2
So let it be with
Caesar. The noble
Doc 1 Brutus hath told you
I did enact Julius Caesar was
Caesar I was killed ambitious
i’ the Capitol;
Brutus killed me.
..
Retrieval

Indexer steps: Sort

▪Sort by terms
▪At least conceptually
▪And then docID
Core indexing step
Retrieval
..

Indexer steps: Dictionary & Postings

▪ Multiple term entries in
a single document are
merged.
▪ Split into Dictionary
and Postings
▪ Doc. frequency
information is added.

Why frequency?
Retrieval
..
is
t
Where do we pay in
stor
age?
s
T d
e o
r c
ms a I
n D
o IR
f
d c
o
u
n
t
s
Pointers
s
sys te
m im ple me nta tio
..
Retrieval
O
u
The index we just built
r
f
▪ How do we process a query?
o
▪ Later – what kinds of queries can we process?
c
u
s
..
Retrieval
Query processing: AND
▪ Consider processing the query:
Brutus AND Caesar
✦Locate Brutus in the Dictionary;
▪ Retrieve its postings.
✦Locate Caesar in the Dictionary;
▪ Retrieve its postings.
✦“Merge” the two postings (intersect the document sets):
64 3 Brutus
21 Caesar
2 4 8 16 32 1 2 3 5 8 1 128 34
Retrieval
The merge
..

▪ Walk through the two postings simultaneously, in time

linear in the total number of postings entries
Brutus

2 4 8 16 32 64 128
3 34 Caesar
123581
21
If the list lengths are x and y, the merge takes O(x+y)
operations.
Crucial: postings sorted by docID.
Retrieval

Intersecting two postings lists (a

“merge” algorithm)
..
Retrieval

Boolean queries: Exact match

▪ The Boolean retrieval model is being able to ask a query that is
a Boolean expression:
▪ Boolean Queries are queries using AND, OR and NOT to
join query terms
▪ Views each document as a set of words
▪ Is precise: document matches condition or not.
▪ Perhaps the simplest model to build an IR system on
▪ Primary commercial retrieval tool for 3 decades. ▪ Many
search systems you still use are Boolean: ▪ Email, library
catalog, macOS Spotlight
..
Retrieval

Boolean queries:
More general merges
▪ Exercise: Adapt the merge for the queries:
Brutus AND NOT Caesar
Brutus OR NOT Caesar

✦ Can we still run through the merge in time O(x+y)? What can
we achieve?
..
Retrieval

Merging
What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
✦ Can we always merge in “linear” time?
▪ Linear in what?
✦ Can we do better?
..
Retrieval

Query optimization

▪ What is the best order for query processing?

▪ Consider a query that is an AND of n terms.
▪ For each of the n terms, get its postings, then AND them
together.
Calpurnia
2 4 8 16 32 64 128 1 2 3 5
Brutus
8 16 21 34 13 16
Caesar
Query: Brutus AND Calpurnia AND Caesar
..
Retrieval

Query optimization example ▪

Process in order of increasing freq:

▪ start with smallest set, then keep cutting further.
This is why we
kept
document freq.
in dictionary
2 4 8 16 32 64 128 1 2 3 5
Brutus 8 16 21 34 13 16
Caesar
Calpurnia
Execute the query as (Calpurnia AND Brutus) AND Caesa

OAuth 2 Simplified
No ratings yet
OAuth 2 Simplified
12 pages
Hotelogix - Frontdesk and Housekeeping Module - HSP - Assignment On Frontdesk (MCQ) 1
No ratings yet
Hotelogix - Frontdesk and Housekeeping Module - HSP - Assignment On Frontdesk (MCQ) 1
5 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
2
No ratings yet
2
50 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
lecture02 - IR
No ratings yet
lecture02 - IR
36 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
04 - Recuperación Información Modelo Booleano
No ratings yet
04 - Recuperación Información Modelo Booleano
41 pages
Chapter 1 Boolean Retrieval Model
No ratings yet
Chapter 1 Boolean Retrieval Model
21 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Unit 1
No ratings yet
Unit 1
181 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Web Search and Mining: Lecture 2: Boolean Retrieval
No ratings yet
Web Search and Mining: Lecture 2: Boolean Retrieval
45 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Introduction To Information Retrieval
100% (2)
Introduction To Information Retrieval
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Lect 3 Inverted Index
No ratings yet
Lect 3 Inverted Index
24 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Ir 1
No ratings yet
Ir 1
59 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Lect 2 Boolean Retrieval
No ratings yet
Lect 2 Boolean Retrieval
24 pages
Intro To IRE
No ratings yet
Intro To IRE
48 pages
Lecture 2-Boolean Retrieval
No ratings yet
Lecture 2-Boolean Retrieval
29 pages
Information Retrieval
No ratings yet
Information Retrieval
44 pages
Unit I
No ratings yet
Unit I
83 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Ir 1
No ratings yet
Ir 1
14 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
Lec2 BooleanRetrieval 1
No ratings yet
Lec2 BooleanRetrieval 1
61 pages
Boolean Model 2021spring
No ratings yet
Boolean Model 2021spring
43 pages
Ir Notes
No ratings yet
Ir Notes
111 pages
Lecture2 Ranking1
No ratings yet
Lecture2 Ranking1
126 pages
Inverted Index Construction: Adapted From Lectures by
No ratings yet
Inverted Index Construction: Adapted From Lectures by
78 pages
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
No ratings yet
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
44 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Lecture 1
No ratings yet
Lecture 1
53 pages
chapter2-MA212-Indexing+&+Preprocessing
No ratings yet
chapter2-MA212-Indexing+&+Preprocessing
68 pages
01 Intro
No ratings yet
01 Intro
145 pages
OpenBSD Mastery: Filesystems: IT Mastery, #19
From Everand
OpenBSD Mastery: Filesystems: IT Mastery, #19
Michael W. Lucas
No ratings yet
MTP1 May2022 - Paper 7 EISSM
No ratings yet
MTP1 May2022 - Paper 7 EISSM
19 pages
Speed Adjustment and Positioning Using Standard Drives: (With S7-200 and MICROMASTER 420)
No ratings yet
Speed Adjustment and Positioning Using Standard Drives: (With S7-200 and MICROMASTER 420)
6 pages
LibreOffice 7.0 Getting Started
100% (2)
LibreOffice 7.0 Getting Started
435 pages
Asterisk 1.4.0 CLI Commands
100% (1)
Asterisk 1.4.0 CLI Commands
6 pages
Manuale a5 Rivo_rev.3.2
No ratings yet
Manuale a5 Rivo_rev.3.2
22 pages
Traktor S2 S4 Speaker Connection Guide
No ratings yet
Traktor S2 S4 Speaker Connection Guide
8 pages
Install Elasticsearch with Docker _ Elasticsearch Guide [8.17] _ Elastic
No ratings yet
Install Elasticsearch with Docker _ Elasticsearch Guide [8.17] _ Elastic
14 pages
JNCIA - New Dumps
No ratings yet
JNCIA - New Dumps
26 pages
Duvoice Oracle
No ratings yet
Duvoice Oracle
40 pages
Debre Tabor University: Course Module On Advanced Programming
No ratings yet
Debre Tabor University: Course Module On Advanced Programming
44 pages
Readme
No ratings yet
Readme
2 pages
3D Models using Tinkercad and 3D Printing Technology_ME
No ratings yet
3D Models using Tinkercad and 3D Printing Technology_ME
3 pages
AI Ethics Class 10
No ratings yet
AI Ethics Class 10
14 pages
PR No 22 MAD
No ratings yet
PR No 22 MAD
7 pages
07 Priority Queues and Heaps PDF
No ratings yet
07 Priority Queues and Heaps PDF
41 pages
Midterm Robotics 12
No ratings yet
Midterm Robotics 12
2 pages
Rahul Tiwari Resume
No ratings yet
Rahul Tiwari Resume
3 pages
Cam 2019
No ratings yet
Cam 2019
10 pages
Exceptional HAndiling Basic
No ratings yet
Exceptional HAndiling Basic
4 pages
DSBDA GROUP B 1
No ratings yet
DSBDA GROUP B 1
5 pages
Module 5: Basic Processing Unit: I) Fundamental Concepts of Processor
No ratings yet
Module 5: Basic Processing Unit: I) Fundamental Concepts of Processor
16 pages
Details in BIM A Process Conceptualization
No ratings yet
Details in BIM A Process Conceptualization
7 pages
Apache Security Guide
No ratings yet
Apache Security Guide
22 pages
Smart Shopping: - Augmented Reality Based Shopping Application
No ratings yet
Smart Shopping: - Augmented Reality Based Shopping Application
13 pages
Log TP.HV553.PB801 HI 3751V350 - solo arranque
No ratings yet
Log TP.HV553.PB801 HI 3751V350 - solo arranque
6 pages
Notes Module-2 COA 22BEC306C
No ratings yet
Notes Module-2 COA 22BEC306C
19 pages
SLG Caro CS1
No ratings yet
SLG Caro CS1
6 pages
V1 Errata of The Third Printing May 2015
No ratings yet
V1 Errata of The Third Printing May 2015
3 pages