Lec 1 IR
Lec 1 IR
Information Retrieval
Introducing Information Retrieval
and Web Search
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
2
Unstructured (text) vs. structured (database)
data in the mid-nineties
3
Unstructured (text) vs. structured (database)
data today
4
Sec. 1.1
5
The classic search model
User task Get rid of mice in a
politically correct way
Misconception?
Info need
Info about removing
mice
without killing them
Misformulation?
Query
how trap mice alive Search
Search
engine
Query Results
Collection
refinement
Sec. 1.1
7
Introduction to
Information Retrieval
Term-document incidence matrices
Sec. 1.1
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented)
bitwise AND.
– 110100 AND
– 110111 AND
– 101111 =
– 100100
11
Sec. 1.1
Answers to query
12
Sec. 1.1
Bigger collections
13
Sec. 1.1
• (1000*1million).
– matrix is extremely sparse (“most of entries are
0” 99.8%).
• What’s a better representation?
– We only record the 1 positions.
14
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Sec. 1.2
Inverted index
• For each term t, we must store a list of all
documents that contain t.
– Identify each doc by a docID, a document serial
number
• Can we used fixed-size arrays for this?
Brutus 1 2 4 1 3 4 173 174
1 1 5
Caesar 1 2 4 5 6 1 5 132
6 7
Calpurnia 2 31 54 101
What happens if the word Caesar
is added to document 14?
16
Sec. 1.2
Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable length
arrays Posting
• Some tradeoffs in size/ease of insertion
Brutus 1 2 4 1 3 4 173 174
1 1 5
Caesar 1 2 4 5 6 1 5 132
6 7
Calpurnia 2 3 54 101
1
Dictionary Postings
Sorted by docID (more later on why).
17
Sec. 1.2
Linguistic modules
countryma
Modified tokens friend roman
n
Indexer friend 2 4
roman 1 2
Inverted
index countryman 1 1
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec. 1.2
Doc Doc
1 2
I did enact Julius So let it be with
Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2
Lists of
docIDs
Terms
and
counts
IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?
Pointers 23
Introduction to
Information Retrieval
Query processing with an inverted index
Sec. 1.3
25
Sec. 1.3
27
Sec. 1.3
The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 1 3 6 1 Brut
6 2 4 2 us
Caes
1 2 3 5 8 1 2 3
8 ar
3 1 4
28
Introduction to
Information Retrieval
The Boolean Retrieval Model
& Extended Boolean Models
Sec. 2.4
Phrase queries
• We want to be able to answer queries such as
“stanford university” – as a phrase
• Thus the sentence “I went to university at
Stanford” is not a match.
– The concept of phrase queries has proven easily
understood by users; one of the few “advanced
search” ideas that works
• For this, it no longer suffices to store only
<term : docs> entries
Sec. 2.4.1
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
Rules of thumb
• A positional index is 2–4 as large as a non-
positional index
Combination schemes
• These two approaches can be profitably
combined
– For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists
• Even more so for phrases like “The Who”
• Williams et al. (2004) evaluate a more
sophisticated mixed indexing scheme
– A typical web query mixture was executed in ¼ of the
time of using just a positional index
– It required 26% more space than having a positional
index alone
Introduction to
Information Retrieval
Structured vs. Unstructured Data
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information
in “tables”
Employe Manage Salar
e r y
Smit 5000
Jones
h 0
Chan Smit 6000
g h 0
Iv Smit 5000
y h 0
41
Semi-structured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified zones such
as the Title and Bullets
• … to say nothing of linguistic structure
• Facilitates “semi-structured” search such as
– Title contains data AND Bullets contain search
• Or even
– Title is about Object Oriented Programming AND
Author something like stro*rup
– where * is the wild-card operator
42