Introduction To: Information Retrieval
Introduction To: Information Retrieval
Information Retrieval
Introducing Information Retrieval
and Web Search
Information Retrieval
Information Retrieval (IR) is finding material
(usually documents) of an unstructured
nature (usually text) that satisfies an
information need from within large
collections (usually stored on computers).
These days we frequently think first of web
search, but there are many other cases:
E-mail search
Searching your laptop
Corporate knowledge bases
Legal information retrieval
Sec. 1.1
User task
Misconception?
Info need
Misformulation?
Query
Search
engine
Query
refinement
Results
Collection
Sear
ch
Sec. 1.1
Introduction to
Information Retrieval
Term-document incidence
matrices
Sec. 1.1
Sec. 1.1
Term-document incidence
matrices
Antony and Cleopatra
J ulius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
Brutus
Caesar
Calpurnia
Cleopatra
mercy
worser
1 if play contains
word, 0 otherwise
Sec. 1.1
Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for
Brutus, Caesar and Calpurnia
(complemented) bitwise AND.
110100 AND
110111 AND
101111 =
100100
J ulius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
Brutus
Caesar
Calpurnia
Cleopatra
mercy
worser
11
Sec. 1.1
Answers to query
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
12
Sec. 1.1
Bigger collections
Consider N = 1 million documents,
each with about 1000 words.
Avg 6 bytes/word including
spaces/punctuation
6GB of data in the documents.
13
Sec. 1.1
Introduction to
Information Retrieval
The Inverted Index
The key data structure
underlying modern IR
Sec. 1.2
Inverted index
For each term t, we must store a list
of all documents that contain t.
Identify each doc by a docID, a
document serial number
1
2
2
31
16 57 132
54 101
16
Sec. 1.2
Inverted index
We need variable-size postings lists
On disk, a continuous run of postings is
normal and best
In memory, can use linked lists or
variable length arrays
Posting
Some tradeoffs in size/ease of insertion
Brutus
Caesar
Calpurnia
Dictionary
2
2
31
11 31 45 173 174
16 57 132
54 101
Postings
17
Sorted by docID (more later on why).
Sec. 1.2
Tokenizer
Token stream
Friends
Romans
Countrymen
roman
countryman
Linguistic modules
friend
Modified tokens
Indexer
Inverted index
friend
roman
countryman
13
16
Normalization
Map text and query term to same form
You want U.S.A. and USA to match
Stemming
We may wish different forms of a root to match
authorize, authorization
Stop words
We may omit very common words (or not)
the, a, to, of
Doc 1
I did enact Julius
Caesar I was killed
i the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Sec. 1.2
Sec. 1.2
Sec. 1.2
Why frequency?
Will discuss later.
Sec. 1.2
Where do we pay in
storage?
Lists of
docIDs
Terms
and
counts
Pointers
IR system
implementation
How do we
index
efficiently?
How much
storage do we
23
need?
Introduction to
Information Retrieval
Query processing with an
inverted index
Sec. 1.3
Our focus
25
Sec. 1.3
1
3
21
34 Caesar
26
Sec. 1.3
The merge
Walk through the two postings
simultaneously, in time linear in the
total number of postings entries
2
16
32
8
64
13
21
128
Brutus
34 Caesar
28
Introduction to
Information Retrieval
The Boolean Retrieval Model
& Extended Boolean Models
Sec. 1.3
30
Sec. 1.4
Example: WestLaw
http://www.westlaw.com/
31
Sec. 1.4
Example: WestLaw
http://www.westlaw.com/
Sec. 1.3
Boolean queries:
More general merges
Exercise: Adapt the merge for the
queries:
Brutus AND NOT Caesar
Brutus OR NOT Caesar
33
Sec. 1.3
Merging
What about an arbitrary Boolean
formula?
(Brutus OR Caesar) AND NOT
(Antony OR Cleopatra)
Can we always merge in linear
time?
Linear in what?
Can we do better?
34
Sec. 1.3
Query optimization
What is the best order for query
processing?
Consider a query that is an AND of
n terms.
Brutus
2 n
4 terms,
8 16 get
32 64
128
For each of the
its
Caesar
1 AND
2 3them
5 8 together.
16 21 34
postings, then
Calpurnia
13 16
Sec. 1.3
Caesar
Calpurnia
4
2
16 32 64 128
16 21 34
13 16
Sec. 1.3
37
Exercise
Recommend a query
processing order for
Term
eyes
kaleidoscope
marmalade
skies
tangerine
trees
Freq
213312
87009
107913
271658
46653
316812
38
Exercise
Try the search feature at
http://www.rhymezone.com/shakespe
are/
Write down five search features you
think it could do better
40
Introduction to
Information Retrieval
Phrase queries and positional
indexes
Sec. 2.4
Phrase queries
We want to be able to answer queries such
as stanford university as a phrase
Thus the sentence I went to university at
Stanford is not a match.
The concept of phrase queries has proven
easily understood by users; one of the few
advanced search ideas that works
Many more queries are implicit phrase queries
Sec. 2.4.1
Sec. 2.4.1
Sec. 2.4.1
Solution 2: Positional
indexes
Sec. 2.4.2
Sec. 2.4.2
Sec. 2.4.2
be:
1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
Sec. 2.4.2
Proximity queries
LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
Again, here, /k means within k words of.
Sec. 2.4.2
Sec. 2.4.2
Why?
Postings
Positional postings
100,000
100
Sec. 2.4.2
Rules of thumb
A positional index is 24 as large as a
non-positional index
Positional index size 3550% of
volume of original text
Caveat: all of this holds for English-like
languages
Sec. 2.4.3
Combination schemes
These two approaches can be profitably
combined
For particular phrases (Michael Jackson,
Britney Spears) it is inefficient to keep
on merging positional postings lists
Even more so for phrases like The Who
Introduction to
Information Retrieval
Structured vs. Unstructured Data
IR vs. databases:
Structured vs unstructured data
Structured data tends to refer to
information in tables
Employee
Manager
Salary
Smith
Jones
50000
Chang
Smith
60000
Ivy
Smith
50000
Unstructured data
Typically refers to free text
Allows
Keyword queries including operators
More sophisticated concept queries
e.g.,
find all web pages dealing with drug abuse
Semi-structured data
In fact almost no data is unstructured
E.g., this slide has distinctly identified zones
such as the Title and Bullets
to say nothing of linguistic structure
Or even
Title is about Object Oriented Programming AND
Author something like stro*rup
where * is the wild-card operator
57