Lecture1 Introduction
Lecture1 Introduction
Retrieval
and Web Search
The literature searching process “is not an exact science but an art.”
Samuel Butler
Agenda
• Course schedule
• Introduction
• Some tricks to be more effective with Google
search
• Structured vs Unstructured Data
• Boolean retrieval
• Search Evaluation (Precision, recall, F-Measure)
• Introduction to ElK
• Reading: Chapter 1 - Boolean retrieval
Tricks to be more effective with Google Search
5
Structured vs.
Unstructured Data
Unstructured (text) vs. structured (database) data
Mid-nineties
Today
7
Basic assumptions of Information Retrieval
8
The classic search model
User task Get rid of mice in a
politically correct way
Misconception?
Info need
Info about removing mice
without killing them
Misformulation?
Query
how trap mice alive Search
Search
engine
Query Results
Collection
refinement
How good are the retrieved docs?
▪ Precision : Fraction of retrieved docs that are
relevant to the user’s information need
▪ Recall : Fraction of relevant docs in collection
that are retrieved
irrelevant
Retrieved & Not Retrieved &
Entire document collection
irrelevant irrelevant
Relevant Retrieved
relevant
documents documents Retrieved
Retrieved && Not Retrieved &
relevant
Relevant But relevant
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains
Brutus AND Caesar BUT NOT Calpurnia
word, 0 otherwise
Sec. 1.1
Incidence vectors
– 101111 = Antony
Brutus
1
1
1
1
0
0
0
1
0
0
1
0
Caesar 1 1 0 1 1 1
– 100100 Calpurnia
Cleopatra
0
1
1
0
0
0
0
0
0
0
0
0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
14
Answers to query
15
Bigger collections
16
Can’t build the matrix
17
Example
• D1: Mehdi habite à Rabat
• D2: Kenza habite à Fes
• D3: Mehdi et Kenza se sont rencontrés à Fes
terms D1 D2 D3 Mehdi AND Kenza
Mehdi 1 0 1 1 0 1
Habite 1 1 0 0 1 1
À 1 1 1
Rabat 1 0 1 0 0 1
Kenza 0 1 1 Mehdi OR Kenza
Fes 0 1 1 1 0 1
Et 0 0 1 0 1 1
Se sont 0 0 1
rencontrés 0 0 1 1
1 1
Why Boolean retrieval is not suitable
Dictionary Postings
Sorted by docID (more later on why).
Inverted index construction
Documents to Friends, Romans, countrymen.
be indexed
Tokenizer
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index
countryman 13 16
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Indexer steps: Token sequence
Doc 1 Doc 2
• Sort by terms
– And then docID
Why frequency?
Will discuss later.
Where do we pay in storage?
Lists of
docIDs
Terms
and
counts
IR system
implementation
• How do we
index efficiently?
• How much
storage do we
need?
Pointers 29
Query processing with an
inverted index
The index we just built
31
Query processing: AND
32
Sec. 1.3
The merge
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
34
The Boolean Retrieval
Model
& Extended Boolean
Models
Boolean queries: Exact match
36
Example: WestLaw http://www.westlaw.com/
37
Exercise
38
Phrase queries and
positional indexes
Phrase queries
<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
could contain “to be
2: 3, 149; or not to be”?
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
53
Semi-structured data
54
Performance Measures
P N
(Predicted) (Predicted)
P True Positive False Negative
(Actual)
N False Positive True Negative
(Actual)
9,990
• Accuracy = = 99.9 %
10,000
• This accuracy makes no sense and is misleading
Performance Measures
𝑇𝑃 𝑇𝑃
Precision = Recall =
𝑇𝑃+𝐹𝑃 𝑇𝑃+𝐹𝑁
P N
(Predicted) (Predicted)
P
Precision = 0
0
(𝑢𝑛𝑑𝑒𝑓𝑖𝑛𝑒𝑑)
(Actual)
True Positive False Negative
0 10
0 N
Recall = =0 (Actual)
False Positive True Negative
0+10
0 9,990
9,990
• However, Accuracy = = 99.9 % !!!
10,000
Performance Measures : an Other Example
𝑇𝑃 𝑇𝑃
Precision = Recall =
𝑇𝑃+𝐹𝑃 𝑇𝑃+𝐹𝑁
P N
(Predicted) (Predicted)
Precision = 10+9,990
10
= 0.001 P
(Actual)
True Positive False Negative
10 0
10 N
Recall = = 1.0 (Actual)
False Positive True Negative
10+0
9,990 0
𝑇𝑃+𝑇𝑁
• However, Accuracy = = 0.1 %
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
!!!
Comparing Performance of two Systems
• System 1 • System 2
– Precision = 70%
– Recall = 60%
? – Precision = 80%
– Recall = 50%
• System 1 • System 2
– 𝛽 = 0.95
– F𝛽 = 0.6942
< – 𝛽 = 0.95
– F𝛽 = 0.7766