Lec 1
Lec 1
Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction to
Information Retrieval
2
Information Retrieval
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information
in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000
5
Semi-structured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified zones
such as the Title and Bullets
• … to say nothing of linguistic structure
• Facilitates “semi-structured” search such as
– Title contains data AND Bullets contain search
6
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
7
Unstructured (text) vs. structured (database)
data in the mid-nineties
8
Unstructured (text) vs. structured (database)
data today
9
Sec. 1.1
10
The classic search model
User task Get rid of mice in a
politically correct way
Misconception?
Info need
Info about removing mice
without killing them
Misformulation?
Query
how trap mice alive Search
Search
engine
Query Results
Collection
refinement
Sec. 1.1
Right or wrong/retrieved
or not
12
Sec. 1.1
Search 13
Word:”ford
Introduction to
Information Retrieval
Term-document incidence matrices
Sec. 1.1
• One could grep all of Shakespeare’s plays for Brutus and Caesar,
then lines containing Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– Roman near countrymen is not trival (position of terms)
– Repeat linear scan with each query(too long time)
– Ranked retrieval (best documents to return(the no each word repeated
in doc)
15
Sec. 1.1
Antony and Cleopatra J ulius Caes ar The Tempes t Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caes ar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
wors er 1 0 1 1 1 0
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented)
bitwise AND.
– 110100 AND
– 110111 AND
– 101111 =
– 100100
17
Sec. 1.1
Answers to query
18
Sec. 1.1
19
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Quiz
• When a search engine returns 30 pages only
20 of which were relevant while failing to
return 40 additional relevant pages, its
precision =……………. while its recall
=…………………….