Information Retrieval - Lecture 2
Information Retrieval - Lecture 2
Engines
Introduction to IR
BIS216E
3
Course: Information Retrieval & Search Engines
Sec. 1.1
Term-document incidence
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
5
Course: Information Retrieval & Search Engines
Incidence vectors
Sec. 1.1
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
7
Course: Information Retrieval & Search Engines
Sec. 1.1
Answers to query
• Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
8
Course: Information Retrieval & Search Engines
Sec. 1.1
9
Course: Information Retrieval & Search Engines
Sec. 1.1
10
Course: Information Retrieval & Search Engines
11
Course: Information Retrieval & Search Engines
Sec. 1.1
Bigger collections
• Consider N = 1 million documents, each with about 1000 words.
• Avg 6 bytes/word including spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among these.
12
Course: Information Retrieval & Search Engines
Sec. 1.1
13
Course: Information Retrieval & Search Engines