0% found this document useful (0 votes)
3 views13 pages

Information Retrieval - Lecture 2

dddd

Uploaded by

M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

Information Retrieval - Lecture 2

dddd

Uploaded by

M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Information Retrieval & Search

Engines

Instructor: Prof. Shereen Taie

Introduction to IR
BIS216E

Course: Information Retrieval & Search Engines


Information Retrieval
&
Search Engines

Course: Information Retrieval & Search Engines


Sec. 1.1

Unstructured data in 1680


• Query: Which plays of Shakespeare contain the words Brutus AND
Caesar but NOT Calpurnia?
• One could grep all of Shakespeare’s plays for Brutus and Caesar,
then strip out lines containing Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– NOT Calpurnia is non-trivial
– Other operations (e.g., find the word Romans near
countrymen) not feasible
– Ranked retrieval (best documents to return)
• Later lectures

3
Course: Information Retrieval & Search Engines
Sec. 1.1

Term-document incidence

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar BUT 1 if play contains


NOT Calpurnia word, 0 otherwise
Course: Information Retrieval & Search Engines
Binary operation

5
Course: Information Retrieval & Search Engines
Incidence vectors
Sec. 1.1

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

• So we have a 0/1 vector for each term.


• To answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented)  bitwise AND.
• Brutus: 110100
• Caesar: 110111
• Calpurnia: 010000
• Not Calpurnia: 101111
• 110100 AND 110111 AND 101111 = 100100.
6
Course: Information Retrieval & Search Engines
Exercise
• Antony Or Brutus

• Antony Or Brutus and Not Antony

• Not Ceaser and Brutus

• Worser and Mercy

7
Course: Information Retrieval & Search Engines
Sec. 1.1

Answers to query
• Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii


Lord Polonius: I did enact Julius Caesar I was killed i' the
Capitol; Brutus killed me.

8
Course: Information Retrieval & Search Engines
Sec. 1.1

Basic assumptions of Information


Retrieval
• Collection: Fixed set of documents
• Goal: Retrieve documents with information that is relevant to the
user’s information need and helps the user complete a task

9
Course: Information Retrieval & Search Engines
Sec. 1.1

How good are the retrieved


docs?
• Precision : Fraction of retrieved docs that are relevant to user’s
information need
• Recall : Fraction of relevant docs in collection that are retrieved
• More precise definitions and measurements to follow in later
lectures

10
Course: Information Retrieval & Search Engines
11
Course: Information Retrieval & Search Engines
Sec. 1.1

Bigger collections
• Consider N = 1 million documents, each with about 1000 words.
• Avg 6 bytes/word including spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among these.

12
Course: Information Retrieval & Search Engines
Sec. 1.1

Can’t build the matrix


• 500K x 1M matrix has half-a-trillion 0’s and 1’s.
• But it has no more than one billion 1’s.
Why?
– matrix is extremely sparse.
• What’s a better representation?
– We only record the 1 positions.

13
Course: Information Retrieval & Search Engines

You might also like