0% found this document useful (0 votes)
5 views21 pages

Lec 1

The document introduces Information Retrieval (IR), emphasizing the distinction between structured, unstructured, and semi-structured data. It outlines the basic assumptions of IR, including the goal of retrieving relevant documents based on user information needs, and discusses the concepts of precision and recall in evaluating search results. Additionally, it highlights the importance of data structures like term-document incidence matrices and inverted indexes in modern IR systems.

Uploaded by

menaahmed15200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views21 pages

Lec 1

The document introduces Information Retrieval (IR), emphasizing the distinction between structured, unstructured, and semi-structured data. It outlines the basic assumptions of IR, including the goal of retrieving relevant documents based on user information needs, and discusses the concepts of precision and recall in evaluating search results. Additionally, it highlights the importance of data structures like term-document incidence matrices and inverted indexes in modern IR systems.

Uploaded by

menaahmed15200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduction to

Information Retrieval
Introducing Information Retrieval
and Web Search
Introduction to
Information Retrieval

2
Information Retrieval
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information
in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000

Typically allows numerical range and exact match


(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
4
Unstructured data
• Typically refers to free text
• Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse
• Classic model for searching text documents

5
Semi-structured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified zones
such as the Title and Bullets
• … to say nothing of linguistic structure
• Facilitates “semi-structured” search such as
– Title contains data AND Bullets contain search

6
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).

– These days we frequently think first of web search,


but there are many other cases:
• E-mail search
• Searching your laptop

7
Unstructured (text) vs. structured (database)
data in the mid-nineties

8
Unstructured (text) vs. structured (database)
data today

9
Sec. 1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents


– Assume it is a static collection for the moment

• Goal: Retrieve documents with information


that is relevant to the user’s information need
and helps the user complete a task

10
The classic search model
User task Get rid of mice in a
politically correct way
Misconception?

Info need
Info about removing mice
without killing them
Misformulation?

Query
how trap mice alive Search

Search
engine

Query Results
Collection
refinement
Sec. 1.1

How good are the retrieved docs?


§ Precision : Fraction of retrieved docs that are
relevant to the user’s information need
§ Recall : Fraction of relevant docs in collection
that are retrieved

Right or wrong/retrieved
or not

12
Sec. 1.1

How good are the retrieved docs?


§ Precision : Fraction of retrieved docs that are
relevant to the user’s information need
§ Recall : Fraction of relevant docs in collection
that are retrieved

Search 13
Word:”ford
Introduction to
Information Retrieval
Term-document incidence matrices
Sec. 1.1

Unstructured data in 1620

• One could grep all of Shakespeare’s plays for Brutus and Caesar,
then lines containing Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– Roman near countrymen is not trival (position of terms)
– Repeat linear scan with each query(too long time)
– Ranked retrieval (best documents to return(the no each word repeated
in doc)

15
Sec. 1.1

Term-document incidence matrices

Antony and Cleopatra J ulius Caes ar The Tempes t Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caes ar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
wors er 1 0 1 1 1 0

Brutus AND Caesar BUT NOT 1 if play contains


Calpurnia word, 0 otherwise
Sec. 1.1

Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented) 
bitwise AND.
– 110100 AND
– 110111 AND
– 101111 =
– 100100

17
Sec. 1.1

Answers to query

• Antony and Cleopatra, Act III, Scene ii


Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii


Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

18
Sec. 1.1

Can’t build the matrix


– matrix is extremely sparse (“most of entries are 0”
99.8%).
• What’s a better representation?
Why?
– We only record the 1 positions.

19
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Quiz
• When a search engine returns 30 pages only
20 of which were relevant while failing to
return 40 additional relevant pages, its
precision =……………. while its recall
=…………………….

You might also like