0% found this document useful (0 votes)

15 views42 pages

Lec 1 IR

This document provides an introduction to information retrieval and describes some key concepts. It discusses how information retrieval systems work by indexing large collections of unstructured text documents and allowing users to search for relevant documents through queries. The document outlines the basic assumptions of information retrieval systems, including that they retrieve documents relevant to a user's information need from a static collection. It also introduces the concept of precision and recall for evaluating search results. The document focuses on the inverted index, which is the core data structure that underlies modern information retrieval and allows for efficient query processing. It provides an overview of how an inverted index is constructed from a document collection and how queries are processed by intersecting relevant postings lists.

Uploaded by

Ahmed gamal ebied

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views42 pages

Lec 1 IR

Uploaded by

Ahmed gamal ebied

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Introduction to

Information Retrieval
Introducing Information Retrieval
and Web Search
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).

– These days we frequently think first of web search,

but there are many other cases:
• E-mail search
• Searching your laptop
• Corporate knowledge bases
• Legal information retrieval

2
Unstructured (text) vs. structured (database)
data in the mid-nineties

3
Unstructured (text) vs. structured (database)
data today

4
Sec. 1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents

– Assume it is a static collection for the moment

• Goal: Retrieve documents with information

that is relevant to the user’s information need
and helps the user complete a task

5
The classic search model
User task Get rid of mice in a
politically correct way
Misconception?

Info need
Info about removing
mice
without killing them
Misformulation?

Query
how trap mice alive Search

Search
engine

Query Results
Collection
refinement
Sec. 1.1

How good are the retrieved docs?

▪ Precision : Fraction of retrieved docs that are
relevant to the user’s information need
▪ Recall : Fraction of relevant docs in collection
that are retrieved

7
Introduction to
Information Retrieval
Term-document incidence matrices
Sec. 1.1

Unstructured data in 1620

• One could grep all of Shakespeare’s plays for Brutus

and Caesar, then strip out lines containing Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– Roman near countrymen is not trival (position of terms)
– Repeat linear scan with each query(too long time)
– Ranked retrieval (best documents to return)
9
Sec. 1.1

Term-document incidence matrices

Brutus AND Caesar BUT 1 if play contains

NOT Calpurnia word, 0 otherwise
Sec. 1.1

Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented)
bitwise AND.
– 110100 AND
– 110111 AND
– 101111 =
– 100100

11
Sec. 1.1

Answers to query

• Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.

12
Sec. 1.1

Bigger collections

13
Sec. 1.1

Can’t build the matrix

• 500K x 1M matrix has half-a-trillion 0’s and 1’s.

• But it has no more than one billion 1’s Why?

• (1000*1million).
– matrix is extremely sparse (“most of entries are
0” 99.8%).
• What’s a better representation?
– We only record the 1 positions.

14
Introduction to
Information Retrieval
The Inverted Index
The key data structure underlying
modern IR
Sec. 1.2

Inverted index
• For each term t, we must store a list of all
documents that contain t.
– Identify each doc by a docID, a document serial
number
• Can we used fixed-size arrays for this?
Brutus 1 2 4 1 3 4 173 174
1 1 5
Caesar 1 2 4 5 6 1 5 132
6 7
Calpurnia 2 31 54 101
What happens if the word Caesar
is added to document 14?
16
Sec. 1.2

Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable length
arrays Posting
• Some tradeoffs in size/ease of insertion
Brutus 1 2 4 1 3 4 173 174
1 1 5
Caesar 1 2 4 5 6 1 5 132
6 7
Calpurnia 2 3 54 101
1
Dictionary Postings
Sorted by docID (more later on why).
17
Sec. 1.2

Inverted index construction

Documents Friends, Romans, countrymen.
to
be indexed
Tokenizer
Friend
Token stream Romans Countrymen
s

Linguistic modules
countryma
Modified tokens friend roman
n

Indexer friend 2 4
roman 1 2
Inverted
index countryman 1 1
Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
• Deal with “John’s”, a state-of-the-art solution
• Normalization
– Map text and query term to same form
• You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
• authorize, authorization
• Stop words
– We may omit very common words (or not)
• the, a, to, of
Sec. 1.2

Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.

Doc Doc
1 2
I did enact Julius So let it be with
Caesar I was killed Caesar. The noble
i’ the Capitol; Brutus hath told you
Brutus killed me. Caesar was ambitious
Sec. 1.2

Indexer steps: Sort

• Sort by terms
– And then docID

Core indexing step

Sec. 1.2

Indexer steps: Dictionary & Postings

• Multiple term entries
in a single document
are merged.
• Split into Dictionary
and Postings
• Doc. frequency
information is added.
Sec. 1.2

Where do we pay in storage?

Lists of
docIDs

Terms
and
counts
IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?

Pointers 23
Introduction to
Information Retrieval
Query processing with an inverted index
Sec. 1.3

The index we just built

• How do we process a query? Our focus

– Later - what kinds of queries can we process?

25
Sec. 1.3

Query processing: AND

• Consider processing the query:
Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings (intersect the
document2 sets):
4 8 1 3 6 128 Brutu
6 2 4
1 2 3 5 8 1 2 34 s
Caesa
3 1 r
26
Intersecting two postings lists
(a “merge” algorithm)

27
Sec. 1.3

The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8 1 3 6 1 Brut
6 2 4 2 us
Caes
1 2 3 5 8 1 2 3
8 ar
3 1 4

28
Introduction to
Information Retrieval
The Boolean Retrieval Model
& Extended Boolean Models
Sec. 2.4

Phrase queries
• We want to be able to answer queries such as
“stanford university” – as a phrase
• Thus the sentence “I went to university at
Stanford” is not a match.
– The concept of phrase queries has proven easily
understood by users; one of the few “advanced
search” ideas that works
• For this, it no longer suffices to store only
<term : docs> entries
Sec. 2.4.1

A first attempt: Biword indexes

• Index every consecutive pair of terms in the text
as a phrase
• For example the text “Friends, Romans,
Countrymen” would generate the biwords
– friends romans
– romans countrymen
• Each of these biwords is now a dictionary term
• Two-word phrase query-processing is now
immediate.
Sec. 2.4.1

Longer phrase queries

• Longer phrases can be processed by breaking
them down
• stanford university palo alto can be broken
into the Boolean query on biwords:
stanford university AND university palo AND
palo alto
• Although its find in a doc but indifferent
places ”not as the pharse”
Can have false positives!
Sec. 2.4.1

Issues for biword indexes

• False positives, as noted before
• Index blowup due to bigger dictionary
– Infeasible for more than biwords, big even for
them
• Biword indexes are not the standard solution
(for all biwords) but can be part of a
compound strategy
Sec. 2.4.2

Solution 2: Positional indexes

• In the postings, store, for each term the
position(s) in which tokens of it appear:

<term, number of docs containing term;

doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Sec. 2.4.2

Positional index example

<be: 993427;
1: 7, 18, 33, 72, 86, 231;
Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?

5: 363, 367, …>

• For biword phrase queries, we use a merge
algorithm(phrase in query with dictionary)
recursively at the document level
• But we now need to deal with more than just
equality
Sec. 2.4.2

Processing a phrase query

• Extract inverted index entries for each distinct
term: to, be, or, not.
• Merge their doc:position lists to enumerate all
positions with “to be or not to be”.
– to:
• 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
– be:
• 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
Sec. 2.4.2

Rules of thumb
• A positional index is 2–4 as large as a non-
positional index

• Positional index size 35–50% of volume of

original text

– Caveat: all of this holds for “English-like”

languages
Sec. 2.4.3

Combination schemes
• These two approaches can be profitably
combined
– For particular phrases (“Michael Jackson”, “Britney
Spears”) it is inefficient to keep on merging positional
postings lists
• Even more so for phrases like “The Who”
• Williams et al. (2004) evaluate a more
sophisticated mixed indexing scheme
– A typical web query mixture was executed in ¼ of the
time of using just a positional index
– It required 26% more space than having a positional
index alone
Introduction to
Information Retrieval
Structured vs. Unstructured Data
IR vs. databases:
Structured vs unstructured data
• Structured data tends to refer to information
in “tables”
Employe Manage Salar
e r y
Smit 5000
Jones
h 0
Chan Smit 6000
g h 0
Iv Smit 5000
y h 0

Typically allows numerical range and exact match

(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
40
Unstructured data
• Typically refers to free text
• Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse
• Classic model for searching text documents

41
Semi-structured data
• In fact almost no data is “unstructured”
• E.g., this slide has distinctly identified zones such
as the Title and Bullets
• … to say nothing of linguistic structure
• Facilitates “semi-structured” search such as
– Title contains data AND Bullets contain search
• Or even
– Title is about Object Oriented Programming AND
Author something like stro*rup
– where * is the wild-card operator

Microsoft Azure
100% (1)
Microsoft Azure
13 pages
Remedial MEIL G12
No ratings yet
Remedial MEIL G12
11 pages
SK Tf80Sc: Warning!
No ratings yet
SK Tf80Sc: Warning!
8 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
I Supplier
100% (1)
I Supplier
16 pages
Ps - 4618service Training - Self Study Programme 470 - The Touareg 2011 - Electrics Electronics - Design and Function
No ratings yet
Ps - 4618service Training - Self Study Programme 470 - The Touareg 2011 - Electrics Electronics - Design and Function
56 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Week 6
No ratings yet
Week 6
98 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Technical Theatre Resume
100% (2)
Technical Theatre Resume
4 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Ir 1
No ratings yet
Ir 1
14 pages
Rob Lect-1 - 2 (Robotics and Arduino) PDF
No ratings yet
Rob Lect-1 - 2 (Robotics and Arduino) PDF
22 pages
Nitro Shock Absorbers
No ratings yet
Nitro Shock Absorbers
25 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
Subsidiary Reference Guide: Installation
No ratings yet
Subsidiary Reference Guide: Installation
130 pages
Quick Start Guide: AED9101D AED9501A
No ratings yet
Quick Start Guide: AED9101D AED9501A
108 pages
D E S C R I P T I O N: Acknowledgement Receipt For Equipment
No ratings yet
D E S C R I P T I O N: Acknowledgement Receipt For Equipment
2 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
Lect 3 Inverted Index
No ratings yet
Lect 3 Inverted Index
24 pages
Unit 1
No ratings yet
Unit 1
181 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Ir 1
No ratings yet
Ir 1
59 pages
Unit I
No ratings yet
Unit I
83 pages
Company Profile Systems Limited
No ratings yet
Company Profile Systems Limited
21 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Rob Lect-1 (Introduction) PDF
No ratings yet
Rob Lect-1 (Introduction) PDF
35 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Data Mining-Exams
100% (2)
Data Mining-Exams
3 pages
Hydraforce
No ratings yet
Hydraforce
28 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Lecture1 Intro Boolean
No ratings yet
Lecture1 Intro Boolean
42 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Fed EX
No ratings yet
Fed EX
23 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Unit 1 Intro To IR
No ratings yet
Unit 1 Intro To IR
32 pages
Project Safety Checklist
100% (1)
Project Safety Checklist
4 pages
Module 1-1
No ratings yet
Module 1-1
12 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Introduction and AVR Architecture
No ratings yet
Introduction and AVR Architecture
31 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Lec 2
No ratings yet
Lec 2
17 pages
Lec 3
No ratings yet
Lec 3
17 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
TDS FS40 LF HF en
No ratings yet
TDS FS40 LF HF en
2 pages
Art As A Way of Knowing
No ratings yet
Art As A Way of Knowing
33 pages
Information Retrieval
No ratings yet
Information Retrieval
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Isolated Current and Voltage Transducers Characteristics - 1
No ratings yet
Isolated Current and Voltage Transducers Characteristics - 1
48 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Document Indexing in Information Retrieval
No ratings yet
Document Indexing in Information Retrieval
19 pages
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
38 pages
RAN Sharing - New Paradigm For LTE
100% (1)
RAN Sharing - New Paradigm For LTE
10 pages
Rentcubo Com
No ratings yet
Rentcubo Com
6 pages
4104.3 Extended Reality - 2022 Exam Paper
100% (1)
4104.3 Extended Reality - 2022 Exam Paper
2 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Analysis of Technological Influence in Small Scale Businesses
No ratings yet
Analysis of Technological Influence in Small Scale Businesses
28 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
IR Lec04 Skip Ptrs Phrase Queries Indexing
No ratings yet
IR Lec04 Skip Ptrs Phrase Queries Indexing
18 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Aiwa XR-MS5 Verticle CD Executive Micro System Manual
No ratings yet
Aiwa XR-MS5 Verticle CD Executive Micro System Manual
18 pages
Welcome To The World Of: "Career Path Finder"
No ratings yet
Welcome To The World Of: "Career Path Finder"
22 pages
600 Computer Mcqs
No ratings yet
600 Computer Mcqs
23 pages
Controlled Power, LLC 38kV Metal Clad Switchgear Guide Specification
No ratings yet
Controlled Power, LLC 38kV Metal Clad Switchgear Guide Specification
17 pages
فاينل
No ratings yet
فاينل
12 pages
Honeypots
No ratings yet
Honeypots
17 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Sp09midterm Revised
No ratings yet
Sp09midterm Revised
6 pages
Lecture 3-Skip Pointers and Phrase Queries
No ratings yet
Lecture 3-Skip Pointers and Phrase Queries
12 pages
Sample Midterm
No ratings yet
Sample Midterm
6 pages
Sheet 1
No ratings yet
Sheet 1
2 pages
Indus: Indian Geotechnical Conference 2019
No ratings yet
Indus: Indian Geotechnical Conference 2019
3 pages
Mid-Term1 - Exam2019
No ratings yet
Mid-Term1 - Exam2019
2 pages
Courier Tracking System Project Showcase
No ratings yet
Courier Tracking System Project Showcase
2 pages
Digital Marketing Sem 4
No ratings yet
Digital Marketing Sem 4
3 pages
Birla Institute of Technology and Science II Semester 2012-13 MEL G641 CAD For IC Design Lab Assignement-1
No ratings yet
Birla Institute of Technology and Science II Semester 2012-13 MEL G641 CAD For IC Design Lab Assignement-1
2 pages
人工智能技术与大数据: Chinese Edition
From Everand
人工智能技术与大数据: Chinese Edition
Posts & Telecom Press
No ratings yet
The Mighty Velociraptor
From Everand
The Mighty Velociraptor
Percy Leed
No ratings yet
Nature's Hidden Patterns
From Everand
Nature's Hidden Patterns
Rick McKeon
No ratings yet

Lec 1 IR

Uploaded by

Lec 1 IR

Uploaded by

Introduction to

– These days we frequently think first of web search,

Basic assumptions of Information Retrieval

• Collection: A set of documents

• Goal: Retrieve documents with information

How good are the retrieved docs?

Unstructured data in 1620

• One could grep all of Shakespeare’s plays for Brutus

Term-document incidence matrices

Brutus AND Caesar BUT 1 if play contains

• Antony and Cleopatra, Act III, Scene ii

• Hamlet, Act III, Scene ii

Can’t build the matrix

• But it has no more than one billion 1’s Why?

Inverted index construction

Indexer steps: Token sequence

Indexer steps: Sort

Core indexing step

Indexer steps: Dictionary & Postings

Where do we pay in storage?

The index we just built

– Later - what kinds of queries can we process?

Query processing: AND

A first attempt: Biword indexes

Longer phrase queries

Issues for biword indexes

Solution 2: Positional indexes

<term, number of docs containing term;

Positional index example

5: 363, 367, …>

Processing a phrase query

• Positional index size 35–50% of volume of

– Caveat: all of this holds for “English-like”

Typically allows numerical range and exact match

You might also like