Lect 2 Boolean Retrieval

The Boolean retrieval model allows users to perform queries using Boolean expressions with operators AND, OR, and NOT, treating documents as sets of words. Grepping, or linear scanning through documents, is a basic retrieval method but is inefficient for large datasets, necessitating more advanced techniques like indexing and term-document incidence matrices. This model enables efficient querying and ranked retrieval, which is essential for managing vast collections of unstructured data.

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lect 2 Boolean Retrieval

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Lect 2: Boolean Retrieval

Dr. Subrat Kumar Nayak

Associate Professor
Department of CSE, ITER, SOADu
Boolean Retrieval Model
 The Boolean retrieval model is a model for information retrieval in
which we can pose any query which is in the form of a Boolean
expression of terms, that is, in which terms are combined with the
operators AND, OR, and NOT.
 The model views each document as just a set of words.
 Queries are Boolean expressions, e.g., CAESAR AND BRUTUS
 The search engine returns all documents that satisfy the Boolean
expression.
 A sort of linear scan through documents.
Boolean Retrieval: Grepping
 The simplest form of document retrieval is for a computer to do linear scan through
documents. This process is commonly referred to as grepping through text.
 Names after the Unix command grep, which performs this process.
 Grepping through text can be a very effective process, especially given the speed of
modern computers
 Often this allows useful possibilities for wild card pattern matching through the use of
regular expressions.
Example:
 Suppose you wanted to determine which document contain the words Information AND
Retrieval AND NOT Boolean
 One way to do that is to start at the beginning and to read through all the text, noting for
each document whether it contains Information and Retrieval and excluding it from
consideration if it contains Boolean.
 This process is commonly referred to as grepping through text
Boolean Retrieval: Grepping
 But for many purposes, you do need more:
❑ To process large document collections quickly. The amount of online data has
grown at least as quickly as the speed of computers, and we would now like to be
able to search collections that total in the order of billions to trillions of words.
❑ To allow more flexible matching operations. For example, it is impractical to
perform the query Romans NEAR countrymen with grep, where NEAR might be
defined as “within 5 words” or “within the same sentence”.
❑ To allow ranked retrieval: in many cases you want the best answer to an
information need among many documents that contain certain words.
Unstructured data in 1620
 Which plays of Shakespeare contain the words Brutus AND Caesar
but NOT Calpurnia?
 One could grep all of Shakespeare’s plays for Brutus and Caesar,
then strip out lines containing Calpurnia?
 Why is that not the answer?
Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the word Romans near countrymen)
not feasible
Ranked retrieval (best documents to return)
Later lectures
Boolean Retrieval: Some terminology
 Documents: documents means whatever units we have decided to
build a retrieval system over. They might be individual memos or
chapters of a book.
 Collection/ Corpus: We will refer to the group of documents over
which we perform retrieval as the (document) collection. It is
sometimes also referred to as a corpus (a body of texts).

Let us consider Shakespeare’s Collected Works, and use it to

introduce the basics of the Boolean retrieval model.
Boolean Retrieval: Term-document
incidence matrices
 The way to avoid linearly scanning the texts for each query is to index the documents in
advance.
 The binary term-document incidence matrix, is an outcome of recording each document
– here a play of Shakespeare’s – whether it contains each word out of all the words
Shakespeare used (Shakespeare used about 32,000 different words)
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0

Brutus AND Caesar BUT NOT 1 if play contains

Calpurnia word, 0 otherwise
Boolean Retrieval: Incidence vectors
 So we have a 0/1 vector for each term.
 To answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented) ➔ bitwise AND.
110100 AND
110111 AND
101111 =
100100 Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Answers to query
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii

Lord Polonius: I did enact Julius Caesar I was killed i’ the
Capitol; Brutus killed me.
Bigger collections
 Consider N = 1 million documents, each with about 1000 words.
 Avg 6 bytes/word including spaces/punctuation
6GB of data in the documents.
 Say there are M = 500K distinct terms among these.
Can’t build the matrix
 500K x 1M matrix has half-a-trillion 0’s and 1’s.

 But it has no more than one billion 1’s. Why?

matrix is extremely sparse.

 What’s a better representation?

We only record the 1 positions.

Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
lecture02 - IR
No ratings yet
lecture02 - IR
36 pages
Unit 1 Intro to IR
No ratings yet
Unit 1 Intro to IR
32 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Web Search and Mining: Lecture 2: Boolean Retrieval
No ratings yet
Web Search and Mining: Lecture 2: Boolean Retrieval
45 pages
Unit 1
No ratings yet
Unit 1
181 pages
Lec2 BooleanRetrieval 1
No ratings yet
Lec2 BooleanRetrieval 1
61 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Boolean Model 2021spring
No ratings yet
Boolean Model 2021spring
43 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Introduction To Information Retrieval
100% (2)
Introduction To Information Retrieval
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
2
No ratings yet
2
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
No ratings yet
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
44 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Information Retrieval
No ratings yet
Information Retrieval
44 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
04 - Recuperación Información Modelo Booleano
No ratings yet
04 - Recuperación Información Modelo Booleano
41 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Ir Notes
No ratings yet
Ir Notes
111 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Inverted Index Construction: Adapted From Lectures by
No ratings yet
Inverted Index Construction: Adapted From Lectures by
78 pages
Ir 1
No ratings yet
Ir 1
14 pages
01 Intro
No ratings yet
01 Intro
145 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
Unit I
No ratings yet
Unit I
83 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Lecture 2-Boolean Retrieval
No ratings yet
Lecture 2-Boolean Retrieval
29 pages
Intro To IRE
No ratings yet
Intro To IRE
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
01 - Introduction To Information Retrieval
No ratings yet
01 - Introduction To Information Retrieval
15 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Information Retrieval - Lecture 2
No ratings yet
Information Retrieval - Lecture 2
13 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
Super Potato and the Soaring Terror of the Pterosaur: Book 8
From Everand
Super Potato and the Soaring Terror of the Pterosaur: Book 8
Artur Laperla
4/5 (13)
Core, Cavity and Side-Core Design For A Multi-Cavity Die-Casting Die
No ratings yet
Core, Cavity and Side-Core Design For A Multi-Cavity Die-Casting Die
24 pages
Ethernet Switching Configuration Guide: SRX Series (SRX3xx, SRX550M, SRX1500)
No ratings yet
Ethernet Switching Configuration Guide: SRX Series (SRX3xx, SRX550M, SRX1500)
24 pages
Les Plaisirs de Versailles
No ratings yet
Les Plaisirs de Versailles
57 pages
Introduction To Javascript: Course Material - Lecture Notes
No ratings yet
Introduction To Javascript: Course Material - Lecture Notes
35 pages
Fundamental of Multimedia
No ratings yet
Fundamental of Multimedia
5 pages
Twos Comp PDF
No ratings yet
Twos Comp PDF
41 pages
Table of The Student T (7!17!06)
No ratings yet
Table of The Student T (7!17!06)
2 pages
Upgrade Preparation ERP
No ratings yet
Upgrade Preparation ERP
8 pages
(Address) (Phone) : Role-Intern Trainee
No ratings yet
(Address) (Phone) : Role-Intern Trainee
1 page
XIIComp.Sc.H.Y.463
No ratings yet
XIIComp.Sc.H.Y.463
5 pages
As 4538.2-2000 Guide To The Sampling of Alumina Preparation of Samples
No ratings yet
As 4538.2-2000 Guide To The Sampling of Alumina Preparation of Samples
5 pages
Disadvantage of Hadoop
No ratings yet
Disadvantage of Hadoop
21 pages
CH03 COA9e
No ratings yet
CH03 COA9e
52 pages
FARMAN CONTACTS PDF
No ratings yet
FARMAN CONTACTS PDF
59 pages
Azure Databricks Overview
100% (1)
Azure Databricks Overview
4 pages
Symantec Ghost Solution Suite 2.0 Getting Started
100% (3)
Symantec Ghost Solution Suite 2.0 Getting Started
26 pages
GST Calculator For Supermarket: A Project Report On
No ratings yet
GST Calculator For Supermarket: A Project Report On
62 pages
Epc Tag Standard
No ratings yet
Epc Tag Standard
210 pages
Sniffer
No ratings yet
Sniffer
14 pages
Bca 1 PC
No ratings yet
Bca 1 PC
19 pages
Goonj
No ratings yet
Goonj
12 pages
Full download Design patterns explained a new perspective on object oriented design 2. ed Edition Shalloway pdf docx
100% (4)
Full download Design patterns explained a new perspective on object oriented design 2. ed Edition Shalloway pdf docx
82 pages
Notes On Java Programming Liang
No ratings yet
Notes On Java Programming Liang
25 pages
Popkin, R. - Hume and Spinoza
No ratings yet
Popkin, R. - Hume and Spinoza
30 pages
Mitsubishi Melsec FX PDF
No ratings yet
Mitsubishi Melsec FX PDF
85 pages
Yahoo! Maktoob
No ratings yet
Yahoo! Maktoob
2 pages
Installation Manual
No ratings yet
Installation Manual
18 pages
Test - Topic 8
No ratings yet
Test - Topic 8
8 pages
Powerdesigner 15.1: New Features Summary
No ratings yet
Powerdesigner 15.1: New Features Summary
20 pages
C Linked List Homework
100% (1)
C Linked List Homework
6 pages