IRWS Lecture 03 - Indexing and Transcrib

Information Retrieval and Web Search
Ahmed Olalekan
[email protected]
Lecture 03 – Indexing (Introduction)
10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 1

Recap
 What is IR Process?
 Preprocessing
 Gathering
 Tokenization
 Stopwords
 Stemming
 Indexing
 Retrieval
 Boolean Models
 Vector Space Models
 Probabilistic Models
 Evaluation
Today
 Overview of indexing
 Inverted index
 Processing Boolean queries
 Query optimisation

Definition of information retrieval
Information retrieval (IR) is finding material (usually documents) of an

unstructured nature (usually text) that satisfies an information need from within
large collections

Boolean retrieval
 The Boolean model is arguably the simplest model to base an information
retrieval system on.
 Queries are Boolean expressions, e.g., information AND retrieval
 The search engine returns all documents that satisfy the Boolean expression.
 Question:
 Does Google use the Boolean model?

Inverted index

The central problem in search
Concepts Concepts
Query Terms: Document Terms

“tragic love story” “faithful star-crossed romance”
Do these represent the same concepts?

Abstract IR Architecture

How do we represent text?
 Remember: computers don’t “understand” anything!
 “Bag of words”
 Treat all the words in a document as index terms
 Assign a “weight” to each term based on “importance” (or, in simplest case,
presence/absence of word)
 Disregard order, structure, meaning, etc. of the words
 Simple, yet effective!
 Assumptions
 Term occurrence is independent
 Document relevance is independent
 “Words” are well-defined

What’s a word?

Sample Document
Bag of Words

Unstructured data
 A query: which plays of Shakespeare contains the words Brutus and Caesar,
but not Calpurnia?
 A solution: one can grep all of Shakespeare's plays for Brutus and Caesar, then
strip out lines containing Calpurnia
 However, grep is not the solution?

 Slow for large collection.
 Grep is line-oriented, IR is document-oriented.
 “NOT Calpurnia” is non-trivial
 Other operations (e.g. find the word Romans near Countryman) not feasible

Term-Document Incidence Matrix (TDIM)

Incidence vectors
 So we have a 0/1 vector for each term.
 To answer the query about Brutus AND Caesar and NOT Calpurnia:
 Take the vectors for Brutus, Caesar AND NOT Calpurnia
 Complement the vector of Calpurnia
 Do a (bitwise) AND on the three vectors
 110100 AND 110111 AND 101111 = 100100

0/1 vector

Bigger collections
 Consider N = 106 documents, each with about tokens
  total of 109 tokens
 On average 6 bytes per token, including spaces and punctuation
  size of the document collection is about 6*109 = 6GB
 Assume there are M = 500,000 distinct terms in the collection
 M = 500,000 * 106 = 0.5 trillion 0s and 1s

A Waste of Space and Effort
 But the matrix has no more than one billion 1s.
 Matrix is extremely sparse
 What is a better representation?
 We only record the 1s.

Reduce the number of Tokens
 Remember:
 Stopwords removal
 Stemming
 Use of more efficient data structure (inverted index)

Inverted Index: An example

Inverted Index
 The inverted index of a document collection is basically a data structure that
attaches each distinctive term with a list of all documents that contains the
term.
 Thus, in retrieval, it takes constant time to find the documents that contains a
query term.
 Multiple query terms are also easy handle as we will see soon.

Processing Boolean Queries

Simple conjunctive query (two terms)
 Consider the query: BRUTUS AND CALPURNIA
 To find all matching documents using inverted index:
 Locate Brutus in the dictionary
 Retrieve its postings list from the postings file
 Locate CALPURNIA in the dictionary
 Retrieve its postings list from the postings file
 Intersect the two postings lists
 Return intersection to user

Intersect the two postings lists
 This is linear in the length of the postings lists.

 Note: This only works if postings lists are sorted.

Intersect the two postings lists

Query processing: Exercise
Compute hit list for [(paris AND NOT france) OR lear]

Boolean Queries

Boolean queries
 The Boolean retrieval model can answer any query that is a Boolean
expression.
 Boolean queries are queries that use AND, OR and NOT to join query terms
 Views each document as a set of terms.
 Is precise: Document matches condition or not.
 Primary commercial retrieval tool for 3 decades
 Many professional searchers (e.g., lawyers) still like Boolean queries
 You know exactly what you are getting.
 Many search systems you use are also Boolean: email for instance.

Commercially successful Boolean retrieval: Westlaw
 A commercial legal search service owned by Thomson Reuters.
 Currently, Westlaw supports natural language and Boolean searches.
 The service was started in 1975.

Westlaw: Example queries
 Information need: Information on the legal theories involved in preventing the
disclosure of trade secrets by employees formerly employed by a competing
company
 Query : “trade secret” /s disclos! /s prevent /s employe!
 Information need : Requirements for disabled people to be able to access a
workplace
 Query: disab! /p access! /s work site work-place (employment /3 place)
 Information need: Cases about a host’s responsibility for drunk guests
 Query : host! /p responsib! liab!) /p intoxicat! drunk /p guest

Westlaw: Comments
 Proximity operators: /3 = within 3 words, /s = within a sentence, /p = within a
paragraph
 Space is disjunction, not conjunction! (This was the default in search pre
Google.)
 Long, precise queries: incrementally developed, not like web search
 Why professional searchers often like Boolean search?
 precision, transparency, control
 When are Boolean queries the best way of searching?
 Depends on: information need , searcher , document collection , . .

Query Optimisation

Overview
 Consider a query that is an AND of n terms, n > 2
 For each of the terms, get its postings list, then and them together
 Example query: BRUTUS AND CALPURNIA AND CAESAR
 What is the best order for processing this query?

Query optimization - Example
 Example query: BRUTUS AND CALPURNIA AND CAESAR
 Simple and effective optimization: Process in order of increasing frequency
 Start with the shortest postings list, then keep cutting further
 In this example, first CAESAR , then CALPURNIA , then BRUTUS

Optimized intersection algorithm for conjunctive queries

More general optimization
 Example query: ( MADDING OR CROWD ) and (IGNOBLE OR STRIFE)
 Get frequencies for all terms
 Estimate the size of each or by the sum of its frequencies conservative
 Process in increasing order of or sizes

Term Frequency Inverted Document Frequency (TF.IDF)

Positional Indexes
 Store term position in postings
 Supports richer queries (e.g., proximity)
 Naturally, leads to larger indexes…

Positional Indexes

Retrieval: Document-at-a-Time
 Evaluate documents one at a time (score all query terms)
 Tradeoffs
 Small memory footprint (good)
 Must read through all postings (bad), but skipping possible
 More disk seeks (bad), but blocking possible

Retrieval: Query-at-a-Time
 Evaluate documents one query term at a time
 Usually, starting from most rare term (often with tf sorted postings)
 Tradeoffs
 Early termination heuristics (good)
 Large memory footprint (bad), but filtering heuristics possible
Summary

Summary
 Overview of indexing
 Inverted index
 Processing Boolean queries
 Query optimisation

IRWS Lecture 03 - Indexing and Transcrib

Uploaded by

Copyright:

Available Formats

IRWS Lecture 03 - Indexing and Transcrib

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IRWS Lecture 03 - Indexing and Transcrib

Uploaded by

Copyright:

Available Formats

Information Retrieval and Web Search

Lecture 03 – Indexing (Introduction)

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 1

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 3

Information retrieval (IR) is finding material (usually documents) of an

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 4

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 5

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 6

Query Terms: Document Terms

Do these represent the same concepts?

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 7

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 8

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 9

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 10

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 11

 However, grep is not the solution?

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 12

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 13

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 14

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 15

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 16

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 17

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 18

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 19

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 20

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 21

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 22

 This is linear in the length of the postings lists.

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 23

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 24

Compute hit list for [(paris AND NOT france) OR lear]

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 25

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 26

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 27

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 28

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 29

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 30

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 31

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 32

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 33

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 34

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 35

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 36

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 37

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 38

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 39

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 41

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 42

You might also like