IRWS Lecture 03 - Indexing and Transcrib

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 42

Information Retrieval and Web Search

Ahmed Olalekan
[email protected]

Lecture 03 – Indexing (Introduction)

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 1


Recap
 What is IR Process?
 Preprocessing
 Gathering
 Tokenization
 Stopwords
 Stemming
 Indexing
 Retrieval
 Boolean Models
 Vector Space Models
 Probabilistic Models
 Evaluation
10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 2
Today
 Overview of indexing
 Inverted index
 Processing Boolean queries
 Query optimisation

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 3


Definition of information retrieval

Information retrieval (IR) is finding material (usually documents) of an


unstructured nature (usually text) that satisfies an information need from within
large collections

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 4


Boolean retrieval
 The Boolean model is arguably the simplest model to base an information
retrieval system on.
 Queries are Boolean expressions, e.g., information AND retrieval
 The search engine returns all documents that satisfy the Boolean expression.

 Question:
 Does Google use the Boolean model?

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 5


Inverted index

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 6


The central problem in search

Concepts Concepts

Query Terms: Document Terms


“tragic love story” “faithful star-crossed romance”

Do these represent the same concepts?

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 7


Abstract IR Architecture

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 8


How do we represent text?
 Remember: computers don’t “understand” anything!
 “Bag of words”
 Treat all the words in a document as index terms
 Assign a “weight” to each term based on “importance” (or, in simplest case,
presence/absence of word)
 Disregard order, structure, meaning, etc. of the words
 Simple, yet effective!
 Assumptions
 Term occurrence is independent
 Document relevance is independent
 “Words” are well-defined

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 9


What’s a word?

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 10


Sample Document

Bag of Words

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 11


Unstructured data
 A query: which plays of Shakespeare contains the words Brutus and Caesar,
but not Calpurnia?
 A solution: one can grep all of Shakespeare's plays for Brutus and Caesar, then
strip out lines containing Calpurnia

 However, grep is not the solution?


 Slow for large collection.
 Grep is line-oriented, IR is document-oriented.
 “NOT Calpurnia” is non-trivial
 Other operations (e.g. find the word Romans near Countryman) not feasible

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 12


Term-Document Incidence Matrix (TDIM)

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 13


Incidence vectors
 So we have a 0/1 vector for each term.
 To answer the query about Brutus AND Caesar and NOT Calpurnia:
 Take the vectors for Brutus, Caesar AND NOT Calpurnia
 Complement the vector of Calpurnia
 Do a (bitwise) AND on the three vectors
 110100 AND 110111 AND 101111 = 100100

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 14


0/1 vector

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 15


Bigger collections
 Consider N = 106 documents, each with about tokens
  total of 109 tokens
 On average 6 bytes per token, including spaces and punctuation
  size of the document collection is about 6*109 = 6GB
 Assume there are M = 500,000 distinct terms in the collection
 M = 500,000 * 106 = 0.5 trillion 0s and 1s

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 16


A Waste of Space and Effort
 But the matrix has no more than one billion 1s.
 Matrix is extremely sparse
 What is a better representation?
 We only record the 1s.

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 17


Reduce the number of Tokens
 Remember:
 Stopwords removal
 Stemming
 Use of more efficient data structure (inverted index)

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 18


Inverted Index: An example

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 19


Inverted Index
 The inverted index of a document collection is basically a data structure that
attaches each distinctive term with a list of all documents that contains the
term.
 Thus, in retrieval, it takes constant time to find the documents that contains a
query term.
 Multiple query terms are also easy handle as we will see soon.

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 20


Processing Boolean Queries

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 21


Simple conjunctive query (two terms)
 Consider the query: BRUTUS AND CALPURNIA
 To find all matching documents using inverted index:
 Locate Brutus in the dictionary
 Retrieve its postings list from the postings file
 Locate CALPURNIA in the dictionary
 Retrieve its postings list from the postings file
 Intersect the two postings lists
 Return intersection to user

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 22


Intersect the two postings lists

 This is linear in the length of the postings lists.


 Note: This only works if postings lists are sorted.

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 23


Intersect the two postings lists

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 24


Query processing: Exercise

Compute hit list for [(paris AND NOT france) OR lear]

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 25


Boolean Queries

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 26


Boolean queries
 The Boolean retrieval model can answer any query that is a Boolean
expression.
 Boolean queries are queries that use AND, OR and NOT to join query terms
 Views each document as a set of terms.
 Is precise: Document matches condition or not.
 Primary commercial retrieval tool for 3 decades
 Many professional searchers (e.g., lawyers) still like Boolean queries
 You know exactly what you are getting.
 Many search systems you use are also Boolean: email for instance.

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 27


Commercially successful Boolean retrieval: Westlaw
 A commercial legal search service owned by Thomson Reuters.
 Currently, Westlaw supports natural language and Boolean searches.
 The service was started in 1975.

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 28


Westlaw: Example queries
 Information need: Information on the legal theories involved in preventing the
disclosure of trade secrets by employees formerly employed by a competing
company
 Query : “trade secret” /s disclos! /s prevent /s employe!
 Information need : Requirements for disabled people to be able to access a
workplace
 Query: disab! /p access! /s work site work-place (employment /3 place)
 Information need: Cases about a host’s responsibility for drunk guests
 Query : host! /p responsib! liab!) /p intoxicat! drunk /p guest

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 29


Westlaw: Comments
 Proximity operators: /3 = within 3 words, /s = within a sentence, /p = within a
paragraph
 Space is disjunction, not conjunction! (This was the default in search pre
Google.)
 Long, precise queries: incrementally developed, not like web search
 Why professional searchers often like Boolean search?
 precision, transparency, control
 When are Boolean queries the best way of searching?
 Depends on: information need , searcher , document collection , . .

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 30


Query Optimisation

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 31


Overview
 Consider a query that is an AND of n terms, n > 2
 For each of the terms, get its postings list, then and them together
 Example query: BRUTUS AND CALPURNIA AND CAESAR
 What is the best order for processing this query?

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 32


Query optimization - Example
 Example query: BRUTUS AND CALPURNIA AND CAESAR
 Simple and effective optimization: Process in order of increasing frequency
 Start with the shortest postings list, then keep cutting further
 In this example, first CAESAR , then CALPURNIA , then BRUTUS

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 33


Optimized intersection algorithm for conjunctive queries

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 34


More general optimization
 Example query: ( MADDING OR CROWD ) and (IGNOBLE OR STRIFE)
 Get frequencies for all terms
 Estimate the size of each or by the sum of its frequencies conservative
 Process in increasing order of or sizes

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 35


Term Frequency Inverted Document Frequency (TF.IDF)

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 36


Positional Indexes
 Store term position in postings
 Supports richer queries (e.g., proximity)
 Naturally, leads to larger indexes…

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 37


Positional Indexes

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 38


Retrieval: Document-at-a-Time
 Evaluate documents one at a time (score all query terms)

 Tradeoffs
 Small memory footprint (good)
 Must read through all postings (bad), but skipping possible
 More disk seeks (bad), but blocking possible

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 39


Retrieval: Query-at-a-Time
 Evaluate documents one query term at a time
 Usually, starting from most rare term (often with tf sorted postings)

 Tradeoffs
 Early termination heuristics (good)
 Large memory footprint (bad), but filtering heuristics possible
10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 40
Summary

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 41


Summary
 Overview of indexing
 Inverted index
 Processing Boolean queries
 Query optimisation

10/08/2024 Information Retrieval and Web Search - IRWS - Griffith College 42

You might also like