0% found this document useful (0 votes)
38 views39 pages

Information Retrieval Detailed Lecture Nov 2023

This document provides an overview of information retrieval, including definitions, terminology, data types, background, logical views of documents, retrieval vs filtering models, and the Boolean model. It defines information retrieval as dealing with representation, organization, storage, and access of unstructured information items like text. Key concepts covered include structured vs unstructured vs semi-structured data, indexing documents with keywords or terms, different logical views of documents, high-level retrieval system architectures, and classic retrieval models like Boolean, vector, and probabilistic.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views39 pages

Information Retrieval Detailed Lecture Nov 2023

This document provides an overview of information retrieval, including definitions, terminology, data types, background, logical views of documents, retrieval vs filtering models, and the Boolean model. It defines information retrieval as dealing with representation, organization, storage, and access of unstructured information items like text. Key concepts covered include structured vs unstructured vs semi-structured data, indexing documents with keywords or terms, different logical views of documents, high-level retrieval system architectures, and classic retrieval models like Boolean, vector, and probabilistic.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Advanced

Databases
Information Retrieval
Dr David Hamill
Overview

• Definitions for Information Retrieval (IR)


• IR terminology
• Data/record types
• IR background
• Logical Views of documents
• Retrieval vs Filtering
• IR models
• The Boolean Model
• Inverted Index
• Web crawling
Introduction

Data vs Information
Data are raw facts.
Information comes when data is processed, organized, and structured in some way.
Data can be posed as information when it is given context and meaning.

For example, look at these numbers: 2, 3, 5, 7, 11, 13, 17, 19

By itself without being ‘The set of prime numbers less than


presented in a context this list 20 appear in the list above’
of numbers has no implied When described in this manner we
meaning – it is a set of data have some information
Introduction
• Information is something that: • Information retrieval (IR) is the scientific
• Is represented by a set of symbols discipline that deals with the analysis,
• Has some structure design, and implementation of computerized
systems that address the representation,
• Can be read and to some extent organization and access to large amounts of
understood by users of information heterogeneous information encoded in
digital format -RIJSBERGEN, C.J., Information
• Information retrieval (IR) involves finding Retrieval, Butterworths, London, 1979.
material (often documents) of an • Information retrieval (IR) deals with the
unstructured nature (often text) that satisfies representation, storage, organization of, and
an information need from within large access to information items. –Modern
collections (usually stored on computers) – Information Retrieval
An Introduction to Information Retrieval –
Cambridge University Press.
Introduction – More definitions
• IR refers to the retrieval of unstructured records: • For example: “find documents which discuss
• Free-form natural language text predominantly. the political implications of the Monica
• Can also include other types of unstructured data: Lewinsky scandal in the results of the 1998
• Images elections for the US congress”
• Sound • For information retrieval needs it is often
• Video necessary to translated the requirement into a
• User Information Need: a natural language declaration query that can be processed by the IR system.
of the information need of a user. • This query often is composed of
keywords/index terms summarising the
user information need.
Introduction –
Terminology
• Documents: the records that IR
systems often process.
• Collection: an organised
repository used by IR systems
to retrieve documents.
• Archive, corpus, digital library are
terms also used in this context.
• Documents that satisfy a query
in the judgement of the user are
said to be relevant.
• The emphasis for IR is
the retrieval of
information as opposed
to data.
• Ranking: an established
order of the documents
retrieved.

Introduction – • IR systems must rank


information items
Terminology according to a degree of
relevance to the user.
• The IR Problem: retrieve
all items relevant to a
user query, while
retrieving as few non-
relevant items as
possible.
Types of data
• Structured records: consist of
name components that are
organised to some well
defined syntax:
• Each component of a
record will have a
definite meaning and a
specific type.
• E.g. Relational Database
table records.
Types of data
• Unstructured records: do not
have a well-defined syntax.
• There is no well-defined
meaning attribute to
each component
syntactical element.
• E.g. emails, chapters
from books, reviews,
audio etc.
Types of data
• Semi-Structured records:
follow a general standard
form. . . No model.
• E.g. using NoSQL data models
like JSON or XML. They can
contain tagged fields but there
is no enforcement of a
particular schema.
IR Background

• Early goals: indexing text and searching for useful


documents in a collection.
• Modern research: modelling (sentiment analysis
of text), web search, text classification, user
interfaces, data visualization, filtering, and
languages.
• Libraries were among the first institutions to
adopt IR systems.
• Initially, consisted of an automation of existing
processes like card cataloguing for searching.
• Increase functionality was added including
subject headings, keywords, query operators.
IR Background

• Until recently IR was mainly of interest to


librarians and information experts.
• The element that changed this was the
introduction of the web, the largest
repository of knowledge in human history.
• Based on its enormous size, finding
information on the web requires running
searches.
Logical View of Documents
• Documents in a collection are often represented through a set of
index terms or keywords:
• Can be extracted automatically or manually.
• IR systems can adopt different logical views of documents:
• Full text
• Representative keywords:
• Eliminate stopwords: words that occur frequently in text documents. Examples: articles
(a, an, the), prepositions (in, at, on, of, to), and conjunctions (and, or, but, if, when).
• Stemming: a technique for reducing words to their grammatical roots.
• Identification of noun groups: eliminate adjectives, adverbs, and verbs.
• These text operations reduce the complexity of the document representation
and allow moving the logical view from full text to a set of index terms.
Logical View of Documents
High-Level Architecture of an IR System
The Retrieval Process
• Retrieval is the “matching” process between document keywords and
words in queries.
Retrieval vs
Filtering
• A distinction between ad-hoc
retrieval and filtering is often
made.
• Ad-hoc retrieval refers to the
application of arbitrary
queries to a fixed collection
of documents:
• Static documents, new
queries.
• Formerly called
retrospective retrieval
Retrieval vs
Filtering
• Filtering refers to having a
fixed number of queries that
are applied to a stream of
changing documents:
• Static queries, new documents
• Can be based on a ‘user profile’
• The documents are classified
according to which query they
most closely match and routed
accordingly.
Information Retrieval Models
• Modelling in IR is a complex process aimed at producing a ranking
function.
• Ranking function: a function that assigns scores to documents with
regard to a given query:
• This process consists of two main tasks:
• The conception of a logical framework for representing documents and queries.
• The definition of a ranking function that allows quantifying the similarities among
documents and queries.
Information Retrieval Models
• IR systems usually adopt index terms to index and retrieve
documents.
• Index term:
• Restrictively: a keyword that has some meaning on its own; usually plays the
role of a noun.
• Generally: any word that appears in a document
• Retrieval based on index terms can be implemented efficiently (see
indexing lecture).
• Index terms are simple to refer to in a query.
• Simplicity is important because it reduces the effort of query formulation.
A ranking is an ordering of the documents that
(hopefully) reflect their relevance to a user query.

Information
Retrieval Any IR system has to deal with the problem of
predicting which documents users will find
Models relevant.

This problem naturally embodies a degree of


uncertainty and vagueness.
Information Retrieval Models
• Three classic models:
• Boolean model: documents and queries are sets of index terms.
• Set theoretic
• Vector model: documents and queries exist in N-dimensional
space.
• Algebraic
• Probabilistic model: based on probability theory.
The Boolean
Model
• The Boolean retrieval model is a
model for information retrieval
which can pose any query in the
form of a Boolean expression of
terms.
• Terms are combined with the
operators AND, OR, and NOT.
• Based on set theory and Boolean
algebra.
• The model views each document
as a set of words
• Queries are specified as Boolean
expressions
The Boolean Model– An Example
To determine which plays of Shakespeare contain the word Brutus
AND Caesar AND NOT Calpurnia.
• One way is to start at the beginning and read through all the text,
noting each play whether it contains Brutus and Caesar and excluding
it if it contains Calpurnia.
• Linear scan through the documents.
• May be appropriate and effective for some queries.
• Many purposes require a different approach.
• Large collection of documents to process.
• Need a ranked retrieval.
• Need other matching operations e.g. proximity match.
The Boolean Model– Document incidence matrix

To avoid Linear scan we index the documents in advance.


Suppose we record for each document – whether it contains each word out of all the words used in the
document (in our example Shakespeare used about 32,000 different words).
The result is a binary term-document incidence matrix.

Matrix element (t,d) is 1 if the play in column d, contains the word in row t, and is 0 otherwise

Depending on whether we look at


the matrix rows or columns, we can
have a vector for each term, which
shows the documents it appears in,
or a vector for each document,
showing the terms that occur in it.
The Boolean Model– Vectors
• To answer the original question: i.e. Brutus AND Caesar AND NOT
Calpurnia:
• We take the vectors for Brutus, Caesar, and Calpurnia, complement
the last, and then do a bitwise AND:
• 110100 AND 110111 AND 101111 = 100100
• The answer for this query are two plays: Antony and Cleopatra and
Hamlet.
• This is an exact match system
Terms are present or • Exact match system; documents are predicted to be
absent relevant or non-relevant.

Retrieval based on
binary decision criteria
with no notion of partial
The Boolean matching.

Model No ranking of
• Information need has to be translated into Boolean
expression, which most users find awkward.
documents is provided • The Boolean queries formulated by the users are
(no grading scale). most often too simplistic.

The model frequently


returns either too few or
too many documents in
response to a user
query.
“…Term-document incidence matrix takes large space to store the information of which document
contains a certain term, and it becomes easily unmanageable and unusable for a large dataset. Also
most of the terms are not contained in most of the documents, which makes the matrix sparse thus
wasting a large amount of storage.

On the other hand, inverted index only records the documents contains a certain term. This makes a
good use of storage and makes the index smaller when compared to the term-document incidence
matrix.”

https://www.quora.com/Why-inverted-index-structure-is-more-efficient-than-Term-Document-incid
ence-matrix-for-IR-systems
Inverted Index

Now assume we want to create a term-document incidence matrix for a collection of 1


million documents where each document contains on average 500,000 terms.

The matrix would have approx. half a trillion 0’s and 1’s.
• It is not practical to store such a data structure in computer memory.

The matrix will be extremely sparse (most entries will be 0)


• Rather than storing all the 1’s and 0’s we will limit the information recorded to only the things that do occur
(store only the 1’s).
• We create an inverted index
Inverted Index
• The basic idea of an inverted index: Posting
• keep a dictionary/vocabulary of terms.
• For each term, we have a list of
records which documents where the
term occurs. Posting
• Each item in the list that records the List
term that appeared in a document
(and often the position in the
document) is conventionally called a
posting.
• The list is then called a postings list (or The two parts of an inverted index. The dictionary is
inverted list), and all the postings lists commonly kept in memory,
taken together are referred to as the with pointers to each postings list, which is stored on
postings. disk.
Constructing an Inverted Index
To gain speed benefits of indexing at retrieval time, we have to build the index in
advance. The major steps in this are:
• Collect the documents to be indexed

• Tokenize the text, turning each document into a list of tokens. Also remove
stopwords. Stopwords are short words that occur frequently and add little meaning
e.g. the, a, in:

• Do linguistic pre-processing, producing normaixed tokens (stemming):

• Index documents that each term occurs in by creating an inverted index (dictionary
and postings).
Inverted Index - Example
• Consider the following conjunctive query:
• Brutus AND Calpurnia

Over the inverted index:


Inverted Index - Example
• Processing Boolean Queries:
1. Locate Brutus in the dictionary
2. Retrieve its postings
3. Locate Calpurnia in the dictionary
4. Retrieve its postings
5. Intersect the two postings lists
Inverted index structure is more efficient than Term-Document incidence matrix for IR
systems for several reasons:

1.Space Efficiency: An inverted index takes up less space than a Term-Document incidence matrix
because it only stores the documents in which a particular term appears, rather than storing a value
for every term in every document.
2.Speed: Retrieving information from an inverted index is faster than from a Term-Document
incidence matrix because the inverted index allows for direct access to the documents containing a
particular term, rather than having to scan through the entire matrix to find the documents.
3.Scalability: Inverted indexes can be easily distributed and scaled to handle large amounts of data,
whereas Term-Document incidence matrices become increasingly difficult to work with as the
amount of data grows.
4.Flexibility: Inverted indexes allow for easy implementation of advanced search features such as
Boolean operators and proximity search, which are difficult or impossible to implement with a
Term-Document incidence matrix.
Overall, inverted index structure is a more efficient and flexible solution for IR systems.
Web Crawling
• Gathering pages from the web in • Web-crawlers must have the
order to index them and support following features:
a search engine • Robustness – crawlers must not
get caught in spider-traps (pages
• Gather as many useful web-pages that mislead crawlers into fetching
as possible, quickly and efficiently an infinite number of pages from
together, with the link structure some domain).
that interconnects them. • Politeness - web servers have
policies regulating the rate a web-
• Web-crawlers are also known as crawler can visit them. These
policies must be respected.
spiders.
Web Crawling
• Features web-crawlers should • Features web-crawlers should
provide: provide:
• Distributed - crawlers have the • Quality – crawlers should be
capability to execute in a distributed biased towards fetching useful
fashion (across multiple machines). information.
• Scalable – crawlers' architecture • Freshness – crawlers should
should permit scaling up the crawl operate in a continuous mode and
rate by adding extra machines and fetch fresh copies of previously
bandwidth. fetched pages.
• Performance & efficacy – crawlers • Extensible – crawlers should be
should make efficient use of system designed to cope with new data
resources including processors, formats, new fetch protocols.
storage, and network bandwidth.
Web Crawling Operation
• Crawlers begin with one or more • Extracted links are added to the
URLs that constitute a seed set. URL frontier, which consist of URLs
whose pages have yet to be
• It picks a URL from the seed set
fetched by the crawler.
and fetches web-pages at the • Initially the URL frontier contains the
URL. seed set.
• Fetched pages are parsed and • As pages are fetched the
text and links are extracted from corresponding URLs are deleted from
the URL frontier.
the page.
• Continuous crawling: the URL of a
• The extracted text is fed to the fetched page is not deleted from
text indexer. the URL frontier but is fetched
again in future.
Robots Exclusion Protocol
• Many hosts on websites place portions of their site off-limits to
crawling, under a standard known as the Robots Exclusions Protocol.
• This is done by placing a robots.txt file at the root of the URL
hierarchy of the site.

• E.g. No robot should visit any URL


whose position in the file hierarchy
starts with /yoursite/temp/, except
for the robot called searchengine
Exercise
1. Create an example Document
incidence matrix for a poem of your
choice. Page numbers to be
replaced with line numbers
2. Create an Inverted index of a your
favourite songs lyrics stored across
three separate documents.
Construct the index to intersect a
particular word/words of your
choice across the three documents.

You might also like