0% found this document useful (0 votes)

38 views39 pages

Information Retrieval Detailed Lecture Nov 2023

This document provides an overview of information retrieval, including definitions, terminology, data types, background, logical views of documents, retrieval vs filtering models, and the Boolean model. It defines information retrieval as dealing with representation, organization, storage, and access of unstructured information items like text. Key concepts covered include structured vs unstructured vs semi-structured data, indexing documents with keywords or terms, different logical views of documents, high-level retrieval system architectures, and classic retrieval models like Boolean, vector, and probabilistic.

Uploaded by

mccreary.michael95

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views39 pages

Information Retrieval Detailed Lecture Nov 2023

Uploaded by

mccreary.michael95

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Advanced

Databases
Information Retrieval
Dr David Hamill
Overview

• Definitions for Information Retrieval (IR)

• IR terminology
• Data/record types
• IR background
• Logical Views of documents
• Retrieval vs Filtering
• IR models
• The Boolean Model
• Inverted Index
• Web crawling
Introduction

Data vs Information
Data are raw facts.
Information comes when data is processed, organized, and structured in some way.
Data can be posed as information when it is given context and meaning.

For example, look at these numbers: 2, 3, 5, 7, 11, 13, 17, 19

By itself without being ‘The set of prime numbers less than

presented in a context this list 20 appear in the list above’
of numbers has no implied When described in this manner we
meaning – it is a set of data have some information
Introduction
• Information is something that: • Information retrieval (IR) is the scientific
• Is represented by a set of symbols discipline that deals with the analysis,
• Has some structure design, and implementation of computerized
systems that address the representation,
• Can be read and to some extent organization and access to large amounts of
understood by users of information heterogeneous information encoded in
digital format -RIJSBERGEN, C.J., Information
• Information retrieval (IR) involves finding Retrieval, Butterworths, London, 1979.
material (often documents) of an • Information retrieval (IR) deals with the
unstructured nature (often text) that satisfies representation, storage, organization of, and
an information need from within large access to information items. –Modern
collections (usually stored on computers) – Information Retrieval
An Introduction to Information Retrieval –
Cambridge University Press.
Introduction – More definitions
• IR refers to the retrieval of unstructured records: • For example: “find documents which discuss
• Free-form natural language text predominantly. the political implications of the Monica
• Can also include other types of unstructured data: Lewinsky scandal in the results of the 1998
• Images elections for the US congress”
• Sound • For information retrieval needs it is often
• Video necessary to translated the requirement into a
• User Information Need: a natural language declaration query that can be processed by the IR system.
of the information need of a user. • This query often is composed of
keywords/index terms summarising the
user information need.
Introduction –
Terminology
• Documents: the records that IR
systems often process.
• Collection: an organised
repository used by IR systems
to retrieve documents.
• Archive, corpus, digital library are
terms also used in this context.
• Documents that satisfy a query
in the judgement of the user are
said to be relevant.
• The emphasis for IR is
the retrieval of
information as opposed
to data.
• Ranking: an established
order of the documents
retrieved.

Introduction – • IR systems must rank

information items
Terminology according to a degree of
relevance to the user.
• The IR Problem: retrieve
all items relevant to a
user query, while
retrieving as few non-
relevant items as
possible.
Types of data
• Structured records: consist of
name components that are
organised to some well
defined syntax:
• Each component of a
record will have a
definite meaning and a
specific type.
• E.g. Relational Database
table records.
Types of data
• Unstructured records: do not
have a well-defined syntax.
• There is no well-defined
meaning attribute to
each component
syntactical element.
• E.g. emails, chapters
from books, reviews,
audio etc.
Types of data
• Semi-Structured records:
follow a general standard
form. . . No model.
• E.g. using NoSQL data models
like JSON or XML. They can
contain tagged fields but there
is no enforcement of a
particular schema.
IR Background

• Early goals: indexing text and searching for useful

documents in a collection.
• Modern research: modelling (sentiment analysis
of text), web search, text classification, user
interfaces, data visualization, filtering, and
languages.
• Libraries were among the first institutions to
adopt IR systems.
• Initially, consisted of an automation of existing
processes like card cataloguing for searching.
• Increase functionality was added including
subject headings, keywords, query operators.
IR Background

• Until recently IR was mainly of interest to

librarians and information experts.
• The element that changed this was the
introduction of the web, the largest
repository of knowledge in human history.
• Based on its enormous size, finding
information on the web requires running
searches.
Logical View of Documents
• Documents in a collection are often represented through a set of
index terms or keywords:
• Can be extracted automatically or manually.
• IR systems can adopt different logical views of documents:
• Full text
• Representative keywords:
• Eliminate stopwords: words that occur frequently in text documents. Examples: articles
(a, an, the), prepositions (in, at, on, of, to), and conjunctions (and, or, but, if, when).
• Stemming: a technique for reducing words to their grammatical roots.
• Identification of noun groups: eliminate adjectives, adverbs, and verbs.
• These text operations reduce the complexity of the document representation
and allow moving the logical view from full text to a set of index terms.
Logical View of Documents
High-Level Architecture of an IR System
The Retrieval Process
• Retrieval is the “matching” process between document keywords and
words in queries.
Retrieval vs
Filtering
• A distinction between ad-hoc
retrieval and filtering is often
made.
• Ad-hoc retrieval refers to the
application of arbitrary
queries to a fixed collection
of documents:
• Static documents, new
queries.
• Formerly called
retrospective retrieval
Retrieval vs
Filtering
• Filtering refers to having a
fixed number of queries that
are applied to a stream of
changing documents:
• Static queries, new documents
• Can be based on a ‘user profile’
• The documents are classified
according to which query they
most closely match and routed
accordingly.
Information Retrieval Models
• Modelling in IR is a complex process aimed at producing a ranking
function.
• Ranking function: a function that assigns scores to documents with
regard to a given query:
• This process consists of two main tasks:
• The conception of a logical framework for representing documents and queries.
• The definition of a ranking function that allows quantifying the similarities among
documents and queries.
Information Retrieval Models
• IR systems usually adopt index terms to index and retrieve
documents.
• Index term:
• Restrictively: a keyword that has some meaning on its own; usually plays the
role of a noun.
• Generally: any word that appears in a document
• Retrieval based on index terms can be implemented efficiently (see
indexing lecture).
• Index terms are simple to refer to in a query.
• Simplicity is important because it reduces the effort of query formulation.
A ranking is an ordering of the documents that
(hopefully) reflect their relevance to a user query.

Information
Retrieval Any IR system has to deal with the problem of
predicting which documents users will find
Models relevant.

This problem naturally embodies a degree of

uncertainty and vagueness.
Information Retrieval Models
• Three classic models:
• Boolean model: documents and queries are sets of index terms.
• Set theoretic
• Vector model: documents and queries exist in N-dimensional
space.
• Algebraic
• Probabilistic model: based on probability theory.
The Boolean
Model
• The Boolean retrieval model is a
model for information retrieval
which can pose any query in the
form of a Boolean expression of
terms.
• Terms are combined with the
operators AND, OR, and NOT.
• Based on set theory and Boolean
algebra.
• The model views each document
as a set of words
• Queries are specified as Boolean
expressions
The Boolean Model– An Example
To determine which plays of Shakespeare contain the word Brutus
AND Caesar AND NOT Calpurnia.
• One way is to start at the beginning and read through all the text,
noting each play whether it contains Brutus and Caesar and excluding
it if it contains Calpurnia.
• Linear scan through the documents.
• May be appropriate and effective for some queries.
• Many purposes require a different approach.
• Large collection of documents to process.
• Need a ranked retrieval.
• Need other matching operations e.g. proximity match.
The Boolean Model– Document incidence matrix

To avoid Linear scan we index the documents in advance.

Suppose we record for each document – whether it contains each word out of all the words used in the
document (in our example Shakespeare used about 32,000 different words).
The result is a binary term-document incidence matrix.

Matrix element (t,d) is 1 if the play in column d, contains the word in row t, and is 0 otherwise

Depending on whether we look at

the matrix rows or columns, we can
have a vector for each term, which
shows the documents it appears in,
or a vector for each document,
showing the terms that occur in it.
The Boolean Model– Vectors
• To answer the original question: i.e. Brutus AND Caesar AND NOT
Calpurnia:
• We take the vectors for Brutus, Caesar, and Calpurnia, complement
the last, and then do a bitwise AND:
• 110100 AND 110111 AND 101111 = 100100
• The answer for this query are two plays: Antony and Cleopatra and
Hamlet.
• This is an exact match system
Terms are present or • Exact match system; documents are predicted to be
absent relevant or non-relevant.

Retrieval based on
binary decision criteria
with no notion of partial
The Boolean matching.

Model No ranking of
• Information need has to be translated into Boolean
expression, which most users find awkward.
documents is provided • The Boolean queries formulated by the users are
(no grading scale). most often too simplistic.

The model frequently

returns either too few or
too many documents in
response to a user
query.
“…Term-document incidence matrix takes large space to store the information of which document
contains a certain term, and it becomes easily unmanageable and unusable for a large dataset. Also
most of the terms are not contained in most of the documents, which makes the matrix sparse thus
wasting a large amount of storage.

On the other hand, inverted index only records the documents contains a certain term. This makes a
good use of storage and makes the index smaller when compared to the term-document incidence
matrix.”

https://www.quora.com/Why-inverted-index-structure-is-more-efficient-than-Term-Document-incid
ence-matrix-for-IR-systems
Inverted Index

Now assume we want to create a term-document incidence matrix for a collection of 1

million documents where each document contains on average 500,000 terms.

The matrix would have approx. half a trillion 0’s and 1’s.
• It is not practical to store such a data structure in computer memory.

The matrix will be extremely sparse (most entries will be 0)

• Rather than storing all the 1’s and 0’s we will limit the information recorded to only the things that do occur
(store only the 1’s).
• We create an inverted index
Inverted Index
• The basic idea of an inverted index: Posting
• keep a dictionary/vocabulary of terms.
• For each term, we have a list of
records which documents where the
term occurs. Posting
• Each item in the list that records the List
term that appeared in a document
(and often the position in the
document) is conventionally called a
posting.
• The list is then called a postings list (or The two parts of an inverted index. The dictionary is
inverted list), and all the postings lists commonly kept in memory,
taken together are referred to as the with pointers to each postings list, which is stored on
postings. disk.
Constructing an Inverted Index
To gain speed benefits of indexing at retrieval time, we have to build the index in
advance. The major steps in this are:
• Collect the documents to be indexed

• Tokenize the text, turning each document into a list of tokens. Also remove
stopwords. Stopwords are short words that occur frequently and add little meaning
e.g. the, a, in:

• Do linguistic pre-processing, producing normaixed tokens (stemming):

• Index documents that each term occurs in by creating an inverted index (dictionary
and postings).
Inverted Index - Example
• Consider the following conjunctive query:
• Brutus AND Calpurnia

Over the inverted index:

Inverted Index - Example
• Processing Boolean Queries:
1. Locate Brutus in the dictionary
2. Retrieve its postings
3. Locate Calpurnia in the dictionary
4. Retrieve its postings
5. Intersect the two postings lists
Inverted index structure is more efficient than Term-Document incidence matrix for IR
systems for several reasons:

1.Space Efficiency: An inverted index takes up less space than a Term-Document incidence matrix
because it only stores the documents in which a particular term appears, rather than storing a value
for every term in every document.
2.Speed: Retrieving information from an inverted index is faster than from a Term-Document
incidence matrix because the inverted index allows for direct access to the documents containing a
particular term, rather than having to scan through the entire matrix to find the documents.
3.Scalability: Inverted indexes can be easily distributed and scaled to handle large amounts of data,
whereas Term-Document incidence matrices become increasingly difficult to work with as the
amount of data grows.
4.Flexibility: Inverted indexes allow for easy implementation of advanced search features such as
Boolean operators and proximity search, which are difficult or impossible to implement with a
Term-Document incidence matrix.
Overall, inverted index structure is a more efficient and flexible solution for IR systems.
Web Crawling
• Gathering pages from the web in • Web-crawlers must have the
order to index them and support following features:
a search engine • Robustness – crawlers must not
get caught in spider-traps (pages
• Gather as many useful web-pages that mislead crawlers into fetching
as possible, quickly and efficiently an infinite number of pages from
together, with the link structure some domain).
that interconnects them. • Politeness - web servers have
policies regulating the rate a web-
• Web-crawlers are also known as crawler can visit them. These
policies must be respected.
spiders.
Web Crawling
• Features web-crawlers should • Features web-crawlers should
provide: provide:
• Distributed - crawlers have the • Quality – crawlers should be
capability to execute in a distributed biased towards fetching useful
fashion (across multiple machines). information.
• Scalable – crawlers' architecture • Freshness – crawlers should
should permit scaling up the crawl operate in a continuous mode and
rate by adding extra machines and fetch fresh copies of previously
bandwidth. fetched pages.
• Performance & efficacy – crawlers • Extensible – crawlers should be
should make efficient use of system designed to cope with new data
resources including processors, formats, new fetch protocols.
storage, and network bandwidth.
Web Crawling Operation
• Crawlers begin with one or more • Extracted links are added to the
URLs that constitute a seed set. URL frontier, which consist of URLs
whose pages have yet to be
• It picks a URL from the seed set
fetched by the crawler.
and fetches web-pages at the • Initially the URL frontier contains the
URL. seed set.
• Fetched pages are parsed and • As pages are fetched the
text and links are extracted from corresponding URLs are deleted from
the URL frontier.
the page.
• Continuous crawling: the URL of a
• The extracted text is fed to the fetched page is not deleted from
text indexer. the URL frontier but is fetched
again in future.
Robots Exclusion Protocol
• Many hosts on websites place portions of their site off-limits to
crawling, under a standard known as the Robots Exclusions Protocol.
• This is done by placing a robots.txt file at the root of the URL
hierarchy of the site.

• E.g. No robot should visit any URL

whose position in the file hierarchy
starts with /yoursite/temp/, except
for the robot called searchengine
Exercise
1. Create an example Document
incidence matrix for a poem of your
choice. Page numbers to be
replaced with line numbers
2. Create an Inverted index of a your
favourite songs lyrics stored across
three separate documents.
Construct the index to intersect a
particular word/words of your
choice across the three documents.

ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
15 Easy Jazz, Blues and Funk Etudes
100% (10)
15 Easy Jazz, Blues and Funk Etudes
36 pages
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
38 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
63 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
Lecture1 Intro Boolean
No ratings yet
Lecture1 Intro Boolean
42 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
Chick Literature
No ratings yet
Chick Literature
9 pages
IR Introduction
100% (1)
IR Introduction
6 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
1 introIR
No ratings yet
1 introIR
15 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
1 IR Introduction
No ratings yet
1 IR Introduction
23 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Part B
No ratings yet
Part B
12 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
List of Autorised Recovery Agencies
No ratings yet
List of Autorised Recovery Agencies
74 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Chapter One ISR
No ratings yet
Chapter One ISR
25 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
The Complete Idiot's Guide To Herbal Remedies
100% (3)
The Complete Idiot's Guide To Herbal Remedies
400 pages
Gold of Praise PDF
No ratings yet
Gold of Praise PDF
526 pages
KoBo Training Package - December2018
No ratings yet
KoBo Training Package - December2018
77 pages
Vsphere Esxi 672 Installation Setup Guide
No ratings yet
Vsphere Esxi 672 Installation Setup Guide
222 pages
Syllabus
No ratings yet
Syllabus
5 pages
MPS and Least Learned (Diagnostic Test) - 033148
No ratings yet
MPS and Least Learned (Diagnostic Test) - 033148
11 pages
Intertext Hypertet Module
No ratings yet
Intertext Hypertet Module
21 pages
Pronoun
No ratings yet
Pronoun
25 pages
ControlLogix Controller Portfolio Customer Presentation
No ratings yet
ControlLogix Controller Portfolio Customer Presentation
22 pages
LCC (Helena - Dittman, - Jane - Hardy)
No ratings yet
LCC (Helena - Dittman, - Jane - Hardy)
164 pages
Concerning Divine Wisdom in The Creation of Man 1st Edition Abu Hamid Al-Ghazali PDF Download
No ratings yet
Concerning Divine Wisdom in The Creation of Man 1st Edition Abu Hamid Al-Ghazali PDF Download
42 pages
70 346 Questions
No ratings yet
70 346 Questions
19 pages
Ip - Practical - File SRI
No ratings yet
Ip - Practical - File SRI
76 pages
Active Passive Voice
No ratings yet
Active Passive Voice
10 pages
Introduction To Teichm Uller Spaces: Jing Tao
No ratings yet
Introduction To Teichm Uller Spaces: Jing Tao
12 pages
Be The Best of Whatever You Are
No ratings yet
Be The Best of Whatever You Are
5 pages
Form 430 ECS Familiarisation Checklist
No ratings yet
Form 430 ECS Familiarisation Checklist
7 pages
Identifying Functions
No ratings yet
Identifying Functions
2 pages
Pratikesh Dasharath Vishe - AD & Windows - Skillmine
No ratings yet
Pratikesh Dasharath Vishe - AD & Windows - Skillmine
3 pages
Tieng Anh 8 Sach Moi de Thi Giua Hoc Ki 2
No ratings yet
Tieng Anh 8 Sach Moi de Thi Giua Hoc Ki 2
6 pages
Puncation Tutoira
No ratings yet
Puncation Tutoira
4 pages
Stephen Hawking's First Paper
No ratings yet
Stephen Hawking's First Paper
10 pages
IXl CLASS AND QUESTIONS
No ratings yet
IXl CLASS AND QUESTIONS
2 pages
vmIndicator-ReleaseNotesForV01 00
No ratings yet
vmIndicator-ReleaseNotesForV01 00
5 pages
Cambridge Assessment International Education: Mathematics 9709/72 October/November 2019
No ratings yet
Cambridge Assessment International Education: Mathematics 9709/72 October/November 2019
11 pages
(Project اخر 6)
No ratings yet
(Project اخر 6)
1 page
Hussain CV To The Public PDF
No ratings yet
Hussain CV To The Public PDF
3 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet

Information Retrieval Detailed Lecture Nov 2023

Uploaded by

Information Retrieval Detailed Lecture Nov 2023

Uploaded by

Advanced

• Definitions for Information Retrieval (IR)

For example, look at these numbers: 2, 3, 5, 7, 11, 13, 17, 19

By itself without being ‘The set of prime numbers less than

Introduction – • IR systems must rank

• Early goals: indexing text and searching for useful

• Until recently IR was mainly of interest to

This problem naturally embodies a degree of

To avoid Linear scan we index the documents in advance.

Depending on whether we look at

The model frequently

Now assume we want to create a term-document incidence matrix for a collection of 1

The matrix will be extremely sparse (most entries will be 0)

• Do linguistic pre-processing, producing normaixed tokens (stemming):

Over the inverted index:

• E.g. No robot should visit any URL

You might also like