Information Retrieval Detailed Lecture Nov 2023
Information Retrieval Detailed Lecture Nov 2023
Databases
Information Retrieval
Dr David Hamill
Overview
Data vs Information
Data are raw facts.
Information comes when data is processed, organized, and structured in some way.
Data can be posed as information when it is given context and meaning.
Information
Retrieval Any IR system has to deal with the problem of
predicting which documents users will find
Models relevant.
Matrix element (t,d) is 1 if the play in column d, contains the word in row t, and is 0 otherwise
Retrieval based on
binary decision criteria
with no notion of partial
The Boolean matching.
Model No ranking of
• Information need has to be translated into Boolean
expression, which most users find awkward.
documents is provided • The Boolean queries formulated by the users are
(no grading scale). most often too simplistic.
On the other hand, inverted index only records the documents contains a certain term. This makes a
good use of storage and makes the index smaller when compared to the term-document incidence
matrix.”
https://www.quora.com/Why-inverted-index-structure-is-more-efficient-than-Term-Document-incid
ence-matrix-for-IR-systems
Inverted Index
The matrix would have approx. half a trillion 0’s and 1’s.
• It is not practical to store such a data structure in computer memory.
• Tokenize the text, turning each document into a list of tokens. Also remove
stopwords. Stopwords are short words that occur frequently and add little meaning
e.g. the, a, in:
• Index documents that each term occurs in by creating an inverted index (dictionary
and postings).
Inverted Index - Example
• Consider the following conjunctive query:
• Brutus AND Calpurnia
1.Space Efficiency: An inverted index takes up less space than a Term-Document incidence matrix
because it only stores the documents in which a particular term appears, rather than storing a value
for every term in every document.
2.Speed: Retrieving information from an inverted index is faster than from a Term-Document
incidence matrix because the inverted index allows for direct access to the documents containing a
particular term, rather than having to scan through the entire matrix to find the documents.
3.Scalability: Inverted indexes can be easily distributed and scaled to handle large amounts of data,
whereas Term-Document incidence matrices become increasingly difficult to work with as the
amount of data grows.
4.Flexibility: Inverted indexes allow for easy implementation of advanced search features such as
Boolean operators and proximity search, which are difficult or impossible to implement with a
Term-Document incidence matrix.
Overall, inverted index structure is a more efficient and flexible solution for IR systems.
Web Crawling
• Gathering pages from the web in • Web-crawlers must have the
order to index them and support following features:
a search engine • Robustness – crawlers must not
get caught in spider-traps (pages
• Gather as many useful web-pages that mislead crawlers into fetching
as possible, quickly and efficiently an infinite number of pages from
together, with the link structure some domain).
that interconnects them. • Politeness - web servers have
policies regulating the rate a web-
• Web-crawlers are also known as crawler can visit them. These
policies must be respected.
spiders.
Web Crawling
• Features web-crawlers should • Features web-crawlers should
provide: provide:
• Distributed - crawlers have the • Quality – crawlers should be
capability to execute in a distributed biased towards fetching useful
fashion (across multiple machines). information.
• Scalable – crawlers' architecture • Freshness – crawlers should
should permit scaling up the crawl operate in a continuous mode and
rate by adding extra machines and fetch fresh copies of previously
bandwidth. fetched pages.
• Performance & efficacy – crawlers • Extensible – crawlers should be
should make efficient use of system designed to cope with new data
resources including processors, formats, new fetch protocols.
storage, and network bandwidth.
Web Crawling Operation
• Crawlers begin with one or more • Extracted links are added to the
URLs that constitute a seed set. URL frontier, which consist of URLs
whose pages have yet to be
• It picks a URL from the seed set
fetched by the crawler.
and fetches web-pages at the • Initially the URL frontier contains the
URL. seed set.
• Fetched pages are parsed and • As pages are fetched the
text and links are extracted from corresponding URLs are deleted from
the URL frontier.
the page.
• Continuous crawling: the URL of a
• The extracted text is fed to the fetched page is not deleted from
text indexer. the URL frontier but is fetched
again in future.
Robots Exclusion Protocol
• Many hosts on websites place portions of their site off-limits to
crawling, under a standard known as the Robots Exclusions Protocol.
• This is done by placing a robots.txt file at the root of the URL
hierarchy of the site.