Information Storage And: Retrieval Techniques
Information Storage And: Retrieval Techniques
RETRIEVAL TECHNIQUES
Course Code: INT204
K. Chakrapani
Department of IT
SOC
SASTRA
[email protected]
7845445557
UNIT - I
Information Storage: Overview – Abstraction – From Data to Wisdom -
Document and Query Forms – Query Structures
Information Retrieval Systems: Introduction – Web Search – Other Search
Applications – IR Applications – IR System Architecture – Documents and
Update – Working with Electronic Text – Test Collections – Open Source IR
Systems – Basic IR Techniques: Inverted Indices – Retrieval and Ranking –
Vector Space Model – Proximity Ranking – Boolean Retrieval - Evaluation:
Recall and Precision – Effectiveness Measures for Ranked Retrieval –
Building a Test Collection – Efficiency Measures
Information storage:
• It can be defined as component of an accounting system that keeps
data in a form accessible to information processors.
• Information storage is usually thought of in digital for but it can still
be in paper format. Information storage can be on the cloud or off the
cloud.
From Data to Wisdom
• It involves understanding and ability to make use of the data and
information to answer questions, solve problems, make decisions, and
so on.
• Wisdom has to do with using one's knowledge in a responsible (wise)
manner.
1.1 What Is Information Retrieval?
• Information retrieval (IR) is concerned with representing, searching, and manipulating
large collections of electronic text and other human-language data.
• IR systems and services are now widespread, with millions of people depending on
them daily to facilitate business, education, and entertainment.
• Web search engines — Google, Bing, and others — are by far the most popular and
heavily used IR services, providing access to up-to-date technical information, locating
people and organizations, summarizing news and events, and simplifying comparison
shopping.
Web Search:
• Regular users of Web search engines casually expect to receive accurate and near-instantaneous
answers to questions and requests merely by entering a short query few words — into a text box
and clicking on a search button.
• Underlying this simple and intuitive interface are clusters of computers, comprising thousands of
machines, working cooperatively to generate a ranked list of those Web pages that are likely to
satisfy the information need embodied in the query.
• These machines identify a set of Web pages containing the terms in the query, compute a score
for each page, eliminate duplicate and redundant pages, generate summaries of the remaining
pages, and finally return the summaries and links back to the user for browsing.
Other Search Applications:
• Desktop and file system search provides browsing facilities for files stored on a local
hard disk and possibly on disks connected over a local network.
• Lying between the desktop and the general Web, enterprise-level IR systems provide
document management and search services across businesses and other organizations.
• Digital libraries and other specialized IR systems support access to collections of high-
quality material, often of a proprietary nature. This material may consist of newspaper
articles, medical journals, maps, or books that cannot be placed on a generally
available Web site due to copyright restrictions.
IR application
• A typical search application evaluates incoming queries against a given document
collection, a routing, filtering, or dissemination system compares newly created or
discovered documents to a fixed set of queries supplied in advance by users,
identifying those that match a given query closely enough to be of possible interest to
the users.
• The difference between clustering and categorization tems from the information
provided to the system. Categorization systems are provided with training data
illustrating the various classes.
1.2 IR System Architecture
• Using the inverted index, collection statistics, and other data, the search engine
accepts queries from its users, processes these queries, and returns ranked lists
of results. To perform relevance ranking, the search engine computes a score,
sometimes called a Retrieval Status Value (RSV), for each document.
Documents and Update:
• The “document” as a generic term to refer to any self-contained unit that can be returned to the user
as a search result.
• In practice, a particular document might be an e-mail message, a Web page, a news article, or even a
video.
• When predefined components of a larger object may be returned as individual search results, such as
pages or paragraphs from a book, we refer to these components as elements.
• When arbitrary text passages, video segments, or similar material may be returned from larger
objects, we refer to them as snippets.
• Documents may be added or deleted in their entirety. Once a document has been added to the search
engine, its contents are not modified.
Performance Evaluation:
• There are two principal aspects to measuring IR system performance:
Efficiency and Effectiveness
• Efficiency may be measured in terms of time and space. The most visible aspect of
efficiency is the response time experienced by a user between issuing a query and
receiving the results
• Efficiency may also be considered in terms of storage space, measured by the bytes of
disk and memory
• Effectiveness is more difficult to measure than efficiency, since it depends entirely on
human judgment. The key idea behind measuring effectiveness is the notion of
relevance
• To determine relevance, a human assessor reviews a document/topic pair and
assigns a relevance value. The relevance value may be binary or graded
1.3 Working with Electronic Text:
• Human-language data in the form of electronic text represents the raw
material of information retrieval.
• Building an IR system requires an understanding of both electronic
text formats and the characteristics of the text they encode.
Text format:
• This figure presents the play as it might appear on a printed page. From the
perspective of an .IR system, there are two aspects of this page that must be
considered when it is represented in electronic form, and ultimately when it
is indexed by the system.
• The first aspect, the content of the page, is the sequence of words in the
order they might normally be read
• The second aspect is the structure of the page: the breaks between lines and
pages, the labelling of speeches with speakers, the stage directions, the act
and scene numbers, and even the page number
• The content and structure of electronic text may be encoded in myriad document formats
supported by various word processing programs and desktop publishing systems.
• Two formats are of special interest to us. The first, HTML (Hyper Text Markup Language), is
the fundamental format for Web pages.
• The second format, XML (extensible Markup Language), is not strictly a document format
but rather a metalanguage for defining document formats.
A Simple Tokenization of English Text
• Regardless of a document’s format, the construction of an inverted index that
can be used to process search queries requires each document to be converted
into a sequence of tokens.
• For English-language documents, a token usually corresponds to a sequence of
alphanumeric characters (A to Z and 0 to 9), but it may also encode structural
information, such as XML tags, or other characteristics of the text.
• To tokenize the XML in Figure, we treat each XML tag and each sequence of
consecutive alphanumeric characters as a token.
• We convert uppercase letters outside tags to lowercase in order to simplify the
matching process, meaning that “FIRST”, “first” and “First” are treated as
equivalent. The result of our tokenization is shown in table
Shakespeare’s 37th play
• A first-order language model for terms is equivalent to the zero-order model for
term bigrams estimated using the same technique:
and equivalently
The inverted index (sometimes called inverted file) is the central data
structure in virtually every information retrieval system. At its simplest,
an inverted index provides a mapping between terms and their
locations of occurrence in a text collection C.
Inverted Indices
Fig .
An inverted index for Shakespeare’s
plays.
This is the simplest type of index
which is know as schema-
independent index as it makes no
assumptions about the structure.
Inverted Indices
• first(t ) returns the first position at which the term t occurs in the
collection
• last(t) returns the last position at which t occurs in the collection
• next(t, current) returns the position of t’s first occurrence after the
current position
• prev(t, current) returns the position of t’s last occurrence before the
current position
first(“hurlyburly”) = 316669
next(“witch”, 745429) = 745451
Phrase search