0% found this document useful (0 votes)
63 views56 pages

Information Storage And: Retrieval Techniques

Uploaded by

amirthaa sri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views56 pages

Information Storage And: Retrieval Techniques

Uploaded by

amirthaa sri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

INFORMATION STORAGE AND

RETRIEVAL TECHNIQUES
Course Code: INT204
K. Chakrapani
Department of IT
SOC
SASTRA
[email protected]
7845445557
UNIT - I
Information Storage: Overview – Abstraction – From Data to Wisdom -
Document and Query Forms – Query Structures
Information Retrieval Systems: Introduction – Web Search – Other Search
Applications – IR Applications – IR System Architecture – Documents and
Update – Working with Electronic Text – Test Collections – Open Source IR
Systems – Basic IR Techniques: Inverted Indices – Retrieval and Ranking –
Vector Space Model – Proximity Ranking – Boolean Retrieval - Evaluation:
Recall and Precision – Effectiveness Measures for Ranked Retrieval –
Building a Test Collection – Efficiency Measures
Information storage:
• It can be defined as component of an accounting system that keeps
data in a form accessible to information processors.
• Information storage is usually thought of in digital for but it can still
be in paper format. Information storage can be on the cloud or off the
cloud.
From Data to Wisdom
•  It involves understanding and ability to make use of the data and
information to answer questions, solve problems, make decisions, and
so on.
• Wisdom has to do with using one's knowledge in a responsible (wise)
manner.
1.1 What Is Information Retrieval?
• Information retrieval (IR) is concerned with representing, searching, and manipulating
large collections of electronic text and other human-language data.

• IR systems and services are now widespread, with millions of people depending on
them daily to facilitate business, education, and entertainment.

• Web search engines — Google, Bing, and others — are by far the most popular and
heavily used IR services, providing access to up-to-date technical information, locating
people and organizations, summarizing news and events, and simplifying comparison
shopping.
Web Search:
• Regular users of Web search engines casually expect to receive accurate and near-instantaneous
answers to questions and requests merely by entering a short query few words — into a text box
and clicking on a search button.

• Underlying this simple and intuitive interface are clusters of computers, comprising thousands of
machines, working cooperatively to generate a ranked list of those Web pages that are likely to
satisfy the information need embodied in the query.

• These machines identify a set of Web pages containing the terms in the query, compute a score
for each page, eliminate duplicate and redundant pages, generate summaries of the remaining
pages, and finally return the summaries and links back to the user for browsing.
Other Search Applications:
• Desktop and file system search provides browsing facilities for files stored on a local
hard disk and possibly on disks connected over a local network.

• Lying between the desktop and the general Web, enterprise-level IR systems provide
document management and search services across businesses and other organizations.

• Digital libraries and other specialized IR systems support access to collections of high-
quality material, often of a proprietary nature. This material may consist of newspaper
articles, medical journals, maps, or books that cannot be placed on a generally
available Web site due to copyright restrictions.
IR application
• A typical search application evaluates incoming queries against a given document
collection, a routing, filtering, or dissemination system compares newly created or
discovered documents to a fixed set of queries supplied in advance by users,
identifying those that match a given query closely enough to be of possible interest to
the users.

• The difference between clustering and categorization tems from the information
provided to the system. Categorization systems are provided with training data
illustrating the various classes.
1.2 IR System Architecture

• As information need, the user constructs and issues a


query to the IR system. Typically, this query consists of a
small number of terms, with two to three terms being
typical for a Web search.

• Depending on the information need, a query term may


be a date, a number, a musical note, or a phrase.
Wildcard operators and other partial-match operators
may also be permitted in query terms.
Cont.....

• A major task of a search engine is to maintain and manipulate an inverted index


for a document collection. This index forms the principal data structure used by
the engine for searching and relevance ranking.

• To support relevance ranking algorithms, the search engine maintains collection


statistics associated with the index, such as the number of documents containing
each term and the length of each document.

• Using the inverted index, collection statistics, and other data, the search engine
accepts queries from its users, processes these queries, and returns ranked lists
of results. To perform relevance ranking, the search engine computes a score,
sometimes called a Retrieval Status Value (RSV), for each document.
Documents and Update:
• The “document” as a generic term to refer to any self-contained unit that can be returned to the user
as a search result.

• In practice, a particular document might be an e-mail message, a Web page, a news article, or even a
video.

• When predefined components of a larger object may be returned as individual search results, such as
pages or paragraphs from a book, we refer to these components as elements.

• When arbitrary text passages, video segments, or similar material may be returned from larger
objects, we refer to them as snippets.

• Documents may be added or deleted in their entirety. Once a document has been added to the search
engine, its contents are not modified.
Performance Evaluation:
• There are two principal aspects to measuring IR system performance:
Efficiency and Effectiveness
• Efficiency may be measured in terms of time and space. The most visible aspect of
efficiency is the response time experienced by a user between issuing a query and
receiving the results
• Efficiency may also be considered in terms of storage space, measured by the bytes of
disk and memory
• Effectiveness is more difficult to measure than efficiency, since it depends entirely on
human judgment. The key idea behind measuring effectiveness is the notion of
relevance
• To determine relevance, a human assessor reviews a document/topic pair and
assigns a relevance value. The relevance value may be binary or graded
1.3 Working with Electronic Text:
• Human-language data in the form of electronic text represents the raw
material of information retrieval.
• Building an IR system requires an understanding of both electronic
text formats and the characteristics of the text they encode.
Text format:
• This figure presents the play as it might appear on a printed page. From the
perspective of an .IR system, there are two aspects of this page that must be
considered when it is represented in electronic form, and ultimately when it
is indexed by the system.
• The first aspect, the content of the page, is the sequence of words in the
order they might normally be read
• The second aspect is the structure of the page: the breaks between lines and
pages, the labelling of speeches with speakers, the stage directions, the act
and scene numbers, and even the page number
• The content and structure of electronic text may be encoded in myriad document formats
supported by various word processing programs and desktop publishing systems.

• Myriad document formats includes Microsoft word,HTML, XML,


XHTML,LATEX,MIF,RTF,PDF,Post Script,SGML and others.

• Although a detailed description of these formats is beyond our scope, a basic


understanding of their impact on indexing and retrieval is important.

• Two formats are of special interest to us. The first, HTML (Hyper Text Markup Language), is
the fundamental format for Web pages.

• The second format, XML (extensible Markup Language), is not strictly a document format
but rather a metalanguage for defining document formats.
A Simple Tokenization of English Text
• Regardless of a document’s format, the construction of an inverted index that
can be used to process search queries requires each document to be converted
into a sequence of tokens.
• For English-language documents, a token usually corresponds to a sequence of
alphanumeric characters (A to Z and 0 to 9), but it may also encode structural
information, such as XML tags, or other characteristics of the text.
• To tokenize the XML in Figure, we treat each XML tag and each sequence of
consecutive alphanumeric characters as a token.
• We convert uppercase letters outside tags to lowercase in order to simplify the
matching process, meaning that “FIRST”, “first” and “First” are treated as
equivalent. The result of our tokenization is shown in table
Shakespeare’s 37th play

The twenty most frequent terms in Bosak’s XML


A tokenization of Shakespeare’s Macbeth.
version of Shakespeare.
Term Distributions
• The frequency of the tags is determined by the
structural constraints of the collection.
• The relationship between frequency and rank
represented by this line is known as Zipf’s law,
Mathematically,
the relationship may be expressed as
log(frequency) = C − α · log(rank) ,
or equivalently as

where Fi is the frequency of the ith most frequent term.

Frequency of words in the Shakespeare collection,


by rank order. The dashed line corresponds to Zipf’s
law with α = 1.
Language Modeling
• There are 912,052 tokens in Bosak’s XML version of Shakespeare’s plays,
excluding tags
• If we pick a token uniformly at random from these plays, the probability of
picking “the” is 28, 317/912, 052 ≈ 3.1%, whereas the probability of picking
“zwaggered” is only 1/912, 052 ≈ 0.00011%.
• Predictions concerning the content of unseen text may be made by way of a
special kind of probability distribution known as a language model. The simplest
language model is a fixed probability distribution M(σ) over the symbols in the
vocabulary:
Higher-order models
• Higher-order language models allow us to take this context into account. A first-
order language model consists of conditional probabilities that depend on the
previous symbol. For example:

• A first-order language model for terms is equivalent to the zero-order model for
term bigrams estimated using the same technique:

• More generally, every nth-order language model may be expressed in terms of a


zero-order (n + 1)-gram model:
Smoothing (page no 20, Eq no 1.13 )
• One solution to this problem is to smooth the first-order model M1 with the
corresponding zero-order model M0. Our smoothed model M1’ is then a linear
combination of M0 and M1:

and equivalently

where γ in both cases is a smoothing parameter (0 ≤ γ ≤ 1).


Markov models
• Figure illustrates a Markov model, another
important method for representing term
distributions.
• Markov models are essentially finite-state
automata augmented with transition
probabilities.
• When used to express a language model, each
transition is labelled with a term, in addition to
the probability.
1.4 Test Collections
• Text REtrieval Conference(TREC),a series of experimental evaluation
efforts conducted annually since 1991 by the U.S.
• National Institute of Standards and Technology (NIST). TREC provides a
forum for researchers to test their IR systems on a broad range of
problems.
• TREC provides at least two important benefits to the IR community. First,
it focuses researchers on common problems, using common data, thus
providing a forum for them to present and discuss their work and
facilitating direct inter-system comparisons.
• As a second benefit, TREC aims to create reusable test collections thatcan
be used by participating groups to validate further improvements and by
non-participating groups to evaluate their own work.
TREC Tasks
The TREC Tasks Track is an attempt in devising mechanisms for evaluating
quality of retrieval systems in terms of
(1) how well they can understand the underlying task that led the user
submit a query, and
(2) how useful they are for helping users complete their tasks.
1.5 Open-Source IR Systems
• There exists a wide variety of open-source information retrieval
systems that you may use for exercises in this book and to start
conducting your own information retrieval experiments.
• All three systems are available for download from the Web and may
be used free of charge, according to their respective licenses.
• Lucene
• Indri
• Wumpus
• Lucene is an indexing and search system implemented in Java, with ports
to other programming languages
• It’s retrieval framework is based on the concept of fields: Every
document is a collection of fields, such as its title, body, URL
• Indri is an academic information retrieval system written in C++.
• It can handle multiple fields per document, such as title, body, and
anchor text, which is important in the context of Web search
• Wumpus is an academic search engine written in C++ and developed at
the University of Waterloo.
• This makes the system particularly attractive for search tasks in which the
ideal search result may not always be a whole document, but may be a
section, a paragraph, or a sequence of paragraphs within a document.
Chapter 2
2.1 Basic IR techniques
2.1 Inverted Indices

The inverted index (sometimes called inverted file) is the central data
structure in virtually every information retrieval system. At its simplest,
an inverted index provides a mapping between terms and their
locations of occurrence in a text collection C.
Inverted Indices

Fig .
An inverted index for Shakespeare’s
plays.
This is the simplest type of index
which is know as schema-
independent index as it makes no
assumptions about the structure.
Inverted Indices
• first(t ) returns the first position at which the term t occurs in the
collection
• last(t) returns the last position at which t occurs in the collection
• next(t, current) returns the position of t’s first occurrence after the
current position
• prev(t, current) returns the position of t’s last occurrence before the
current position
first(“hurlyburly”) = 316669
next(“witch”, 745429) = 745451
Phrase search

• Most commercial Web search engines, as well as many other IR


systems, treat a list of terms enclosed in double quotes ("...") as a
phrase
• To process a query that contains a phrase, the IR system must identify
the occurrences of the phrase in the collection
Phrase search
• For example, the interval [914823, 914829] from the index might represent the
text
O Romeo, Romeo! wherefore art thou Romeo?
Given the phrase “t1t2...tn”, consisting of a sequence of n terms, our algorithm
works through the postings lists for the terms from left to right, making a call to the
next method for each term, and then from right to left, making a call to the prev
method for each term. After each pass from left to right and back, it has computed
an interval in which the terms appear in the correct order and as close together as
possible. It then checks whether the terms are in fact adjacent.
If they are, an occurrence of the phrase has been found; if not, the algorithm
moves on.
Implementing Inverted Indices
• When
  a collection will never change and when it is small enough to be
maintained entirely in memory, an inverted index may be implemented with very
simple data structures
• The dictionary may be stored in a hash table or similar structure, and the postings
list for each term t may be stored in a fixed array [] with length
• For example, the following array represents the term “witch” in the
Shakespeare`s plays index
• Access patterns for three approaches to solving prev(“witch”, 745429) = 745407:
(a) binary search, (b) sequential scan, and (c) galloping. For (b) and (c), the
algorithms start at an initial cached position of 1.
Document and other elements
• Most IR systems and algorithms operate over a standard unit of
retrieval: the document
• Depending on these requirements, a document might be an e-mail,
message, a Web page, a newspaper article, collection of books, etc.,
• In the case of our collection of Shakespeare’s plays, the most natural
course is probably to treat each play as a document, but acts, scenes,
speeches, and lines might all be appropriate units of retrieval in some
circumstances
Document and other elements
• For the purposes of a simple example, assume we are interested in speeches and
wish to locate those spoken by the “first witch”
• The phrase “first witch” first occurs at [745406, 745407]
• Computing the speech that contains this phrase is reasonably straightforward.
Using the methods of our inverted index ADT, we determine that the start of a
speech immediately preceding this phrase is located at
prev(“<SPEECH>”, 745406) = 745404
• The end of this speech is located at
next(“</SPEECH>”, 754404) = 745425
Document and other elements
• Most IR research assumes that the text collection naturally divides
into documents, which are considered to be atomic units for retrieval
• In a system for searching e-mail, messages form this basic retrieval
unit. In a file system, files do; on the Web, Web pages
• Allowing a collection to be partitioned into multiple sub collections
for parallel retrieval and allowing documents to be reordered to
improve efficiency, perhaps by grouping all documents from a single
source or Website.
Document-oriented Indices
• Because document retrieval represents such an important special
case, indices are usually optimized around it
• To accommodate this optimization, the numbering of positions in a
document collection may be split into two components: a document
number and an offset within the document.
N is a document identifier (or docid)
m is within-document-position (offset)
• For example, next(“witch”, 22:288) = 22:310
2.2 Retrieval and Ranking
• In thing section, we will learn about 3 simple retrieval methods, namely
1.Vector Space Model
2.Proximity Ranking
3.Boolean Retrieval
• The first two methods produce ranked results, sorting the documents according
to the expected relevancy to the query
• The third retrieval method allows Boolean filters to be applied to the collection
Vector Space Model
• The vector space model is one of the oldest and best known of the
information retrieval models
• The method was developed and promulgated by Gerald Salton, who
was perhaps the most influential of the early IR researchers.
• Given a query vector and a set of document vectors, one for each
document in the collection, we rank the documents by computing a
similarity measure between the query vector and each document
vector, comparing the angle between them.
Vector Space model

Document similarity under the vector space


model. Angles are computed between a
query vector q and two document vectors d1
and d2. Because θ1 < θ2, d1 should be ranked
higher than d2.
Vector Space Model
• In representing a document or query as a vector, a weight must be
assigned to each term that represents the value of the corresponding
component of the vector. Throughout the long history of the vector
space model, many formulae for assigning these weights have been
proposed and evaluated. With few exceptions, these formulae may be
characterized as belonging to a general family known as TF-IDF
weights.
• TF-IDF weight is a product of functions of term frequency and inverse
document frequency.
Vector Space Model

•Where, N is the total number of documents in the


collection.

  • Where s the product of the function of term


frequency
Proximity Ranking
•  The vector space ranking method from the previous section explicitly
depends only on TF and IDF. In contrast, this method explicitly
depends only on term proximity.
• When the components of a term vector (, , . . . , ) appear in close
proximity within a document, it suggests that the document is more
likely to be relevant than one in which the terms appear farther apart.
• Given a term vector (, , . . . , ), we define a cover for the vector as an
interval in the collection [u, v] that contains a match to all the terms
without containing a smaller interval [u, v], u ≤ u` ≤ v` ≤ v, that also
contains a match to all the terms.
Boolean Retrieval
• In contrast to ranked retrieval, Boolean retrieval returns sets of documents rather
than ranked lists. Under the Boolean retrieval model, a term t is considered to
specify the set of documents containing it. The standard Boolean operators (AND,
OR, and NOT) are used to construct Boolean queries, which are interpreted as
operations over these sets, as follows:
A AND B intersection of A and B (A ∩ B)
A OR B union of A and B (A ∪ B)
NOT A complement of A with respect to the document collection (A  ̄ )
where A and B are terms or other Boolean queries.
Boolean Retrieval
• For example, from a given set of journal articles, user needs the
articles which has words “retrival”, “Rocchio”, “Naïve Bayes” but not
with “Martix”, query can be made the following way

(“retival” AND “Rocchio” AND “naïve Bayes”) AND NOT “martix”


2.3 Evaluation
• An implementation of a retrieval method must be efficient enough to
compute the results of a typical query in adequate time to satisfy the
user, and possible trade-offs between efficiency and effectiveness
must be considered
• A user may not wish to wait for a longer period of time — additional
seconds or even minutes — in order to receive a result that is only
slightly better than a result they could have received immediately.
• These alteration among, speed and accuracy can be achieved using
the following topics
Recall and Precision
• An assessor reads the document which is given for a query and judges
it relevant or not relevant with respect to a topic.
• TREC experiments generally use these binary judgments, with a
document being judged relevant if any part of it is relevant.

Where the set of documents returned by the


query is Res and the set of relevant documents
for the topic contained in the collection is Rel.
Effectiveness Measures for Ranked Retrieval
• If the user is interested in reading only one or two relevant
documents, ranked retrieval may provide a more useful result than
Boolean retrieval.
• To extend our notions of recall and precision to the ordered lists
returned by ranked retrieval algorithms, we consider the top k
documents returned by a query, Res[1..k], and define:
Effectiveness Measures for Ranked Retrieval
• By definition, recall@k increases monotonically with respect to k.
• Conversely, if a ranked retrieval method adheres to the Probability
Ranking Principle defined in Chapter 1 (i.e., ranking documents in
order of decreasing probability of relevance), then P@k will tend to
decrease as k increases.
Building a Test Collection
• Given the difficulty of judging a topic over the entire collection, TREC
and other retrieval efforts depend on a technique known as pooling
to limit the number of judgments required.
• Each of these runs consists of the top 1,000 or 10,000 documents for
each topic.
• The goal of the pooling method is to reduce the amount of
assessment effort while retaining the ability to make reasonable
estimates of precision and recall. For runs contributing to a pool, the
values for P@k and recall@k are accurate at least down to the pool
depth.
Building a Test Collection
• The creation of a test collection at TREC and similar evaluation efforts generally
proceed as follows:
1. Obtain an appropriate document set either from public sources or through
negotiation with its owners.
2. Develop at least 50 topics, the minimum generally considered acceptable for
a meaningful evaluation.
3. Release the topics to the track participants and receive their experimental
runs.
4. Create the pools, judge the topics, and return the results to the participants.
Efficiency Measures
• From a user’s perspective the only efficiency measure of interest is
response time, the time between entering a query and receiving the
results.
• Throughput, the average number of queries processed in a given
period of time, is primarily of interest to search engine operators,
particularly when the search engine is shared by many users and must
cope with thousands of queries per second.
• A simple but reasonable procedure for measuring response time is to
execute a full query set, capturing the start time from the operating
system at a well-defined point just before the first query is issued, and
the end time just after the last result is generated.
• The time to execute the full set is then divided by the number of
queries to give an average response time per query.
• As an example, the following table compares the average response time of a
schema-independent index versus a frequency index, using the Wumpus
implementation of the Okapi BM25 ranking function
• The efficiency benefits of using the frequency index are obvious, particularly on
the larger GOV2 collection. The use of a schema-independent index requires the
computation at run-time of document and term statistics that are precomputed
in the frequency index. To a user, a 202 ms response time would seem
instantaneous, whereas a 4.7 sec response time would be a noticeable lag.
• However, with a frequency index it is not possible to perform phrase searches or
to apply ranking functions to elements other than documents.

You might also like