0% found this document useful (0 votes)
215 views4 pages

Information Retrieval System

The document discusses an information retrieval system called HAIRCUT that was developed at APL to search unstructured text. It uses a flexible tokenizer to extract terms from documents and queries, and indexes these terms along with n-grams to support language-neutral searching. While it does not include language-specific processing like stemming or parsing, the system aims to improve accuracy through techniques like removing stop words from queries.

Uploaded by

Sonu Davidson
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
215 views4 pages

Information Retrieval System

The document discusses an information retrieval system called HAIRCUT that was developed at APL to search unstructured text. It uses a flexible tokenizer to extract terms from documents and queries, and indexes these terms along with n-grams to support language-neutral searching. While it does not include language-specific processing like stemming or parsing, the system aims to improve accuracy through techniques like removing stop words from queries.

Uploaded by

Sonu Davidson
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Information Retrieval System

Summary:
Much of the research in Information Retrieval has concerned improvements to similarity
computations, statistics gathering, and term extraction, with the goal of improving effectiveness.
However, a simple examination of user characteristics can readily show, the method of
computing similarity is less important than the behavior of the system interface and
environmental factors. It is hypothesised that there must be knowledge of the relationship
between a query, its user, the environment, and the query and user instantiation in the real world.
This hypothesis and others are demonstrated. With facilities for interaction and feedback
appropriately incorporated, effectiveness of 100% can be achieved.

Introduction:
Information Retrieval is the science of locating, from a large document collection, those
documents that full a specified information need [1, 2, 3, 4]. Much of Information Retrieval
research is concerned with proposing and testing of methodologies intended to perform this
function. To perform such tests it is necessary to make assumptions about the behavior of users
and the properties of text. For reasons of experimental design (following the assumption that
good" experiments should not have lots of variables) the user is often assigned the role of reader
with no part in the process that produces answers from the document collection.
It might be thought that a formal model of the relationships between queries, documents,
meaning, and relevance could be used as a foundation for information retrieval. It is argued that
there can be no such model, humans cannot be left out of the equation yet cannot be modelled.
(This paper does not consider the information needs of non-humans, such as robo-cup
competitors.) This paper considers the basis and aims of information retrieval, examining
assumptions and, on the basis of these observations, describes user experiments showing just
how much effectiveness can be improved. These experiments justify great optimism for future
system measurement and design, with full or at least 100% effectiveness easily achieved.
Language and text and their impact on information retrieval are considered first, then
there is examination of the interaction of users, their environment, and relevance. The suggested
system design and experiments are then reported.
Definition:

Information retrieval (IR) is the science of searching for documents,


for information within documents, and for metadata about documents, as well as that of
searching relational databases and the World Wide Web. There is overlap in the usage of the
terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also
has its own body of literature, theory, praxis, and technologies. IR is interdisciplinary, based on
computer science, mathematics, library science, information science, information architecture,
cognitive psychology, linguistics and statistics.
Automated information retrieval systems are used to reduce what has been called
"information overload". Many universities and public libraries use IR systems to provide access
to books, journals and other documents. Web search engines are the most visible IR applications.
Overview:
The use of digital methods for storing and retrieving information has led to the
phenomenon of digital obsolescence, where a digital resource ceases to be readable because the
physical media, the reader required to read the media, the hardware, or the software that runs on
it, is no longer available. The information is initially easier to retrieve than if it were on paper,
but is then effectively lost.
An information retrieval process begins when a user enters a query into the system.
Queries are formal statements of information needs, for example search strings in web search
engines. In information retrieval a query does not uniquely identify a single object in the
collection. Instead, several objects may match the query, perhaps with different degrees
of relevancy.
An object is an entity that is represented by information in a database. User queries are
matched against the database information. Depending on the application the data objects may be,
for example, text documents, images, audio, mind maps or videos. Often the documents
themselves are not kept or stored directly in the IR system, but are instead represented in the
system by document surrogates or metadata.
Most IR systems compute a numeric score on how well each object in the database match
the query, and rank the objects according to this value. The top ranking objects are then shown to
the user. The process may then be iterated if the user wishes to refine the query.
Performance Measures:
Different measures for evaluating the performance of information retrieval systems have
been proposed. The measures require a collection of documents and a query. All common
measures described here assume a ground truth notion of relevancy: every document is known to
be either relevant or non-relevant to a particular query. In practice queries may be ill-posed and
there may be different shades of relevancy.
Precision
Precision is the fraction of the documents retrieved that are relevant to the user's
information need. Precision takes all retrieved documents into account. It can also be evaluated
at a given cut-off rank, considering only the topmost results returned by the system. This
measure is called precision at n or P@n.
Recall
Recall is the fraction of the documents that are relevant to the query that are successfully
retrieved. It is trivial to achieve recall of 100% by returning all documents in response to any
query. Therefore recall alone is not enough but one needs to measure the number of non-relevant
documents also, for example by computing the precision.
Fall-Out
The proportion of non-relevant documents that are retrieved, out of all non-relevant
documents available. It can be looked at as the probability that a non-relevant document is
retrieved by the query.It is trivial to achieve fall-out of 0% by returning zero documents in
response to any query.
F-measure
The Weighted harmonic mean of precision and recall, This is also known as
the F1 measure, because recall and precision are evenly weighted.
Mean Average precision
Precision and recall are single-value metrics based on the whole list of documents
returned by the system. For systems that return a ranked sequence of documents, it is desirable to
also consider the order in which the returned documents are presented. Average precision
emphasizes ranking relevant documents higher. It is the average of precisions computed at the
point of each of the relevant documents in the ranked sequence. This metric is also sometimes
referred to geometrically as the area under the Precision-Recall curve.
Discounted cumulative gain
DCG uses a graded relevance scale of documents from the result set to evaluate the
usefulness, or gain, of a document based on its position in the result list. The premise of DCG is
that highly relevant documents appearing lower in a search result list should be penalized as the
graded relevance value is reduced logarithmically proportional to the position of the result.
THE HAIRCUT SYSTEM
HAIRCUT (The Hopkins Automated Information Retriever for Combing Unstructured
Text) is a Java based text retrieval engine developed at APL. We are particularly interested in
language-neutral techniques for HAIRCUT because we lack the resources to do significant
language-specific work.
HAIRCUT has a flexible tokenizer that supports multiple term types such as words, word
stems, and character n-grams. All text is read as Unicode using Java’s built-in Unicode facilities.
For alphabetic languages, the tokenizer is typically configured to break words at spaces,
downcase them, and remove diacritics. Punctuation is used to identify sentence boundaries and
then removed. Stop structure (the noncontent-bearing part of a user’s query such as “find
documents that” or “I’m interested in learning about”) is then optionally removed. We manually
developed a list of 459 English stop phrases to be removed from queries. Each phrase was then
translated into the other supported languages using various commercial MT systems. We do not
have the means to verify the quality of such non-English stop structure, but its removal from
queries seems to improve accuracy.
The resulting words, called raw words, are used as the main point of comparison with n-
grams. They also form the basis for the construction of n-grams. A space is placed at the
beginning and end of each sentence and between each pair of words. Each subsequence of length
n is then generated as an n-gram. A text with fewer than n 2 characters generates no n-grams in
this approach. This is not problematic for 4-grams, but 6-grams are unable to respond, for
example, to the query “IBM.” A solution is to generate an additional indexing term for each
word of length less than n 2; however, this is not part of our ordinary processing.
Besides the character-level processing required by the tokenizer, and the removal of our
guesses at stop structure, HAIRCUT has no language-specific code. We have occasionally run
experiments using one of the Snowball stemmers, 40 which attempt to conflate-related words
with a common root using language-specific rules, but this is not a regular part of our processing.
Nor do we do any decompounding, lemmatization, part-of-speech tagging, chunking, parsing, or
other linguistically motivated techniques.
The HAIRCUT index is a typical inverted index each indexing term is associated with a
postings list of all documents that contain that term. The dictionary is stored in a compressed B-
tree, which is paged to disk as necessary. Postings are stored on disk using gamma
compression41 to reduce disk use. Both document identifiers and term frequencies are
compressed. Only term counts are kept in our postings lists; we do not keep term position
information. We also store a bag-of-words representation of each document on disk to facilitate
blind relevance feedback and term relationship discovery.
Blind relevance feedback for monolingual retrieval, and pre and post-translation
expansion for bilingual retrieval, are accomplished in the same way. Retrieval is performed on
the initial query, and the top retrieved documents (typically 20) are selected. The terms in those
documents are weighted according to our affinity statistic. The highest-weighted terms (typically
50) are then selected as feedback terms.
Conclusions:
Much of the research in Information Retrieval has concerned improvements to similarity
computations, statistics gathering, and term extraction, with a goal to improve effectiveness.
However, a simple examination of user characteristics can readily show, the method of
computing similarity is less important than the behavior of the system interface and
environmental factors. It was hypothesised there must be knowledge of the relationship between
a query, its user, the environment, and the query and user instantiation in the real world! This
hypothesis and others are demonstrated. With facilities for interaction and feedback
appropriately incorporated, effectiveness of 100% can be achieved.

You might also like