Information Retrieval System

The document discusses an information retrieval system called HAIRCUT that was developed at APL to search unstructured text. It uses a flexible tokenizer to extract terms from documents and queries, and indexes these terms along with n-grams to support language-neutral searching. While it does not include language-specific processing like stemming or parsing, the system aims to improve accuracy through techniques like removing stop words from queries.

Uploaded by

Sonu Davidson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

215 views4 pages

Information Retrieval System

Uploaded by

Sonu Davidson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Information Retrieval System

Summary:
Much of the research in Information Retrieval has concerned improvements to similarity
computations, statistics gathering, and term extraction, with the goal of improving effectiveness.
However, a simple examination of user characteristics can readily show, the method of
computing similarity is less important than the behavior of the system interface and
environmental factors. It is hypothesised that there must be knowledge of the relationship
between a query, its user, the environment, and the query and user instantiation in the real world.
This hypothesis and others are demonstrated. With facilities for interaction and feedback
appropriately incorporated, effectiveness of 100% can be achieved.

Introduction:
Information Retrieval is the science of locating, from a large document collection, those
documents that full a specified information need [1, 2, 3, 4]. Much of Information Retrieval
research is concerned with proposing and testing of methodologies intended to perform this
function. To perform such tests it is necessary to make assumptions about the behavior of users
and the properties of text. For reasons of experimental design (following the assumption that
good" experiments should not have lots of variables) the user is often assigned the role of reader
with no part in the process that produces answers from the document collection.
It might be thought that a formal model of the relationships between queries, documents,
meaning, and relevance could be used as a foundation for information retrieval. It is argued that
there can be no such model, humans cannot be left out of the equation yet cannot be modelled.
(This paper does not consider the information needs of non-humans, such as robo-cup
competitors.) This paper considers the basis and aims of information retrieval, examining
assumptions and, on the basis of these observations, describes user experiments showing just
how much effectiveness can be improved. These experiments justify great optimism for future
system measurement and design, with full or at least 100% effectiveness easily achieved.
Language and text and their impact on information retrieval are considered first, then
there is examination of the interaction of users, their environment, and relevance. The suggested
system design and experiments are then reported.
Definition:

Information retrieval (IR) is the science of searching for documents,

for information within documents, and for metadata about documents, as well as that of
searching relational databases and the World Wide Web. There is overlap in the usage of the
terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also
has its own body of literature, theory, praxis, and technologies. IR is interdisciplinary, based on
computer science, mathematics, library science, information science, information architecture,
cognitive psychology, linguistics and statistics.
Automated information retrieval systems are used to reduce what has been called
"information overload". Many universities and public libraries use IR systems to provide access
to books, journals and other documents. Web search engines are the most visible IR applications.
Overview:
The use of digital methods for storing and retrieving information has led to the
phenomenon of digital obsolescence, where a digital resource ceases to be readable because the
physical media, the reader required to read the media, the hardware, or the software that runs on
it, is no longer available. The information is initially easier to retrieve than if it were on paper,
but is then effectively lost.
An information retrieval process begins when a user enters a query into the system.
Queries are formal statements of information needs, for example search strings in web search
engines. In information retrieval a query does not uniquely identify a single object in the
collection. Instead, several objects may match the query, perhaps with different degrees
of relevancy.
An object is an entity that is represented by information in a database. User queries are
matched against the database information. Depending on the application the data objects may be,
for example, text documents, images, audio, mind maps or videos. Often the documents
themselves are not kept or stored directly in the IR system, but are instead represented in the
system by document surrogates or metadata.
Most IR systems compute a numeric score on how well each object in the database match
the query, and rank the objects according to this value. The top ranking objects are then shown to
the user. The process may then be iterated if the user wishes to refine the query.
Performance Measures:
Different measures for evaluating the performance of information retrieval systems have
been proposed. The measures require a collection of documents and a query. All common
measures described here assume a ground truth notion of relevancy: every document is known to
be either relevant or non-relevant to a particular query. In practice queries may be ill-posed and
there may be different shades of relevancy.
Precision
Precision is the fraction of the documents retrieved that are relevant to the user's
information need. Precision takes all retrieved documents into account. It can also be evaluated
at a given cut-off rank, considering only the topmost results returned by the system. This
measure is called precision at n or P@n.
Recall
Recall is the fraction of the documents that are relevant to the query that are successfully
retrieved. It is trivial to achieve recall of 100% by returning all documents in response to any
query. Therefore recall alone is not enough but one needs to measure the number of non-relevant
documents also, for example by computing the precision.
Fall-Out
The proportion of non-relevant documents that are retrieved, out of all non-relevant
documents available. It can be looked at as the probability that a non-relevant document is
retrieved by the query.It is trivial to achieve fall-out of 0% by returning zero documents in
response to any query.
F-measure
The Weighted harmonic mean of precision and recall, This is also known as
the F1 measure, because recall and precision are evenly weighted.
Mean Average precision
Precision and recall are single-value metrics based on the whole list of documents
returned by the system. For systems that return a ranked sequence of documents, it is desirable to
also consider the order in which the returned documents are presented. Average precision
emphasizes ranking relevant documents higher. It is the average of precisions computed at the
point of each of the relevant documents in the ranked sequence. This metric is also sometimes
referred to geometrically as the area under the Precision-Recall curve.
Discounted cumulative gain
DCG uses a graded relevance scale of documents from the result set to evaluate the
usefulness, or gain, of a document based on its position in the result list. The premise of DCG is
that highly relevant documents appearing lower in a search result list should be penalized as the
graded relevance value is reduced logarithmically proportional to the position of the result.
THE HAIRCUT SYSTEM
HAIRCUT (The Hopkins Automated Information Retriever for Combing Unstructured
Text) is a Java based text retrieval engine developed at APL. We are particularly interested in
language-neutral techniques for HAIRCUT because we lack the resources to do significant
language-specific work.
HAIRCUT has a flexible tokenizer that supports multiple term types such as words, word
stems, and character n-grams. All text is read as Unicode using Java’s built-in Unicode facilities.
For alphabetic languages, the tokenizer is typically configured to break words at spaces,
downcase them, and remove diacritics. Punctuation is used to identify sentence boundaries and
then removed. Stop structure (the noncontent-bearing part of a user’s query such as “find
documents that” or “I’m interested in learning about”) is then optionally removed. We manually
developed a list of 459 English stop phrases to be removed from queries. Each phrase was then
translated into the other supported languages using various commercial MT systems. We do not
have the means to verify the quality of such non-English stop structure, but its removal from
queries seems to improve accuracy.
The resulting words, called raw words, are used as the main point of comparison with n-
grams. They also form the basis for the construction of n-grams. A space is placed at the
beginning and end of each sentence and between each pair of words. Each subsequence of length
n is then generated as an n-gram. A text with fewer than n 2 characters generates no n-grams in
this approach. This is not problematic for 4-grams, but 6-grams are unable to respond, for
example, to the query “IBM.” A solution is to generate an additional indexing term for each
word of length less than n 2; however, this is not part of our ordinary processing.
Besides the character-level processing required by the tokenizer, and the removal of our
guesses at stop structure, HAIRCUT has no language-specific code. We have occasionally run
experiments using one of the Snowball stemmers, 40 which attempt to conflate-related words
with a common root using language-specific rules, but this is not a regular part of our processing.
Nor do we do any decompounding, lemmatization, part-of-speech tagging, chunking, parsing, or
other linguistically motivated techniques.
The HAIRCUT index is a typical inverted index each indexing term is associated with a
postings list of all documents that contain that term. The dictionary is stored in a compressed B-
tree, which is paged to disk as necessary. Postings are stored on disk using gamma
compression41 to reduce disk use. Both document identifiers and term frequencies are
compressed. Only term counts are kept in our postings lists; we do not keep term position
information. We also store a bag-of-words representation of each document on disk to facilitate
blind relevance feedback and term relationship discovery.
Blind relevance feedback for monolingual retrieval, and pre and post-translation
expansion for bilingual retrieval, are accomplished in the same way. Retrieval is performed on
the initial query, and the top retrieved documents (typically 20) are selected. The terms in those
documents are weighted according to our affinity statistic. The highest-weighted terms (typically
50) are then selected as feedback terms.
Conclusions:
Much of the research in Information Retrieval has concerned improvements to similarity
computations, statistics gathering, and term extraction, with a goal to improve effectiveness.
However, a simple examination of user characteristics can readily show, the method of
computing similarity is less important than the behavior of the system interface and
environmental factors. It was hypothesised there must be knowledge of the relationship between
a query, its user, the environment, and the query and user instantiation in the real world! This
hypothesis and others are demonstrated. With facilities for interaction and feedback
appropriately incorporated, effectiveness of 100% can be achieved.

Application of Computational Linguistics
No ratings yet
Application of Computational Linguistics
19 pages
TIME ID Rev2019-11-25
No ratings yet
TIME ID Rev2019-11-25
1,495 pages
2563-10-31 09-12-55
60% (5)
2563-10-31 09-12-55
5 pages
E Commerce Module 5
No ratings yet
E Commerce Module 5
24 pages
Vodafone V 1.1
No ratings yet
Vodafone V 1.1
12 pages
Multi-View Stereo A Tutorial.
No ratings yet
Multi-View Stereo A Tutorial.
164 pages
Chapter 1 - Modern Network Security Threats
100% (1)
Chapter 1 - Modern Network Security Threats
143 pages
TYBSC CS Information Retrieval Munotes
No ratings yet
TYBSC CS Information Retrieval Munotes
85 pages
GSM & WCDMA Paging Load LAC Split Guideline
100% (1)
GSM & WCDMA Paging Load LAC Split Guideline
10 pages
Essentials Displayport Fec Protocols Webinar
No ratings yet
Essentials Displayport Fec Protocols Webinar
59 pages
Lab Report Template
No ratings yet
Lab Report Template
4 pages
Bitcoin Black Whitepaper 2.0
0% (1)
Bitcoin Black Whitepaper 2.0
29 pages
Fidel, 1994
No ratings yet
Fidel, 1994
23 pages
Berry Picking
No ratings yet
Berry Picking
22 pages
Irs Unit III
No ratings yet
Irs Unit III
74 pages
220-1101 Exam A+ Core 1 13-08-2023
No ratings yet
220-1101 Exam A+ Core 1 13-08-2023
333 pages
PSO4
No ratings yet
PSO4
8 pages
Mobile Assisted Language Learning: Dorota Czerska-Andrzejewska
No ratings yet
Mobile Assisted Language Learning: Dorota Czerska-Andrzejewska
10 pages
Log
No ratings yet
Log
7 pages
Irs Unit - V
No ratings yet
Irs Unit - V
6 pages
Scholarship Management System: Team Members: BM10518, BM10527, BM10545 Class: II-MCA
100% (3)
Scholarship Management System: Team Members: BM10518, BM10527, BM10545 Class: II-MCA
30 pages
Micro Processors: Case Study Summary
No ratings yet
Micro Processors: Case Study Summary
6 pages
Cloud Computing Program Brochure
No ratings yet
Cloud Computing Program Brochure
19 pages
Pe Ii6
No ratings yet
Pe Ii6
166 pages
Unit I - Irs
No ratings yet
Unit I - Irs
116 pages
User Manual HD5-20K LCD
No ratings yet
User Manual HD5-20K LCD
36 pages
Datasheet: Radii™ KL41-0.1-50
No ratings yet
Datasheet: Radii™ KL41-0.1-50
8 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Bulu
No ratings yet
Bulu
47 pages
IRS - Notes - I&2 CSE A&B
No ratings yet
IRS - Notes - I&2 CSE A&B
27 pages
Everything in Brief Introduction
No ratings yet
Everything in Brief Introduction
5 pages
Entradas Analogica Kfd2-Stc4-Ex PDF
No ratings yet
Entradas Analogica Kfd2-Stc4-Ex PDF
3 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
Unit I - Irs
No ratings yet
Unit I - Irs
85 pages
Define Table Form
No ratings yet
Define Table Form
2 pages
ChipScreen Rev Min
No ratings yet
ChipScreen Rev Min
2 pages
Cp5293 Big Data Analytics Question Bank
No ratings yet
Cp5293 Big Data Analytics Question Bank
26 pages
Programming Reaction Timer Report
No ratings yet
Programming Reaction Timer Report
4 pages
Mentro UT Waygate-Technoloiges
No ratings yet
Mentro UT Waygate-Technoloiges
8 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
16 pages
Minimize The Overhead of A User Locating Needed Information Precision and Recall
No ratings yet
Minimize The Overhead of A User Locating Needed Information Precision and Recall
14 pages
Csit1232 (2021 - 07 - 30 08 - 37 - 35 UTC)
No ratings yet
Csit1232 (2021 - 07 - 30 08 - 37 - 35 UTC)
11 pages
Ker Ruthven Lalmas PDF
No ratings yet
Ker Ruthven Lalmas PDF
53 pages
CMP 312 - 2
No ratings yet
CMP 312 - 2
5 pages
CST MWS GPU Computing 2
No ratings yet
CST MWS GPU Computing 2
2 pages
PID Controllers - Intro To Control Design - Online Engineering Courses
No ratings yet
PID Controllers - Intro To Control Design - Online Engineering Courses
1 page
Artificial Intelligence in Information Retrieval
No ratings yet
Artificial Intelligence in Information Retrieval
5 pages
Chapter - 5 Programming Timers
No ratings yet
Chapter - 5 Programming Timers
29 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
4 pages
Compresor ELFD75 100HP - EN (3 y 4)
No ratings yet
Compresor ELFD75 100HP - EN (3 y 4)
13 pages
Fuzzy Ontologies and Scale Free Networks
No ratings yet
Fuzzy Ontologies and Scale Free Networks
11 pages
Lab1-Algorithms For Information Retrieval. Introduction
No ratings yet
Lab1-Algorithms For Information Retrieval. Introduction
13 pages
Cs8080 Irt Unit 1 PDF
No ratings yet
Cs8080 Irt Unit 1 PDF
28 pages
STR 0367 Ver. 0 - Earth Leakage Detector
No ratings yet
STR 0367 Ver. 0 - Earth Leakage Detector
4 pages
Bees Swarm Optimization Based Approach For Web Information Retrieval
No ratings yet
Bees Swarm Optimization Based Approach For Web Information Retrieval
8 pages
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
No ratings yet
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
4 pages
Information Retrieval 1
No ratings yet
Information Retrieval 1
10 pages
Solar Final Report
No ratings yet
Solar Final Report
15 pages
Modern Information Retrieval: A Brief Overview
No ratings yet
Modern Information Retrieval: A Brief Overview
9 pages
CCNA Security - Configuring AAA
No ratings yet
CCNA Security - Configuring AAA
5 pages
Development Team: Paper No
No ratings yet
Development Team: Paper No
10 pages
IR Cs Sem 6
No ratings yet
IR Cs Sem 6
16 pages
Dynamic Fuzzy String-Matching Model For Information Retrieval Based On Incongruous User Queries
No ratings yet
Dynamic Fuzzy String-Matching Model For Information Retrieval Based On Incongruous User Queries
6 pages
Introduction To Information Retrieval Systems
No ratings yet
Introduction To Information Retrieval Systems
2 pages
Information Retrieval System Design For Very High Effectiveness
No ratings yet
Information Retrieval System Design For Very High Effectiveness
7 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
Olalekan Et Al 2
No ratings yet
Olalekan Et Al 2
8 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
Of 280fbpkmhy
No ratings yet
Of 280fbpkmhy
9 pages
Chap 4 Text IR PDF
No ratings yet
Chap 4 Text IR PDF
19 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
5 pages
Modern Information Retrieval Amit Singhal
No ratings yet
Modern Information Retrieval Amit Singhal
9 pages
Information Retrieval Systems in Academic Libraries
No ratings yet
Information Retrieval Systems in Academic Libraries
5 pages
A Survey On Various Architectures, Models and Methodologies For Information Retrieval
No ratings yet
A Survey On Various Architectures, Models and Methodologies For Information Retrieval
13 pages
CS232 Team Project Final: Infoglut
No ratings yet
CS232 Team Project Final: Infoglut
15 pages
A Simple Information Retrieval Technique
No ratings yet
A Simple Information Retrieval Technique
4 pages
QUESEM: Towards Building A Meta Search Service Utilizing Query Semantics
No ratings yet
QUESEM: Towards Building A Meta Search Service Utilizing Query Semantics
10 pages
Piccolo Clamps
No ratings yet
Piccolo Clamps
2 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson
No ratings yet
SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson
5 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Text Analytics with Python: A Brief Introduction to Text Analytics with Python
From Everand
Text Analytics with Python: A Brief Introduction to Text Analytics with Python
Anthony S. Williams
No ratings yet
Irss N
No ratings yet
Irss N
2 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet

Information Retrieval System

Uploaded by

Information Retrieval System

Uploaded by

Information Retrieval System

Information retrieval (IR) is the science of searching for documents,

You might also like